Preparing Data for ML Projects

Data quality makes or breaks your ML projects. You can have the fanciest algorithms and unlimited compute power, but garbage data produces garbage results. This guide walks you through preparing data for ML projects - from initial assessment to handling missing values, scaling features, and preventing data leakage. We'll cover what actually matters when building production systems.

3-4 weeks for a production dataset

Prerequisites

Basic understanding of machine learning concepts (training, testing, validation)
Familiarity with Python, pandas, or similar data manipulation tools
Access to your raw dataset and documentation of its source
Understanding of your specific ML problem and target variable

Step-by-Step Guide

Conduct a Thorough Data Audit

Start by understanding exactly what you're working with. Pull basic statistics on your dataset - row count, column count, data types, and memory footprint. Run `info()` and `describe()` functions if you're using pandas. This initial scan reveals obvious problems like wrong data types or unrealistic value ranges. Next, check your data source documentation. Where did this data come from? How was it collected? What transformations already happened? These details matter because they affect how you'll handle edge cases later. If you're pulling data from a legacy database, understand the schema and any quirks in how fields are stored. Document everything you find. Create a simple spreadsheet tracking each column, its type, expected range, and any observed anomalies. This becomes invaluable when you're debugging model performance issues three months later.

Tip

Calculate the percentage of missing values per column - often this single metric guides your strategy
Look for suspicious patterns like entire columns with the same value or impossibly high/low outliers
Check timestamps for gaps or inconsistencies that might indicate data collection problems
Compare column distributions against your domain knowledge - does this data match reality?

Warning

Don't assume data is clean just because it came from an official source - audit everything
Avoid making cleanup decisions based on statistics alone - talk to domain experts first
Watch for seasonal patterns or time-based artifacts that might disappear in your training set

Handle Missing Values Strategically

Missing data isn't a single problem - it's actually several different problems depending on the pattern. First, identify whether data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). MCAR is typically handled with simple imputation. MNAR is trickier because the missingness itself contains information. For numeric columns, you've got options. Mean imputation works for MCAR scenarios with low missingness rates (under 5%). Median imputation is more robust to outliers. Forward fill and backward fill work for time-series data where temporal relationships matter. For more complex patterns, consider multiple imputation or K-nearest neighbors imputation. For categorical data, creating a separate 'missing' category often works better than deleting rows. Before imputing anything, ask whether the missingness is important. Sometimes that missing value is a signal - it might indicate a sensor failure, a customer who didn't complete a form, or a process that wasn't followed. Preserve that information by creating a binary 'was_missing' feature alongside your imputed value.

Tip

Start simple - often mean or median imputation works well enough for real projects
Use sklearn's SimpleImputer or IterativeImputer for consistent, reproducible results
Keep your imputation logic simple enough that you can explain it to stakeholders
Test model performance with different imputation strategies - sometimes one approach dramatically outperforms others

Warning

Deleting rows with missing values wastes data - only do this if missingness exceeds 50% or the rows are truly invalid
Never impute on your entire dataset before splitting into train/test - this causes data leakage
Avoid forward fill on shuffled data - it only makes sense for time-series preserving temporal order

Remove or Transform Outliers

Outliers distort your model's understanding of normal patterns. A single extreme value can shift means, inflate variance estimates, and throw off distance-based algorithms. But here's the catch - sometimes outliers are real, important data points. A credit card fraudster making a $50,000 purchase isn't a data error; it's a critical signal. Use statistical methods to identify outliers objectively. The interquartile range (IQR) method flags values beyond 1.5 * IQR from the quartiles. Z-score method identifies values more than 3 standard deviations from the mean. Isolation Forest is more sophisticated and handles multivariate outliers. For domain-specific problems, sometimes a simple rule works best - transaction amounts below $0 are impossible, so those are definitely errors. You've got three choices for handling outliers: delete them (only if they're errors), transform them (log scaling, square root), or cap them (winsorization). For your ML project, capping extreme values often works better than deletion. Replace values beyond the 99th percentile with the 99th percentile value. This preserves row count while reducing distortion.

Tip

Visualize distributions with histograms and box plots before deciding on outlier treatment
Check if outliers cluster in specific subgroups - they might indicate a data quality issue in one source
Use domain expertise - ask the business if high values make sense for your use case
Apply outlier detection separately to train and test sets to prevent leakage

Warning

Don't remove outliers before exploring them - you might delete your most valuable signals
Be careful with capping strategies on skewed distributions - they can create artificial clusters
Watch for seasonal outliers in time-series data - extreme values in December might be perfectly normal

Encode Categorical Variables Properly

Machine learning algorithms don't understand text categories. They need numbers. How you convert 'Red', 'Green', 'Blue' or 'Premium', 'Standard', 'Basic' into numbers significantly impacts model performance. Pick wrong and you're literally telling your algorithm that 'Blue' is twice as good as 'Red' when that ordering doesn't exist. For tree-based models (Random Forests, XGBoost), label encoding works fine since trees don't assume numeric ordering. One-hot encoding creates a binary column for each category, which works everywhere but explodes your feature count on high-cardinality columns. If a category appears only once or twice, it's often better to group it into an 'Other' category to reduce noise. For ordinal categories where order matters (Small, Medium, Large), explicitly assign ordered integers. Ordinal encoding and target encoding are powerful for high-cardinality features. Target encoding replaces each category with the mean target value for that category - this captures predictive information directly. Just be careful with target encoding on rare categories where the mean is unstable.

Tip

Use pd.factorize() for quick label encoding or pd.get_dummies() for one-hot encoding
Group rare categories appearing in less than 1% of data into 'Other' to improve generalization
Apply categorical encoding after train/test split - learn encoding from training data only
For high-cardinality features (1000+ unique values), consider target encoding or embeddings instead of one-hot

Warning

One-hot encoding can create thousands of features - this increases model complexity and training time
Never apply one-hot encoding before splitting your data - you'll leak information about test categories into training
Target encoding can cause severe overfitting on small datasets - be cautious with rare categories

Scale and Normalize Numeric Features

Features on vastly different scales cause problems. If one feature ranges 0-1 and another ranges 0-1,000,000, your algorithm often treats the big numbers as more important. Distance-based algorithms (KNN, K-means) become completely dominated by large-scale features. Gradient descent in neural networks trains slowly because step sizes need to be tiny for the small-scale features. Standardization (z-score normalization) subtracts the mean and divides by standard deviation, centering each feature around zero with unit variance. This works well for normally distributed features and most algorithms. Min-max scaling squashes everything to the 0-1 range, which works better for bounded features and neural networks. Robust scaling uses medians and quantiles instead of means and standard deviation, making it resistant to outliers. The critical rule: fit your scaler on training data only, then apply it to test and production data using those same parameters. Most people get this right eventually, but it's a common source of data leakage that inflates performance estimates. Store your fitted scaler alongside your model so production predictions use identical scaling.

Tip

Use sklearn's StandardScaler for most cases - it's simple, standard, and works well
Don't scale binary features or features you've already engineered to specific ranges
Apply scaling after creating derived features - scale the final feature set
Use RobustScaler for datasets with extreme outliers that you can't remove

Warning

Never fit your scaler on the entire dataset before splitting - this causes data leakage
Don't scale categorical features that you've label encoded - they don't need it
Watch out when scaling target variables - you'll need to inverse-transform predictions back to original scale

Create Train-Test Splits Correctly

How you split data determines whether your model performance estimates are honest. Random splitting works for most cases - shuffle your data and allocate 70-80% to training, 20-30% to testing. But random splitting fails for time-series data. If you shuffle before splitting, your model uses future information to predict the past. Instead, use temporal splits - training data is everything before a date, test data is everything after. For imbalanced datasets where one class is rare, use stratified splitting. This ensures both train and test sets have the same class distribution. If you have 10% positive examples, both sets get 10%. sklearn's train_test_split handles this with the stratify parameter. For grouped data (multiple transactions per customer, multiple measurements per machine), group-based splitting ensures all data from the same entity goes to either train or test, never both. Consider creating a separate validation set for hyperparameter tuning. Your workflow becomes: train on training data, tune hyperparameters on validation data, report final metrics on test data. This three-way split is especially important for deep learning where hyperparameter choices are numerous.

Tip

Use random_state parameter for reproducibility - set it to the same value every run
For time-series, always split by time - never shuffle dates before splitting
Create stratified splits for classification problems with class imbalance
Hold out 10-15% of data for final testing and don't touch it until you're ready to report results

Warning

Random splitting on time-series data creates unrealistic evaluation - future data leaks into training
Using test data for any decisions (feature selection, hyperparameter tuning, threshold adjustment) invalidates your results
Stratified splitting helps but doesn't solve severe class imbalance - consider SMOTE or class weighting too

Feature Engineering and Selection

Raw features rarely work well. Feature engineering creates new features that make patterns clearer. If you have date columns, extract year, month, day of week, whether it's a holiday. For customer data, calculate recency (days since last purchase), frequency (purchase count), monetary (total spent). These derived features often outperform raw data because they directly encode business logic. But more features aren't always better. Too many features introduce noise, slow training, and cause overfitting. Use correlation analysis to remove features that perfectly correlate with others - keeping one and dropping duplicates. Univariate feature selection tests each feature independently against your target variable, ranking them by predictive power. Permutation importance trains your model then shuffles each feature to see how much model performance drops - big drops mean important features. Tree-based models provide feature importance scores automatically. After training a Random Forest or XGBoost model, you can see which features the model actually uses. Keep the top 20-30 features contributing 80-90% of the importance, dropping low-importance features to simplify your model and reduce training time.

Tip

Start with domain expertise - work with business stakeholders to identify meaningful feature combinations
Use correlation matrices to identify and remove multicollinearity before modeling
Apply feature selection on training data only, then use the same features on test data
Combine automatic selection with manual review - sometimes business logic should override statistics

Warning

Feature selection before train-test splitting causes data leakage - always select features from training data
Avoid creating features that directly use test data or future information
Too much feature engineering wastes time - start simple and iterate based on model results

Address Class Imbalance

When your target variable is heavily imbalanced, standard accuracy metrics mislead you. If 99% of your data is 'no fraud' and 1% is 'fraud', a model that predicts everything as 'no fraud' gets 99% accuracy while being completely useless. You need strategies specifically designed for imbalance. Oversampling replicates minority class examples until classes are balanced. Undersampling removes majority class examples. Both work but have tradeoffs - oversampling can overfit on limited minority data, undersampling loses majority class information. SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic minority examples by interpolating between existing ones. This often outperforms both simple oversampling and undersampling. For most projects, SMOTE on training data then evaluation on unsampled test data works well. Class weighting adjusts your loss function to penalize minority class errors more heavily. With XGBoost, set scale_pos_weight to the ratio of negative to positive examples. Tree-based models handle class imbalance reasonably well with proper weighting. For imbalanced problems, always use appropriate metrics - F1-score, precision-recall curves, and area under the PR curve instead of accuracy.

Tip

Use F1-score or precision-recall AUC for imbalanced classification - accuracy misleads
Apply SMOTE or other resampling only to training data, evaluate on real unsampled test data
Try class weighting first - it's simpler than SMOTE and often works just as well
Combine SMOTE with class weighting for severe imbalance problems

Warning

Never apply SMOTE before splitting - this creates synthetic data that bleeds into test set
Using accuracy on imbalanced data hides poor performance - stakeholders will be disappointed in production
Extreme oversampling can cause overfitting - keep minority to majority ratio between 1:2 and 1:10

Validate Data Quality and Consistency

Before training any model, validate that your data makes sense. Check data type consistency - all phone numbers should be strings, all dates should parse correctly. Run integrity checks: do IDs have duplicates that shouldn't? Are foreign key relationships valid? For a customer dataset, verify that customer IDs exist before referencing them in transaction records. Create data quality tests as code. Check that age values fall between 0 and 120, that dates don't have future values, that numeric fields stay within expected ranges. These tests should run automatically on new data. I recommend using Great Expectations or similar data validation frameworks that maintain schemas and quality rules. Compare your processed data against raw data using summary statistics. If raw data has 1 million rows and processed data has 900k rows, you need to know exactly what happened to the 100k rows. Document every transformation - deleted duplicates (5k rows), removed out-of-range values (50k rows), etc. This transparency prevents nasty surprises when your model fails on real data later.

Tip

Write validation checks that document your data quality assumptions
Use data profiling tools to automatically detect anomalies and distributions
Create a data quality scorecard tracking key metrics over time - degradation might indicate upstream problems
Document why you removed or modified data so others understand your decisions

Warning

Don't modify data silently - always document what you changed and why
Silent failures in data validation are dangerous - make validation checks loud and visible
Test your validation rules on real data - sometimes your assumptions about 'normal' are wrong

Create Proper Validation Metrics and Baselines

You can't know if your model performs well without knowing what 'good' means. Establish a baseline - the performance of a simple, dumb approach. For classification, the most common class baseline guesses the majority class every time. For regression, the mean baseline predicts the average target value. Any sophisticated model should beat these baselines significantly. Choose metrics aligned with your business goal, not just standard metrics. If you're building a fraud detection model, false negatives (missed fraud) might cost 100x more than false positives (blocked legitimate transactions). In that case, precision-recall tradeoff matters more than overall accuracy. For regression, mean absolute error (MAE) in original units often communicates better to stakeholders than root mean squared error (RMSE). Set up cross-validation for more stable performance estimates. K-fold cross-validation splits your training data into K folds, trains on K-1 folds and evaluates on the held-out fold, repeating K times. This uses your limited data more efficiently and gives you confidence intervals around performance. Time-series cross-validation uses expanding windows - train on weeks 1-4, test on week 5, then train on weeks 1-5, test on week 6.

Tip

Establish baseline performance before building complex models - sometimes simple works great
Report confidence intervals, not just point estimates - 85% +/- 5% is more informative than 85%
Use stratified k-fold for imbalanced classification to ensure each fold has similar class ratios
Save cross-validation scores from all folds to identify unstable models that perform differently on different data subsets

Warning

Using test set performance to pick your final model is data leakage - use validation set for model selection
Don't use classification accuracy for imbalanced problems - it's fundamentally misleading
Avoid metrics that don't align with business goals - you'll optimize the wrong thing

Prevent and Detect Data Leakage

Data leakage happens when information from test data influences training, creating artificially inflated performance that disappears in production. This is catastrophic because you think you've built a great model, then it fails in the real world. Common sources include scaling before splitting, using future information in features, selecting features from the entire dataset, or including the target variable in X features (yes, this happens). Prevent leakage through workflow discipline. Always split first, then preprocess. Fit scalers on training data. Create features from training data only. Remove any feature that mathematically depends on information you wouldn't have at prediction time. If you're predicting loan defaults, don't include whether the loan later defaulted - you're trying to predict that! Detect leakage through suspicious performance differences. If your cross-validation score is 95% but production performance is 70%, you probably have leakage. Similarly, if test performance is much better than cross-validation performance, something's wrong. A small gap (1-3%) is expected, but larger gaps suggest problems. Create a simple holdout test set that you absolutely don't touch until the end - it catches leakage that other checks miss.

Tip

Create a reproducible preprocessing pipeline using sklearn Pipeline or similar - this makes leakage obvious
Use time-based validation for time-series instead of random splitting
Explicitly list which data you have available at prediction time - ensure all features respect this
Have someone else review your code for leakage - it's surprisingly easy to miss

Warning

Data leakage often happens subtly - be paranoid about it
Future information bleeding into features is common - think through your feature creation carefully
Using test metrics for any decision-making causes leakage - reserve test data for final evaluation only

Frequently Asked Questions

What percentage of time goes to data preparation vs modeling?

Industry research shows data preparation takes 60-80% of project time, with only 20-40% spent on actual modeling. This varies by domain, but the 80-20 split is common. Don't underestimate data work - it's where most impact happens.

Should I remove outliers or keep them?

It depends. If outliers are data errors, remove or fix them. If they're real but extreme values, cap them or transform them with log scaling. Sometimes outliers are your most important signals - fraud cases are outliers. Always investigate first.

When should I do feature engineering?

After cleaning and handling missing values, but before final feature selection. Create features from training data only to prevent leakage. Use domain expertise combined with statistical analysis to guide which features to engineer.

How do I know if my data is clean enough?

Run data quality tests, compare processed vs raw data counts, validate against business rules, and check for suspicious patterns. Your model's generalization gap (train vs test performance) also indicates data quality - large gaps suggest problems.

What's the best way to handle missing data?

Start simple with mean or median imputation for numeric data. For categorical data, create a 'missing' category. If missingness exceeds 50% or is informative, preserve it as a separate feature. Test different strategies - sometimes one dramatically outperforms others.

Prerequisites

Step-by-Step Guide

Conduct a Thorough Data Audit

Handle Missing Values Strategically

Remove or Transform Outliers

Encode Categorical Variables Properly

Scale and Normalize Numeric Features

Create Train-Test Splits Correctly

Feature Engineering and Selection

Address Class Imbalance

Validate Data Quality and Consistency

Create Proper Validation Metrics and Baselines

Prevent and Detect Data Leakage

Frequently Asked Questions

Related Pages