Data quality makes or breaks your ML projects. You can have the fanciest algorithms and unlimited compute power, but garbage data produces garbage results. This guide walks you through preparing data for ML projects - from initial assessment to handling missing values, scaling features, and preventing data leakage. We'll cover what actually matters when building production systems.
Prerequisites
- Basic understanding of machine learning concepts (training, testing, validation)
- Familiarity with Python, pandas, or similar data manipulation tools
- Access to your raw dataset and documentation of its source
- Understanding of your specific ML problem and target variable
Step-by-Step Guide
Conduct a Thorough Data Audit
Start by understanding exactly what you're working with. Pull basic statistics on your dataset - row count, column count, data types, and memory footprint. Run `info()` and `describe()` functions if you're using pandas. This initial scan reveals obvious problems like wrong data types or unrealistic value ranges. Next, check your data source documentation. Where did this data come from? How was it collected? What transformations already happened? These details matter because they affect how you'll handle edge cases later. If you're pulling data from a legacy database, understand the schema and any quirks in how fields are stored. Document everything you find. Create a simple spreadsheet tracking each column, its type, expected range, and any observed anomalies. This becomes invaluable when you're debugging model performance issues three months later.
- Calculate the percentage of missing values per column - often this single metric guides your strategy
- Look for suspicious patterns like entire columns with the same value or impossibly high/low outliers
- Check timestamps for gaps or inconsistencies that might indicate data collection problems
- Compare column distributions against your domain knowledge - does this data match reality?
- Don't assume data is clean just because it came from an official source - audit everything
- Avoid making cleanup decisions based on statistics alone - talk to domain experts first
- Watch for seasonal patterns or time-based artifacts that might disappear in your training set
Handle Missing Values Strategically
Missing data isn't a single problem - it's actually several different problems depending on the pattern. First, identify whether data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). MCAR is typically handled with simple imputation. MNAR is trickier because the missingness itself contains information. For numeric columns, you've got options. Mean imputation works for MCAR scenarios with low missingness rates (under 5%). Median imputation is more robust to outliers. Forward fill and backward fill work for time-series data where temporal relationships matter. For more complex patterns, consider multiple imputation or K-nearest neighbors imputation. For categorical data, creating a separate 'missing' category often works better than deleting rows. Before imputing anything, ask whether the missingness is important. Sometimes that missing value is a signal - it might indicate a sensor failure, a customer who didn't complete a form, or a process that wasn't followed. Preserve that information by creating a binary 'was_missing' feature alongside your imputed value.
- Start simple - often mean or median imputation works well enough for real projects
- Use sklearn's SimpleImputer or IterativeImputer for consistent, reproducible results
- Keep your imputation logic simple enough that you can explain it to stakeholders
- Test model performance with different imputation strategies - sometimes one approach dramatically outperforms others
- Deleting rows with missing values wastes data - only do this if missingness exceeds 50% or the rows are truly invalid
- Never impute on your entire dataset before splitting into train/test - this causes data leakage
- Avoid forward fill on shuffled data - it only makes sense for time-series preserving temporal order
Remove or Transform Outliers
Outliers distort your model's understanding of normal patterns. A single extreme value can shift means, inflate variance estimates, and throw off distance-based algorithms. But here's the catch - sometimes outliers are real, important data points. A credit card fraudster making a $50,000 purchase isn't a data error; it's a critical signal. Use statistical methods to identify outliers objectively. The interquartile range (IQR) method flags values beyond 1.5 * IQR from the quartiles. Z-score method identifies values more than 3 standard deviations from the mean. Isolation Forest is more sophisticated and handles multivariate outliers. For domain-specific problems, sometimes a simple rule works best - transaction amounts below $0 are impossible, so those are definitely errors. You've got three choices for handling outliers: delete them (only if they're errors), transform them (log scaling, square root), or cap them (winsorization). For your ML project, capping extreme values often works better than deletion. Replace values beyond the 99th percentile with the 99th percentile value. This preserves row count while reducing distortion.
- Visualize distributions with histograms and box plots before deciding on outlier treatment
- Check if outliers cluster in specific subgroups - they might indicate a data quality issue in one source
- Use domain expertise - ask the business if high values make sense for your use case
- Apply outlier detection separately to train and test sets to prevent leakage
- Don't remove outliers before exploring them - you might delete your most valuable signals
- Be careful with capping strategies on skewed distributions - they can create artificial clusters
- Watch for seasonal outliers in time-series data - extreme values in December might be perfectly normal
Encode Categorical Variables Properly
Machine learning algorithms don't understand text categories. They need numbers. How you convert 'Red', 'Green', 'Blue' or 'Premium', 'Standard', 'Basic' into numbers significantly impacts model performance. Pick wrong and you're literally telling your algorithm that 'Blue' is twice as good as 'Red' when that ordering doesn't exist. For tree-based models (Random Forests, XGBoost), label encoding works fine since trees don't assume numeric ordering. One-hot encoding creates a binary column for each category, which works everywhere but explodes your feature count on high-cardinality columns. If a category appears only once or twice, it's often better to group it into an 'Other' category to reduce noise. For ordinal categories where order matters (Small, Medium, Large), explicitly assign ordered integers. Ordinal encoding and target encoding are powerful for high-cardinality features. Target encoding replaces each category with the mean target value for that category - this captures predictive information directly. Just be careful with target encoding on rare categories where the mean is unstable.
- Use pd.factorize() for quick label encoding or pd.get_dummies() for one-hot encoding
- Group rare categories appearing in less than 1% of data into 'Other' to improve generalization
- Apply categorical encoding after train/test split - learn encoding from training data only
- For high-cardinality features (1000+ unique values), consider target encoding or embeddings instead of one-hot
- One-hot encoding can create thousands of features - this increases model complexity and training time
- Never apply one-hot encoding before splitting your data - you'll leak information about test categories into training
- Target encoding can cause severe overfitting on small datasets - be cautious with rare categories
Scale and Normalize Numeric Features
Features on vastly different scales cause problems. If one feature ranges 0-1 and another ranges 0-1,000,000, your algorithm often treats the big numbers as more important. Distance-based algorithms (KNN, K-means) become completely dominated by large-scale features. Gradient descent in neural networks trains slowly because step sizes need to be tiny for the small-scale features. Standardization (z-score normalization) subtracts the mean and divides by standard deviation, centering each feature around zero with unit variance. This works well for normally distributed features and most algorithms. Min-max scaling squashes everything to the 0-1 range, which works better for bounded features and neural networks. Robust scaling uses medians and quantiles instead of means and standard deviation, making it resistant to outliers. The critical rule: fit your scaler on training data only, then apply it to test and production data using those same parameters. Most people get this right eventually, but it's a common source of data leakage that inflates performance estimates. Store your fitted scaler alongside your model so production predictions use identical scaling.
- Use sklearn's StandardScaler for most cases - it's simple, standard, and works well
- Don't scale binary features or features you've already engineered to specific ranges
- Apply scaling after creating derived features - scale the final feature set
- Use RobustScaler for datasets with extreme outliers that you can't remove
- Never fit your scaler on the entire dataset before splitting - this causes data leakage
- Don't scale categorical features that you've label encoded - they don't need it
- Watch out when scaling target variables - you'll need to inverse-transform predictions back to original scale
Create Train-Test Splits Correctly
How you split data determines whether your model performance estimates are honest. Random splitting works for most cases - shuffle your data and allocate 70-80% to training, 20-30% to testing. But random splitting fails for time-series data. If you shuffle before splitting, your model uses future information to predict the past. Instead, use temporal splits - training data is everything before a date, test data is everything after. For imbalanced datasets where one class is rare, use stratified splitting. This ensures both train and test sets have the same class distribution. If you have 10% positive examples, both sets get 10%. sklearn's train_test_split handles this with the stratify parameter. For grouped data (multiple transactions per customer, multiple measurements per machine), group-based splitting ensures all data from the same entity goes to either train or test, never both. Consider creating a separate validation set for hyperparameter tuning. Your workflow becomes: train on training data, tune hyperparameters on validation data, report final metrics on test data. This three-way split is especially important for deep learning where hyperparameter choices are numerous.
- Use random_state parameter for reproducibility - set it to the same value every run
- For time-series, always split by time - never shuffle dates before splitting
- Create stratified splits for classification problems with class imbalance
- Hold out 10-15% of data for final testing and don't touch it until you're ready to report results
- Random splitting on time-series data creates unrealistic evaluation - future data leaks into training
- Using test data for any decisions (feature selection, hyperparameter tuning, threshold adjustment) invalidates your results
- Stratified splitting helps but doesn't solve severe class imbalance - consider SMOTE or class weighting too
Feature Engineering and Selection
Raw features rarely work well. Feature engineering creates new features that make patterns clearer. If you have date columns, extract year, month, day of week, whether it's a holiday. For customer data, calculate recency (days since last purchase), frequency (purchase count), monetary (total spent). These derived features often outperform raw data because they directly encode business logic. But more features aren't always better. Too many features introduce noise, slow training, and cause overfitting. Use correlation analysis to remove features that perfectly correlate with others - keeping one and dropping duplicates. Univariate feature selection tests each feature independently against your target variable, ranking them by predictive power. Permutation importance trains your model then shuffles each feature to see how much model performance drops - big drops mean important features. Tree-based models provide feature importance scores automatically. After training a Random Forest or XGBoost model, you can see which features the model actually uses. Keep the top 20-30 features contributing 80-90% of the importance, dropping low-importance features to simplify your model and reduce training time.
- Start with domain expertise - work with business stakeholders to identify meaningful feature combinations
- Use correlation matrices to identify and remove multicollinearity before modeling
- Apply feature selection on training data only, then use the same features on test data
- Combine automatic selection with manual review - sometimes business logic should override statistics
- Feature selection before train-test splitting causes data leakage - always select features from training data
- Avoid creating features that directly use test data or future information
- Too much feature engineering wastes time - start simple and iterate based on model results
Address Class Imbalance
When your target variable is heavily imbalanced, standard accuracy metrics mislead you. If 99% of your data is 'no fraud' and 1% is 'fraud', a model that predicts everything as 'no fraud' gets 99% accuracy while being completely useless. You need strategies specifically designed for imbalance. Oversampling replicates minority class examples until classes are balanced. Undersampling removes majority class examples. Both work but have tradeoffs - oversampling can overfit on limited minority data, undersampling loses majority class information. SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic minority examples by interpolating between existing ones. This often outperforms both simple oversampling and undersampling. For most projects, SMOTE on training data then evaluation on unsampled test data works well. Class weighting adjusts your loss function to penalize minority class errors more heavily. With XGBoost, set scale_pos_weight to the ratio of negative to positive examples. Tree-based models handle class imbalance reasonably well with proper weighting. For imbalanced problems, always use appropriate metrics - F1-score, precision-recall curves, and area under the PR curve instead of accuracy.
- Use F1-score or precision-recall AUC for imbalanced classification - accuracy misleads
- Apply SMOTE or other resampling only to training data, evaluate on real unsampled test data
- Try class weighting first - it's simpler than SMOTE and often works just as well
- Combine SMOTE with class weighting for severe imbalance problems
- Never apply SMOTE before splitting - this creates synthetic data that bleeds into test set
- Using accuracy on imbalanced data hides poor performance - stakeholders will be disappointed in production
- Extreme oversampling can cause overfitting - keep minority to majority ratio between 1:2 and 1:10
Validate Data Quality and Consistency
Before training any model, validate that your data makes sense. Check data type consistency - all phone numbers should be strings, all dates should parse correctly. Run integrity checks: do IDs have duplicates that shouldn't? Are foreign key relationships valid? For a customer dataset, verify that customer IDs exist before referencing them in transaction records. Create data quality tests as code. Check that age values fall between 0 and 120, that dates don't have future values, that numeric fields stay within expected ranges. These tests should run automatically on new data. I recommend using Great Expectations or similar data validation frameworks that maintain schemas and quality rules. Compare your processed data against raw data using summary statistics. If raw data has 1 million rows and processed data has 900k rows, you need to know exactly what happened to the 100k rows. Document every transformation - deleted duplicates (5k rows), removed out-of-range values (50k rows), etc. This transparency prevents nasty surprises when your model fails on real data later.
- Write validation checks that document your data quality assumptions
- Use data profiling tools to automatically detect anomalies and distributions
- Create a data quality scorecard tracking key metrics over time - degradation might indicate upstream problems
- Document why you removed or modified data so others understand your decisions
- Don't modify data silently - always document what you changed and why
- Silent failures in data validation are dangerous - make validation checks loud and visible
- Test your validation rules on real data - sometimes your assumptions about 'normal' are wrong
Create Proper Validation Metrics and Baselines
You can't know if your model performs well without knowing what 'good' means. Establish a baseline - the performance of a simple, dumb approach. For classification, the most common class baseline guesses the majority class every time. For regression, the mean baseline predicts the average target value. Any sophisticated model should beat these baselines significantly. Choose metrics aligned with your business goal, not just standard metrics. If you're building a fraud detection model, false negatives (missed fraud) might cost 100x more than false positives (blocked legitimate transactions). In that case, precision-recall tradeoff matters more than overall accuracy. For regression, mean absolute error (MAE) in original units often communicates better to stakeholders than root mean squared error (RMSE). Set up cross-validation for more stable performance estimates. K-fold cross-validation splits your training data into K folds, trains on K-1 folds and evaluates on the held-out fold, repeating K times. This uses your limited data more efficiently and gives you confidence intervals around performance. Time-series cross-validation uses expanding windows - train on weeks 1-4, test on week 5, then train on weeks 1-5, test on week 6.
- Establish baseline performance before building complex models - sometimes simple works great
- Report confidence intervals, not just point estimates - 85% +/- 5% is more informative than 85%
- Use stratified k-fold for imbalanced classification to ensure each fold has similar class ratios
- Save cross-validation scores from all folds to identify unstable models that perform differently on different data subsets
- Using test set performance to pick your final model is data leakage - use validation set for model selection
- Don't use classification accuracy for imbalanced problems - it's fundamentally misleading
- Avoid metrics that don't align with business goals - you'll optimize the wrong thing
Prevent and Detect Data Leakage
Data leakage happens when information from test data influences training, creating artificially inflated performance that disappears in production. This is catastrophic because you think you've built a great model, then it fails in the real world. Common sources include scaling before splitting, using future information in features, selecting features from the entire dataset, or including the target variable in X features (yes, this happens). Prevent leakage through workflow discipline. Always split first, then preprocess. Fit scalers on training data. Create features from training data only. Remove any feature that mathematically depends on information you wouldn't have at prediction time. If you're predicting loan defaults, don't include whether the loan later defaulted - you're trying to predict that! Detect leakage through suspicious performance differences. If your cross-validation score is 95% but production performance is 70%, you probably have leakage. Similarly, if test performance is much better than cross-validation performance, something's wrong. A small gap (1-3%) is expected, but larger gaps suggest problems. Create a simple holdout test set that you absolutely don't touch until the end - it catches leakage that other checks miss.
- Create a reproducible preprocessing pipeline using sklearn Pipeline or similar - this makes leakage obvious
- Use time-based validation for time-series instead of random splitting
- Explicitly list which data you have available at prediction time - ensure all features respect this
- Have someone else review your code for leakage - it's surprisingly easy to miss
- Data leakage often happens subtly - be paranoid about it
- Future information bleeding into features is common - think through your feature creation carefully
- Using test metrics for any decision-making causes leakage - reserve test data for final evaluation only