machine learning data preprocessing and feature engineering

Machine learning data preprocessing and feature engineering are where most ML projects actually succeed or fail. You can have the fanciest algorithm, but if your data is messy and your features don't capture what matters, your model will flop. This guide walks you through the practical steps to clean, transform, and engineer features that make your ML models actually work in production.

3-4 hours

Prerequisites

  • Basic understanding of Python and pandas library
  • Familiarity with SQL or basic database queries
  • Access to a dataset (CSV, database, or API)
  • Knowledge of what your target variable represents

Step-by-Step Guide

1

Audit Your Raw Data and Identify Quality Issues

Before you touch anything, spend time understanding what you're working with. Load your dataset and check its shape, data types, and basic statistics using pandas info() and describe() methods. Look for the obvious culprits: missing values, duplicates, outliers, and data type mismatches. Run a quick data quality report. Check how many values are null in each column, what percentage of records are duplicates, and whether numeric columns have realistic ranges. If you're working with 100,000 customer records and 40% of a critical column is missing, that's a problem you need to know about upfront. Document everything you find - this becomes your roadmap.

Tip
  • Use df.isnull().sum() to quickly spot missing values across all columns
  • Check df.duplicated().sum() to identify exact duplicate rows
  • Run df.describe() to spot statistical anomalies and impossible value ranges
  • Create a data quality scorecard before you start preprocessing
Warning
  • Don't assume missing values are random - they often signal collection problems
  • Duplicates might be legitimate (same customer, multiple transactions) or errors
  • Outliers can be real business events or data entry mistakes - investigate, don't just delete
2

Handle Missing Values Strategically

Missing data isn't one-size-fits-all. The strategy depends on how much is missing and why. For columns missing less than 5%, deletion might work. For 5-30% missing, imputation makes sense. Above 30%, you're usually better off dropping the feature entirely unless it's critical. Choose your imputation method based on data patterns. Mean or median imputation works for numeric features, but if your data has seasonal patterns, forward fill (using previous value) or KNN imputation (using similar records) preserves structure better. For categorical data, mode imputation or creating a 'missing' category keeps information intact. Test which approach minimizes error on your validation set - there's no universal right answer.

Tip
  • Use SimpleImputer with strategy='mean' for normal numeric distributions
  • Try KNNImputer for features with strong correlations between records
  • Consider creating a 'was_missing' binary flag - missingness itself can be predictive
  • Never impute test data using test set statistics, use training set parameters only
Warning
  • Mean imputation reduces variance and can underestimate uncertainty
  • Deleting rows with any missing values often removes 50%+ of data unnecessarily
  • Don't use test set statistics for imputation - that's data leakage
3

Remove or Fix Duplicate and Inconsistent Records

Duplicates inflate your dataset and bias your model. But first, decide what 'duplicate' means for your problem. Exact row duplicates are obvious, but what about the same customer appearing twice with slight spelling variations in their name? These semantic duplicates matter. For exact duplicates, pandas drop_duplicates() handles it fast. For fuzzy matching (similar but not identical records), use libraries like difflib or fuzzywuzzy to find approximate matches. Clean inconsistencies systematically - standardize date formats, fix case sensitivity in categorical values, and trim whitespace. A customer record with 'New York' and another with 'new york' should map to the same thing. Standardize before you aggregate or model.

Tip
  • Use df.drop_duplicates(subset=['customer_id', 'transaction_date']) to find logical duplicates
  • Standardize text with .str.lower() and .str.strip() before comparisons
  • Use fuzzywuzzy library for matching customer names with 90%+ similarity
  • Check for whitespace and special characters that break exact matching
Warning
  • Deleting all exact duplicates can remove legitimate repeat purchases or visits
  • Case sensitivity and whitespace often hide duplicates from basic detection
  • Be careful with fuzzy matching thresholds - too high misses real duplicates, too low creates false positives
4

Standardize and Normalize Your Data Types

A numeric column that's actually stored as text won't work in your model. Check dtypes on all columns and fix mismatches. Convert strings to datetime if you're working with timestamps, ensure IDs stay as strings (not floats), and turn categorical columns into proper categorical types. Numeric normalization comes later, but type consistency comes now. Boolean columns should be 0/1, not 'Yes'/'No'. Dates should be datetime objects so you can extract day-of-week or time-since features. If a column should be categorical but pandas read it as numeric, convert it. These seemingly small fixes prevent hours of debugging downstream.

Tip
  • Use pd.to_datetime() for timestamp columns to enable time-based operations
  • Convert high-cardinality string columns to category type for memory efficiency
  • Use astype('int64') or astype('float32') to match sklearn requirements
  • Check df.dtypes immediately after loading data to catch issues early
Warning
  • Don't convert ID columns to numeric - you'll lose leading zeros and create mismatches
  • Category type assignment must happen before analysis to avoid dtype warnings
  • Scientific notation in CSV files often reads as floats instead of integers
5

Detect and Handle Outliers Appropriately

Outliers aren't automatically bad data. A customer spending $50,000 in one transaction is an outlier, but it's real and important. The question is whether they break your model's assumptions or represent legitimate business events. For tree-based models, outliers matter less. For linear models or distance-based algorithms, they can dominate. Use the IQR (Interquartile Range) method to identify outliers: flag values beyond 1.5 * IQR from the quartiles. Then investigate each flagged column. If 95% of your users spend under $100 monthly but a few spend $50,000, those aren't errors - they're your high-value customers. Cap extreme outliers (winsorization) or log-transform skewed features instead of deleting them. The goal is making features work with your algorithm, not pretending edge cases don't exist.

Tip
  • Calculate IQR bounds: Q1 - 1.5*IQR and Q3 + 1.5*IQR to identify outlier ranges
  • Use log transformation (np.log1p) for right-skewed features with extreme values
  • Create separate models for high-value segments if outliers represent distinct populations
  • Document which features had outliers handled and how - this matters for production
Warning
  • Deleting outliers reduces your dataset and loses valuable information
  • Outliers sometimes signal data collection errors, but sometimes signal real patterns
  • Hard caps on outliers (like capping at 99th percentile) can hurt model performance
6

Engineer Features from Raw Columns

Raw data rarely gives you the features you need. This is where you create signal. From a 'signup_date' column, extract customer age, tenure in months, whether they signed up during peak season. From 'transaction_amount', calculate log-transformed versions, moving averages, or ratios against customer average spend. Focus on features that capture business logic. If you're predicting churn, features like 'days_since_last_purchase', 'purchase_frequency', and 'spending_trend' matter way more than raw transaction counts. Use domain knowledge here - talk to your business stakeholders. They'll point out patterns your data alone won't reveal. Start with 2-3 features you're confident about, validate they improve your model, then expand systematically.

Tip
  • Extract temporal features: day_of_week, is_weekend, month, quarter from timestamps
  • Calculate aggregated features: sum, mean, max, min, std of groups (purchase totals per customer)
  • Create ratio features: current_purchase / average_purchase to capture relative behavior
  • Use domain logic: 'is_premium_customer' if spending > 90th percentile makes interpretation easier
Warning
  • Don't create features that directly contain your target variable - that's leakage
  • Features from future data (predicting churn but using post-churn data) introduce leakage
  • Too many engineered features cause overfitting - start conservative and validate each addition
7

Encode Categorical Variables Correctly

Categorical data needs conversion to numbers for most ML algorithms. You have options: one-hot encoding, ordinal encoding, target encoding, and frequency encoding. Each works in different situations. One-hot encoding creates binary columns for each category - perfect for tree models and when categories have no inherent order. Ordinal encoding (0, 1, 2, etc.) works when categories have ranking (small, medium, large). Target encoding (mean target value per category) is powerful but risks overfitting on rare categories. Frequency encoding (how often each category appears) works when popularity predicts your target. For high-cardinality features (50+ unique categories), group rare categories into 'other' before encoding, or use target/frequency encoding to avoid creating 50 new columns. Test which encoding method improves your model's validation score.

Tip
  • Use pd.get_dummies() for one-hot encoding, but drop first column to avoid multicollinearity
  • For tree models, ordinal encoding of high-cardinality features often outperforms one-hot
  • Use target encoding cautiously with cross-validation to prevent overfitting on training data
  • Group categories with less than 1% frequency into 'other' to reduce dimensionality
Warning
  • One-hot encoding high-cardinality features creates thousands of columns and sparse data
  • Target encoding without proper cross-validation leaks information from training to test
  • Never fit encoders on test data - fit on training data, then apply to test
8

Scale and Normalize Numeric Features

If one feature ranges 0-1 and another ranges 0-1,000,000, distance-based algorithms (KNN, SVM, neural networks) will weight the larger one heavily regardless of actual importance. Scaling fixes this. StandardScaler normalizes to mean 0 and standard deviation 1 - good for algorithms assuming normal distributions. MinMaxScaler scales to 0-1 range - good for preserving zero values and when you need bounded output. RobustScaler handles outliers better by using median and interquartile range instead of mean and std dev. Tree-based models (Random Forest, XGBoost, LightGBM) don't need scaling - they split on feature values, not distances. Apply scaling after train-test split, fit the scaler on training data only, then transform both train and test sets. This prevents data leakage and ensures test data follows your training distribution.

Tip
  • Fit StandardScaler on training data with scaler.fit(X_train), then transform both sets
  • Use RobustScaler if your data has outliers that StandardScaler would amplify
  • Don't scale target variable unless using neural networks - it complicates interpretation
  • Keep scaler objects for production - you'll need identical scaling on new data
Warning
  • Fitting scaler on full dataset before splitting causes train-test leakage
  • Some algorithms need scaling, others don't - check sklearn docs for your specific model
  • Scaling can introduce NaN values if columns have zero variance - handle before scaling
9

Create Interaction and Polynomial Features Strategically

Sometimes two features together matter more than separately. Customer spending * customer tenure might predict lifetime value better than either alone. These interaction features capture non-linear relationships. Polynomial features (x^2, x^3) let linear models capture curved patterns without becoming actually nonlinear. But here's the trap: create 10 features and their interactions, you now have 100+ features. Most won't help, and you'll overfit. Use domain knowledge to guide feature interactions. If you think age and income together predict purchasing power, create age*income. Test each interaction on validation data. If it doesn't improve your model, remove it. Start with your top 3-5 most important features and create interactions between those, not every possible pair.

Tip
  • Use sklearn's PolynomialFeatures for systematic generation, then validate each feature
  • Create domain-specific interactions: spending*frequency for engagement score
  • Normalize features before creating polynomial features to keep scales manageable
  • Use feature selection tools to identify which interactions actually improve predictions
Warning
  • Interaction features multiply with polynomial features - a 100-feature set becomes 5,000+
  • Most interactions won't improve your model - they just add noise and overfitting
  • High-degree polynomials (x^4, x^5) rarely help and often hurt generalization
10

Handle Imbalanced Classes in Your Target Variable

If you're predicting fraud and 99% of transactions are legitimate, a model predicting 'not fraud' everywhere gets 99% accuracy but catches zero fraud. That's not useful. Imbalanced classification requires special handling. Check your target variable distribution first - anything under 70-30 usually needs attention. You have several options. Oversampling creates copies of minority class samples (or synthetic ones via SMOTE). Undersampling removes majority class samples - faster but loses data. Class weights tell your model 'penalize mistakes on rare classes more heavily'. Threshold adjustment changes what probability counts as positive - if default is 50%, moving to 30% catches more fraud but increases false positives. Combine methods based on your problem: fraud detection needs high recall (catch fraud), marketing needs high precision (don't waste money on unlikely converters).

Tip
  • Use imblearn.over_sampling.SMOTE for synthetic minority oversampling
  • Set class_weight='balanced' in sklearn models to auto-weight by class frequency
  • Use stratified k-fold cross-validation (stratify=y) to maintain class ratios in splits
  • Calculate precision, recall, and F1 score - accuracy alone is misleading for imbalanced data
Warning
  • Oversampling training data can cause overfitting if done before cross-validation
  • Don't apply SMOTE or oversampling to test set - only training data
  • Class weights and resampling both change your model's probability calibration
11

Perform Feature Selection to Reduce Dimensionality

You engineered dozens of features. Now half of them are noise. Feature selection removes features that don't help prediction, which simplifies your model, reduces overfitting, and speeds up training. You have several approaches: univariate statistical tests, model-based feature importance, and iterative elimination. For univariate, use SelectKBest with f_classif (classification) or f_regression (regression) to score features independently. For model-based, train a simple model (linear regression, Random Forest) and use its feature importance scores. Recursive Feature Elimination (RFE) repeatedly trains models and removes the weakest feature until you hit your target count. Start by removing obviously weak features (near-zero variance, high correlation with others), then use domain knowledge and validation performance to guide further elimination.

Tip
  • Calculate feature correlation matrix to identify redundant features with high correlation
  • Use SelectKBest(f_classif, k=20) to pick top-k features for initial screening
  • Random Forest feature_importances_ gives quick, reliable importance estimates
  • Eliminate correlated features - keep the one with higher correlation to target
Warning
  • Feature importance varies by model type - Random Forest ranks differently than linear models
  • Don't choose feature selection threshold based on test set - use validation set only
  • Removing features that have low individual importance can hurt if they interact with others
12

Create Your Preprocessing and Feature Engineering Pipeline

Now package all these steps into a reproducible pipeline. sklearn's Pipeline class chains preprocessing and modeling steps so you apply identical transformations to train and test data. This prevents data leakage and makes your code cleaner. Your pipeline might look like: imputation -> outlier handling -> feature engineering -> encoding -> scaling -> feature selection. Save your fitted pipeline objects (preprocessor pickle files) for production. When new data arrives, load the same pipeline that was trained on your original data. This ensures consistency. Document each step, parameters used, and why you chose them. Future you will thank present you when you need to retrain or debug.

Tip
  • Use Pipeline to chain ColumnTransformer for different feature processing
  • Pickle your fitted pipeline with joblib.dump() for easy production deployment
  • Test your pipeline on a holdout set to catch leakage before going live
  • Log all hyperparameters and transformation steps in your experiment tracker
Warning
  • Fitting any transformer on test data after fitting on train data causes leakage
  • Saved pipelines are version-specific - document sklearn version requirements
  • Pipeline steps execute sequentially - order matters (scale before feature selection works differently)
13

Validate Your Preprocessing Against Your Model Performance

The best preprocessing isn't always obvious. A feature you thought would help might hurt. Missing value imputation strategy A might outperform strategy B. The only real test is model performance. Set up cross-validation with your full pipeline and measure your target metric (accuracy, F1, AUC, RMSE - depends on your problem). Run A/B experiments: train two models with different preprocessing approaches on the same train-test split and compare validation scores. If dropping outliers gives 2% better performance than capping them, drop them. If target encoding outperforms one-hot encoding by 1%, use target encoding. Keep detailed logs of which approach won. This becomes institutional knowledge for future projects. What works for predicting churn might not work for predicting purchase value.

Tip
  • Run cross-validation (cv=5 minimum) to get reliable performance estimates
  • Log preprocessing parameters and resulting model performance in a results table
  • Test on validation set, not test set, to avoid overfitting to test data
  • Document which preprocessing choices actually improved performance vs. intuition
Warning
  • Don't tune preprocessing on test set - only validation set
  • A small performance improvement might be noise - run multiple cross-validation folds
  • Preprocessing that improves training accuracy but hurts validation accuracy is overfitting

Frequently Asked Questions

What's the difference between data preprocessing and feature engineering?
Preprocessing cleans raw data to make it usable - handling missing values, duplicates, and scaling. Feature engineering creates new predictive features from raw columns, like extracting day-of-week from timestamps or calculating spending ratios. Preprocessing fixes problems; feature engineering creates signal.
How much missing data is too much to impute?
Generally, impute when less than 30% is missing. Below 5%, deletion is acceptable. Above 30%, the column probably doesn't contain real signal anyway. But context matters - if missing values themselves predict your target, keep them as a 'was_missing' flag even if 50% are missing.
Should I scale data before or after feature engineering?
Perform feature engineering first (creating new features), then scale. Scaling engineered features ensures they're on the same scale as raw features. Always fit scalers on training data only, then apply to test data. This prevents data leakage.
Can too many features hurt my model?
Yes. Too many features cause overfitting, slower training, and poor test performance. Use feature selection to keep only the most predictive features. Start with 20-30 features max and expand only if validation performance improves. Fewer, better features usually beats more, mediocre ones.
How do I handle categorical variables with 100+ unique values?
Group rare categories (under 1% frequency) into 'other' category first. Then use target encoding or frequency encoding instead of one-hot encoding. One-hot encoding creates too many sparse columns. Target encoding captures predictive information in fewer dimensions.

Related Pages