machine learning data preprocessing and feature engineering

Machine learning data preprocessing and feature engineering are where most ML projects actually succeed or fail. You can have the fanciest algorithm, but if your data is messy and your features don't capture what matters, your model will flop. This guide walks you through the practical steps to clean, transform, and engineer features that make your ML models actually work in production.

3-4 hours

Prerequisites

Basic understanding of Python and pandas library
Familiarity with SQL or basic database queries
Access to a dataset (CSV, database, or API)
Knowledge of what your target variable represents

Step-by-Step Guide

Audit Your Raw Data and Identify Quality Issues

Before you touch anything, spend time understanding what you're working with. Load your dataset and check its shape, data types, and basic statistics using pandas info() and describe() methods. Look for the obvious culprits: missing values, duplicates, outliers, and data type mismatches. Run a quick data quality report. Check how many values are null in each column, what percentage of records are duplicates, and whether numeric columns have realistic ranges. If you're working with 100,000 customer records and 40% of a critical column is missing, that's a problem you need to know about upfront. Document everything you find - this becomes your roadmap.

Tip

Use df.isnull().sum() to quickly spot missing values across all columns
Check df.duplicated().sum() to identify exact duplicate rows
Run df.describe() to spot statistical anomalies and impossible value ranges
Create a data quality scorecard before you start preprocessing

Warning

Don't assume missing values are random - they often signal collection problems
Duplicates might be legitimate (same customer, multiple transactions) or errors
Outliers can be real business events or data entry mistakes - investigate, don't just delete

Handle Missing Values Strategically

Missing data isn't one-size-fits-all. The strategy depends on how much is missing and why. For columns missing less than 5%, deletion might work. For 5-30% missing, imputation makes sense. Above 30%, you're usually better off dropping the feature entirely unless it's critical. Choose your imputation method based on data patterns. Mean or median imputation works for numeric features, but if your data has seasonal patterns, forward fill (using previous value) or KNN imputation (using similar records) preserves structure better. For categorical data, mode imputation or creating a 'missing' category keeps information intact. Test which approach minimizes error on your validation set - there's no universal right answer.

Tip

Use SimpleImputer with strategy='mean' for normal numeric distributions
Try KNNImputer for features with strong correlations between records
Consider creating a 'was_missing' binary flag - missingness itself can be predictive
Never impute test data using test set statistics, use training set parameters only

Warning

Mean imputation reduces variance and can underestimate uncertainty
Deleting rows with any missing values often removes 50%+ of data unnecessarily
Don't use test set statistics for imputation - that's data leakage

Remove or Fix Duplicate and Inconsistent Records

Duplicates inflate your dataset and bias your model. But first, decide what 'duplicate' means for your problem. Exact row duplicates are obvious, but what about the same customer appearing twice with slight spelling variations in their name? These semantic duplicates matter. For exact duplicates, pandas drop_duplicates() handles it fast. For fuzzy matching (similar but not identical records), use libraries like difflib or fuzzywuzzy to find approximate matches. Clean inconsistencies systematically - standardize date formats, fix case sensitivity in categorical values, and trim whitespace. A customer record with 'New York' and another with 'new york' should map to the same thing. Standardize before you aggregate or model.

Tip

Use df.drop_duplicates(subset=['customer_id', 'transaction_date']) to find logical duplicates
Standardize text with .str.lower() and .str.strip() before comparisons
Use fuzzywuzzy library for matching customer names with 90%+ similarity
Check for whitespace and special characters that break exact matching

Warning

Deleting all exact duplicates can remove legitimate repeat purchases or visits
Case sensitivity and whitespace often hide duplicates from basic detection
Be careful with fuzzy matching thresholds - too high misses real duplicates, too low creates false positives

Standardize and Normalize Your Data Types

A numeric column that's actually stored as text won't work in your model. Check dtypes on all columns and fix mismatches. Convert strings to datetime if you're working with timestamps, ensure IDs stay as strings (not floats), and turn categorical columns into proper categorical types. Numeric normalization comes later, but type consistency comes now. Boolean columns should be 0/1, not 'Yes'/'No'. Dates should be datetime objects so you can extract day-of-week or time-since features. If a column should be categorical but pandas read it as numeric, convert it. These seemingly small fixes prevent hours of debugging downstream.

Tip

Use pd.to_datetime() for timestamp columns to enable time-based operations
Convert high-cardinality string columns to category type for memory efficiency
Use astype('int64') or astype('float32') to match sklearn requirements
Check df.dtypes immediately after loading data to catch issues early

Warning

Don't convert ID columns to numeric - you'll lose leading zeros and create mismatches
Category type assignment must happen before analysis to avoid dtype warnings
Scientific notation in CSV files often reads as floats instead of integers

Detect and Handle Outliers Appropriately

Outliers aren't automatically bad data. A customer spending $50,000 in one transaction is an outlier, but it's real and important. The question is whether they break your model's assumptions or represent legitimate business events. For tree-based models, outliers matter less. For linear models or distance-based algorithms, they can dominate. Use the IQR (Interquartile Range) method to identify outliers: flag values beyond 1.5 * IQR from the quartiles. Then investigate each flagged column. If 95% of your users spend under $100 monthly but a few spend $50,000, those aren't errors - they're your high-value customers. Cap extreme outliers (winsorization) or log-transform skewed features instead of deleting them. The goal is making features work with your algorithm, not pretending edge cases don't exist.

Tip

Calculate IQR bounds: Q1 - 1.5*IQR and Q3 + 1.5*IQR to identify outlier ranges
Use log transformation (np.log1p) for right-skewed features with extreme values
Create separate models for high-value segments if outliers represent distinct populations
Document which features had outliers handled and how - this matters for production

Warning

Deleting outliers reduces your dataset and loses valuable information
Outliers sometimes signal data collection errors, but sometimes signal real patterns
Hard caps on outliers (like capping at 99th percentile) can hurt model performance

Engineer Features from Raw Columns

Raw data rarely gives you the features you need. This is where you create signal. From a 'signup_date' column, extract customer age, tenure in months, whether they signed up during peak season. From 'transaction_amount', calculate log-transformed versions, moving averages, or ratios against customer average spend. Focus on features that capture business logic. If you're predicting churn, features like 'days_since_last_purchase', 'purchase_frequency', and 'spending_trend' matter way more than raw transaction counts. Use domain knowledge here - talk to your business stakeholders. They'll point out patterns your data alone won't reveal. Start with 2-3 features you're confident about, validate they improve your model, then expand systematically.

Tip

Extract temporal features: day_of_week, is_weekend, month, quarter from timestamps
Calculate aggregated features: sum, mean, max, min, std of groups (purchase totals per customer)
Create ratio features: current_purchase / average_purchase to capture relative behavior
Use domain logic: 'is_premium_customer' if spending > 90th percentile makes interpretation easier

Warning

Don't create features that directly contain your target variable - that's leakage
Features from future data (predicting churn but using post-churn data) introduce leakage
Too many engineered features cause overfitting - start conservative and validate each addition

Encode Categorical Variables Correctly

Categorical data needs conversion to numbers for most ML algorithms. You have options: one-hot encoding, ordinal encoding, target encoding, and frequency encoding. Each works in different situations. One-hot encoding creates binary columns for each category - perfect for tree models and when categories have no inherent order. Ordinal encoding (0, 1, 2, etc.) works when categories have ranking (small, medium, large). Target encoding (mean target value per category) is powerful but risks overfitting on rare categories. Frequency encoding (how often each category appears) works when popularity predicts your target. For high-cardinality features (50+ unique categories), group rare categories into 'other' before encoding, or use target/frequency encoding to avoid creating 50 new columns. Test which encoding method improves your model's validation score.

Tip

Use pd.get_dummies() for one-hot encoding, but drop first column to avoid multicollinearity
For tree models, ordinal encoding of high-cardinality features often outperforms one-hot
Use target encoding cautiously with cross-validation to prevent overfitting on training data
Group categories with less than 1% frequency into 'other' to reduce dimensionality

Warning

One-hot encoding high-cardinality features creates thousands of columns and sparse data
Target encoding without proper cross-validation leaks information from training to test
Never fit encoders on test data - fit on training data, then apply to test

Scale and Normalize Numeric Features

If one feature ranges 0-1 and another ranges 0-1,000,000, distance-based algorithms (KNN, SVM, neural networks) will weight the larger one heavily regardless of actual importance. Scaling fixes this. StandardScaler normalizes to mean 0 and standard deviation 1 - good for algorithms assuming normal distributions. MinMaxScaler scales to 0-1 range - good for preserving zero values and when you need bounded output. RobustScaler handles outliers better by using median and interquartile range instead of mean and std dev. Tree-based models (Random Forest, XGBoost, LightGBM) don't need scaling - they split on feature values, not distances. Apply scaling after train-test split, fit the scaler on training data only, then transform both train and test sets. This prevents data leakage and ensures test data follows your training distribution.

Tip

Fit StandardScaler on training data with scaler.fit(X_train), then transform both sets
Use RobustScaler if your data has outliers that StandardScaler would amplify
Don't scale target variable unless using neural networks - it complicates interpretation
Keep scaler objects for production - you'll need identical scaling on new data

Warning

Fitting scaler on full dataset before splitting causes train-test leakage
Some algorithms need scaling, others don't - check sklearn docs for your specific model
Scaling can introduce NaN values if columns have zero variance - handle before scaling

Create Interaction and Polynomial Features Strategically

Sometimes two features together matter more than separately. Customer spending * customer tenure might predict lifetime value better than either alone. These interaction features capture non-linear relationships. Polynomial features (x^2, x^3) let linear models capture curved patterns without becoming actually nonlinear. But here's the trap: create 10 features and their interactions, you now have 100+ features. Most won't help, and you'll overfit. Use domain knowledge to guide feature interactions. If you think age and income together predict purchasing power, create age*income. Test each interaction on validation data. If it doesn't improve your model, remove it. Start with your top 3-5 most important features and create interactions between those, not every possible pair.

Tip

Use sklearn's PolynomialFeatures for systematic generation, then validate each feature
Create domain-specific interactions: spending*frequency for engagement score
Normalize features before creating polynomial features to keep scales manageable
Use feature selection tools to identify which interactions actually improve predictions

Warning

Interaction features multiply with polynomial features - a 100-feature set becomes 5,000+
Most interactions won't improve your model - they just add noise and overfitting
High-degree polynomials (x^4, x^5) rarely help and often hurt generalization

Handle Imbalanced Classes in Your Target Variable

If you're predicting fraud and 99% of transactions are legitimate, a model predicting 'not fraud' everywhere gets 99% accuracy but catches zero fraud. That's not useful. Imbalanced classification requires special handling. Check your target variable distribution first - anything under 70-30 usually needs attention. You have several options. Oversampling creates copies of minority class samples (or synthetic ones via SMOTE). Undersampling removes majority class samples - faster but loses data. Class weights tell your model 'penalize mistakes on rare classes more heavily'. Threshold adjustment changes what probability counts as positive - if default is 50%, moving to 30% catches more fraud but increases false positives. Combine methods based on your problem: fraud detection needs high recall (catch fraud), marketing needs high precision (don't waste money on unlikely converters).

Tip

Use imblearn.over_sampling.SMOTE for synthetic minority oversampling
Set class_weight='balanced' in sklearn models to auto-weight by class frequency
Use stratified k-fold cross-validation (stratify=y) to maintain class ratios in splits
Calculate precision, recall, and F1 score - accuracy alone is misleading for imbalanced data

Warning

Oversampling training data can cause overfitting if done before cross-validation
Don't apply SMOTE or oversampling to test set - only training data
Class weights and resampling both change your model's probability calibration

Perform Feature Selection to Reduce Dimensionality

You engineered dozens of features. Now half of them are noise. Feature selection removes features that don't help prediction, which simplifies your model, reduces overfitting, and speeds up training. You have several approaches: univariate statistical tests, model-based feature importance, and iterative elimination. For univariate, use SelectKBest with f_classif (classification) or f_regression (regression) to score features independently. For model-based, train a simple model (linear regression, Random Forest) and use its feature importance scores. Recursive Feature Elimination (RFE) repeatedly trains models and removes the weakest feature until you hit your target count. Start by removing obviously weak features (near-zero variance, high correlation with others), then use domain knowledge and validation performance to guide further elimination.

Tip

Calculate feature correlation matrix to identify redundant features with high correlation
Use SelectKBest(f_classif, k=20) to pick top-k features for initial screening
Random Forest feature_importances_ gives quick, reliable importance estimates
Eliminate correlated features - keep the one with higher correlation to target

Warning

Feature importance varies by model type - Random Forest ranks differently than linear models
Don't choose feature selection threshold based on test set - use validation set only
Removing features that have low individual importance can hurt if they interact with others

Create Your Preprocessing and Feature Engineering Pipeline

Now package all these steps into a reproducible pipeline. sklearn's Pipeline class chains preprocessing and modeling steps so you apply identical transformations to train and test data. This prevents data leakage and makes your code cleaner. Your pipeline might look like: imputation -> outlier handling -> feature engineering -> encoding -> scaling -> feature selection. Save your fitted pipeline objects (preprocessor pickle files) for production. When new data arrives, load the same pipeline that was trained on your original data. This ensures consistency. Document each step, parameters used, and why you chose them. Future you will thank present you when you need to retrain or debug.

Tip

Use Pipeline to chain ColumnTransformer for different feature processing
Pickle your fitted pipeline with joblib.dump() for easy production deployment
Test your pipeline on a holdout set to catch leakage before going live
Log all hyperparameters and transformation steps in your experiment tracker

Warning

Fitting any transformer on test data after fitting on train data causes leakage
Saved pipelines are version-specific - document sklearn version requirements
Pipeline steps execute sequentially - order matters (scale before feature selection works differently)

Validate Your Preprocessing Against Your Model Performance

The best preprocessing isn't always obvious. A feature you thought would help might hurt. Missing value imputation strategy A might outperform strategy B. The only real test is model performance. Set up cross-validation with your full pipeline and measure your target metric (accuracy, F1, AUC, RMSE - depends on your problem). Run A/B experiments: train two models with different preprocessing approaches on the same train-test split and compare validation scores. If dropping outliers gives 2% better performance than capping them, drop them. If target encoding outperforms one-hot encoding by 1%, use target encoding. Keep detailed logs of which approach won. This becomes institutional knowledge for future projects. What works for predicting churn might not work for predicting purchase value.

Tip

Run cross-validation (cv=5 minimum) to get reliable performance estimates
Log preprocessing parameters and resulting model performance in a results table
Test on validation set, not test set, to avoid overfitting to test data
Document which preprocessing choices actually improved performance vs. intuition

Warning

Don't tune preprocessing on test set - only validation set
A small performance improvement might be noise - run multiple cross-validation folds
Preprocessing that improves training accuracy but hurts validation accuracy is overfitting

Frequently Asked Questions

What's the difference between data preprocessing and feature engineering?

Preprocessing cleans raw data to make it usable - handling missing values, duplicates, and scaling. Feature engineering creates new predictive features from raw columns, like extracting day-of-week from timestamps or calculating spending ratios. Preprocessing fixes problems; feature engineering creates signal.

How much missing data is too much to impute?

Generally, impute when less than 30% is missing. Below 5%, deletion is acceptable. Above 30%, the column probably doesn't contain real signal anyway. But context matters - if missing values themselves predict your target, keep them as a 'was_missing' flag even if 50% are missing.

Should I scale data before or after feature engineering?

Perform feature engineering first (creating new features), then scale. Scaling engineered features ensures they're on the same scale as raw features. Always fit scalers on training data only, then apply to test data. This prevents data leakage.

Can too many features hurt my model?

Yes. Too many features cause overfitting, slower training, and poor test performance. Use feature selection to keep only the most predictive features. Start with 20-30 features max and expand only if validation performance improves. Fewer, better features usually beats more, mediocre ones.

How do I handle categorical variables with 100+ unique values?

Group rare categories (under 1% frequency) into 'other' category first. Then use target encoding or frequency encoding instead of one-hot encoding. One-hot encoding creates too many sparse columns. Target encoding captures predictive information in fewer dimensions.

Prerequisites

Step-by-Step Guide

Audit Your Raw Data and Identify Quality Issues

Handle Missing Values Strategically

Remove or Fix Duplicate and Inconsistent Records

Standardize and Normalize Your Data Types

Detect and Handle Outliers Appropriately

Engineer Features from Raw Columns

Encode Categorical Variables Correctly

Scale and Normalize Numeric Features

Create Interaction and Polynomial Features Strategically

Handle Imbalanced Classes in Your Target Variable

Perform Feature Selection to Reduce Dimensionality

Create Your Preprocessing and Feature Engineering Pipeline

Validate Your Preprocessing Against Your Model Performance

Frequently Asked Questions

Related Pages