20 May 2026
Feature Engineering for Machine Learning: Practical Techniques
The often-overlooked craft of transforming raw data into informative features — covering encoding, scaling, missing values, interactions, and selection.
Why Feature Engineering Matters
A model is only as good as the information it receives. Raw data rarely arrives in a form that exposes the structure an algorithm needs. Thoughtful feature engineering can improve a model’s performance more than switching from a random forest to gradient boosting — it encodes domain knowledge directly into the input space.
Even with deep learning, tabular data still benefits enormously from hand-crafted features. Neural networks for tabular data consistently underperform gradient-boosted trees (XGBoost, LightGBM) when features are not engineered.
Handling Missing Values
Missing values must be addressed before most models can be trained. The right strategy depends on why data is missing:
| Mechanism | Description | Strategy |
|---|---|---|
| MCAR | Missing completely at random | Mean/median imputation safe |
| MAR | Missing depends on observed data | Model-based imputation |
| MNAR | Missing depends on the missing value itself | Flag + impute, investigate carefully |
Always add a missingness indicator feature when the fact that a value is missing carries signal:
from sklearn.impute import SimpleImputer
import numpy as np
# Add binary indicator columns before imputing
df["age_missing"] = df["age"].isna().astype(int)
imputer = SimpleImputer(strategy="median")
df["age"] = imputer.fit_transform(df[["age"]])
Encoding Categorical Variables
Categorical features need to be converted to numerical representations. The right choice depends on cardinality:
Ordinal encoding — for low-cardinality features with a natural order:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[["low", "medium", "high"]])
One-hot encoding — for low-cardinality nominals (< ~15 categories):
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
Target encoding — for high-cardinality features. Replace each category with the mean target value for that category, estimated with cross-validation to avoid leakage:
x^i=∑j=i1[xj=xi]+α∑j=i1[xj=xi]⋅yj+αyˉ
where α is a smoothing factor that shrinks estimates toward the global mean for rare categories.
Numerical Transformations
Raw numerical features often have distributions that are poorly suited to linear models or distance-based algorithms.
Standardisation — zero mean, unit variance:
z=σx−μ
Min-max scaling — maps to [0,1]:
z=xmax−xminx−xmin
Log transform — compresses right-skewed distributions:
z=log(1+x)
Useful for revenue, counts, prices — any feature with a heavy right tail.
Box-Cox transform — generalises log transform with a learned λ:
z={λxλ−1logxλ=0λ=0
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method="box-cox") # requires x > 0
Datetime Features
Datetime columns encode rich cyclical information that needs to be extracted explicitly:
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["month"] = df["timestamp"].dt.month
df["is_weekend"] = (df["day_of_week"] >= 5).astype(int)
# Cyclical encoding for periodic features
import numpy as np
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)
Cyclical encoding (sin/cos) ensures that hour 23 is close to hour 0 in feature space, which a raw integer would not capture.
Interaction Features
For linear models, interactions between features must be created explicitly:
x12=x1⋅x2
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(X)
Tree-based models discover interactions automatically but still benefit from well-constructed ratio and difference features:
df["price_per_sqft"] = df["price"] / (df["sqft"] + 1)
df["age_since_renovation"] = df["year"] - df["last_renovation_year"]
Feature Selection
Adding too many features hurts generalisation (the curse of dimensionality) and slows training. Common selection strategies:
Variance threshold — remove near-constant features:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.01)
Mutual information — measures non-linear dependency between feature and target:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
sel = SelectKBest(mutual_info_classif, k=20)
Permutation importance — train the model, then measure how much performance drops when each feature is randomly shuffled:
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_val, y_val, n_repeats=10)
SHAP values — model-agnostic, theoretically grounded attribution based on Shapley values. The gold standard for production feature selection.
Building a Preprocessing Pipeline
Use sklearn.pipeline.Pipeline to chain transformations safely — preventing train/test leakage:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
numeric_transformer = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", PowerTransformer()),
])
categorical_transformer = Pipeline([
("impute", SimpleImputer(strategy="most_frequent")),
("encode", OneHotEncoder(handle_unknown="ignore")),
])
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_cols),
("cat", categorical_transformer, categorical_cols),
])
model = Pipeline([
("preprocessor", preprocessor),
("classifier", GradientBoostingClassifier()),
])
model.fit(X_train, y_train)
All transformation statistics (means, encodings, etc.) are fitted on X_train only and applied to X_val/X_test.
Key Takeaways
- Feature engineering is often more impactful than model selection for tabular data
- Always add missingness indicator features when missing data carries signal
- Use cyclical encoding (sin/cos) for periodic features like hour, day, month
- Target encode high-cardinality categoricals with cross-validation to prevent leakage
- Wrap all preprocessing in a
Pipelineto ensure correct train/test separation - Use permutation importance or SHAP values to prune irrelevant features