Feature Engineering: Polynomial and Interaction Features

Linear models like linear and logistic regression are powerful for their interpretability and speed, but they make a strong assumption: that the relationship between your features and the target is linear. In the real world, relationships are often curved or involve synergy between variables. Feature engineering is the process of creating new input features from your existing data to improve model performance. Polynomial features and interaction features are two essential techniques for helping linear models capture these more complex, non-linear patterns, effectively allowing them to fit a wider range of real-world data.

The "Why": Capturing Non-Linearity in Linear Models

At its core, a simple linear regression model fits an equation of the form $y = β_{0} + β_{1} x_{1} + ϵ$ . It can only draw a straight line (or a flat plane/hyperplane). If the true relationship between $x_{1}$ and $y$ is curved—for instance, where performance increases with effort up to a point, then plateaus or declines—a straight line will be a poor fit. By creating polynomial features, such as $x_{1}^{2}$ or $x_{1}^{3}$ , you give the model new variables to work with. The model equation can then become $y = β_{0} + β_{1} x_{1} + β_{2} x_{1}^{2} + ϵ$ , which is capable of modeling a parabolic curve. This transforms a constrained linear model into a powerful, flexible non-linear model while retaining the benefits of linear optimization techniques.

Generating Polynomial Features: Theory and Practice

The creation of polynomial features is systematic. For a given set of original features, you generate new features that are products of the original features raised to various powers up to a specified degree. For example, with two input features $a$ and $b$ and degree=2, the generated features would be: $1$ (the bias term), $a$ , $b$ , $a^{2}$ , $a * b$ , and $b^{2}$ .

In Python's scikit-learn library, the PolynomialFeatures transformer automates this process. Here is a fundamental example:

from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Sample data: two features
X = np.array([[2, 3], [1, 4]])
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

After this transformation, X_poly will contain the features: [a, b, a^2, a*b, b^2]. The include_bias=False parameter excludes the constant "1" column, as most linear models in scikit-learn add their own intercept. This automated generation ensures you consistently create all possible polynomial combinations, a task that is tedious and error-prone if done manually.

The Critical Art of Degree Selection

Choosing the polynomial degree is the most crucial decision in this process and directly addresses the risk of feature explosion. The number of features created grows combinatorially. The formula for the number of output features (including the bias) for n input features and degree d is $(d n + d)$ . For just 10 original features and degree=3, you would create 286 features. With degree=4, you get 1001.

A high degree can lead to two major problems:

Severe Overfitting: The model will fit the noise in your training data perfectly but fail to generalize to new data.
Computational Burden: Training time increases, and you may run into memory issues.

Strategy for selection:

Start Low: Begin with degree=2 or 3. This often captures the most significant non-linear trends.
Use Cross-Validation: Systematically evaluate model performance (e.g., using GridSearchCV) on a validation set across degrees (2, 3, 4) to find the one that yields the best generalization.
Domain Knowledge: In physics or engineering, the underlying relationship might suggest a specific degree (e.g., kinetic energy involves $v e l oc i t y^{2}$ ).

Creating Targeted Features: Interaction-Only Mode

Sometimes, you don't want the pure polynomial terms ( $a^{2}$ , $b^{2}$ ) but are specifically interested in how features combine to influence the target. This is the concept of interaction. For instance, in a business context, the impact of marketing spend on sales might depend on the season (a marketing-spend * season interaction).

PolynomialFeatures has an interaction_only parameter for this purpose. When set to True and degree=2, it generates only the original features and their pairwise products: $a$ , $b$ , and $a * b$ . It excludes $a^{2}$ and $b^{2}$ . This is a powerful way to add complexity more sparingly and interpretably, as interaction terms have a clear meaning: the combined effect of two features.

Combining with Regularization for Feature Selection

Creating many polynomial features almost guarantees that many will be irrelevant or redundant. This is where regularization becomes an indispensable partner. Regularization techniques like Lasso (L1) and Ridge (L2) regression penalize model complexity.

Lasso (L1 Regularization): This is particularly powerful in this context. It can drive the coefficients of irrelevant polynomial or interaction terms all the way to zero, effectively performing feature selection. By applying Lasso after polynomial expansion, you can automatically identify the most important non-linear and interaction terms from a large, generated set.
Ridge (L2 Regularization): Shrinks coefficients towards zero but rarely sets them to exactly zero. It helps prevent overfitting by ensuring no single high-degree term gets an excessively large weight.

A robust workflow is: 1) Generate polynomial features to a moderate degree, 2) Scale all features (e.g., using StandardScaler), 3) Train a regularized linear model, and 4) Use cross-validation to tune both the polynomial degree and the regularization strength simultaneously.

Polynomial Features vs. Tree-Based Models

A fundamental question is when to use this engineered complexity versus choosing a different model. Tree-based models like Decision Trees, Random Forests, and Gradient Boosted Machines (e.g., XGBoost) natively handle non-linear relationships and interactions by recursively partitioning the feature space. They do not require you to manually create $x^{2}$ or $a * b$ terms.

When to use polynomial features with linear models:

You require a highly interpretable model (coefficients for terms like $x^{2}$ have a clear meaning).
The dataset is very large, and linear models are computationally cheaper.
You have strong domain reason to believe the relationship follows a specific polynomial form.
You are working within a pipeline that requires linear models for other reasons (e.g., statistical inference).

When tree-based models may be preferable:

You have a complex, high-dimensional interaction landscape that is not easily captured by low-degree polynomials.
Model interpretability beyond feature importance is less critical.
You want to avoid the manual tuning of degree and regularization parameters.

Common Pitfalls

Ignoring Feature Scaling: Polynomial features, especially high-degree terms, can have vastly different scales (e.g., age=30 vs. age^3=27000). Failing to standardize or normalize these features before training will cause any distance-based or regularization-based model to be biased towards the high-magnitude features. Always apply scaling after generating polynomial features.

Automatically Using High Degrees: Setting degree=10 because "more is better" is a recipe for overfitting. The model will fit the training data's idiosyncratic noise, resulting in wild, nonsensical curves that perform poorly on any new data. Always validate the chosen degree on a held-out set.

Forgetting the Computational Cost: Feature explosion isn't just a statistical problem; it's a practical one. Generating polynomial features for a dataset with 100 columns at degree 3 can create nearly 180,000 features, which may be impossible to fit in memory or train in a reasonable time. Start small and understand the combinatorial growth.

Applying Polynomials to Categorical Features: Applying PolynomialFeatures to one-hot encoded categorical variables creates meaningless combinations (e.g., is_male * is_male). Either pre-separate your numeric and categorical columns, generating polynomials only for the numeric ones, or use the interaction_only flag to avoid these nonsensical self-interactions.

Summary

Polynomial and interaction features are engineered to help linear models capture non-linear relationships and synergistic effects between variables, transforming them into powerful, flexible tools.
Use PolynomialFeatures for automated generation, but carefully select the degree using cross-validation to avoid exponential feature explosion and severe overfitting.
The interaction_only=True parameter creates targeted features that model how the effect of one variable depends on the level of another, without creating pure square terms.
Regularization (Lasso/Ridge) is a critical companion technique to penalize complexity and perform feature selection from the large set of created polynomial terms.
Tree-based models natively handle non-linearity and interactions, making them a strong alternative to the manual feature engineering required for linear models. Your choice depends on the need for interpretability, computational constraints, and the suspected complexity of the relationships in your data.

Feature Engineering: Polynomial and Interaction Features

Feature Engineering: Polynomial and Interaction Features

The "Why": Capturing Non-Linearity in Linear Models

Generating Polynomial Features: Theory and Practice

The Critical Art of Degree Selection

Creating Targeted Features: Interaction-Only Mode

Combining with Regularization for Feature Selection

Polynomial Features vs. Tree-Based Models

Common Pitfalls

Summary

Write better notes with AI