Recursive Feature Elimination

In machine learning, more features do not always mean a better model. Irrelevant or redundant variables introduce noise, increase computational cost, and can lead to overfitting. Recursive Feature Elimination (RFE) is a powerful wrapper method that systematically prunes your feature set by iteratively removing the least important features, leaving you with a compact, high-performing subset. This process simplifies your model, often improving its interpretability, generalization, and training speed.

Understanding the Core RFE Algorithm

At its heart, RFE is a backward elimination technique. It starts with all available features, fits a model, ranks the features by their importance, discards the weakest ones, and repeats the process on the reduced set until a specified number of features remains. The "importance" is derived from the model itself, making RFE a model-specific method.

The algorithm follows these concrete steps:

Train a Model: Fit an estimator (e.g., a linear regression, SVM, or decision tree) on the entire dataset with all $n$ features.
Rank Features: Compute a ranking criterion. For linear models, this is often the absolute magnitude of coefficients $∣ β_{i} ∣$ . For tree-based models, it's typically a measure of feature importance like Gini importance or mean decrease in impurity.
Prune Weakest Features: Remove the $k$ features with the smallest ranking scores. The parameter $k$ is the step size.
Repeat: Re-train the model on the remaining $n - k$ features, rank again, and prune. This recursion continues until the desired number of features is reached.

In scikit-learn, this is implemented via the RFE class. You must choose an estimator and the target number of features (n_features_to_select). The .support_ attribute is a boolean mask indicating the selected features, and .ranking_ shows the elimination order (1 means selected last).

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Instantiate a model and the RFE selector
model = LogisticRegression(max_iter=1000)
selector = RFE(estimator=model, n_features_to_select=5, step=1)

# Fit to data
selector = selector.fit(X_train, y_train)

# See which features were kept
print("Selected features:", X_train.columns[selector.support_])
print("Feature rankings (1 is best):", selector.ranking_)

Automating Selection with RFECV and Tuning Step Size

Manually specifying n_features_to_select is a guess. A more robust approach uses RFECV (Recursive Feature Elimination with Cross-Validation), which automatically determines the optimal number of features. RFECV performs RFE inside a cross-validation loop. For each feature subset size, it calculates a cross-validated performance score (e.g., accuracy, F1-score). The subset size with the highest mean score is chosen.

This process guards against overfitting the feature selection to a particular train/test split. In scikit-learn, RFECV also allows you to set a scoring metric and cv strategy. The RFECV object's .n_features_ attribute gives the optimal number, and .cv_results_['mean_test_score'] lets you visualize the performance versus feature count.

from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

# Use CV to find the optimal number of features
cv_strategy = StratifiedKFold(n_splits=5)
selector_cv = RFECV(estimator=model, step=1, cv=cv_strategy, scoring='accuracy')
selector_cv.fit(X_train, y_train)

print(f"Optimal number of features: {selector_cv.n_features_}")

The step parameter controls efficiency. A step=1 removes one feature per iteration, which is thorough but computationally expensive with hundreds of features. A step=0.1 (or any float between 0 and 1) removes a percentage of the remaining features each round, speeding up the process significantly. For example, with 100 features and step=0.1, the first iteration removes 10 features. Choose a larger step for an initial coarse search and a step of 1 for the final, precise selection.

Comparing RFE with Filter and Embedded Methods

RFE is one of three main families of feature selection techniques. Understanding its place helps you choose the right tool.

Filter Methods: These methods (e.g., correlation score, chi-squared, mutual information) select features based on statistical tests, independent of any machine learning model. They are fast and scalable but ignore feature interactions and the model's specific biases. RFE is more computationally intensive but directly optimizes for the model's performance.
Embedded Methods: Techniques like Lasso regularization (L1) or tree-based importance perform feature selection during model training. They are efficient and model-aware. RFE can be seen as a more aggressive, wrapper-based approach that often provides a finer-grained ranking through its iterative process. Crucially, RFE can wrap around an embedded method; for instance, you can use a Linear SVM or a Logistic Regression with L1 penalty as the estimator inside RFE, combining their strengths.
Wrapper Methods (RFE's family): These methods, including RFE and forward selection, evaluate feature sets based on model performance. They are typically the most computationally expensive but can yield the best-performing subset for a given algorithm. RFE's backward elimination is particularly effective at detecting feature dependencies that filter methods might miss.

Building Integrated Feature Selection Pipelines

In a rigorous machine learning workflow, feature selection must be nested within cross-validation to prevent data leakage. Using a pipeline ensures that the RFE step is fitted only on the training fold of each CV split, and then transforms the validation fold. This is critical for getting a true estimate of your model's performance on new data.

A robust pipeline integrates RFE/RFECV with your final model and standardization. Scikit-learn's Pipeline and GridSearchCV make this seamless.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Create a pipeline: Scale -> Select Features -> Model
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', RFE(estimator=LogisticRegression(max_iter=2000))),
    ('classifier', LogisticRegression(max_iter=2000))
])

# Define a parameter grid to search over
param_grid = {
    'selector__n_features_to_select': [5, 10, 15],  # Test different feature counts
    'selector__step': [1, 2],
    'classifier__C': [0.01, 0.1, 1.0]  # Regularization strength for the final model
}

# Perform a grid search with cross-validation
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best CV score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

This pipeline ensures that for every candidate parameter set in the grid search, the scaling and feature selection are properly fitted on the training folds, providing a valid and automated end-to-end workflow.

Common Pitfalls

Data Leakage from Improper Nesting: The most critical mistake is fitting RFE or RFECV on your entire dataset before splitting into train and test sets, or before cross-validation. This allows information from the test/holdout set to influence the feature selection, optimistically biasing your performance estimates. Always use RFECV or place RFE inside a Pipeline that is managed by cross_val_score or GridSearchCV.
Ignoring the Estimator Choice: RFE's results are entirely dependent on the underlying estimator's ability to rank features. A poorly chosen model will lead to a poor feature subset. For example, using a linear model on non-linear data, or an unregularized model on high-dimensional data, will produce unreliable rankings. Always ensure your base estimator is appropriate for your data structure.
Overlooking Multicollinearity: In the presence of highly correlated features, the importance ranking can become unstable. One of two correlated informative features might be arbitrarily eliminated early on because the other carries similar information. While RFE can handle this, the specific subset chosen may vary. It can be helpful to perform moderate correlation filtering before RFE or to use an estimator robust to multicollinearity (like Ridge regression) within the RFE process.
Misinterpreting RFECV Output: The optimal number of features from RFECV is the one that maximizes cross-validated score on your training data. It is not a guarantee of performance on entirely new data. Furthermore, if the performance curve is flat across many feature counts, a smaller, more parsimonious model (fewer features) is often preferable for interpretability and deployment simplicity.

Summary

Recursive Feature Elimination (RFE) is a backward-selection wrapper method that iteratively removes the least important features based on a model's coefficients or feature importance scores.
RFECV automates the process by using cross-validation to identify the optimal number of features to select, providing a robust guard against overfitting in the feature selection stage.
Tuning the step parameter balances computational cost with selection granularity; use a larger step (or a percentage) for high-dimensional data.
RFE is more model-aware than filter methods and can be more thorough than single-pass embedded methods, but it is computationally more intensive. It can be effectively combined with embedded methods by using regularized models as its core estimator.
Always integrate RFE into a scikit-learn Pipeline to prevent data leakage and ensure the selection process is correctly validated during hyperparameter tuning and model evaluation.

Recursive Feature Elimination

Recursive Feature Elimination

Understanding the Core RFE Algorithm

Automating Selection with RFECV and Tuning Step Size

Comparing RFE with Filter and Embedded Methods

Building Integrated Feature Selection Pipelines

Common Pitfalls

Summary

Write better notes with AI