Feature Engineering: Frequency and Count Encoding
AI-Generated Content
Feature Engineering: Frequency and Count Encoding
In the world of machine learning, raw data is rarely ready for modeling. This is especially true for categorical variables—non-numeric data like city names or product IDs. One-hot encoding, which creates a new binary column for each category, becomes unwieldy and computationally inefficient when a categorical feature has hundreds or thousands of unique values, a condition known as high cardinality. Frequency and count encoding offer powerful, efficient alternatives by transforming categories into meaningful numbers based on how often they appear, providing models with valuable ordinal information about category prevalence.
Understanding Count and Frequency Encoding
At their core, both techniques leverage the simple statistic of occurrence to create a new numerical feature. While the terms are sometimes used interchangeably, they represent two slightly different transformations.
Count Encoding replaces each category with the integer count of how many times it appears in the dataset. If "New York" appears 150 times in a "City" column, every row containing "New York" is assigned the value 150. This method directly embeds the raw frequency information into the feature.
Frequency Encoding is a normalization of count encoding. It replaces each category with its relative frequency or proportion. This is calculated by dividing the category's count by the total number of observations in the dataset. Using the same example, if the dataset has 1000 rows, "New York" would be encoded as . This bounds the encoded values between 0 and 1, which can be beneficial for models sensitive to feature scale, like neural networks or gradient-based algorithms.
The implementation in Python using pandas is straightforward:
import pandas as pd
# Sample data
data = pd.DataFrame({'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'LA', 'Chicago']})
# Count Encoding
count_map = data['city'].value_counts().to_dict()
data['city_count'] = data['city'].map(count_map)
# Frequency Encoding
freq_map = (data['city'].value_counts() / len(data)).to_dict()
data['city_freq'] = data['city'].map(freq_map)This process converts a categorical column into a single, dense numerical column, dramatically reducing dimensionality compared to one-hot encoding.
Strategic Advantages and Ideal Use Cases
The primary advantage of these methods is dimensionality reduction. A categorical column with 100 unique values would generate 100 new columns with one-hot encoding but only one column with frequency encoding. This efficiency prevents the curse of dimensionality and reduces training time.
These encodings perform exceptionally well with tree-based models like Random Forests and Gradient Boosted Machines (e.g., XGBoost, LightGBM). These models make splits based on feature values. A frequency-encoded feature provides ordinal information—categories that appear more frequently might behave differently from rare categories, and the model can learn to split on these numerical thresholds. For high-cardinality features, this often outperforms one-hot encoding, which can force trees to create many inefficient splits across hundreds of sparse binary columns.
Furthermore, frequency can be a meaningful proxy for other characteristics. Common categories might represent a "default" or "majority" case, while rare categories could indicate niche groups or potential errors. Encoding this information directly gives the model a useful signal.
Handling Ties and Combining with Other Methods
A common challenge arises when two or more categories share the same count. This creates a tie in frequency. For example, if "Boston" and "Seattle" both appear exactly 42 times, they will receive the same encoded value. While this preserves the information that they are equally frequent, it collapses any distinction between them. To handle this, you can introduce a secondary, deterministic rule to break ties, such as sorting the tied categories alphabetically and assigning slightly different values, or adding a tiny amount of random noise. However, for tree-based models, a simple tie is often acceptable, as the model will treat them identically based on this feature alone.
Frequency encoding is rarely used in isolation. It is most powerful when combined with other encoding methods or feature engineering techniques. A robust strategy is to use frequency encoding for high-cardinality features and one-hot encoding for low-cardinality, nominal features. You can also create interaction features by multiplying a frequency-encoded column with another numerical column (e.g., transaction frequency average transaction amount). Another advanced tactic is to use target encoding* (replacing a category with the mean of the target variable) for supervised tasks but blend it with frequency encoding to reduce overfitting, using frequency as a smoothing factor or as a separate correlated feature.
Comparison with One-Hot Encoding and Model Considerations
Choosing between frequency and one-hot encoding hinges on cardinality and model type. For features with low cardinality (e.g., less than 10 unique categories), one-hot encoding is usually safe and preserves all information without assuming an ordinal relationship. For high-cardinality features, one-hot encoding becomes a liability.
The critical trade-off is information loss versus efficiency. One-hot encoding is lossless from the perspective of the category identity but creates sparse, high-dimensional data. Frequency encoding is lossy—it discards the unique identity of each category and replaces it with a single number. If the specific identity of a rare category is critically important (e.g., a rare disease code), frequency encoding might wash out that signal. However, for many practical problems, the frequency of occurrence is a more general and useful signal than the exact name.
It's crucial to note that frequency encoding assumes the training set distribution is representative of future data. You must calculate the frequency mapping only on the training set and then apply it to validation and test sets. Calculating it on the entire dataset before splitting, or on the test set, leaks information and leads to overly optimistic performance estimates, a form of data leakage.
Common Pitfalls
Leaking Information from the Test Set. As mentioned, the most critical error is computing frequency counts using the entire dataset. Always fit the encoding (calculate the value_counts) on the training fold alone, then transform the training, validation, and test data using that learned mapping.
Ignoring the Impact of Rare Categories. Categories that appear only once or twice (unique values) will get very small frequency codes. This can make them overly influential or create instability. A common solution is to group all rare categories (e.g., those with a count below a threshold like 10) into a single "Other" category before encoding, giving them a shared, more stable frequency value.
Misapplying to Linear Models without Caution. Frequency encoding creates a monotonic relationship between the encoded value and the target by construction. For linear models like Logistic Regression, this artificially introduces a linear assumption that may not hold. If the true relationship is not monotonic (e.g., both very high and very low frequency categories are associated with high risk), the linear model will fail to capture it. For such models, one-hot encoding or more sophisticated techniques are often preferable.
Overlooking Category Drift. If the underlying frequency of categories changes over time (e.g., a new city becomes popular), an encoding map derived from old data will become inaccurate. Monitoring feature distributions in production and periodically retraining the encoding scheme is necessary for maintaining model performance.
Summary
- Frequency and count encoding replace high-cardinality categorical values with their occurrence statistics—either raw counts or normalized proportions—creating a single, dense numerical feature.
- They provide dimensionality reduction and inject useful ordinal information about category commonness, which is particularly effective for tree-based models and often outperforms one-hot encoding for features with many unique values.
- Always compute the encoding mapping strictly from the training data to prevent data leakage, and have a strategy for handling ties in frequency and grouping rare categories.
- These methods are lossy and assume frequency is a meaningful signal; they work best when combined with other techniques and are applied with an understanding of the model's assumptions, especially avoiding naive use with linear models without further validation.