Pandas GroupBy Multiple Aggregations

The groupby().agg() operation is one of the most powerful tools in the pandas library for data summarization. While simple aggregations like .sum() or .mean() are common, real-world analysis often demands you apply different aggregation functions to different columns efficiently within the same operation. Mastering multiple aggregations, reshaping the output, and calculating derived metrics is essential for moving from simple summaries to production-ready data reports.

Core Concept 1: Aggregation Dictionaries and Named Aggregation

The foundation of multiple aggregations is passing a dictionary, list, or tuple to the .agg() method. The most direct method is a dictionary where keys are column names and values are the aggregation functions (as strings, callables, or lists thereof) to apply.

Consider a sales DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Region': ['North', 'North', 'South', 'South', 'South'],
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Revenue': [100, 150, 200, 175, 125],
    'Cost': [40, 60, 90, 70, 50],
    'Units': [5, 10, 20, 15, 10]
})

To get the total Revenue and average Cost per Region, you use an aggregation dictionary:

summary_dict = df.groupby('Region').agg({
    'Revenue': 'sum',
    'Cost': 'mean',
    'Units': ['min', 'max']  # Apply multiple functions to one column
})
print(summary_dict)

This creates a MultiIndex for columns. For clearer, more maintainable code, pandas offers named aggregation (NamedAgg) using tuples. This is especially useful when you need to assign specific names to each result column directly. The syntax uses the agg method with keyword arguments where the value is a tuple of (column, aggfunc).

summary_named = df.groupby('Region').agg(
    total_rev=pd.NamedAgg(column='Revenue', aggfunc='sum'),
    avg_cost=pd.NamedAgg(column='Cost', aggfunc='mean'),
    first_product=pd.NamedAgg(column='Product', aggfunc='first')
)
print(summary_named)

Named aggregation gives you explicit control over the final column names from the start, avoiding the generic names produced by the dictionary method.

Core Concept 2: Flattening Multi-Level Column Indexes

The dictionary-of-lists approach creates a hierarchical column MultiIndex. While informative, this structure can be cumbersome for further analysis or export. Flattening converts this multi-level index into a single level with readable column names.

The result from our first summary_dict has a MultiIndex. You can flatten it by joining the levels:

summary_dict_flattened = summary_dict.copy()
summary_dict_flattened.columns = ['_'.join(col).strip() for col in summary_dict_flattened.columns.values]
print(summary_dict_flattened)

This produces column names like Revenue_sum, Cost_mean, Units_min, and Units_max. This flat structure is much easier to filter, rename, or write to a CSV file. When using named aggregation, this step is often unnecessary because you defined the flat column names at the outset.

Core Concept 3: Percent-of-Group Calculations with Transform

Sometimes you need a summary statistic to appear on every original row, not just the aggregated group rows. This is where the .transform() method shines. It returns a DataFrame with the same index as the original, where the aggregation function's result is broadcast back to each member of the group.

A classic use case is calculating a percent-of-group total. For instance, what percentage of each region's total revenue does each row represent?

# Calculate total revenue per region
region_revenue = df.groupby('Region')['Revenue'].transform('sum')
# Calculate the percentage
df['Pct_of_Region_Revenue'] = (df['Revenue'] / region_revenue * 100).round(1)
print(df[['Region', 'Revenue', 'Pct_of_Region_Revenue']])

The .transform('sum') computes the group sum and aligns it with df. This is different from .apply(), which can return aggregated or transformed data but is less optimized for this specific broadcasting operation. Use .transform() whenever you need to add a grouped summary as a new column in your source data.

Core Concept 4: First and Last Aggregation for Non-Numeric Columns

Aggregation isn't just for numbers. For categorical or datetime columns, functions like 'first', 'last', and 'size' are invaluable. They allow you to extract meaningful information from grouped non-numeric data.

Building on our sales data, suppose we want a summary per region that includes the first product sold and the most recent (last) product, along with the count of transactions:

non_numeric_summary = df.groupby('Region').agg({
    'Product': ['first', 'last', 'count'], # Works on strings
    'Revenue': 'sum'
})
print(non_numeric_summary)

This is particularly useful in time-series data ordered by a date column. Grouping by an entity (e.g., customer ID) and using 'first' and 'last' on a date column can quickly show acquisition and last activity dates. Remember, the order of 'first' and 'last' depends on the row order within the group unless you sort the data first.

Core Concept 5: Building Comprehensive Summary Statistics Tables

The ultimate goal is to combine these techniques to create publication-ready summary tables. This involves selecting relevant columns, applying a tailored set of aggregation functions, and then cleaning the output for readability.

Let's build a comprehensive summary for our sales data per product:

# Define a robust aggregation dictionary
agg_config = {
    'Revenue': ['sum', 'mean', 'std'],  # Total, average, and variability
    'Cost': ['sum', 'mean'],
    'Units': ['sum', 'min', 'max', 'count'],
    'Region': lambda x: x.mode()[0]  # Most frequent region for the product
}

product_summary = df.groupby('Product').agg(agg_config)

# Flatten the MultiIndex columns
product_summary.columns = [f'{col[0]}_{col[1]}' for col in product_summary.columns]
# Round numeric columns for display
product_summary = product_summary.round(2)
# Reset index to make 'Product' a column
final_summary_table = product_summary.reset_index()

print(final_summary_table)

This table provides a holistic view: financials, volume, and even the modal region for each product. The key is planning your agg_config dictionary to answer the specific business or research questions at hand.

Common Pitfalls

Misaligned Indexes after Transform vs. Agg: Confusing .agg() and .transform() is a frequent error. .agg() returns a condensed DataFrame with one row per group, while .transform() returns a broadcasted DataFrame with the same shape as the input. If you try to assign a .agg() result directly as a new column without matching indices, you'll get NaN values. Always use .transform() for per-row group calculations.

Incorrect Flattening of MultiIndex Columns: When flattening columns using .join, ensure you handle the tuple structure correctly. A common mistake is trying to join a column name that isn't a tuple (e.g., from a single-level index), which will cause an error. Check df.columns.levels first. A safer method is: df.columns = df.columns.map('_'.join).

Applying Numeric Functions to Non-Numeric Data: Attempting to calculate the 'mean' of a string column will throw an error. Be deliberate in your aggregation dictionary. Use functions appropriate to the data type: 'first', 'last', 'count', and 'nunique' for categorical data; 'sum', 'mean', 'std' for numeric data. Pandas will silently ignore non-applicable functions in some contexts, leading to incomplete results.

Forgetting to Sort Before 'first'/'last': The 'first' and 'last' aggregations refer to the row order as it appears in the DataFrame. If your data isn't sorted chronologically or logically, these values will be arbitrary. Always sort your DataFrame by the relevant column (e.g., date) within the group before aggregating if order matters. You can do this inside the aggregation using a custom lambda, but it's often clearer to sort the entire DataFrame first with df.sort_values(['GroupCol', 'OrderCol']).

Summary

Use an aggregation dictionary passed to .agg() to apply different functions (or lists of functions) to different columns efficiently. For more explicit control over output column names, use the named aggregation (NamedAgg) syntax with tuples.
Results from multiple aggregations create a MultiIndex for columns. Flatten this index by joining level names with '_'.join(col) to create a more analyzable, single-level column structure.
Use the .transform() method to broadcast group-level calculations (like a group sum) back to each row of the original DataFrame. This is the essential tool for calculating metrics like percent-of-group totals.
Aggregation functions like 'first' and 'last' are powerful for non-numeric columns, allowing you to extract representative categorical values or sequence-based information from grouped data.
Combine all these techniques to build comprehensive summary statistics tables. Start with a planned aggregation dictionary, apply groupby().agg(), flatten the columns, and format the result for clear communication.

Pandas GroupBy Multiple Aggregations

Pandas GroupBy Multiple Aggregations

Core Concept 1: Aggregation Dictionaries and Named Aggregation

Core Concept 2: Flattening Multi-Level Column Indexes

Core Concept 3: Percent-of-Group Calculations with Transform

Core Concept 4: First and Last Aggregation for Non-Numeric Columns

Core Concept 5: Building Comprehensive Summary Statistics Tables

Common Pitfalls

Summary

Write better notes with AI