Skip to content
Mar 1

SQL Interview: Aggregation and Window Problems

MT
Mindli Team

AI-Generated Content

SQL Interview: Aggregation and Window Problems

Mastering complex SQL problems is often the defining hurdle in data science and analytics interviews. Your ability to elegantly transform and summarize data using aggregation and window functions directly signals your analytical maturity and technical precision. This guide moves beyond basic SELECT statements to tackle the nuanced patterns—like ranking, sequencing, and running calculations—that interviewers use to separate competent candidates from exceptional ones.

Core Concepts: From Aggregation to Windowing

The first step in solving any interview problem is identifying the correct tool. Standard aggregation (using GROUP BY with functions like SUM(), COUNT(), AVG()) collapses multiple rows into a single summary row per group. In contrast, a window function performs a calculation across a set of table rows that are somehow related to the current row, without collapsing them. You can think of a window as a peephole that slides over your result set, allowing each row to see and compute based on others.

For example, to find total company revenue, you'd use aggregation: SELECT SUM(revenue) FROM sales;. To find each employee's revenue alongside the departmental average, preserving all rows, you'd use a window function: SELECT employee_id, revenue, AVG(revenue) OVER (PARTITION BY department_id) AS dept_avg FROM sales;. The key distinction is that window functions add computed columns without reducing the number of rows returned, which is essential for problems requiring rankings or row-by-row comparisons.

Solving Top-N Per Group Problems

This is a quintessential interview question: "Find the top 3 highest-paid employees in each department." The brute-force method uses a correlated subquery, but this is inefficient. The modern, performant approach uses the ROW_NUMBER(), RANK(), or DENSE_RANK() window functions.

Your strategy is to: 1) Partition your data by the group (department), 2) Order within each partition by the metric of interest (salary descending), and 3) Assign a rank. ROW_NUMBER() gives unique sequential numbers, while RANK() leaves gaps for ties. For our top 3 problem, DENSE_RANK() is often safest if you want exactly three rows per department even with ties.

WITH ranked_employees AS (
  SELECT
    employee_id,
    name,
    department_id,
    salary,
    DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank
  FROM employees
)
SELECT * FROM ranked_employees WHERE salary_rank <= 3;

Always consider edge cases: What if a department has fewer than N employees? What if there are ties for the last position? Articulating these considerations shows thoroughness.

Calculating Running Totals and Complex Rankings

Questions about running totals, moving averages, or "find the cumulative sales for each salesperson over time" test your understanding of the ORDER BY clause within the OVER() statement. Unlike a PARTITION BY, which resets the calculation, an ORDER BY in a window function creates a cumulative window frame.

For a running total:

SELECT
  date,
  sales,
  SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM daily_sales;

The window frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is often the default for ORDER BY with aggregate window functions, making it implicit. For a 7-day moving average, you would use: AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Understanding how to manipulate this frame—using RANGE vs. ROWS for logical groupings—is advanced knowledge that impresses interviewers.

Detecting Gaps and Consecutive Sequences

Problems like "find missing sequential IDs" or "identify periods of consecutive login days" require creative use of window functions to group sequential data. The core technique involves creating a grouping key by subtracting a row number from the sequence value.

For gap detection in an ID list:

WITH gaps AS (
  SELECT
    id,
    id - ROW_NUMBER() OVER (ORDER BY id) AS gap_group
  FROM table_with_ids
)
SELECT MIN(id) AS gap_start, MAX(id) AS gap_end
FROM gaps
GROUP BY gap_group
HAVING COUNT(*) > 1;

Here, id - ROW_NUMBER()... creates a constant value for consecutive IDs. When an ID is missing, this gap_group value changes. Grouping by this value isolates the consecutive blocks, allowing you to identify the blocks with more than one member, where the min and max ID define the gap's boundaries. For consecutive days, you would use the date column instead of an ID.

Calculating the Median

SQL has no built-in MEDIAN() function in most dialects, so you must construct it. The median is the middle value, or the average of the two middle values, in an ordered set. The solution leverages ROW_NUMBER() and COUNT() as window functions.

WITH ordered_data AS (
  SELECT
    value,
    ROW_NUMBER() OVER (ORDER BY value) AS row_asc,
    COUNT(*) OVER () AS total_count
  FROM dataset
)
SELECT AVG(value) AS median
FROM ordered_data
WHERE row_asc IN (FLOOR((total_count + 1) / 2.0), CEIL((total_count + 1) / 2.0));

This approach works for both odd and even counts. It numbers all rows, finds the total count, and then selects the middle one or two row numbers to average. Be prepared to explain the logic behind the FLOOR and CEIL operations, which handle the indexing for the median position.

Common Pitfalls

  1. Ignoring NULLs and Ties: Window functions like RANK() handle ties explicitly, but aggregate functions like SUM() may ignore NULLs, affecting totals. Always state your assumptions: "I'll assume NULL values in the sales column represent zero and should be handled with COALESCE(sales, 0)."
  2. Confusing PARTITION BY with GROUP BY: Using PARTITION BY in a window function does not reduce rows; it only defines window boundaries. Attempting to SELECT a non-aggregated column not in the PARTITION BY will still cause an error if a GROUP BY is present elsewhere in the query. Isolate the aggregation logic clearly.
  3. Performance Blindness: A solution with multiple nested subqueries might work on small data but fail in production. When presenting your solution, note that window functions are generally more performant than correlated subqueries for ranking and running total problems, as they often require only a single pass over the data.
  4. Overcomplicating the Solution: Under time pressure, it's easy to jump into a complex CTE. First, verbally decompose the problem: "To find the top seller per region, I first need to rank sales within each region, then filter to rank one." Starting with clear, commented CTEs (Common Table Expressions) is better than a monolithic, unreadable query.

Summary

  • Core Tool Selection: Use standard aggregation (GROUP BY) to collapse rows into summaries. Use window functions (OVER(), PARTITION BY) to perform calculations across rows while preserving the original dataset, essential for rankings and row-wise comparisons.
  • Master the Top-N Pattern: Combine PARTITION BY, ORDER BY, and ROW_NUMBER()/RANK() to solve "top N per group" problems efficiently, always considering edge cases like ties.
  • Leverage Window Frames for Trends: Use the ORDER BY clause in window functions to calculate running totals and moving averages by defining a sliding window frame over your ordered data.
  • Solve Sequences with Arithmetic: Detect gaps and consecutive sequences by creating a grouping key, typically by subtracting a row number from a sequential value, then grouping on that result.
  • Construct the Median: Build a median calculation using ROW_NUMBER() and COUNT() as window functions to find the middle value(s) in an ordered list, handling both odd and even counts.
  • Interview Strategy: Articulate your problem decomposition, choose window functions over correlated subqueries for clarity and performance, and explicitly address potential edge cases like NULLs and duplicates.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.