MEDIAN in Redshift

Redshift has an interesting quirk in that when a MEDIAN function is called within a query, it implicitly give an ORDER BY for that column.

Window Functions in 5 seconds

Taking a step back, when working with window functions there's usually OVER and ORDER BY arguments and optionally, but often, a PARTITION BY argument as well.

If I wanted to order customer purchases by customer, I could do the following:

SELECT 
    CUSTOMER_ID,
    PRODUCT_NAME,
    ORDER_DATE,
    ROW_NUMBER() OVER (PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE) ORDER_NUMBER
FROM CUSTOMER_ORDERS

The MEDIAN function also uses an ORDER BY internally to determine the middle value, but doesn't use the same syntax as the above.

Ok, fine... But why does this matter?

The above is important to keep in mind since multiple MEDIAN functions in a query will mess up. You'll get the following error:

ERROR: within group ORDER BY clauses for aggregate functions must be the same

You'll get this error if you have any combination of LISTAGG, PERCENTILE_CONT, or MEDIAN in a query that do not share the same ORDER BY column.

The AWS Redshift Documentation shows the following example that will execute correctly when mixing these various window functions in a query:

select top 10 salesid, sum(pricepaid), 
percentile_cont(0.6) within group (order by salesid),
median (salesid)
from sales group by salesid, pricepaid;

Solutions if you need multiple MEDIANs

If you absolutely need to have various medians in a result set, most likely the easiest option is to make multiple subqueries and then join them together at the end.

For example:

WITH MED_ITEMS_BOUGHT AS (
SELECT 
    CUSTOMER_ID,
    MEDIAN(ITEMS_BOUGHT) MED_ITEMS_BOUGHT
FROM 
  CUSTOMER_ORDERS
GROUP BY 
  CUSTOMER_ID
),

MED_PAYMENT AS (
SELECT 
  CUSTOMER_ID,
  MEDIAN(PAYMENT) MED_PAYMENT
FROM 
  CUSTOMER_ORDERS
GROUP BY 
  CUSTOMER_ID
)

SELECT 
    MIB.CUSTOMER_ID,
    MIB.MED_ITEMS_BOUGHT,
    MP.MED_PAYMENT
FROM 
    MED_ITEMS_BOUGHT MIB,
    MED_PAYMENT MP
WHERE MIB.CUSTOMER_ID = MP.CUSTOMER_ID

MEDIAN in PostgreSQL

Unfortunately PostgreSQL doesn't have a native median function, but this doesn't really matter so much, as there is a percentile_cont function. percentile_cont(0.5) is equivalent to median.

You can see the description of percentile_cont in the official PostgreSQL documentation.

However, if you must have a median function (i.e you're using SQLAlchemy and want to seamlessly switch backends) there's a user-defined PL/pgSQL function that can mimic this functionality.

The code for median is available at this link.

Summary

When using MEDIAN in Redshift, you cannot use more than one in a query at a time. A workaround is to make subqueries and join later.

PostgreSQL does not have a native MEDIAN function, but this may be mimicked with the percentile_cont(0.5) function or an additional PL/pgSQL function.

Comments

comments powered by Disqus

Published

Category

Redshift

Tags

Contact