Calculating medians in PostgreSQL with percentile_cont
SkillsLogic is developing a big reporting and data dashboard solution and as part of that we need to calculate a lot of medians.
Remember from school, the median of a list of numbers is found by sorting the values from lowest to highest and then picking the middle one. If there’s no middle then the median is typically defined to be the mean of the two middle values.
PostgreSQL is our database of choice and with v9.4 it got a new function - percentile_cont - that makes it much easier to perform this type of calculation on a dataset (similar versions are available in SQL Server and Oracle).
The description of percentile_cont in the PostgreSQL documentation is a bit bewildering:
- percentile_cont - continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed
In practice it’s quite straightforward to use percentile_cont to get a median average from a list. Take the following simplified dataset :
college_name | number_of_staff --------------------------------- Sheffield | 100 Rotherham | 150 Leeds | 300 Manchester | 400 Newcastle | 450
In this example the median number of staff is 300, in PostgreSQL the query might look like this (the 0.5 indicates we want the middle value and the ‘order by' is the column that contains the data):
SELECT percentile_cont(0.5) within group ( order by number_of_staff ) FROM reportdata; > 300
This is obviously a pretty simple example, but it gets more interesting if you need to find medians within categories in the result set. Take the next example dataset:
college_name | number_of_staff | college_type ---------------------------------------------------- Sheffield | 100 |FE Rotherham | 150 |FE Leeds | 300 |FE Manchester | 400 |Sixth Form Newcastle | 450 |Sixth Form
Here we have a median for FE colleges of 150, and a median of sixth form colleges of 425. By adding a group by to the query, PostgreSQL returns the correct median within each category:
SELECT college_type, percentile_cont(0.5) within group ( order by number_of_staff ) FROM reportdata2 GROUP BY college_type > FE; 150 > Sixth Form; 425
percentile_cont has been a real time saver and the output, when combined with Google Charts, is a nice data dashboard solution for our client.