Calculating medians in PostgreSQL with percentile_cont

Dashboards & Data Warehousing  / Bespoke Software

SkillsLogic is developing a big reporting and data dashboard solution and as part of that we need to calculate a lot of medians.

Remember from school, the median of a list of numbers is found by sorting the values from lowest to highest and then picking the middle one. If there’s no middle then the median is typically defined to be the mean of the two middle values.

PostgreSQL is our database of choice and with v9.4 it got a new function - percentile_cont - that makes it much easier to perform this type of calculation on a dataset (similar versions are available in SQL Server and Oracle).

The description of percentile_cont in the PostgreSQL documentation is a bit bewildering:

  • percentile_cont - continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed

In practice it’s quite straightforward to use percentile_cont to get a median average from a list. Take the following simplified dataset :

	college_name 		| number_of_staff
	---------------------------------
	Sheffield		| 100
	Rotherham		| 150
	Leeds			| 300
	Manchester		| 400
	Newcastle		| 450

In this example the median number of staff is 300, in PostgreSQL the query might look like this (the 0.5 indicates we want the middle value and the ‘order by' is the column that contains the data):

	SELECT 		
		percentile_cont(0.5) within group ( order by number_of_staff )
	FROM		
		reportdata;

	> 300

This is obviously a pretty simple example, but it gets more interesting if you need to find medians within categories in the result set. Take the next example dataset:

	college_name 	| number_of_staff 	| college_type
	----------------------------------------------------
	Sheffield	| 100				|FE
	Rotherham	| 150				|FE
	Leeds		| 300				|FE
	Manchester	| 400				|Sixth Form
	Newcastle	| 450				|Sixth Form

Here we have a median for FE colleges of 150, and a median of sixth form colleges of 425. By adding a group by to the query, PostgreSQL returns the correct median within each category:

	SELECT
		college_type,
		percentile_cont(0.5) within group ( order by number_of_staff )
	FROM
		reportdata2
	GROUP BY
		college_type

	> FE; 150
	> Sixth Form; 425

percentile_cont has been a real time saver and the output, when combined with Google Charts, is a nice data dashboard solution for our client.

SkillsLogic develops data warehouse and business intelligence solutions.

Talk to us today and find out how we can help you get more insights from your data.


Contact Us

Contact Us

SkillsLogic Ltd,
Cooper Buildings,
Sheffield Technology Park,
Arundel Street,
Sheffield,
S1 2NS

+44 (0)114 383 0093
info@skillslogic.com


Follow us on Twitter: