All blog posts
Tech
How to do histograms in PostgreSQL
Learn how to create in-table histograms with PostgreSQL.
UPDATED FOR 2017 Now with easy subquery support and a more sensible argument order!
We adapted this excellent piece by Dmitri Fontaine and turned it into a function.
It can be invoked like this:
SELECT * FROM histogram($table_name_or_subquery, $column_name);
. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:
fec=# SELECT * FROM histogram('(SELECT * FROM small_donors_vt LIMIT 50000)', 'transaction_amt');
bucket | range | freq | bar
--------+-----------+------+-----------------
1 | [0,9] | 2744 | ******
2 | [10,19] | 5630 | *************
3 | [20,29] | 6383 | ***************
4 | [30,39] | 1290 | ***
5 | [40,49] | 369 | *
6 | [50,59] | 3541 | ********
7 | [60,69] | 174 |
8 | [70,79] | 313 | *
9 | [80,89] | 171 |
10 | [90,99] | 65 |
11 | [100,109] | 2363 | ******
12 | [110,119] | 51 |
13 | [120,129] | 115 |
14 | [130,139] | 32 |
15 | [140,146] | 11 |
16 | [150,159] | 187 |
17 | [160,169] | 24 |
18 | [170,177] | 33 |
19 | [180,189] | 19 |
20 | [191,199] | 24 |
21 | [200,200] | 795 | **
Use it yourself by adding this to your postgres setup:
CREATE OR REPLACE FUNCTION histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
WITH
source AS (
SELECT * FROM %s
),
min_max AS (
SELECT min(%s) AS min, max(%s) AS max FROM source
),
histogram AS (
SELECT
width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
count(%s) AS freq
FROM source, min_max
WHERE %s IS NOT NULL
GROUP BY bucket
ORDER BY bucket
)
SELECT
bucket,
"range",
freq::bigint,
repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
FROM histogram',
table_name_or_subquery,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name
);
END
$func$ LANGUAGE plpgsql;
Note:
- You don't need to use a subquery - you can also just provide a table name as the first argument.
- Subqueries are useful for sampling. For example,
(SELECT * FROM bigtable LIMIT 500 ORDER BY RANDOM()
. Don't forget parentheses! - You can call this as
SELECT histogram()
orSELECT * FROM histogram()
. The latter form is much more legible!
Happy querying!
Ready for easy AI?
Skip the ML struggle and focus on your downstream application. We have built-in sample data so you can get started without sharing yours.