How to do histograms in PostgreSQL

How to do histograms in PostgreSQL

·

3 min read

UPDATED FOR 2017 Now with easy subquery support and a more sensible argument order!

We adapted this excellent piece by Dmitri Fontaine and turned it into a function.

It can be invoked like this:

SELECT * FROM histogram($table_name_or_subquery, $column_name);

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('(SELECT * FROM small_donors_vt LIMIT 50000)', 'transaction_amt');

 bucket |   range   | freq |       bar
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 |
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 |
     10 | [90,99]   |   65 |
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 |
     13 | [120,129] |  115 |
     14 | [130,139] |   32 |
     15 | [140,146] |   11 |
     16 | [150,159] |  187 |
     17 | [160,169] |   24 |
     18 | [170,177] |   33 |
     19 | [180,189] |   19 |
     20 | [191,199] |   24 |
     21 | [200,200] |  795 | **

Use it yourself by adding this to your postgres setup:

CREATE OR REPLACE FUNCTION histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
  WITH
  source AS (
    SELECT * FROM %s
  ),
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM source
  ),
  histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM source, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM histogram',
  table_name_or_subquery,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name
  );
END
$func$ LANGUAGE plpgsql;

Note:

  1. You don't need to use a subquery - you can also just provide a table name as the first argument.
  2. Subqueries are useful for sampling. For example, (SELECT * FROM bigtable LIMIT 500 ORDER BY RANDOM(). Don't forget parentheses!
  3. You can call this as SELECT histogram() or SELECT * FROM histogram(). The latter form is much more legible!

Happy querying!


The best of AI, right in your inbox