Skip to main content

PostgreSQL - Count() Performance Boost

This blog post might be outdated!
This blog post was published more than one year ago and might be outdated!
· 4 min read
Florian Horn
Business Analyst Digital Sales

Counting is a slow measure in Postgresql, and it is even slower with a lot of data and complex statements. This issue is related with the kind of how the data consistency is implemented. As a result Postgresql must check the visiblity of all rows first and so it has to iterate over them all.

In our project context we use a static cache table with calculated values to avoid using the Postgresql count() method all the time, especially when the numbers did not change at all. So we created a SQL function, which will be called only if the corresponding data has changed.

Sadly, the performance drops heavily on the function call and this is not acceptable due to the fact that the request is synchronous and the user must wait for the request to finish.

In our application we have a lot of products and some of them can be activated respectively be deactivated and all of the products are related to a product group. In our static cache table we summarize information about the product groups and how much products overall and how much active products each one contains.

To fetch the update information we run a SQL statement like this:

SELECT
"pg"."productgroupId",
(
SELECT
count("productId")
FROM
"products" AS "p"
WHERE
"pg"."productgroupId" = "p"."productgroupId"
) as "numberOfAllProducts",
(
SELECT
count("productId")
FROM
"products" AS "ap"
WHERE
"pg"."productgroupId" = "ap"."productgroupId" AND "ap"."isObsolete" = FALSE
) as "numberOfActiveProducts"
FROM
"productgroups" "pg"
GROUP BY
"pg"."productgroupId";

With over 78.000 products and over 370 product groups, this call measures a timeload of approximately 30.000ms (yep, 30 seconds!).

So we optimized our statements by switching the explicit count() call with analyzing the estimated count information. We placed some logic in separate functions (count_estimate() and printf()):

SELECT
"pg"."productgroupId",
count_estimate(
printf(
'SELECT "productId" FROM "products" AS "p" WHERE % = "p"."productgroupId"',
"pg"."productgroupId"
)
) as "numberOfAllProducts",
count_estimate(
printf(
'SELECT "productId" FROM "products" AS "ap" WHERE % = "ap"."productgroupId" AND "ap"."isObsolete" = FALSE',
"pg"."productgroupId"
)
) as "numberOfActiveProducts"
FROM
"productgroups" "pg"
GROUP BY
"pg"."productgroupId";

Now our timeload is approximately 1.200ms, this is a over 2000 percentage boost. Not bad!

The printf() function is mandatory due to the fact that we want to inject information of the outer statement scope, but this scope is not accessable by the count_estimate() method itself. So we had to swap the SQL column reference (the product group id) to its literal value.

The count_estimate() function is defined as this:

CREATE OR REPLACE FUNCTION count_estimate(query text) RETURNS INTEGER AS
$BODY$
DECLARE
rec record;
ROWS INTEGER;
BEGIN
FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
ROWS := SUBSTRING(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
EXIT WHEN ROWS IS NOT NULL;
END LOOP;

RETURN ROWS;
END
$BODY$
LANGUAGE plpgsql;

The printf() function is defined as this:

CREATE OR REPLACE FUNCTION printf(fmt text, variadic args anyarray) returns text
language plpgsql AS $$
DECLARE
argcnt int = 1;
chrcnt int = 0;
fmtlen int;
CHR text;
output text = '';
BEGIN
fmtlen = LENGTH(fmt);
LOOP
chrcnt = chrcnt + 1;

-- ran out of format string? bail out
IF chrcnt > fmtlen THEN
EXIT;
END IF;

-- grab our char
CHR = substring(fmt, chrcnt, 1);

-- %% means output a single %, and skip them
IF CHR = '%' AND substring(fmt, chrcnt + 1, 1) = '%' THEN
output = output || '%';
chrcnt = chrcnt + 1;
continue;
END IF;

-- a % on its own means output an element from our arg list
IF CHR = '%' THEN
output = output || COALESCE(args[argcnt]::text, '');
argcnt = argcnt + 1;
continue;
END IF;

-- no special case? output the thing
output = output || CHR;
END LOOP;

RETURN output;
END;
$$;