PostgreSQL - Count() Performance Boost
Counting is a slow measure in Postgresql and it is even slower with a lot of data and complex statements. This issue is related with the kind of how the data consistency is implemented. As a result Postgresql must check the visiblity of all rows first and so it has to iterate over them all.
In our project context we use a static cache table with calculated values to avoid using the Postgresql count() method all the time, especially when the numbers did not change at all. So we created a SQL function, which will be called only if the corresponding data has changed.
Sadly, the performance drops heavily on the function call and this is not acceptable due to the fact that the request is synchronious and the user must wait for the request to finish.
In our application we have a lot of products and some of them can be activated respectively be deactivated and all of the products are related to a product group. In our static cache table we summarize information about the product groups and how much products overall and how much active products each one contains.
To fetch the update information we run a SQL statement like this:
SELECT "pg"."productgroupId", ( SELECT count("productId") FROM "products" AS "p" WHERE "pg"."productgroupId" = "p"."productgroupId" ) as "numberOfAllProducts", ( SELECT count("productId") FROM "products" AS "ap" WHERE "pg"."productgroupId" = "ap"."productgroupId" AND "ap"."isObsolete" = FALSE ) as "numberOfActiveProducts" FROM "productgroups" "pg" GROUP BY "pg"."productgroupId";
With over 78.000 products and over 370 product groups, this call measures a timeload of approximately 30.000ms (yep, 30 seconds!).
So we optimized our statements by switching the explicit count() call with analyzing the estimated count information. We placed some logic in separate functions (count_estimate() and printf()):
SELECT "pg"."productgroupId", count_estimate( printf( 'SELECT "productId" FROM "products" AS "p" WHERE % = "p"."productgroupId"', "pg"."productgroupId" ) ) as "numberOfAllProducts", count_estimate( printf( 'SELECT "productId" FROM "products" AS "ap" WHERE % = "ap"."productgroupId" AND "ap"."isObsolete" = FALSE', "pg"."productgroupId" ) ) as "numberOfActiveProducts" FROM "productgroups" "pg" GROUP BY "pg"."productgroupId";
Now our timeload is appromixamtely 1.200ms, this is a over 2000 percentage boost. Not bad!
The printf() function is mandatory due to the fact that we want to inject information of the outter statement scope, but this scope is not accessable by the count_estimate() method itself. So we had to swap the SQL column reference (the product group id) to its literal value.
The count_estimate() function is defined as this:
CREATE OR REPLACE FUNCTION count_estimate(query text) RETURNS INTEGER AS $BODY$ DECLARE rec record; ROWS INTEGER; BEGIN FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP ROWS := SUBSTRING(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)'); EXIT WHEN ROWS IS NOT NULL; END LOOP; RETURN ROWS; END $BODY$ LANGUAGE plpgsql;
The printf() function is defined as this:
CREATE OR REPLACE FUNCTION printf(fmt text, variadic args anyarray) returns text language plpgsql AS $$ DECLARE argcnt int = 1; chrcnt int = 0; fmtlen int; CHR text; output text = ''; BEGIN fmtlen = LENGTH(fmt); LOOP chrcnt = chrcnt + 1; -- ran out of format string? bail out IF chrcnt > fmtlen THEN EXIT; END IF; -- grab our char CHR = substring(fmt, chrcnt, 1); -- %% means output a single %, and skip them IF CHR = '%' AND substring(fmt, chrcnt + 1, 1) = '%' THEN output = output || '%'; chrcnt = chrcnt + 1; continue; END IF; -- a % on its own means output an element from our arg list IF CHR = '%' THEN output = output || COALESCE(args[argcnt]::text, ''); argcnt = argcnt + 1; continue; END IF; -- no special case? output the thing output = output || CHR; END LOOP; RETURN output; END; $$;Tags: Performance, PostgreSQL