PostgreSQL - Count() Performance Boost

This blog post was published more than one year ago and might be outdated!

December 8, 2015 · 4 min read

Business Analyst Digital Sales

Counting is a slow measure in Postgresql, and it is even slower with a lot of data and complex statements. This issue is related with the kind of how the data consistency is implemented. As a result Postgresql must check the visiblity of all rows first and so it has to iterate over them all.

In our project context we use a static cache table with calculated values to avoid using the Postgresql count() method all the time, especially when the numbers did not change at all. So we created a SQL function, which will be called only if the corresponding data has changed.

Sadly, the performance drops heavily on the function call and this is not acceptable due to the fact that the request is synchronous and the user must wait for the request to finish.

In our application we have a lot of products and some of them can be activated respectively be deactivated and all of the products are related to a product group. In our static cache table we summarize information about the product groups and how much products overall and how much active products each one contains.

To fetch the update information we run a SQL statement like this:

SELECT
    "pg"."productgroupId",
    (
        SELECT 
            count("productId") 
        FROM 
            "products" AS "p" 
        WHERE 
            "pg"."productgroupId" = "p"."productgroupId"
    ) as "numberOfAllProducts",
    (
        SELECT 
            count("productId") 
        FROM 
            "products" AS "ap" 
        WHERE 
            "pg"."productgroupId" = "ap"."productgroupId" AND "ap"."isObsolete" = FALSE
    ) as "numberOfActiveProducts"
FROM
    "productgroups" "pg"
GROUP BY
    "pg"."productgroupId";

With over 78.000 products and over 370 product groups, this call measures a timeload of approximately 30.000ms (yep, 30 seconds!).

So we optimized our statements by switching the explicit count() call with analyzing the estimated count information. We placed some logic in separate functions (count_estimate() and printf()):

SELECT
    "pg"."productgroupId",
    count_estimate(
        printf(
            'SELECT "productId" FROM "products" AS "p" WHERE % = "p"."productgroupId"', 
            "pg"."productgroupId"
        )
    )  as "numberOfAllProducts",
    count_estimate(
        printf(
            'SELECT "productId" FROM "products" AS "ap" WHERE % = "ap"."productgroupId" AND "ap"."isObsolete" = FALSE', 
            "pg"."productgroupId"
        )
    ) as "numberOfActiveProducts"
FROM
    "productgroups" "pg"
GROUP BY
    "pg"."productgroupId";

Now our timeload is approximately 1.200ms, this is a over 2000 percentage boost. Not bad!

The printf() function is mandatory due to the fact that we want to inject information of the outer statement scope, but this scope is not accessable by the count_estimate() method itself. So we had to swap the SQL column reference (the product group id) to its literal value.

The count_estimate() function is defined as this:

CREATE OR REPLACE FUNCTION count_estimate(query text) RETURNS INTEGER AS
$BODY$
DECLARE
    rec   record;
    ROWS  INTEGER;
BEGIN
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
        ROWS := SUBSTRING(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
        EXIT WHEN ROWS IS NOT NULL;
    END LOOP;

    RETURN ROWS;
END
$BODY$ 
LANGUAGE plpgsql;

The printf() function is defined as this:

CREATE OR REPLACE FUNCTION printf(fmt text, variadic args anyarray) returns text
language plpgsql AS $$
DECLARE
    argcnt  int = 1;
    chrcnt  int = 0;
    fmtlen  int;
    CHR             text;
    output  text = '';
BEGIN
    fmtlen = LENGTH(fmt);
    LOOP
        chrcnt = chrcnt + 1;

        -- ran out of format string? bail out
        IF chrcnt > fmtlen THEN
            EXIT;
        END IF;

        -- grab our char
        CHR = substring(fmt, chrcnt, 1);

        -- %% means output a single %, and skip them
        IF CHR = '%' AND substring(fmt, chrcnt + 1, 1) = '%' THEN
            output = output || '%';
            chrcnt = chrcnt + 1;
            continue;
        END IF;

        -- a % on its own means output an element from our arg list
        IF CHR = '%' THEN
            output = output || COALESCE(args[argcnt]::text, '');
            argcnt = argcnt + 1;
            continue;
        END IF;

        -- no special case? output the thing
        output = output || CHR;
    END LOOP;

    RETURN output;
END;
$$;