STATS_AGG with NULL to save storage space in Continuous Aggregates

I’m using a Hypertable to store metrics which are quite sparse. So for, example, the table has around 6 columns but generally only one of these are set per row.

I’m using a continuous aggregate with STATS_AGG for each of these columns and that all works nicely. The query is something like this:

CREATE MATERIALIZED VIEW my_ca
    WITH (timescaledb.continuous) AS
    SELECT
"deviceId",
time_bucket(INTERVAL '1 hour', ts), 
STATS_AGG(a),
STATS_AGG(b)
from my_hypertable 
group by 1, 2;

The issue is that STATS_AGG seems to generate a non-null result even when there is no data. E.g. an empty column result looks like this:

(version:1,n:0,sx:0,sx2:0,sx3:0,sx4:0), 

The n is presumably the count, and indeed if you run AVERAGE on this, it will return NULL (which makes sense).

My hunch is that this takes up a lot of space in the CA compared to a NULL value (as NULLs get handled in a special way by Postgres, so I assume it’s the same for Timescale). Am I correct?

I imagine that these will compress very nicely when in the hypercore, but I’d still like to reduce the rowstore usage if possible.

If I’m correct that there is a storage saving here, is there a way to have this instead store NULL when n=0?

I notice that AVERAGE, for example, already handles NULL:

-- These both return NULL
select AVERAGE('(version:1,n:0,sx:0,sx2:0,sx3:0,sx4:0)'::StatsSummary1D);
select AVERAGE(NULL::StatsSummary1D);

I guess I could wrap STATS_AGG in a custom function which returns NULL but I’m not sure of the performance impact of that and I’m also not sure if treating n=0 as NULL is technically correct. Can n=0 be treated equivalently as NULL?

Is it worth pursuing this?

Hmm this is actually much harder than I thought.

Because STATS_AGG is an aggregate function, you can’t just wrap it in another function as it involves more complex accumulation and finalize behaviour.

I also tried this approach but Timescale complains that it’s not a suitable CA (I guess the nested SELECT hides the GROUP BY from Timescale):

CREATE OR REPLACE FUNCTION stats_agg_or_null(summary StatsSummary1D)
RETURNS StatsSummary1D
LANGUAGE plpgsql
AS $$
BEGIN
    IF summary IS NULL OR summary.n = 0 THEN
        RETURN NULL;
    ELSE
        RETURN summary;
    END IF;
END;
$$;

CREATE MATERIALIZED VIEW my_ca
    WITH (timescaledb.continuous) AS
    SELECT "deviceId", stats_agg_or_null("a"), stats_agg_or_null("b) 
   FROM (
    SELECT
      "deviceId",
      time_bucket(INTERVAL '1 hour', ts), 
      STATS_AGG(a),
      STATS_AGG(b)
    FROM my_hypertable 
    group by 1, 2
   );

OK actually using a FILTER WHERE seems to work:

CREATE MATERIALIZED VIEW my_ca
    WITH (timescaledb.continuous) AS
    SELECT
"deviceId",
time_bucket(INTERVAL '1 hour', ts), 
STATS_AGG(a) FILTER (WHERE a IS NOT NULL),
STATS_AGG(b) FILTER (WHERE b IS NOT NULL),
from my_hypertable 
group by 1, 2;

It’s maybe a bit hacky (this feels like it should be done by the STATS_AGG function), and it probably adds more cost to the refresh query but still does the job.