"Transactionness" of continuous aggregates refresh

We saw a problem that a long continuous aggregate refresh call, which was invoked by the refresh policy, was keeping transaction open (as reported by the pg_stat_activity.backend_xid ) for the whole duration of the run. This manifested as autovacuum not being able to cleanup dead tuples until refresh job completes.

When looking at the API docs of add_continuous_aggregate_policy there is a following note:

Because each batch is an individual transaction, executing a policy in batches make the data visible for the users before the entire job is executed.

Which can be understood as refresh call doing some trickery to perform multiple transactions in the same call. This contradicts stalled autovacuum behaviour we observed.

Could you please clarify how continuous aggregates refresh interacts with xid horizon PG tracks for every backend. There are different remediations, depending on what it actually does:

  • If single refresh job truly makes multiple transactions, and therefore job’s backend_xid should be moved up as it finishes batch (which is one bucket by default), then our remediation should be reducing bucket width of the continuous aggregates.
  • If single refresh job keeps single transaction open for the whole duration (documentation is somewhat inaccurate) , then our remediation should be limiting max_batches_per_execution to make it releasing transaction more often.

Any insights into these nitty-gritty details are much appreciated.

OK, I had checked the code and it seems to be both:

  • scheduled continues aggregate refresh policy starts refresh process in the backend. That backend has own xid and that xid doesn’t move anywhere.
  • refresh process calculates batches based on start and end times, bucket width and invalidation thresholds. For each batch it starts new transaction (actually two) over SPI interface. These transactions are committed as batch work finishes and therefore aggregation result become available for querying.

So overall there is always a transaction open for the duration of the job and we need to implicitly limit its runtime with max_batches_per_execution to allow vacuum to run efficiently. We also need to ensure that buckets are small enough, so that job can both process multiple buckets and stay within our desired wall clock limit.

It is probably worth putting upper cap on the refresh job execution with job.max_runtime , but because it can lead to a wasted compute (cancelling job in the middle of a batch will discard batch’s progress) , it shouldn’t be a primary control mechanism.