Deadlock Detected Between INSERTs and Retention Policy (drop_chunks) – Any Solutions?

Hi Timescale Community,

We’re running into a recurring deadlock issue in our TimescaleDB setup and would appreciate any advice or proven solutions.

Background:
We use TimescaleDB for storing raw machine data, with regular INSERTs via Kafka Connect (JDBC Sink). We also have a retention policy in place that uses drop_chunks to remove old data.

Additional Info:

  • TimescaleDB version: 2.21.4

  • PostgreSQL version: 17.7

Problem:
On 15./16.10.2025, we experienced a significant outage due to deadlocks between our regular INSERTs and the Timescale retention job. The issue persisted until we restarted the connector. Here’s a summary of what we observed:

  • Error:

    ERROR: deadlock detected Detail: Process 2027782 waits for ShareLock on transaction 158691927; blocked by process 2082871. Process 2082871 waits for ShareLock on transaction 158691928; blocked by process 2027782. Where: while locking tuple (0,4) in relation "dimension_slice"

  • Stacktrace:
    The deadlock occurs between INSERTs into our hypertable and the retention job (policy_retention) that calls drop_chunks. Both operations seem to lock Timescale metadata tables, especially _timescaledb_catalog.dimension_slice.

  • Impact:
    All subsequent transactions failed with current transaction is aborted, commands ignored until end of transaction block until the connector was restarted.

Analysis:

  • The deadlock happens even though drop_chunks only deletes old chunks, because both INSERTs and retention jobs need locks on shared catalog tables.

  • According to Timescale docs, drop_chunks requires exclusive locks, and can fail if there’s concurrent access—even if the data ranges don’t overlap.

Questions for the Community:

  1. Has anyone else experienced deadlocks between INSERTs and retention jobs? How did you resolve them?

  2. Are there best practices for scheduling retention jobs to avoid these conflicts?

  3. Is there a way to make drop_chunks less aggressive in locking, or to prioritize INSERTs?

  4. Would switching to custom retention jobs with tuned lock_timeout and retries be more robust than the built-in policy?

  5. Any other monitoring or mitigation strategies you recommend?

Thanks in advance for your help! Any pointers or shared experiences would be greatly appreciated.

Best,
Thomas