Query Timeouts Causing Instability

Summary

While investigating increased query timeouts, we suspected that our ClickHouse cluster might have been under-provisioned. Our ClickHouse Cloud instance had recently been upgraded from version 24.10 to 24.12. A bug in the newer version caused two of our five ClickHouse servers to become unresponsive to log queries while otherwise appearing healthy. We worked with ClickHouse support to resolve the issue ultimately requiring a rollback to the previous version.

Impact

Source Ingestion: Paused for 30 minutes during investigation, causing delays of that duration.
Detection Delays: Increased timeouts led to Query queue buildup, causing detection delays of up to 30 minutes in the worst cases.
Data Loss Risk: Webhook-based sources without retry implementations may have lost data.
Potential Data Duplication: In specific scenarios involving large S3 files, some duplicate events may have appeared. This occurs when we're processing large S3 files (many GBs, 100k+ rows). We write to ClickHouse in batches of 50k, meaning that it's possible if we have some successful writes, but encountered a ClickHouse failure before we completed processing the s3 file, then we would have some duplicate events as we begin reprocessing the file after recovery.

Timeline (Timestamps in Pacific Time)

Friday, May 9

5:00 PT: ClickHouse completed server upgrade from 24.10 to 24.12
8:13 PT: Alerted to queue processors restarting due to unexpected EOF errors and I/O timeout errors from ClickHouse
12:44 PT: Opened case with ClickHouse support after determining ClickHouse was the cause.
13:00 PT: Scaled up the cluster hoping to alleviate issues, which inadvertently exacerbated the problem.
14:11 PT: Paused ingestion to facilitate ClickHouse team's investigation
14:40 PT: Resumed ingestion after ClickHouse implemented initial mitigations
Afternoon/Evening: ClickHouse took additional actions that reduced timeouts but didn't fully resolve the issue

Saturday, May 10

ClickHouse identified root cause related to the upgrade
11:00 PT: Cluster rolled back to version 24.10, restoring query timeouts to normal levels

Communication with ClickHouse

We continue to work with the ClickHouse team to understand the exact nature of the bug that caused this incident. Key points from our ongoing discussion include:

Initial response time was prompt, but diagnosis took longer than expected
The issue was identified as a bug in version 24.12 that introduced a memory leak.
We've requested advance notification of at least one week for future upgrades.
We've also reached out to ask about the improvements they plan to make to avoid this in the future and improve the response for future incidents.

[This section will be updated as we continue communication with ClickHouse]

Remediation Plan

We're improving our incident management process to make communications to customers using the status page more consistent. By keeping one incident open instead of opening multiple, we'll keep communications tight and punctual.
Request at least one week's advance notice from ClickHouse for version upgrades
Enhance monitoring for query timeouts with better error information to increase alert fidelity. Excluding queries timing out because they're scanning a long time range means that we can see operational issues more clearly.
Implement more granular health checks for ClickHouse servers, and implement exponential backoff and retry when ClickHouse is unresponsive.
Develop staging environment load tests for ClickHouse version upgrades

Technical Deep Dive: Data Integrity Safeguards

Our architecture employs several safeguards that prevented significant data loss during this incident:

Message Queue Persistence: Source events (or s3 file notifications) remain unacknowledged in their originating queues until all events or the whole file is successfully processed, providing resilience against processing failures.
Batched Processing with Acknowledgment Controls: We only acknowledge batch completion after successful ClickHouse writes, ensuring data isn't lost during processing.
File-Based Tracking: For S3 and object storage sources, we track completion at the file level, providing an additional integrity layer.
Dynamic Windowing for Detections: Detections using the from and to built-in detection parameters have their windows automatically scaled such that no logs are missed during scans. Always be sure to include these in your detections to not miss a single log line.

These mechanisms ensured that while data was delayed, permanent loss was limited to webhook sources without retry capabilities, representing a small fraction of our total ingestion volume and detections were still scanning and reading all data as it arrives.

Summary

Impact

Source Ingestion: Paused for 30 minutes during investigation, causing delays of that duration.

Detection Delays: Increased timeouts led to Query queue buildup, causing detection delays of up to 30 minutes in the worst cases.

Data Loss Risk: Webhook-based sources without retry implementations may have lost data.

Potential Data Duplication: In specific scenarios involving large S3 files, some duplicate events may have appeared. This occurs when we're processing large S3 files (many GBs, 100k+ rows). We write to ClickHouse in batches of 50k, meaning that it's possible if we have some successful writes, but encountered a ClickHouse failure before we completed processing the s3 file, then we would have some duplicate events as we begin reprocessing the file after recovery.

Timeline (Timestamps in Pacific Time)

Friday, May 9

5:00 PT: ClickHouse completed server upgrade from 24.10 to 24.12

8:13 PT: Alerted to queue processors restarting due to unexpected EOF errors and I/O timeout errors from ClickHouse

12:44 PT: Opened case with ClickHouse support after determining ClickHouse was the cause.

13:00 PT: Scaled up the cluster hoping to alleviate issues, which inadvertently exacerbated the problem.

14:11 PT: Paused ingestion to facilitate ClickHouse team's investigation

14:40 PT: Resumed ingestion after ClickHouse implemented initial mitigations

Afternoon/Evening: ClickHouse took additional actions that reduced timeouts but didn't fully resolve the issue

Saturday, May 10

ClickHouse identified root cause related to the upgrade

11:00 PT: Cluster rolled back to version 24.10, restoring query timeouts to normal levels

Communication with ClickHouse

We continue to work with the ClickHouse team to understand the exact nature of the bug that caused this incident. Key points from our ongoing discussion include:

Initial response time was prompt, but diagnosis took longer than expected

The issue was identified as a bug in version 24.12 that introduced a memory leak.

We've requested advance notification of at least one week for future upgrades.

We've also reached out to ask about the improvements they plan to make to avoid this in the future and improve the response for future incidents.

[This section will be updated as we continue communication with ClickHouse]

Remediation Plan

We're improving our incident management process to make communications to customers using the status page more consistent. By keeping one incident open instead of opening multiple, we'll keep communications tight and punctual.

Request at least one week's advance notice from ClickHouse for version upgrades

Enhance monitoring for query timeouts with better error information to increase alert fidelity. Excluding queries timing out because they're scanning a long time range means that we can see operational issues more clearly.

Implement more granular health checks for ClickHouse servers, and implement exponential backoff and retry when ClickHouse is unresponsive.

Develop staging environment load tests for ClickHouse version upgrades

Technical Deep Dive: Data Integrity Safeguards

Our architecture employs several safeguards that prevented significant data loss during this incident:

Message Queue Persistence: Source events (or s3 file notifications) remain unacknowledged in their originating queues until all events or the whole file is successfully processed, providing resilience against processing failures.

Batched Processing with Acknowledgment Controls: We only acknowledge batch completion after successful ClickHouse writes, ensuring data isn't lost during processing.

File-Based Tracking: For S3 and object storage sources, we track completion at the file level, providing an additional integrity layer.

Dynamic Windowing for Detections: Detections using the from and to built-in detection parameters have their windows automatically scaled such that no logs are missed during scans. Always be sure to include these in your detections to not miss a single log line.