Query Timeouts Causing Instability
Started at
Resolved
Summary
While investigating increased query timeouts, we suspected that our ClickHouse cluster might have been under-provisioned. Our ClickHouse Cloud instance had recently been upgraded from version 24.10 to 24.12. A bug in the newer version caused two of our five ClickHouse servers to become unresponsive to log queries while otherwise appearing healthy. We worked with ClickHouse support to resolve the issue ultimately requiring a rollback to the previous version.
Impact
- Source Ingestion: Paused for 30 minutes during investigation, causing delays of that duration.
- Detection Delays: Increased timeouts led to Query queue buildup, causing detection delays of up to 30 minutes in the worst cases.
- Data Loss Risk: Webhook-based sources without retry implementations may have lost data.
- Potential Data Duplication: In specific scenarios involving large S3 files, some duplicate events may have appeared. This occurs when we're processing large S3 files (many GBs, 100k+ rows). We write to ClickHouse in batches of 50k, meaning that it's possible if we have some successful writes, but encountered a ClickHouse failure before we completed processing the s3 file, then we would have some duplicate events as we begin reprocessing the file after recovery.
Timeline (Timestamps in Pacific Time)
Friday, May 9
- 5:00 PT: ClickHouse completed server upgrade from 24.10 to 24.12
- 8:13 PT: Alerted to queue processors restarting due to unexpected EOF errors and I/O timeout errors from ClickHouse
- 12:44 PT: Opened case with ClickHouse support after determining ClickHouse was the cause.
- 13:00 PT: Scaled up the cluster hoping to alleviate issues, which inadvertently exacerbated the problem.
- 14:11 PT: Paused ingestion to facilitate ClickHouse team's investigation
- 14:40 PT: Resumed ingestion after ClickHouse implemented initial mitigations
- Afternoon/Evening: ClickHouse took additional actions that reduced timeouts but didn't fully resolve the issue
Saturday, May 10
- ClickHouse identified root cause related to the upgrade
- 11:00 PT: Cluster rolled back to version 24.10, restoring query timeouts to normal levels
Communication with ClickHouse
We continue to work with the ClickHouse team to understand the exact nature of the bug that caused this incident. Key points from our ongoing discussion include:
- Initial response time was prompt, but diagnosis took longer than expected
- The issue was identified as a bug in version 24.12 that introduced a memory leak.
- We've requested advance notification of at least one week for future upgrades.
- We've also reached out to ask about the improvements they plan to make to avoid this in the future and improve the response for future incidents.
[This section will be updated as we continue communication with ClickHouse]
Remediation Plan
- We're improving our incident management process to make communications to customers using the status page more consistent. By keeping one incident open instead of opening multiple, we'll keep communications tight and punctual.
- Request at least one week's advance notice from ClickHouse for version upgrades
- Enhance monitoring for query timeouts with better error information to increase alert fidelity. Excluding queries timing out because they're scanning a long time range means that we can see operational issues more clearly.
- Implement more granular health checks for ClickHouse servers, and implement exponential backoff and retry when ClickHouse is unresponsive.
- Develop staging environment load tests for ClickHouse version upgrades
Technical Deep Dive: Data Integrity Safeguards
Our architecture employs several safeguards that prevented significant data loss during this incident:
- Message Queue Persistence: Source events (or s3 file notifications) remain unacknowledged in their originating queues until all events or the whole file is successfully processed, providing resilience against processing failures.
- Batched Processing with Acknowledgment Controls: We only acknowledge batch completion after successful ClickHouse writes, ensuring data isn't lost during processing.
- File-Based Tracking: For S3 and object storage sources, we track completion at the file level, providing an additional integrity layer.
- Dynamic Windowing for Detections: Detections using the
from
andto
built-in detection parameters have their windows automatically scaled such that no logs are missed during scans. Always be sure to include these in your detections to not miss a single log line.
These mechanisms ensured that while data was delayed, permanent loss was limited to webhook sources without retry capabilities, representing a small fraction of our total ingestion volume and detections were still scanning and reading all data as it arrives.
Resolved
Clickhouse identified the cause of the instability and since the mitigation they put in place we have been stable all day.
Please reach out if you continue to experience any issues or timeouts running queries.
Monitoring
Clickhouse has implemented a mitigation and we are monitoring the system health to watch for additional frequent query timeouts.
Investigating
After restoring service from the incident earlier today, we're seeing connectivity issues reaching the primary Clickhouse cluster.
We're actively investigating it with the Clickhouse team and will provide an update as soon as we can.
We're still processing logs and they aren't delayed but occasionally queries are timing out, needing to be retried. Some duplicates may occur as a result.