RS
  • Status
  • Events
  • Monitors
RunReveal Status
Back
Apr 9, 2026
2 months ago
RunReveal Data Delay
RunReveal
Resolved · April 10 at 4:28 PM (in 1 day)

Incident Report: Service Degradation - April 9, 2026

Summary

On April 9, 2026, RunReveal experienced degraded performance caused by slow query execution on one of our ClickHouse data warehouses following a patch upgrade by ClickHouse Cloud. Data ingestion was delayed by up to 5 hours while we worked to mitigate the issue. The API and web app remained available throughout.

Timeline

  • ClickHouse Cloud applied a routine patch upgrade to our production warehouse.
  • Shortly after, query performance degraded significantly, causing ingestion pipelines to back up.
  • To unblock data ingestion, we migrated to a new ClickHouse warehouse pointing at the same coherent dataset in S3.
  • During recovery, our IP geolocation database (IPDB) failed to load due to degraded performance on a shared EFS volume, preventing some backend services from restarting cleanly.
  • We disabled IP geolocation enrichment to unblock service recovery.
  • Ingestion fully caught up and all services were restored.
  • IP geolocation enrichment was re-enabled after migrating IPDB from shared storage to per-pod downloads from S3.

Impact

  • Data ingestion was delayed by up to 5 hours during the incident window. No data was lost.
  • Log events ingested during the incident were not enriched with IP geolocation data (city, country, etc.). The events themselves were stored normally.
  • Search queries were slow or timed out during the period of ClickHouse degradation.

Resolution

  • We migrated query and ingestion workloads to a new ClickHouse warehouse, restoring normal performance.
  • IP geolocation enrichment was re-enabled after rearchitecting the IPDB delivery mechanism to eliminate the EFS dependency.
  • We are awaiting confirmation from ClickHouse Cloud on the root cause of the query performance degradation following their patch upgrade.
Monitoring · April 9 at 10:28 PM (18 hours earlier)

We were able to mitigate the issue and data ingestion has now fully caught up.

The root cause is still under investigation and we'll have an update for you tomorrow.

RunReveal experienced a period of degraded performance caused by slow query execution on one of our ClickHouse data warehouses.

During the incident, data ingestion was significantly delayed (up to 5 hours), and IP geolocation enrichment was temporarily unavailable. The API and webapp remained available throughout the incident.

Monitoring · April 9 at 7:50 PM (3 hours earlier)

We're enabling all of our queues slowly to warm up our datastore. We've resolved a few unrelated issues that complicated fixing this issue quickly. A more detailed status report will be written and posted here, but we expect service to be fully restored shortly.

Identified · April 9 at 5:57 PM (2 hours earlier)

We're continuing to work on the data delay issue. No data is being lost, and the API / web experience should continue working normally.

We're working on a rollback of an upgrade to our underlying data store and paused data ingestion during this. We will provide an update soon.

Investigating · April 9 at 4:04 PM (2 hours earlier)

We're dealing with a data delay caused by a recent change we released to production

powered by openstatus.dev