RunReveal Status

Jun 13, 2025

RunReveal is investigating data delays in our platform

Resolved

RunReveal June 9th BYODB incident

RunReveal deployed a change on June 9th to help our "bring your own databases" customers more easily manage database migrations. This change allows customers to automatically update their databases, choose when to apply migrations, and not need to rely on the RunReveal team to keep their schemas up to date.

Unfortunately there was a subtle bug that caused data delays for object storage and API polling sources, and data loss for webhook sources that do not retry failed requests.

What was the bug

Our Bring your own database destinations have a variety of settings associated with them. These settings tell us how large the batch sizes should be that we write to their databases, what the flush frequency should be if the batch size is not reached, if we should connect with custom headers, etc.

As part of this change the settings object was rewritten to have additional context and the format of the FlushDuration was subtly changed. When we read the flush duration we can have two separate formats of it. The first is a human readable string like 30s, and the second is in Integer milliseconds like 30000 due to how Javascript represents timestamps.

We accidentally left the JSON Marshaling logic unimplemented for the Flush Duration in this settings object (which will default to nanoseconds), and the custom FlushDuration while unmarshalling treated this number as milliseconds. This meant that the Flush Duration was significantly longer that expected by a huge factor. This meant that smaller sources that relied on this flush duration were essentially never being synced.

Below is the fix we implemented that got the Flush timeout into the format that it was expected to be in:

@@ -69,26 +69,31 @@ types/destinations.go
type Duration time.Duration
func (d *Duration) UnmarshalJSON(b []byte) error {
	var s string
	if err := json.Unmarshal(b, &s); err != nil {
		// fallback to int milliseconds for compatibility with old code
		var i int
		if err2 := json.Unmarshal(b, &i); err2 == nil {
			*d = Duration(time.Duration(i) * time.Millisecond)
			return nil
		}
		return err
	}
	t, err := time.ParseDuration(s)
	if err != nil {
		return err
	}
	*d = Duration(t)
	return nil
}

+func (d Duration) MarshalJSON() ([]byte, error) {
+	t := time.Duration(d)
+	return json.Marshal(t.String())
+}

What was the impact of this issue

This issue only impacted customers using bring your own database configurations. Sources we read from object storage and sources we poll APIs for, that data was delayed during this incident. However data we received from webhooks was lost between 2025-06-09T18:40:00Z and 2025-06-10T01:00:00Z for impacted customers.

With sources we receive an HTTP request (Webhook) from, these sources lost data during the time window of 6 hours if they did not retry their requests when the API service was restored. We are deeply sorry about this. We believe that RunReveal going down or having operational issues shouldn't cause data loss for our customers and we're working on that being true.

What are we doing about this

There are two key areas we can improve from this incident.

This issue took us way too long to become aware of. This issue impacted only a subset of customers and a subset of those customer's sources. We need better per source and per customer monitoring. RunReveal processes hundreds of thousands of logs per second and our monitoring is focused on high level observability metrics. We also need to get better at the smaller volume log sources, and for BYODB + Cloud focused alerting alike.
We need better resilience for Webhook sources. It's not okay for our object storage sources and API polling sources to be resilient while webhooks are simply lost when we have operational issues. We need to improve this by changing the way we handle webhooks. We plan to queue them immediately upon ingest so they aren't lost, and we can safely store logs in the queue for a while during any future operational issues.

The work required to improve the resilience of webhook sources is very straightforward and well-scoped so we expect to start working on this problem in the next few weeks.

Follow up questions

If you have follow up questions feel free to reach out to evan@ and alan@ and we'd be happy to walk you through the bug, the metrics we have on the incident, and our plans to make RunReveal even more resilient.

Investigating

This incident only impacted a small handful of bring your own database customers. We're working on a postmortem document to share the details and learnings from this incident and expect to have it posted by June 11th.

Investigating

We're monitoring a data delay in RunReveal.

Jun 12, 2025

Website outage due to upstream provider

Resolved

This has been resolved. Please reach out if you see any other issues.

Monitoring

Cloudflare reports the issue as remediated and monitoring. We'll keep an eye on things too.

Identified

Our website is hosted on Cloudflare who is handling an outage currently: https://www.cloudflarestatus.com/

All log processing is unaffected and progressing normally.

May 12, 2025

Query Timeouts Causing Instability

Resolved

Summary

While investigating increased query timeouts, we suspected that our ClickHouse cluster might have been under-provisioned. Our ClickHouse Cloud instance had recently been upgraded from version 24.10 to 24.12. A bug in the newer version caused two of our five ClickHouse servers to become unresponsive to log queries while otherwise appearing healthy. We worked with ClickHouse support to resolve the issue ultimately requiring a rollback to the previous version.

Impact

Source Ingestion: Paused for 30 minutes during investigation, causing delays of that duration.
Detection Delays: Increased timeouts led to Query queue buildup, causing detection delays of up to 30 minutes in the worst cases.
Data Loss Risk: Webhook-based sources without retry implementations may have lost data.
Potential Data Duplication: In specific scenarios involving large S3 files, some duplicate events may have appeared. This occurs when we're processing large S3 files (many GBs, 100k+ rows). We write to ClickHouse in batches of 50k, meaning that it's possible if we have some successful writes, but encountered a ClickHouse failure before we completed processing the s3 file, then we would have some duplicate events as we begin reprocessing the file after recovery.

Timeline (Timestamps in Pacific Time)

Friday, May 9

5:00 PT: ClickHouse completed server upgrade from 24.10 to 24.12
8:13 PT: Alerted to queue processors restarting due to unexpected EOF errors and I/O timeout errors from ClickHouse
12:44 PT: Opened case with ClickHouse support after determining ClickHouse was the cause.
13:00 PT: Scaled up the cluster hoping to alleviate issues, which inadvertently exacerbated the problem.
14:11 PT: Paused ingestion to facilitate ClickHouse team's investigation
14:40 PT: Resumed ingestion after ClickHouse implemented initial mitigations
Afternoon/Evening: ClickHouse took additional actions that reduced timeouts but didn't fully resolve the issue

Saturday, May 10

ClickHouse identified root cause related to the upgrade
11:00 PT: Cluster rolled back to version 24.10, restoring query timeouts to normal levels

Communication with ClickHouse

We continue to work with the ClickHouse team to understand the exact nature of the bug that caused this incident. Key points from our ongoing discussion include:

Initial response time was prompt, but diagnosis took longer than expected
The issue was identified as a bug in version 24.12 that introduced a memory leak.
We've requested advance notification of at least one week for future upgrades.
We've also reached out to ask about the improvements they plan to make to avoid this in the future and improve the response for future incidents.

[This section will be updated as we continue communication with ClickHouse]

Remediation Plan

We're improving our incident management process to make communications to customers using the status page more consistent. By keeping one incident open instead of opening multiple, we'll keep communications tight and punctual.
Request at least one week's advance notice from ClickHouse for version upgrades
Enhance monitoring for query timeouts with better error information to increase alert fidelity. Excluding queries timing out because they're scanning a long time range means that we can see operational issues more clearly.
Implement more granular health checks for ClickHouse servers, and implement exponential backoff and retry when ClickHouse is unresponsive.
Develop staging environment load tests for ClickHouse version upgrades

Technical Deep Dive: Data Integrity Safeguards

Our architecture employs several safeguards that prevented significant data loss during this incident:

Message Queue Persistence: Source events (or s3 file notifications) remain unacknowledged in their originating queues until all events or the whole file is successfully processed, providing resilience against processing failures.
Batched Processing with Acknowledgment Controls: We only acknowledge batch completion after successful ClickHouse writes, ensuring data isn't lost during processing.
File-Based Tracking: For S3 and object storage sources, we track completion at the file level, providing an additional integrity layer.
Dynamic Windowing for Detections: Detections using the from and to built-in detection parameters have their windows automatically scaled such that no logs are missed during scans. Always be sure to include these in your detections to not miss a single log line.

These mechanisms ensured that while data was delayed, permanent loss was limited to webhook sources without retry capabilities, representing a small fraction of our total ingestion volume and detections were still scanning and reading all data as it arrives.

Resolved

Clickhouse identified the cause of the instability and since the mitigation they put in place we have been stable all day.

Please reach out if you continue to experience any issues or timeouts running queries.

Monitoring

Clickhouse has implemented a mitigation and we are monitoring the system health to watch for additional frequent query timeouts.

Investigating

After restoring service from the incident earlier today, we're seeing connectivity issues reaching the primary Clickhouse cluster.

We're actively investigating it with the Clickhouse team and will provide an update as soon as we can.

We're still processing logs and they aren't delayed but occasionally queries are timing out, needing to be retried. Some duplicates may occur as a result.

May 09, 2025

Source Error Handling Edge Case Caused Some Duplicates

Resolved

While rolling out the fix for handling source errors better, we did encounter another issue with our SaaS ClickHouse instance. See the new incident for details:

https://runrevealstatus.com/events/report/506

Monitoring

We've implemented a fix and are monitoring the rollout.

The root cause was not destinations as we had originally anticipated.

We determined the root cause to be related to how we parse individual S3 files in one particular source type (awsdns). Previously, we'd delegated all the error handling to individual sources. One file in one source was malformed and was causing an error to be bubbled up that shouldn't have been returned.

We now handle all parsing issues across all object storage buckets using the same logic. If there's an issue parsing a file in an object storage bucket, we record it and expose it through our source errors, but do not terminate the queue.

This will give our customers the visibility they need without affecting the processing of other sources.

The impact was determined to be that some object sources may have seen a few number of duplicates as a result of the queues responsible for handling the malformed files terminating while processing other object files. No data was lost.

Identified

We've identified the issue and are implementing a fix.

Investigating

We're noticing some transient connectivity issues connecting to some destinations that is causing some batches to be retried and may result in duplicates being written to destinations.

We're investigating and will update this as soon as we've got more information.

No data has been lost.

Data Ingestion Delayed for SaaS Pipeline

Resolved

We've now returned to a nominal state and are caught up on ingestion and detections.

We'll publish a full post-mortem out as soon as we've pieced together all the information internally with the details of the chain of events that happened and what we'll do to help mitigate this in the future.

For now, we'll be monitoring it closely throughout the weekend. Please stay tuned and subscribe to the blog to receive the update for the post mortem when we publish it.

Monitoring

We identified and remediated the issue with the help of our partners at Clickhouse.

There were issues with the instances that we auto-scaled the database onto and as a result Clickhouse is following up internally and with AWS to do a root cause analysis.

We'll monitor the progress of our queues as they catch up and update customers with a postmortem of the incident and details of what happened once we know everything that happened.

Investigating

To follow-up from the previous incident report, there was indeed issues with our SaaS RunReveal Clickhouse instance.

We're investigating with the Clickhouse team now. Data ingestion will be delayed for customers of the SaaS pipeline, and customers utilizing BYODB may notice some duplicates as we work to resolve the issue.

May 07, 2025

Some Detections Delayed 1h

Resolved

This incident has been resolved.

Monitoring

We released some improvements to how we load configuration for our services, but unfortunately our detections scheduler had a JSON unmarshaling error that was uncaught in testing. We're adding tests to ensure this doesn't happen again.

Current detections would be delayed at most one hour. A release fixing the issue has been deployed and we're now caught up again.

Feb 12, 2025

Scheduled query delay

Resolved

We encountered a bug that delayed many scheduled query runs. Queries that were written with the {from} and {to} syntax will still be run with the appropriate windowing, but many queries may be delayed off of their expected timeframe.

This issue was identified by our monitoring systems earlier today and resolved after some debugging a few hours later.

Dec 20, 2024

Database Migration Requires Read-only Access

Maintenance

During this maintenance, the UI, API and the ability to perform investigations will continue to operate as usual. The impact will mean log ingestion will be delayed while a database migration is being applied.

We will always strive to have zero downtime migrations, and this migration will help make database maintenance less impactful going forward. Unfortunately, there was no clear way to enable a zero downtime migration in this particular case.

If you have any questions, comments or concerns, please let us know ASAP.

Nov 22, 2024

Nov 22 2024 API Downtime

Resolved

A misconfiguration when rolling out a new feature caused our API to not restart successfully. We were able to immediately identify the issue and it took 15 minutes to implement a fix, review the fix, and then deploying the fix took another 15 minutes.

This incident didn't impact our ability to collect logs or deliver them to our customer's destinations.