RunReveal is investigating data delays in our platform

Started at

Resolved

RunReveal June 9th BYODB incident

RunReveal deployed a change on June 9th to help our "bring your own databases" customers more easily manage database migrations. This change allows customers to automatically update their databases, choose when to apply migrations, and not need to rely on the RunReveal team to keep their schemas up to date.

Unfortunately there was a subtle bug that caused data delays for object storage and API polling sources, and data loss for webhook sources that do not retry failed requests.

What was the bug

Our Bring your own database destinations have a variety of settings associated with them. These settings tell us how large the batch sizes should be that we write to their databases, what the flush frequency should be if the batch size is not reached, if we should connect with custom headers, etc.

As part of this change the settings object was rewritten to have additional context and the format of the FlushDuration was subtly changed. When we read the flush duration we can have two separate formats of it. The first is a human readable string like 30s, and the second is in Integer milliseconds like 30000 due to how Javascript represents timestamps.

We accidentally left the JSON Marshaling logic unimplemented for the Flush Duration in this settings object (which will default to nanoseconds), and the custom FlushDuration while unmarshalling treated this number as milliseconds. This meant that the Flush Duration was significantly longer that expected by a huge factor. This meant that smaller sources that relied on this flush duration were essentially never being synced.

Below is the fix we implemented that got the Flush timeout into the format that it was expected to be in:

@@ -69,26 +69,31 @@ types/destinations.go
type Duration time.Duration
func (d *Duration) UnmarshalJSON(b []byte) error {
	var s string
	if err := json.Unmarshal(b, &s); err != nil {
		// fallback to int milliseconds for compatibility with old code
		var i int
		if err2 := json.Unmarshal(b, &i); err2 == nil {
			*d = Duration(time.Duration(i) * time.Millisecond)
			return nil
		}
		return err
	}
	t, err := time.ParseDuration(s)
	if err != nil {
		return err
	}
	*d = Duration(t)
	return nil
}

+func (d Duration) MarshalJSON() ([]byte, error) {
+	t := time.Duration(d)
+	return json.Marshal(t.String())
+}

What was the impact of this issue

This issue only impacted customers using bring your own database configurations. Sources we read from object storage and sources we poll APIs for, that data was delayed during this incident. However data we received from webhooks was lost between 2025-06-09T18:40:00Z and 2025-06-10T01:00:00Z for impacted customers.

With sources we receive an HTTP request (Webhook) from, these sources lost data during the time window of 6 hours if they did not retry their requests when the API service was restored. We are deeply sorry about this. We believe that RunReveal going down or having operational issues shouldn't cause data loss for our customers and we're working on that being true.

What are we doing about this

There are two key areas we can improve from this incident.

  1. This issue took us way too long to become aware of. This issue impacted only a subset of customers and a subset of those customer's sources. We need better per source and per customer monitoring. RunReveal processes hundreds of thousands of logs per second and our monitoring is focused on high level observability metrics. We also need to get better at the smaller volume log sources, and for BYODB + Cloud focused alerting alike.
  2. We need better resilience for Webhook sources. It's not okay for our object storage sources and API polling sources to be resilient while webhooks are simply lost when we have operational issues. We need to improve this by changing the way we handle webhooks. We plan to queue them immediately upon ingest so they aren't lost, and we can safely store logs in the queue for a while during any future operational issues.

The work required to improve the resilience of webhook sources is very straightforward and well-scoped so we expect to start working on this problem in the next few weeks.

Follow up questions

If you have follow up questions feel free to reach out to evan@ and alan@ and we'd be happy to walk you through the bug, the metrics we have on the incident, and our plans to make RunReveal even more resilient.

Investigating

This incident only impacted a small handful of bring your own database customers. We're working on a postmortem document to share the details and learnings from this incident and expect to have it posted by June 11th.

Investigating

We're monitoring a data delay in RunReveal.