RunReveal Status

Jun 13, 2025

RunReveal is investigating data delays in our platform

    Resolved

    RunReveal June 9th BYODB incident

    RunReveal deployed a change on June 9th to help our "bring your own databases" customers more easily manage database migrations. This change allows customers to automatically update their databases, choose when to apply migrations, and not need to rely on the RunReveal team to keep their schemas up to date.

    Unfortunately there was a subtle bug that caused data delays for object storage and API polling sources, and data loss for webhook sources that do not retry failed requests.

    What was the bug

    Our Bring your own database destinations have a variety of settings associated with them. These settings tell us how large the batch sizes should be that we write to their databases, what the flush frequency should be if the batch size is not reached, if we should connect with custom headers, etc.

    As part of this change the settings object was rewritten to have additional context and the format of the FlushDuration was subtly changed. When we read the flush duration we can have two separate formats of it. The first is a human readable string like 30s, and the second is in Integer milliseconds like 30000 due to how Javascript represents timestamps.

    We accidentally left the JSON Marshaling logic unimplemented for the Flush Duration in this settings object (which will default to nanoseconds), and the custom FlushDuration while unmarshalling treated this number as milliseconds. This meant that the Flush Duration was significantly longer that expected by a huge factor. This meant that smaller sources that relied on this flush duration were essentially never being synced.

    Below is the fix we implemented that got the Flush timeout into the format that it was expected to be in:

    @@ -69,26 +69,31 @@ types/destinations.go
    type Duration time.Duration
    func (d *Duration) UnmarshalJSON(b []byte) error {
    	var s string
    	if err := json.Unmarshal(b, &s); err != nil {
    		// fallback to int milliseconds for compatibility with old code
    		var i int
    		if err2 := json.Unmarshal(b, &i); err2 == nil {
    			*d = Duration(time.Duration(i) * time.Millisecond)
    			return nil
    		}
    		return err
    	}
    	t, err := time.ParseDuration(s)
    	if err != nil {
    		return err
    	}
    	*d = Duration(t)
    	return nil
    }
    
    +func (d Duration) MarshalJSON() ([]byte, error) {
    +	t := time.Duration(d)
    +	return json.Marshal(t.String())
    +}
    

    What was the impact of this issue

    This issue only impacted customers using bring your own database configurations. Sources we read from object storage and sources we poll APIs for, that data was delayed during this incident. However data we received from webhooks was lost between 2025-06-09T18:40:00Z and 2025-06-10T01:00:00Z for impacted customers.

    With sources we receive an HTTP request (Webhook) from, these sources lost data during the time window of 6 hours if they did not retry their requests when the API service was restored. We are deeply sorry about this. We believe that RunReveal going down or having operational issues shouldn't cause data loss for our customers and we're working on that being true.

    What are we doing about this

    There are two key areas we can improve from this incident.

    1. This issue took us way too long to become aware of. This issue impacted only a subset of customers and a subset of those customer's sources. We need better per source and per customer monitoring. RunReveal processes hundreds of thousands of logs per second and our monitoring is focused on high level observability metrics. We also need to get better at the smaller volume log sources, and for BYODB + Cloud focused alerting alike.
    2. We need better resilience for Webhook sources. It's not okay for our object storage sources and API polling sources to be resilient while webhooks are simply lost when we have operational issues. We need to improve this by changing the way we handle webhooks. We plan to queue them immediately upon ingest so they aren't lost, and we can safely store logs in the queue for a while during any future operational issues.

    The work required to improve the resilience of webhook sources is very straightforward and well-scoped so we expect to start working on this problem in the next few weeks.

    Follow up questions

    If you have follow up questions feel free to reach out to evan@ and alan@ and we'd be happy to walk you through the bug, the metrics we have on the incident, and our plans to make RunReveal even more resilient.

    Investigating

    This incident only impacted a small handful of bring your own database customers. We're working on a postmortem document to share the details and learnings from this incident and expect to have it posted by June 11th.

    Investigating

    We're monitoring a data delay in RunReveal.

    May 12, 2025

    Query Timeouts Causing Instability

      Resolved

      Summary

      While investigating increased query timeouts, we suspected that our ClickHouse cluster might have been under-provisioned. Our ClickHouse Cloud instance had recently been upgraded from version 24.10 to 24.12. A bug in the newer version caused two of our five ClickHouse servers to become unresponsive to log queries while otherwise appearing healthy. We worked with ClickHouse support to resolve the issue ultimately requiring a rollback to the previous version.

      Impact

      • Source Ingestion: Paused for 30 minutes during investigation, causing delays of that duration.
      • Detection Delays: Increased timeouts led to Query queue buildup, causing detection delays of up to 30 minutes in the worst cases.
      • Data Loss Risk: Webhook-based sources without retry implementations may have lost data.
      • Potential Data Duplication: In specific scenarios involving large S3 files, some duplicate events may have appeared. This occurs when we're processing large S3 files (many GBs, 100k+ rows). We write to ClickHouse in batches of 50k, meaning that it's possible if we have some successful writes, but encountered a ClickHouse failure before we completed processing the s3 file, then we would have some duplicate events as we begin reprocessing the file after recovery.

      Timeline (Timestamps in Pacific Time)

      Friday, May 9

      • 5:00 PT: ClickHouse completed server upgrade from 24.10 to 24.12
      • 8:13 PT: Alerted to queue processors restarting due to unexpected EOF errors and I/O timeout errors from ClickHouse
      • 12:44 PT: Opened case with ClickHouse support after determining ClickHouse was the cause.
      • 13:00 PT: Scaled up the cluster hoping to alleviate issues, which inadvertently exacerbated the problem.
      • 14:11 PT: Paused ingestion to facilitate ClickHouse team's investigation
      • 14:40 PT: Resumed ingestion after ClickHouse implemented initial mitigations
      • Afternoon/Evening: ClickHouse took additional actions that reduced timeouts but didn't fully resolve the issue

      Saturday, May 10

      • ClickHouse identified root cause related to the upgrade
      • 11:00 PT: Cluster rolled back to version 24.10, restoring query timeouts to normal levels

      Communication with ClickHouse

      We continue to work with the ClickHouse team to understand the exact nature of the bug that caused this incident. Key points from our ongoing discussion include:

      • Initial response time was prompt, but diagnosis took longer than expected
      • The issue was identified as a bug in version 24.12 that introduced a memory leak.
      • We've requested advance notification of at least one week for future upgrades.
      • We've also reached out to ask about the improvements they plan to make to avoid this in the future and improve the response for future incidents.

      [This section will be updated as we continue communication with ClickHouse]

      Remediation Plan

      • We're improving our incident management process to make communications to customers using the status page more consistent. By keeping one incident open instead of opening multiple, we'll keep communications tight and punctual.
      • Request at least one week's advance notice from ClickHouse for version upgrades
      • Enhance monitoring for query timeouts with better error information to increase alert fidelity. Excluding queries timing out because they're scanning a long time range means that we can see operational issues more clearly.
      • Implement more granular health checks for ClickHouse servers, and implement exponential backoff and retry when ClickHouse is unresponsive.
      • Develop staging environment load tests for ClickHouse version upgrades

      Technical Deep Dive: Data Integrity Safeguards

      Our architecture employs several safeguards that prevented significant data loss during this incident:

      1. Message Queue Persistence: Source events (or s3 file notifications) remain unacknowledged in their originating queues until all events or the whole file is successfully processed, providing resilience against processing failures.
      2. Batched Processing with Acknowledgment Controls: We only acknowledge batch completion after successful ClickHouse writes, ensuring data isn't lost during processing.
      3. File-Based Tracking: For S3 and object storage sources, we track completion at the file level, providing an additional integrity layer.
      4. Dynamic Windowing for Detections: Detections using the from and to built-in detection parameters have their windows automatically scaled such that no logs are missed during scans. Always be sure to include these in your detections to not miss a single log line.

      These mechanisms ensured that while data was delayed, permanent loss was limited to webhook sources without retry capabilities, representing a small fraction of our total ingestion volume and detections were still scanning and reading all data as it arrives.

      Resolved

      Clickhouse identified the cause of the instability and since the mitigation they put in place we have been stable all day.

      Please reach out if you continue to experience any issues or timeouts running queries.

      Monitoring

      Clickhouse has implemented a mitigation and we are monitoring the system health to watch for additional frequent query timeouts.

      Investigating

      After restoring service from the incident earlier today, we're seeing connectivity issues reaching the primary Clickhouse cluster.

      We're actively investigating it with the Clickhouse team and will provide an update as soon as we can.

      We're still processing logs and they aren't delayed but occasionally queries are timing out, needing to be retried. Some duplicates may occur as a result.

      May 09, 2025

      Source Error Handling Edge Case Caused Some Duplicates

        Resolved

        While rolling out the fix for handling source errors better, we did encounter another issue with our SaaS ClickHouse instance. See the new incident for details:

        https://runrevealstatus.com/events/report/506

        Monitoring

        We've implemented a fix and are monitoring the rollout.

        The root cause was not destinations as we had originally anticipated.

        We determined the root cause to be related to how we parse individual S3 files in one particular source type (awsdns). Previously, we'd delegated all the error handling to individual sources. One file in one source was malformed and was causing an error to be bubbled up that shouldn't have been returned.

        We now handle all parsing issues across all object storage buckets using the same logic. If there's an issue parsing a file in an object storage bucket, we record it and expose it through our source errors, but do not terminate the queue.

        This will give our customers the visibility they need without affecting the processing of other sources.

        The impact was determined to be that some object sources may have seen a few number of duplicates as a result of the queues responsible for handling the malformed files terminating while processing other object files. No data was lost.

        Identified

        We've identified the issue and are implementing a fix.

        Investigating

        We're noticing some transient connectivity issues connecting to some destinations that is causing some batches to be retried and may result in duplicates being written to destinations.

        We're investigating and will update this as soon as we've got more information.

        No data has been lost.

        Dec 20, 2024

        Database Migration Requires Read-only Access

          -

          Maintenance

          During this maintenance, the UI, API and the ability to perform investigations will continue to operate as usual. The impact will mean log ingestion will be delayed while a database migration is being applied.

          We will always strive to have zero downtime migrations, and this migration will help make database maintenance less impactful going forward. Unfortunately, there was no clear way to enable a zero downtime migration in this particular case.

          If you have any questions, comments or concerns, please let us know ASAP.