RS
  • Status
  • Events
  • Monitors
RunReveal Status
Jan 14, 2026
1 day ago
RunReveal API failured and data delays
Monitoring January 14 at 4:22 AM

The fix was implemented successfully and all systems are recovering. Data that was received during the delay is being processed now and our queues should catch up within the next hour.

We will continue monitoring.

Identified January 14 at 3:41 AM (41 minutes earlier)

We've implemented a fix and are monitoring it's effectiveness. If the fix is successful however it will take a bit of time for data to catch up, as it's awaiting in our queues for processing. No data has been lost, however data is delayed and will be processed overnight.

We'll keep this status report open until our queues catch up.

Investigating January 14 at 2:14 AM (1 hour earlier)

We're investigating API Failures and data delays. Our engineering team is working to identify and resolve the issue.

Dec 18, 2025
28 days ago
API Response Failures
Resolved December 18 at 4:34 PM

Our website and API were down for several minutes after we released a change with a database migration that didn't apply cleanly.

We were instantly made aware of this with our monitoring and after diagnosing the issue we initiated a rollback to put in place a fix. This didn't effect data processing services.

Nov 18, 2025
2 months ago
Website & API outage due to upstream provider
RunRevealIngestion APIWeb
Resolved November 18 at 11:52 PM (in 12 hours)

The Cloudflare incident is resolved. All systems are reporting nominal.

Identified November 18 at 11:53 AM (12 hours earlier)

Our website is hosted on Cloudflare who is handling an outage currently: https://www.cloudflarestatus.com/incidents/8gmgl950y3h7

Webhook log ingestion may receive 500 errors during this time.

Object storage and API polling log ingestion will be unaffected.

Oct 14, 2025
3 months ago
Elevated Rate Of Log Query Timeouts
RunRevealWeb
Resolved October 20 at 11:22 PM (in 6 days)

We identified and resolved the root cause of the increased rate of 502 errors on the API to be an increase in agents querying runreveal without effectively using the database indexes. Other endpoints were not affected.

We've done two things to address this issue:

  • We're updating the logs query tool description to include instructions for how to query RunReveal in a way that better utilizes the indexes.

  • We've changed the status code for query timeouts from 502 to 408 to indicate to the client that the query should be ajdusted to use an index.

Identified October 14 at 9:47 PM (6 days earlier)

We've received reports for an elevated rate of log query timeouts for RunReveal Cloud. We began to receive reports on Tuesday and immediately began investigating. We're opening this retroactively to indicate we're still working on a fix.

This should only be affecting queries done via the dashboard and via the AI chat or agent and it should not affect detection queries.

Data delivery has not been affected by this incident.

Oct 18, 2025
3 months ago
Data Processing Delays
Ingestion API
Resolved October 18 at 9:56 AM

Summary

On October 17, 2025, we experienced processing delays affecting several source types. Full recovery was achieved by October 18, 2025 at 02:56:50 UTC.

Impact

Affected Sources:

  • CloudTrail: Delays of 1-2 hours
  • Cloudflare (HTTP, WAF): Delays of 1-2 hours
  • GitHub: Delays of 1-2 hours
  • Other sources: Minimal lag

Data Integrity:

  • No data loss occurred
  • Some sources may have experienced duplicate events during the affected timeframe
  • All events were ultimately processed and delivered

Timeline (UTC)

  • 2025-10-17 06:16:00: Processing delays began
  • 2025-10-17 06:21:00: Monitoring alerts triggered
  • 2025-10-17 06:21:00+: Investigation began
  • 2025-10-17 08:12:50: Major backlogs cleared
  • 2025-10-17 23:20:50: Root cause identified and mitigation applied
  • 2025-10-18 02:56:50: Full recovery, all sources caught up

Root Cause

A configuration issue related to processing high-latency data sources caused resource contention in our event processing pipeline. This affected processing capacity for all sources sharing the same infrastructure.

Resolution

The issue was resolved by adjusting source configurations and will be prevented in the future through:

    Action Items

    Short-term:

    • Implement monitoring of s3 object processing time
    • Add alerting for early detection of similar issues

    Long-term:

    • Architectural improvements to prevent cross-source performance impacts
    • Capacity planning for geographically distributed data sources

    Customer Impact

    If you experienced delays in your log data during this timeframe, all data has now been processed and is available for analysis. If you notice any duplicate events from October 17, 2025, this is expected and can be deduplicated using event IDs if necessary.

    For questions or concerns, please get in touch: contact@runreveal.com

    Jun 9, 2025
    7 months ago
    RunReveal is investigating data delays in our platform
    Resolved July 9 at 2:41 PM (in 30 days)

    This issue was erroneously open for quite some time.

    Resolved June 13 at 3:40 PM (26 days earlier)

    RunReveal June 9th BYODB incident

    RunReveal deployed a change on June 9th to help our "bring your own databases" customers more easily manage database migrations. This change allows customers to automatically update their databases, choose when to apply migrations, and not need to rely on the RunReveal team to keep their schemas up to date.

    Unfortunately there was a subtle bug that caused data delays for object storage and API polling sources, and data loss for webhook sources that do not retry failed requests.

    What was the bug

    Our Bring your own database destinations have a variety of settings associated with them. These settings tell us how large the batch sizes should be that we write to their databases, what the flush frequency should be if the batch size is not reached, if we should connect with custom headers, etc.

    As part of this change the settings object was rewritten to have additional context and the format of the FlushDuration was subtly changed. When we read the flush duration we can have two separate formats of it. The first is a human readable string like 30s, and the second is in Integer milliseconds like 30000 due to how Javascript represents timestamps.

    We accidentally left the JSON Marshaling logic unimplemented for the Flush Duration in this settings object (which will default to nanoseconds), and the custom FlushDuration while unmarshalling treated this number as milliseconds. This meant that the Flush Duration was significantly longer that expected by a huge factor. This meant that smaller sources that relied on this flush duration were essentially never being synced.

    Below is the fix we implemented that got the Flush timeout into the format that it was expected to be in:

    @@ -69,26 +69,31 @@ types/destinations.go
    type Duration time.Duration
    func (d *Duration) UnmarshalJSON(b []byte) error {
    	var s string
    	if err := json.Unmarshal(b, &s); err != nil {
    		// fallback to int milliseconds for compatibility with old code
    		var i int
    		if err2 := json.Unmarshal(b, &i); err2 == nil {
    			*d = Duration(time.Duration(i) * time.Millisecond)
    			return nil
    		}
    		return err
    	}
    	t, err := time.ParseDuration(s)
    	if err != nil {
    		return err
    	}
    	*d = Duration(t)
    	return nil
    }
    
    +func (d Duration) MarshalJSON() ([]byte, error) {
    +	t := time.Duration(d)
    +	return json.Marshal(t.String())
    +}
    

    What was the impact of this issue

    This issue only impacted customers using bring your own database configurations. Sources we read from object storage and sources we poll APIs for, that data was delayed during this incident. However data we received from webhooks was lost between 2025-06-09T18:40:00Z and 2025-06-10T01:00:00Z for impacted customers.

    With sources we receive an HTTP request (Webhook) from, these sources lost data during the time window of 6 hours if they did not retry their requests when the API service was restored. We are deeply sorry about this. We believe that RunReveal going down or having operational issues shouldn't cause data loss for our customers and we're working on that being true.

    What are we doing about this

    There are two key areas we can improve from this incident.

      The work required to improve the resilience of webhook sources is very straightforward and well-scoped so we expect to start working on this problem in the next few weeks.

      Follow up questions

      If you have follow up questions feel free to reach out to evan@ and alan@ and we'd be happy to walk you through the bug, the metrics we have on the incident, and our plans to make RunReveal even more resilient.

      Investigating June 10 at 3:28 AM (4 days earlier)

      This incident only impacted a small handful of bring your own database customers. We're working on a postmortem document to share the details and learnings from this incident and expect to have it posted by June 11th.

      Investigating June 9 at 8:17 PM (7 hours earlier)

      We're monitoring a data delay in RunReveal.

      Jun 12, 2025
      7 months ago
      Website outage due to upstream provider
      Resolved June 12 at 9:42 PM (in 3 hours)

      This has been resolved. Please reach out if you see any other issues.

      Monitoring June 12 at 9:09 PM (33 minutes earlier)

      Cloudflare reports the issue as remediated and monitoring. We'll keep an eye on things too.

      Identified June 12 at 6:57 PM (2 hours earlier)

      Our website is hosted on Cloudflare who is handling an outage currently: https://www.cloudflarestatus.com/

      All log processing is unaffected and progressing normally.

      May 10, 2025
      8 months ago
      Query Timeouts Causing Instability
      Resolved May 12 at 11:22 PM (in 3 days)

      Summary

      While investigating increased query timeouts, we suspected that our ClickHouse cluster might have been under-provisioned. Our ClickHouse Cloud instance had recently been upgraded from version 24.10 to 24.12. A bug in the newer version caused two of our five ClickHouse servers to become unresponsive to log queries while otherwise appearing healthy. We worked with ClickHouse support to resolve the issue ultimately requiring a rollback to the previous version.

      Impact

      • Source Ingestion: Paused for 30 minutes during investigation, causing delays of that duration.
      • Detection Delays: Increased timeouts led to Query queue buildup, causing detection delays of up to 30 minutes in the worst cases.
      • Data Loss Risk: Webhook-based sources without retry implementations may have lost data.
      • Potential Data Duplication: In specific scenarios involving large S3 files, some duplicate events may have appeared. This occurs when we're processing large S3 files (many GBs, 100k+ rows). We write to ClickHouse in batches of 50k, meaning that it's possible if we have some successful writes, but encountered a ClickHouse failure before we completed processing the s3 file, then we would have some duplicate events as we begin reprocessing the file after recovery.

      Timeline (Timestamps in Pacific Time)

      Friday, May 9

      • 5:00 PT: ClickHouse completed server upgrade from 24.10 to 24.12
      • 8:13 PT: Alerted to queue processors restarting due to unexpected EOF errors and I/O timeout errors from ClickHouse
      • 12:44 PT: Opened case with ClickHouse support after determining ClickHouse was the cause.
      • 13:00 PT: Scaled up the cluster hoping to alleviate issues, which inadvertently exacerbated the problem.
      • 14:11 PT: Paused ingestion to facilitate ClickHouse team's investigation
      • 14:40 PT: Resumed ingestion after ClickHouse implemented initial mitigations
      • Afternoon/Evening: ClickHouse took additional actions that reduced timeouts but didn't fully resolve the issue

      Saturday, May 10

      • ClickHouse identified root cause related to the upgrade
      • 11:00 PT: Cluster rolled back to version 24.10, restoring query timeouts to normal levels

      Communication with ClickHouse

      We continue to work with the ClickHouse team to understand the exact nature of the bug that caused this incident. Key points from our ongoing discussion include:

      • Initial response time was prompt, but diagnosis took longer than expected
      • The issue was identified as a bug in version 24.12 that introduced a memory leak.
      • We've requested advance notification of at least one week for future upgrades.
      • We've also reached out to ask about the improvements they plan to make to avoid this in the future and improve the response for future incidents.

      [This section will be updated as we continue communication with ClickHouse]

      Remediation Plan

      • We're improving our incident management process to make communications to customers using the status page more consistent. By keeping one incident open instead of opening multiple, we'll keep communications tight and punctual.
      • Request at least one week's advance notice from ClickHouse for version upgrades
      • Enhance monitoring for query timeouts with better error information to increase alert fidelity. Excluding queries timing out because they're scanning a long time range means that we can see operational issues more clearly.
      • Implement more granular health checks for ClickHouse servers, and implement exponential backoff and retry when ClickHouse is unresponsive.
      • Develop staging environment load tests for ClickHouse version upgrades

      Technical Deep Dive: Data Integrity Safeguards

      Our architecture employs several safeguards that prevented significant data loss during this incident:

        These mechanisms ensured that while data was delayed, permanent loss was limited to webhook sources without retry capabilities, representing a small fraction of our total ingestion volume and detections were still scanning and reading all data as it arrives.

        Resolved May 11 at 3:43 AM (2 days earlier)

        Clickhouse identified the cause of the instability and since the mitigation they put in place we have been stable all day.

        Please reach out if you continue to experience any issues or timeouts running queries.

        Monitoring May 10 at 4:39 AM (23 hours earlier)

        Clickhouse has implemented a mitigation and we are monitoring the system health to watch for additional frequent query timeouts.

        Investigating May 10 at 2:24 AM (2 hours earlier)

        After restoring service from the incident earlier today, we're seeing connectivity issues reaching the primary Clickhouse cluster.

        We're actively investigating it with the Clickhouse team and will provide an update as soon as we can.

        We're still processing logs and they aren't delayed but occasionally queries are timing out, needing to be retried. Some duplicates may occur as a result.

        May 9, 2025
        8 months ago
        Data Ingestion Delayed for SaaS Pipeline
        Resolved May 9 at 10:37 PM (in 2 hours)

        We've now returned to a nominal state and are caught up on ingestion and detections.

        We'll publish a full post-mortem out as soon as we've pieced together all the information internally with the details of the chain of events that happened and what we'll do to help mitigate this in the future.

        For now, we'll be monitoring it closely throughout the weekend. Please stay tuned and subscribe to the blog to receive the update for the post mortem when we publish it.

        Monitoring May 9 at 9:46 PM (51 minutes earlier)

        We identified and remediated the issue with the help of our partners at Clickhouse.

        There were issues with the instances that we auto-scaled the database onto and as a result Clickhouse is following up internally and with AWS to do a root cause analysis.

        We'll monitor the progress of our queues as they catch up and update customers with a postmortem of the incident and details of what happened once we know everything that happened.

        Investigating May 9 at 8:57 PM (49 minutes earlier)

        To follow-up from the previous incident report, there was indeed issues with our SaaS RunReveal Clickhouse instance.

        We're investigating with the Clickhouse team now. Data ingestion will be delayed for customers of the SaaS pipeline, and customers utilizing BYODB may notice some duplicates as we work to resolve the issue.

        May 9, 2025
        8 months ago
        Source Error Handling Edge Case Caused Some Duplicates
        Resolved May 9 at 9:03 PM (in 1 hour)

        While rolling out the fix for handling source errors better, we did encounter another issue with our SaaS ClickHouse instance. See the new incident for details:

        https://runrevealstatus.com/events/report/506

        Monitoring May 9 at 8:08 PM (55 minutes earlier)

        We've implemented a fix and are monitoring the rollout.

        The root cause was not destinations as we had originally anticipated.

        We determined the root cause to be related to how we parse individual S3 files in one particular source type (awsdns). Previously, we'd delegated all the error handling to individual sources. One file in one source was malformed and was causing an error to be bubbled up that shouldn't have been returned.

        We now handle all parsing issues across all object storage buckets using the same logic. If there's an issue parsing a file in an object storage bucket, we record it and expose it through our source errors, but do not terminate the queue.

        This will give our customers the visibility they need without affecting the processing of other sources.

        The impact was determined to be that some object sources may have seen a few number of duplicates as a result of the queues responsible for handling the malformed files terminating while processing other object files. No data was lost.

        Identified May 9 at 7:54 PM (14 minutes earlier)

        We've identified the issue and are implementing a fix.

        Investigating May 9 at 7:36 PM (18 minutes earlier)

        We're noticing some transient connectivity issues connecting to some destinations that is causing some batches to be retried and may result in duplicates being written to destinations.

        We're investigating and will update this as soon as we've got more information.

        No data has been lost.

        May 7, 2025
        8 months ago
        Some Detections Delayed 1h
        Resolved May 7 at 7:57 PM (in 31 minutes)

        This incident has been resolved.

        Monitoring May 7 at 7:26 PM (31 minutes earlier)

        We released some improvements to how we load configuration for our services, but unfortunately our detections scheduler had a JSON unmarshaling error that was uncaught in testing. We're adding tests to ensure this doesn't happen again.

        Current detections would be delayed at most one hour. A release fixing the issue has been deployed and we're now caught up again.

        Feb 12, 2025
        11 months ago
        Scheduled query delay
        Resolved February 12 at 11:45 PM

        We encountered a bug that delayed many scheduled query runs. Queries that were written with the {from} and {to} syntax will still be run with the appropriate windowing, but many queries may be delayed off of their expected timeframe.

        This issue was identified by our monitoring systems earlier today and resolved after some debugging a few hours later.

        Nov 22, 2024
        1 year ago
        Nov 22 2024 API Downtime
        Resolved November 22 at 5:51 PM

        A misconfiguration when rolling out a new feature caused our API to not restart successfully. We were able to immediately identify the issue and it took 15 minutes to implement a fix, review the fix, and then deploying the fix took another 15 minutes.

        This incident didn't impact our ability to collect logs or deliver them to our customer's destinations.

        powered by openstatus.dev