Source Error Handling Edge Case Caused Some Duplicates

Started at

Resolved

While rolling out the fix for handling source errors better, we did encounter another issue with our SaaS ClickHouse instance. See the new incident for details:

https://runrevealstatus.com/events/report/506

Monitoring

We've implemented a fix and are monitoring the rollout.

The root cause was not destinations as we had originally anticipated.

We determined the root cause to be related to how we parse individual S3 files in one particular source type (awsdns). Previously, we'd delegated all the error handling to individual sources. One file in one source was malformed and was causing an error to be bubbled up that shouldn't have been returned.

We now handle all parsing issues across all object storage buckets using the same logic. If there's an issue parsing a file in an object storage bucket, we record it and expose it through our source errors, but do not terminate the queue.

This will give our customers the visibility they need without affecting the processing of other sources.

The impact was determined to be that some object sources may have seen a few number of duplicates as a result of the queues responsible for handling the malformed files terminating while processing other object files. No data was lost.

Identified

We've identified the issue and are implementing a fix.

Investigating

We're noticing some transient connectivity issues connecting to some destinations that is causing some batches to be retried and may result in duplicates being written to destinations.

We're investigating and will update this as soon as we've got more information.

No data has been lost.