This incident has been resolved. All services are operating normally.Summary: A failing physical disk under one of our Ceph OSDs caused slow I/O operations on our Postgres primary. The slow operations held database connections open longer than usual, saturating the connection pool. Once the pool was saturated, both read and write requests that could not acquire a connection within our timeout window were returned as 500 errors by the Public API. During the period of degraded storage performance, two Postgres read replicas accumulated WAL replay inconsistencies that prevented them from catching up automatically.We marked the affected OSD out of the Ceph cluster, allowed the cluster to rebalance to the remaining healthy OSDs, and then rebuilt the two affected Postgres replicas from the primary to restore full replication health.We are conducting an internal review to improve our detection of slow OSD operations and replication lag. A more detailed post-mortem will be shared with affected customers within 48 hours.
This incident has been resolved. All services are operating normally.
Summary: During a planned database infrastructure migration, agent data ingestion was temporarily interrupted. Historical data was not affected.
Duration: 7 minutes
Root cause: A brief service interruption occurred during the cutover phase of a database migration.
This incident has been resolved. All services are operating normally.Summary: a breaking change at upstream dependency caused rounding issues. Problem resolved
Status is automatically updated every 60 seconds. For urgent issues, contact authors@kubeadapt.io. Subscribe to updates via RSS or email notifications.
·