Status is automatically updated every 60 seconds. For urgent issues, contact authors@kubeadapt.io. Subscribe to updates via RSS or email notifications.
This incident has been resolved. All services are operating normally.
Summary: A failing physical disk under one of our Ceph OSDs caused slow I/O operations on our Postgres primary. The slow operations held database connections open longer than usual, saturating the connection pool. Once the pool was saturated, both read and write requests that could not acquire a connection within our timeout window were returned as 500 errors by the Public API. During the period of degraded storage performance, two Postgres read replicas accumulated WAL replay inconsistencies that prevented them from catching up automatically.
We marked the affected OSD out of the Ceph cluster, allowed the cluster to rebalance to the remaining healthy OSDs, and then rebuilt the two affected Postgres replicas from the primary to restore full replication health.
We are conducting an internal review to improve our detection of slow OSD operations and replication lag. A more detailed post-mortem will be shared with affected customers within 48 hours.
Resolved
This incident has been resolved. All services are operating normally.
Summary: A failing physical disk under one of our Ceph OSDs caused slow I/O operations on our Postgres primary. The slow operations held database connections open longer than usual, saturating the connection pool. Once the pool was saturated, both read and write requests that could not acquire a connection within our timeout window were returned as 500 errors by the Public API. During the period of degraded storage performance, two Postgres read replicas accumulated WAL replay inconsistencies that prevented them from catching up automatically.
We marked the affected OSD out of the Ceph cluster, allowed the cluster to rebalance to the remaining healthy OSDs, and then rebuilt the two affected Postgres replicas from the primary to restore full replication health.
We are conducting an internal review to improve our detection of slow OSD operations and replication lag. A more detailed post-mortem will be shared with affected customers within 48 hours.
Identified
Ceph rebalance is complete.
The two affected Postgres replicas have been unable to recover automatically. The earlier storage issues caused inconsistencies in their WAL replay state that prevent them from continuing normal replication. We will rebuild these replicas from the primary database to fully restore replication health and rebalance read traffic.
Identified
We have identified the root cause. A physical disk underlying one of our Ceph OSDs is producing hardware level read errors. The OSD daemon retries these operations rather than failing fast, which is why we see slow operations rather than outright failures. The cumulative effect of these slow operations is what is saturating the connection pool on Postgres and causing the 500 errors on the Public API for both read and write requests.
Our next step is to mark the affected OSD out of the cluster. Ceph will then automatically rebalance the affected placement groups to the remaining healthy OSDs. We expect rebalancing to take a couple of hours. During this period, the Public API will remain degraded, may continue to see elevated response times on some requests.
A disk replacement has been scheduled with our infrastructure team for the next maintenance window.
Investigating
The issue has been narrowed down to our Ceph storage cluster. One of the OSDs is reporting slow operations, which is degrading I/O performance for any database data placed on that OSD. Read and write queries that touch the affected data are slow, and the resulting connection pool saturation continues to impact requests across the API even when they do not directly hit the affected data.
We are isolating the specific OSD and determining whether the cause is at the disk hardware level or the OSD daemon level.
Investigating
We have traced the elevated error rate to slow I/O on our Postgres primary. The cascade is as follows: write operations are directly impacted by storage write latency, and read operations are queueing because connections are being held longer than usual by the slow in-flight queries. As the connection pool saturates, both read and write requests that cannot acquire a connection within our timeout window are returned as 500 errors by the Public API.
Our Postgres data resides on Ceph backed persistent volumes. We are now examining the Ceph cluster for the source of the slow
Investigating
500 errors on public API since around 16:30 TRT. Error rate at 3.8%, climbing slowly. Sentry shows sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection timeout expired on a mix of endpoints, mostly the ones that write to Postgres.
App-side connection pool metrics look fine, not exhausted. But query wait times are elevated and pgbouncer waiting clients went from 0 to ~30 in the last 20 min. p99 latency at 2.1s, baseline 200ms.
Going to look at Postgres directly.
Investigating
We are currently investigating reports of issues affecting our services. Our engineering team has been alerted and is actively looking into the matter.
We will provide updates as we learn more.