Engineering VP Ines Sombra published Bluesky's outage post-mortem on April 10, 2026. The report attributes 12 hours and 43 minutes of downtime on April 5 to PostgreSQL cluster overload from a 450% traffic surge. Some 28 million users faced disruptions, Sombra wrote.
Bluesky runs a decentralized social network on the AT Protocol. Go-based services rely on PostgreSQL 16. The platform's growth to 28 million users increased scalability pressures.
Outage Timeline
Failures began at 14:37 UTC on April 5. Users saw login errors and empty feeds on iOS, Android, and web. Bluesky updated its status page 15 minutes later.
Peak downtime hit at 18:20 UTC. Engineers pinpointed issues by 22:45 UTC and scaled resources. Systems reached 99% availability by 03:20 UTC on April 6, per Sombra.
Datadog spotted latency spikes. Monitors missed connection pool exhaustion. Root cause analysis took three hours.
Root Cause Analysis
PostgreSQL replicas overloaded. Primary nodes handled 15,000 queries per second, 320% over capacity. Go servers created connection storms that saturated the cluster, the report states.
A viral topic on decentralized identity drove 2.1 million sign-ups in 24 hours. Read-heavy feed generation strained resources.
Financial Impact
User retention dropped 8% post-outage, internal analytics showed. Sombra cited a 14% projected quarterly revenue decline. Premium subscriptions cost 5 USD monthly.
PitchBook valued Bluesky at 1.2 billion USD as of April 10, 2026. Downtime cut subscription renewals and ad impressions.
Implemented Fixes
Teams set connection pool limits at 5,000 per node. Go services added circuit breakers for automatic throttling.
PostgreSQL upgraded to version 17. Kubernetes autoscaling triggers at 70% CPU. Load tests verify 30-minute recovery.
Prometheus alerts on replica lag over 5 seconds. Weekly chaos drills test 600% traffic spikes.
Key Developer Lessons
Sombra urges strict connection pool limits against thundering herds. Deploy read replicas for feed generation.
Add backpressure in microservices. Prefer gRPC with load shedding over HTTP/2. Use pgBadger for slow query profiling.
Bluesky shifted 40% of databases to Google Cloud SQL for 90-second geo-failover.
Industry Reactions
Mastodon's Q1 2026 review cited similar PostgreSQL problems. Bluesky's report earned 4,200 GitHub stars within 12 hours as of April 10.
EU regulators reviewed the incident under the Digital Services Act. Bluesky filed its report within 24 hours, the company stated.
Prevention Roadmap
Bluesky targets AI anomaly detection by Q3 2026 with 12 million USD in upgrades.
Cloudflare now serves 60% of static content. Blue-green deployments aim for zero downtime.
The Bluesky outage post-mortem offers a blueprint for scalable platforms amid software reliability challenges from database failures. The service eyes 50 million users by end-2026.
