In April, we experienced four incidents that resulted in degraded performance across GitHub services.
April 05 08:11 UTC (lasting 47 minutes)
On April 5, between 8:11 and 8:58 UTC, several GitHub services experienced issues. Web request error rates peaked at 6% and API request error rates peaked at 10%, and over 100,000 GitHub Actions workflows failed to start. The root cause was traced to a change in the database load balancer, which caused connection failures to multiple critical databases in one of our three data centers. We resolved the incident by rolling back the change and have implemented new measures to detect similar problems earlier in the deployment pipeline to minimize user impact moving forward.
April 10 08:18 UTC (lasting 120 minutes)
On April 10, between 8:18 and 9:38 UTC, several services experienced increased error rates due to an overloaded primary database instance caused by an unbounded query. To mitigate the impact, we scaled up the instance and shipped an improved version of the query to run against read replicas. The incident resulted in a 17% failure rate for web-based repository file editing and failure rates between 1.5% and 8% for other repository management operations. Issue and pull request authoring were also heavily impacted, and work is ongoing to remove dependence on the impacted database primary. GitHub search saw a 5% failure rate due to reliance on the impacted primary database when authorizing repository access.
April 10 08:18 UTC (lasting 30 minutes)
On April 10, between 18:33 and 19:03 UTC, several services were degraded due to a compute-intensive database query that prevented a key database cluster from serving other queries. Impact was widespread due to the critical dependency on this cluster’s data. GitHub Actions experienced delays and failures, GitHub API requests had a significant number of timeouts, all GitHub Pages deployments during the incident period failed, and Git Systems saw HTTP 50X error codes for a portion of raw file and repository archive download requests. Issues also experienced increased latency for creation and updates, and GitHub Codespaces saw timeouts for requests to create and resume a codespace. The incident was mitigated by rolling back the offending query. We have a mechanism to detect similar compute-intensive queries in CI testing, but identified a gap in that coverage and have addressed that to prevent similar issues in the future. In addition, we have implemented improvements to various services to be more resilient to this dependency and to detect and stop deployments with similar regressions.
April 11 08:18 UTC (lasting 3 days, 4 hours, 23 minutes)
Between April 11 and April 14, GitHub.com experienced significant delays (up to two hours) in delivering emails, particularly for time-sensitive emails like password reset and unrecognized device verification. Users without 2FA attempting to sign in on an unrecognized device were unable to complete device verification, and users attempting to reset their password were unable to complete the reset. The delays were caused by increased usage of a shared resource pool, and a separate internal job queue that became unhealthy and prevented the mailer queue from processing. Immediate improvements have been made to better detect and react to similar situations in the future, including a queue-bypass ability for time-sensitive emails and updated methods of detection for anomalous email delivery. The unhealthy job queue has been paused to prevent impact to other queues using shared resources.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.
The post GitHub Availability Report: April 2024 appeared first on The GitHub Blog.
In April, we experienced four incidents that resulted in degraded performance across GitHub services.
The post GitHub Availability Report: April 2024 appeared first on The GitHub Blog.
Social Plugin