January 4 return to work caused Slack to go into overload
Slack has provided an analysis of what occurred on January 4 when its service went down attempting to carry the load of what was for many the first work day of 2021.
“During the Americas’ morning we got paged by an external monitoring service: Error rates were creeping up. We began to investigate. As initial triage showed the errors getting worse, we started our incident process,” Slack said in a post.
As the company was starting to investigate, its dashboard and alerting service became unavailable. Slack said it had to revert to more historical methods of finding errors, as its metrics backends were thankfully still up.
It also rolled back some changes that were pushed out that day, but that was quickly found to not be the cause of the outage.
“While our infrastructure seemed to generally be up and running, we observed signs that we were seeing widespread network degradation, which we escalated to AWS, our main cloud provider,” it explained.
Slack was still up at 6.57am PST, seeing 99% of messages sent successfully, versus the 99.999% send rate it usually clocks. The company said usually it has a traffic pattern of mini-peaks at the top of each hour and half hour, as reminders and other kinds of automation trigger and send messages. It said it has standard scaling procedures in place to manage these peaks.
“However, the mini-peak at 7am PST — combined with the underlying network problems — led to saturation of our web tier,” Slack said. “As load increased so did the widespread packet loss. The increased packet loss led to much higher latency for calls from the web tier to its backends, which saturated system resources in our web tier.
“Slack became unavailable.”
Some of Slack’s instances were marked unhealthy due to not being able to reach the backends they depended on, and as a result, its systems attempted to replace the unhealthy instances with new instances. Simultaneously, Slack’s autoscaling system downscaled the web tier.
This also kicked off several engineers who were already investigating.
“We scale our web tier based on two signals. One is CPU utilization … and the other is utilization of available Apache worker threads. The network problems prior to 7:00am PST meant that the threads were spending more time waiting, which caused CPU utilization to drop,” Slack explained.
“This drop in CPU utilization initially triggered some automated downscaling. However, this was very quickly followed by significant automated upscaling as a result of increased utilization of threads as network conditions worsened and the web tier waited longer for responses from its backends.”
Slack said it attempted to add 1,200 servers to its web tier between 7.01am and 7.15am PST.
“Unfortunately, our scale-up did not work as intended,” it said.
“The spike of load from the simultaneous provisioning of so many instances under suboptimal network conditions meant that provision-service hit two separate resource bottlenecks (the most significant one was the Linux open files limit, but we also exceeded an AWS quota limit).”
Slack said while it was repairing the provision-service, it was still under capacity for its web tier because the scale-up was not working as expected. A large number of instances had been created, but most of them were not fully provisioned and were not serving. The large number of broken instances caused Slack to also hit its pre-configured autoscaling-group size limits, which determine the maximum number of instances in its web tier.
“These size limits are multiples of the number of instances that we normally require to serve our peak traffic,” it said, noting as broken instances were being cleared and investigation into connectivity problems were ongoing, monitoring dashboards were still down.
Provision-service came back online at 8.15am PST.
“We saw an improvement as healthy instances entered service. We still had some less-critical production issues which were mitigated or being worked on, and we still had increased packet loss in our network,” Slack said.
Its web tier, however, had a sufficient number of functioning hosts to serve traffic, but its load balancing tier was still showing an extremely high rate of health check failures to its web application instances due to network problems. The load balancers “panic mode” feature kicked in and instances that were failing health checks were balanced.
“This — plus retries and circuit breaking — got us back to serving,” it said.
By around 9.15am PST, Slack was “degraded, not down”.
“By the time Slack had recovered, engineers at AWS had found the trigger for our problems: Part of our AWS networking infrastructure had indeed become saturated and was dropping packets,” it said.
“On January 4th, one of our [AWS] Transit Gateways became overloaded. The TGWs are managed by AWS and are intended to scale transparently to us. However, Slack’s annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!).
“On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. We go from our quietest time of the whole year to one of our biggest days quite literally overnight.”
While Slack said its own serving systems scaled quickly to meet such peaks in demand, its TGWs did not scale fast enough.
“During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency,” it wrote.
Slack said it has set itself a reminder to request a preemptive upscaling of its TGWs at the end of the next holiday season.
On May 12, Slack went down for several hours amid mass COVID-19 related teleworking.