Cloudflare experienced a major outage caused by a bug in the company’s firewall software, leaving many sites reeling in its wake. The problem was fixed in probably 30 minutes and soon after all the sites recovered.
What happened??
Users all over experienced problems with accessing the Internet and were displayed the message, “502 Bad Gateway.” Several service providers, including AT&T that relied on Cloudfare, were knocked off-grid. Soon after the company acknowledged the widespread outage and assured that they are working on fixing the problem. Cloudflare has found a solution for this issue and is currently testing the solution. Cloudflare will update the status once the issue is resolved.
Appear to have mitigated the issue causing the outage. Traffic restored. Working now to restore all services globally. More details to come as we have them.
— Matthew Prince 🌥 (@eastdakota) July 2, 2019
CEO and Company explain the cause of outage…
CEO Matthew Prince cleared up the situation about the cause of the outage. According to him, Massive spike in CPU usage caused primary and backup systems to fall over. They caused issues with all the services. There was no evidence of any attack. He then added that shut down service responsible for CPU spike and traffics are back to normal levels and they are finding what caused the issue in the first place.
Prince also accepted that it wasn’t a directed-denial-of-service attack (DDoS) even though they initially thought that it was one.
The company later explained what happened in detail that for 30 minutes today, visitors to Cloudflare sites received 502 errors, which was caused by a massive spike in CPU utilization in their network. The spike in CPU was caused by a lousy software deploy that was rolled back. Once it was pushed back, the service returned to regular operation, and all domains using Cloudflare returned to normal traffic levels. The company stressed that it was not an attack (as some have speculated) and they are incredibly sorry for the outage. He added that Company’s Internal teams are meeting to perform a full post-mortem to understand the cause of the outage and preventing it from happening again.
The company takes full responsibility and apologizes.
The company also apologized for this outage by writing that they are fully aware that situations like this are scary for their customers.
Places most affected by the outage
Europe and the East Coast of the United States wee the areas most affected by the outage as it occurred during business hours in those regions.