- This is an IT Support Group
- Posts
- Cloudflare's post-mortem is worth 10 minutes of your time
Cloudflare's post-mortem is worth 10 minutes of your time
X, ChatGPT, Shopify went down. The explanation of what happened is surprisingly useful.
What IT Pros Can Learn From Cloudflare's Very Bad Tuesday
Last Tuesday, Cloudflare had their worst outage since 2019.
Shopify, Indeed, ChatGPT, X, and Truth Social CNBC all went down. Downdetector was also down, which is the kind of irony that writes itself.
The cause wasn't a sophisticated attack or a hardware failure. It was a config file that got too big.
But here's why I'm writing about it: Cloudflare published a detailed post-mortem within hours of restoring service, and it's one of the better incident reports I've read in a while. Whether or not you use Cloudflare, there's a lot here worth stealing for your own incident response process.
What actually happened
The issue was triggered by a change to one of their database systems' permissions which caused the database to output multiple entries into a "feature file" used by their Bot Management system. That feature file doubled in size. Cloudflare
The bot detection system uses a config file containing machine learning features—normally about 60 of them. The feature file ballooned beyond 200 entries due to duplicate data from underlying database tables. MGX
The software had a hardcoded limit of 200 features. When the file exceeded that, it didn't fail gracefully. It panicked and crashed, taking down core traffic routing across their network.
Database permission change → duplicate query results → oversized config file → hard limit exceeded → panic → widespread 5xx errors.
To make things worse, the feature file was regenerated every five minutes. Depending on which part of the ClickHouse cluster the query ran, either a good or bad configuration file was generated, causing intermittent recovery and failure cycles. Rescana
Services kept flickering up and down while engineers tried to diagnose the problem. Eventually all nodes got the bad file simultaneously and the system stabilized—in a fully broken state.
Core traffic was largely flowing as normal by 14:30 UTC. All systems were functioning normally by 17:06 UTC. Cloudflare
Why this post-mortem is actually good
Most outage communications read like they were written by a lawyer and a PR team in a cage match. You get something like "we experienced degraded performance due to an internal issue" and that's it.
Cloudflare gave us timestamps, root causes, and—importantly—where they went wrong during the investigation itself.
They admitted the misdiagnosis.
Between 11:32 and 13:05 UTC, the Cloudflare team investigated elevated error rates, initially suspecting a hyper-scale DDoS attack due to fluctuating system behavior and the coincidental unavailability of the status page. Rescana
For about 90 minutes, they were looking for attackers that didn't exist. And they told us that. They explained why the symptoms looked like a DDoS (intermittent failures, status page down) and when they realized it wasn't.
The technical analysis didn't try to hide their mistakes, or handwave it away, or make excuses. Krebs on Security
This is what a blameless post-mortem looks like in practice. You document the diagnostic process, including the dead ends, because that's how you actually learn something.
They led with what it wasn't.
The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Cloudflare
When services go down, the first question is always "were we hacked?" Addressing that immediately lets everyone focus on the actual problem instead of spinning up threat scenarios.
They explained impact from the user's perspective.
For customers using Bot Management rules, all traffic received a bot score of zero, leading to large numbers of false positives and potential blocking of legitimate users. Rescana
Not "bot management experienced issues" but "all traffic got scored as zero, which means your rules probably blocked legitimate users." That's actionable information.
The security detail worth noting
This one didn't get much attention, but it matters:
Cloudflare's WAF does a good job filtering out malicious traffic. Automated bots would thus probably get some penetration in the window with Cloudflare down. Krebs on Security
For a few hours, the WAF wasn't filtering. If you were behind Cloudflare during the outage, it's worth checking your origin server logs from that window.
There's also the dependency problem:
Many websites behind Cloudflare found they could not migrate away because the Cloudflare portal was unreachable and/or because they also were getting their DNS services from Cloudflare. Krebs on Security
When your CDN, WAF, DNS, and management portal are all the same provider, an outage means you're just waiting. You can't even fail over manually because you can't reach the controls.
Not saying you need to rearchitect everything tomorrow, but it's worth thinking about which dependencies share a single point of failure.
What you can take from this
If your org doesn't do formal post-incident reviews, here's a framework worth considering:
Start with timeline. When did symptoms start? When was it detected? When was root cause identified? When was it resolved? Cloudflare's report has timestamps down to the minute.
Separate symptoms from root cause. Users saw 5xx errors. The actual problem was a config file exceeding a size limit. These are different things and both matter.
Document the investigation, not just the fix. Including the wrong turns. "We thought it was X because of Y, then realized it was Z" is more useful than "we fixed it."
Be specific. "A feature file exceeded the 200-entry hardcoded limit" is useful. "A configuration issue occurred" is not.
Explain impact in user terms. "Auth service returned errors" means less than "users couldn't log in for 47 minutes."
Assign follow-up items. A post-mortem without action items is just documentation. Make sure someone owns the fixes.
The bigger picture
This wasn't an isolated incident.
This comes less than a month after Amazon Web Services suffered a daylong disruption, followed by a global outage of Microsoft's Azure cloud and 365 services. CNBC
AWS. Azure. Cloudflare. All within about four weeks, and none of them were attacks.
A trio of autumn outages in a four-week period highlights how configuration and metadata errors in the cloud are becoming "the new power cuts." Cybernews
These are all variations on the same theme: someone changed a setting, it cascaded in ways nobody anticipated, and a lot of things broke. The systems are complex enough that predicting every failure mode is basically impossible.
The organizations that handle this well are the ones that document failures thoroughly and share what they learned. Cloudflare's post-mortem is a good example of that.
Cloudflare acknowledged this as their worst outage since 2019 Cyber Press and committed to a list of specific remediation steps. Whether they follow through is another question, but at least the commitments are public and specific.
Worth bookmarking
Cloudflare publishes all their incident post-mortems at blog.cloudflare.com/tag/post-mortem. Even if you're not a customer, they're solid reading for anyone managing production systems. Their June 2025 and September 2025 outage reports are also worth a look.
If you've got a post-incident review process that works well (or a horror story about one that didn't), reply and let me know. Always looking for examples.
Realtime User Onboarding, Zero Engineering
Quarterzip delivers realtime, AI-led onboarding for every user with zero engineering effort.
✨ Dynamic Voice guides users in the moment
✨ Picture-in-Picture stay visible across your site and others
✨ Guardrails keep things accurate with smooth handoffs if needed
No code. No engineering. Just onboarding that adapts as you grow.

