"If they had a perfectly normalized database, no NULLing and formally verified code, this bug would not have happened."
That may be. What's not specified there is the immense, immense cost of driving a dev org on those terms. It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.
Cloudflare may well need to transition to this sort of engineering culture, but there is no doubt that they would not be in the position they are in if they started with this culture -- they would have been too slow to capture the market.
I think critiques that have actionable plans for real dev teams are likely to be more useful than what, to me, reads as a sort of complaint from an ivory tower. Culture matters, shipping speed matters, quality matters, team DNA matters. That's what makes this stuff hard (and interesting!)
That's entirely right. Products have to transition from fast-moving exploratory products to boring infrastructure. We have different goals and expectations for an ecommerce web app vs. a database, or a database vs. the software controlling an insulin pump.
Having said that, at this point, Cloudflare's core DDOS-protection proxy should now be built more like an insulin pump than like a web app. This thing needs to never go down worldwide, much more than it needs to ship a new feature fast.
You are simplifying the control software of an insulin point to a degree that does not match reality. I'm saying that because I actually reviewed the code of one and the amount of safety consciousness on display there was off the charts compared to what you usually encounter in typical web development. You also under-estimate the dynamic nature of the environment these pumps operate in as well as the amount of contingency planning that they embody, failure modes of each and every part in the pump were taken into consideration, and there are more such parts that you are most likely aware of. This includes material defects, defects as a result from abuse, wear & tear, parts being simply out of spec and so on.
To see this as the typical firmware that ships with say a calculator or a watch is to diminish the accomplishment considerably.
I had a former coworker who moved from the medical device industry to similar-to-cloudflare-web software. While he had some appreciation for the validation and intense QA they did (they didn't use formal methods, just heavy QA and deep specs), it became very clear to him very clearly that those approaches don't work with speed-of-release as a concern (his development cycles were annual, not weekly or daily). And they absolutely don't work in contexts where user-abuse or reactivity are necessary. The contexts are just totally different.
an insulin pump is a good metaphor; insulin as a hormone has a lot of interactions and the pump itself, if not wanting to unalive its user, should (most do not) account for external variables, such as: exercise, heart rate, sickness, etc. these variables are left for the user to deal with, and in this case, is a subpar experience in managing a condition.
This bug might not have, but others would. Formal verification methods still rely on humans to input the formal specification, which is where problems happen.
As others point out, if they didn't really ship fast, they certainly would not have become profitable, and they would definitely not have captured the market to the extent they have.
But really, if the market was more distributed, and Cloudflare commanded 5% of the web as the biggest player, any single outage would have been limited in impact. So it's also about market behaviour: yet "nobody is fired for choosing IBM" as it used to go 40 years ago.
But does "formally verified code" really go in the same bag as "normalized database" and ensuring data integrity at the database level? The former is immensely complex and difficult; the other two are more like sound engineering principles?
Software people, especially coming through Rust, are falling into the old trap of believing if code is bug free it is reliable: it isn’t because there is a world of faults outside, including but not limited to the developer intentions.
This inverts everything because structuring to be fault tolerant, of the right things, changes what is a good idea almost entirely.
Rust generally forces you to acknowledge these faults. The problem is managing them in a sane way, which for Rust in many cases simply is failing loudly.
Compared to than many other languages which preferring chugging along and hoping that no downstream corruption happens.
What I have seen work in the past is testing using a production backup as a final step prior to releasing, including applying database scripts. In this case, the permissions change would have been executed, the query would have run, and the failure would have been observed.
> It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.
We could also invest in tooling to make this kind of thing easier. Unclear why humans need to hand-normalise the database schema - isn't this exactly the kind of thing compilers are good at?
I would just add that I've noticed organizations tend to calcify as they get bigger and older. Kind of like trees, they start out as flexible saplings, and over time develop hard trunks and branches. The rigidity gives them stability.
You're right that there's no way they could have gotten to where they are if they had prioritized data integrity and formal verification in all their practices. Now that they have so much market share, they might collapse under their own weight if their trunk isn't solid. Maybe investing in data integrity and strongly typed, functional programming that's formally verifiable is what will help them keep their market share.
Cultures are hard to change and I'm not suggesting an expectation for them to change beyond what is feasible or practical. I don't lead an engineering organization like it so I'm definitely armchairing here. I just see some of the logic of the argument that them adopting some of these methods would probably benefit everyone using their services.
When you're powering this large a fraction of the internet is it even an option not to work like that? You'd think that with that kind of market cap resource constraints should no longer be holding you back from doing things properly.
It is so wildly more expensive than traditional development that it is simply not feasible to apply it anywhere but absolutely the most critical paths, and even then, the properties asserted by formal verification are often quite a bit less powerful than necessary to truly guarantee something useful.
I want formal verification everywhere. I believe in provable correctness. I wish we could hire people capable of always writing software to that standard and maintaining those proofs alongside their work.
We really can’t, though. Its a frustrating reality of being human — we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.
> we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.
This seems like a contradiction. If the smartest engineers you can hire are not smart enough to work within formal verification constraints then we in fact do not know how to do this.
If formal verification hinges on having perfect engineers then it’s useless because perfect engineers wouldn’t need formal verification.
I don’t understand why anyone should want this. Why should normal engineering efforts be held to the same standards as life-critical systems? Why would anyone expect that CloudFlare DDoS protection be built to the standards of avionics equipment?
Also if we’re being fair, avionics software is far narrower in scope than just “software in general”. And even with that Boeing managed to kill a bunch of people with shitty software.
The vast majority of Cloudflare's "customers" are paying 0 to 20 dollars a month, for virtually the same protection coverage and features as most of their 200 dollars/mo customers. That's not remotely in the realm of avionics price structure, be it software or hardware.
It is the aggregate they pay that counts here, not the individual payments.
A better comparison would be to compare this to airline passengers paying for their tickets, they pay a few hundred bucks in the expectation that they will arrive at their destination.
Besides, it is not the customers that determine Cloudflare's business model, Cloudflare does. Note that their whole business is to prevent outages and that as soon as they become the cause of an outage they have invalidated their whole reason for existence. Of course you could then turn this into a statistical argument that as long as they prevent more outages than they cause that they are a net benefit but that's not what this discussion is about, it is first and foremost about the standard of development they are held up against.
Ericsson identified similar issues in their offering long ago and created a very capable solution and I'm wondering if that would not have been a better choice for this kind of project, even if it would have resulted in more resource consumption.
> as soon as they become the cause of an outage they have invalidated their whole reason for existence
This is a bar no engineering effort has ever met. “If you ever fail, even for a moment, there’s no reason for you to even exist.”
There have been 6 fatal passenger airplane crashes in the US this year alone. NASA only built 6 shuttles and 2 of those exploded, killing their crews. And these were life-preserving systems that failed.
Discussions around software engineering quality always seem to veer into spaces where we assign almost mythic properties to other engineering efforts in an attempt to paint software engineering as lazy or careless.
> Anyone in avionics software dev to give an opinion?
I've done some for fuel estimation of freighter jets (not quite avionics but close enough to get a sense for the development processes) and the amount of rigor involved in that one project made me a better developer for the rest of my career. Was it slow? Yes, it was very slow. A couple of thousand lines of code, a multiple of that in tests for a very long time compare to what it would normally take me.
But within the full envelope of possible inputs it performed exactly as advertised. The funny thing is that I'm not particularly proud of it, it was the process that kept things running even when my former games programmer mentality would have long ago said 'ship it'.
Some things you just need to do properly, or not at all.
I left out any commentary on `.unwrap()` from my original comment, but it’s an obvious example of something that should never have appeared in critical code.
Rust needs to get rid of .unwrap() and its kin. They're from pre-1.0 Rust, before many of the type system features and error handling syntax sugar were added.
There's no reason to use them as the language provides lots of safer alternatives. If you do want to trigger a panic, you can, but I'd also ask - why?
Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures. Any function that then calls a panicky function needs to be similarly marked. In doing this, we can statically be certain no such methods are called if we want to be rid of the behavior.
Perhaps something like:
panic fn my_panicky_function() {
None.unwrap(); // NB: `unwrap()` is also marked `panic` in stdlib
}
fn my_safe_function() {
// with a certain compiler or Crates flag, this would fail to compile
// as my_safe_function isn't annotated as `panic`
my_panicky_function()
}
The ideal future would be to have code that is 100% panic free.
> There's no reason to use [panics] as the language provides lots of safer alternatives.
Dunno ... I think runtime assertions and the ability to crash a misbehaving program are a pretty important part of the toolset. If rust required `Result`s to be wired up up and down the entire call tree for the privilege of using a runtime assertion, I think it would be a lot less popular, and probably less safe in practice.
> Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures.
I 100% agree that a mechanism to prove that code can or cannot panic would be great, but why would malloc be special here? Folks who are serious about preventing panics will generally use `no-std` in order to prevent malloc in the first place.
> I’ve been seeing you blazing this trail since the incident and it feels a short sighted and reductive.
Why is it inappropriate to be able to statically label the behavior?
Maybe I don't want my failure behavior dictated by a downstream dependency or distracted engineer.
The subject of how to fail is a big topic and is completely orthogonal to the topic of how can we know about this and shape our outcomes.
I would rather the policy be encoded with first class tools rather than engineering guidelines and runbooks. Let me have some additional control at what looks like to me not a great expense.
It doesn't feel "safe" to me to assume the engineer meant to do exactly this and all of the upstream systems accounted for it. I would rather the code explicitly declare this in a policy we can enforce, in an AST we can shallowly reason about.
How deep do you go? Being forced to label any function that allocates memory with ”panic”?
Right now you all the instances where the code can panic are labeled. Grep for unwrap, panic, expect etc.
In all my years of professional Rust development I’ve never seen a potential panic pass code review without a discussion. Unless it was trivial like trying to build an invalid Regex from a static string.
I would guess at the individual team level they probably still behave like any other tech shop. When the end of the year comes the higher-ups still expect fancy features and accomplishments and saying "well, we spent months writing a page of TLA+ code" is not going to look as "flashy" as another team who delivered 20 new features. It would take someone from above to push and ask that other team who delivered 20 features, where is their TLA+ code verifying their correctness. But, how many people in the middle management chain would do that?
Thank you for putting this in such clear terms. It really is a Catch-22 problem for startups. Most of the time, you can't reach scale unless you cut some corners along the way, and when you reach scale, you benefit from NOT cutting those corners.
I'd not be surprised if root of the issue was some engineer who didn't add DB selector because in other SQL engines SELECT like that would select from currently connected database vs all of them
I’d be with you except that cloudflare prioritizes profit over doing a good job (layoffs, offshoring, etc). You don’t get to make excuses when you willingly reduced quality to keep your profits high.
Not to mention that perfectly normalizing a database always incurs join overhead that limits horizontal scalability. In fact, denormalization is required to achieve scale (with a trade-off).
I’m not sure how formal verification would’ve prevented this issue from happening. In my experience, it’s unusual to have to specify a database name in the query. How could have formal verification covered this outcome?
The recommendations don’t make sense saying that the query needed DISTINCT and LIMIT. Don’t forget that the incoming data was different (r0 and default did not return the same exact data, this is why the config files more than doubled in size), so using DISTINCT would have led to uncertain blending of data, producing neither result and hiding the double-database read altogether. Secondly, LIMIT only makes sense to use in conjunction with a failure circuit breaker (if LIMIT items is returned, fail the query). When does it make business-logic sense to LIMIT the query-in-question’s result? And do you think the authors would have known how to set the LIMIT to not exceed the configuration file consumers’ limitations?
The article says:
> “You can’t reliably catch that with more tests or rollouts or flags. You prevent it by construction—through analytical design.”
That’s the big design up front fallacy. Of course you can catch it reliably with more tests, and limit the damage with flags and rollouts. There’s zero guarantee that the analytical design would’ve caught this up front.
Why is being able to "capture the market" something we want to encourage? This leads to monopolies or oligopolies and makes possible various types of abuse that a free competitive market would normally correct.
If you're going to step into the role of managing a large percentage of public internet traffic, maybe you need to be held to a different standard and set of rules than a startup trying to get a foothold among dozens or hundreds of other competitors. Something more like a public utility than a private enterprise.
The three other replies you've gotten so far have given some generically applicable though still good answers, but I want to address something regard Cloudflare specifically: a major part of their entire core goal and value proposition revolves around being able to defend their customers from continuously scaling ever larger hostile attacks. This isn't merely a case of "natural selection" or what a company/VCs might desire, but that it's hard to see how under the current (depressing, shitty) state of the Internet it'd be possible to cheaply defend against terabit-plus class DDOS and the like without Cloudflare level scale in turn. And "cheaply" is in fact critical too because the whole point of resource exhaustion attacks is that they're purely economic, if it costs many times more to mitigate them then to launch and profit from them then the attackers are going to win in the end. Ideally we'd be solving this collective action problem collectively with standards amongst nations and ISPs to mitigate or eliminate botnets at the source, but we have to trundle along as best we can in the mean time right? I'm not sure there is room for a large number of players in Cloudflare's role, and they've been a pretty dang decent one so far.
It doesn't matter what "we" "encourage". This is a natural selection process: all sorts of teams exist, and then the market decides to be captured by certain ones. We do not prescribe which attributes capture the market; we discover them.
> I base my paragraph on their choice of abandoning PostgreSQL and adopting ClickHouse(Bocharov 2018). The whole post is a great overview on trying to process data fast, without a single line on how to garantee its logical correctness/consistency in the face of changes.
I'm completely mystified how the author concludes that the switch from PostgreSQL to ClickHouse shows the root of this problem.
1. If the point is that PostgreSQL is somehow more less prone to error, it's not in this case. You can make the same mistake if you leave off the table_schema in information_schema.columns queries.
2. If the point is that Cloudflare should have somehow discovered this error through normalization and/or formal methods, perhaps he could demonstrate exactly how this would have (a) worked, (b) been less costly than finding and fixing the query through a better review process or testing, and (c) avoided generating other errors as a side effect.
I'm particularly mystified how lack of normalization is at fault. ClickHouse system.columns is normalized. And if you normalized the query result to remove duplicates that would just result in other kinds of bugs as in 2c above.
This sort of Monday morning quarterbacking is pointless and only serves as a way for random bloggers to try to grab credit without actually doing or creating any value.
> I disagree. I learnt good stuff from this article and it’s enough.
That's perfectly fine. It's also besides the point though. You can learn without reading random people online cynically shit talking others as a self promotion strategy. This is junior dev energy manifesting junior level understanding of the whole problem domain.
There's not a lot to learn from claims that boil down to "don't have bugs".
Not to single you out in particular, but I see this sentiment among programmers a lot and to me it's akin to a structural engineer saying "I laughed out loud when he said they should analyze the forces in the bridge".
This article actually explains how this bug in particular could have been avoided. Sure you may not consider his approach realistic, but it's not at all saying "don't have bugs". In fact, not having formal verification or similar tooling in place, would be more like saying "just don't write buggy code".
Not commenting on the quality of this post but occasional writing that responds to an event provides a good opportunity to share thoughts that wouldn’t otherwise reach an audience. If you post advice without a concrete scenario you’re responding to, it’s both less tangible for your audience and less likely to find an audience when it’s easier to shrug off (or put off).
I'm using this incident to draw attention to Rust's panic behavior.
Rust could use additional language features to help us write mostly panic-free* code and statically catch even transitive dependencies that might subject us to unnecessary panics.
We've been talking about it on our team and to other Rust folks, and I think it's worth building a proposal around. Rust should have a way to statically guarantee this never happens. Opt-in at first, but eventually the default.
It's already in the box... there's a bunch of options from unwrap_or, etc... to actually checking the error result and dealing with it cleanly... that's not what happened.
Not to mention the possibility of just bumping up through Result<> chaining with an app specific error model. The author chose neither... likely because they want the app to crash/reload from an external service. This is often the best approach to an indeterminate or unusable state/configuration.
> This is often the best approach to an indeterminate or unusable state/configuration.
The engineers had more semantic tools at their disposal for this than a bare `unwrap()`.
This was a systems failure. A better set of tools in Rust would have helped mitigate some of the blow.
`unwrap()` is from pre-1.0 Rust, before many of the type system-enabled error safety features existed. And certainly before many of the idiomatic syntactic sugars were put into place.
I posted in another thread that Rust should grow annotation features to allow us to statically rid or minimize our codebase of panic behavior. Outside of malloc failures, we should be able to constrain or rid large classes of them with something like this:
panic fn my_panicky_function() {
None.unwrap(); // NB: `unwrap()` is also marked `panic` in stdlib
}
fn my_safe_function() {
// with a certain compiler or Crates flag, this would fail to compile
// as my_safe_function isn't annotated as `panic`
my_panicky_function()
}
Obviously just an idea, but something like this would be nice. We should be able to do more than just linting, and we should have tools that guarantee transitive dependencies can't blow off our feet with panic shotguns.
In any case, until something is done, this is not the last time we'll hear unwrap() horror stories.
I agree it should not have happened, but I don’t agree that the database schema is the core problem. The “logical single point of failure” here was created by the rapid, global deployment process. If you don’t want to take down all of prod, you can’t update all of prod at the same time. Gradual deployments are a more reliable defense against bugs than careful programming.
One of the things I find fascinating about this is that we don't blink twice about the idea that an update to a "hot" cache entry that's "just data" should propagate rapidly across caches... but we do have change management and gradual deployments for code updates and meaningful configuration changes.
Machine learning feature updates live somewhere in the middle. Large amounts of data, a need for unsupervised deployment that can react in seconds, somewhat opaque. But incredibly impactful if something bad rolls out.
I do agree with the OP that the remediation steps in https://blog.cloudflare.com/18-november-2025-outage/#remedia... seem undercooked. But I'd focus on something entirely different than trying to verify the creation of configuration files. There should be real attention to: "how can we take blue/green approaches to allowing our system to revert to old ML feature data and other autogenerated local caches, self-healing the same way we would when rolling out code updates?"
Of course, this has some risk in Cloudflare's context, because attackers may very well be overjoyed by a slower rollout of ML features that are used to detect their DDoS attacks (or a rollout that they can trigger to rollback by crafting DDoS attacks).
But I very much hope they find a happy medium. This won't be the last time that a behavior-modifying configuration file gets corrupted. And formal verification, as espoused by the OP, doesn't help if the problem is due to a bad business assumption, encoded in a verified way.
>Gradual deployments are a more reliable defense against bugs than careful programming
The challenge, as I understand it, is that the feature in question had an explicit requirement of fast, wide deployment because of the need to react in real time to changing external attacker behaviors.
Yeah, I don’t know how fast “fast” needs to be in this system; but my understanding is this particular failure would have been seen immediately on the first replica. The progression could still be aggressive after verifying the first wave.
* The unwrap() in production code should have never passed code review. Damn, it should have been flagged by a linter.
* The deployment should have followed the blue/green pattern, limiting the blast radius of a bad change to a subset of nodes.
* In general, a company so much at the foundational level of internet connectivity should not follow the "move fast, break things" pattern. They did not have an overwhelming reason to hurry and take risks. This has burned a lot of trust, no matter the nature of the actual bug.
Unless you work at Cloudflare it seems very unlikely that you have enough information about systems and tradeoffs there to make these flat assertions about what "should have" happened. Systems can do worse things than crashing in response to unexpected states. Blue/green deployment isn't always possible (eg due to constrained compute resources) or practical (perhaps requiring greatly increased complexity), and is by no means the only approach to reducing deploy risk. We don't know that any of the related code was shipped with a "move fast, break things" mindset; the most careful developers still write bugs.
Actually learning from incidents and making systems more reliable requires curiosity and a willingness to start with questions rather than mechanically applying patterns. This is standard systems-safety stuff. The sort of false confidence involved in making prescriptions from afar suggests a mindset I don't want anywhere near the operation of anything critical.
unwrap() and the family of methods like it are a Rust anti-pattern from the early days of Rust. It dates back to before many of the modern error-handling and safety-conscious features of the language and type system.
Rust is being pulled in so many different directions from new users that the language perhaps never originally intended. Some engineers will be fine with panicky behavior, but a lot of others want to be able to statically guarantee most panics (outside of perhaps memory allocation failures) cannot occur.
We need more than just a linter on this. A new language feature that poisons, marks, or annotates methods that can potentially panic (for reasons other than allocation) would be amazing. If you then call a method that can panic, you'll have to mark your own method as potentially panicky. The ideal future would be that in time, as more standard library and 3rd party library code adopts this, we can then statically assert our code cannot possibly panic.
As it stands, I'm pretty mortified that some transitive dependency might use unwrap() deep in its internals.
Cloudflare doesn't seem to have called it a "Root Cause Analysis" and, in fact, the term "root cause" doesn't appear to occur in Prince's report. I bring this up because there's a school of thought that says "root cause analysis" is counterproductive: complex systems are always balanced on the precipice of multicausal failure.
When I was at AWS, when we did postmortems on incidents we called it "root cause analysis", but it was understood by everyone that most incidents are multicausal and the actual analyses always ended up being fishbone diagrams.
Probably there are some teams which don't do this and really do treat RCA as trying to find a sole root cause, but I think a lot of "getting mad at RCA" is bikeshedding the terminology, and nothing to do with the actual practice.
Right, I'm not a semantic zealot on this point, but the post we're commenting on really does suggest that the Cloudflare incident had a root cause in basic database management failures, which is the substantive issue the root-cause-haters have with the term.
True, and I agree, but from their report they do seem to be doing Root Cause Analysis (RCA) even if they don't call it that.
RCA is a really bad way of investigating a failure. Simply put; if you show me your RCA I know exactly where you couldn't be bothered to look any further.
I think most software engineers using RCA confuse the "cause" ("Why did this happen") with the solution ("We have changed this line of code and it's fixed"). These are quite different problem domains.
Using RCA to determine "Why did this happen" is only useful for explaining the last stages of an accident. It focuses on cause->effect relationships and tells a relatively simple story but one that is easy to communicate - Hi there managers and media! But RCA only encourages simple countermeasures which will probably be ineffective and will be easily outrun by the complexity of real systems
However one thing RCA is really good at is allocating blame. If your organisation is using RCA then, what ever you pretend, your organisation has a culture of blame.
With a blame culture (rather than a reporting culture) your organisation is much more likely to fail again. You will lack operational resilience.
Of course it shouldn't have happened. But if you run infrastructure as complex as this on the scale that they do, and with the agility that they need, then it was bound to happen eventually. No matter how good you are, there is always some extremely unlikely chain of events that will lead to a catastrophic out. Given enough time, that chain will eventually happen.
While this blog post is pretty useless, it's a hell of a lot better than the LinkedIn posts about the outage... my god, I wish the "Not interested" button worked.
I'd be wanting to have some sort of a "dry run" on the produced artifact by the rust code consuming it, or a deploy to some sort of a test environment before letting it roll out to production. I've been surprised that no mention of that sort of thing in the Cloudflare after-action or here.
> A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.
In a database, you wouldn't solve this with a distinct or a limit? You would make the schema guarantee uniqueness?
And yes, that wouldn't deal with cross database queries. But the solution here is just the filter by db name, the rest is table design.
Adding distinct or group by to a query is not some advanced technic comments are suggesting. It does not slow down development one bit, if you expect distinct result you put explicit distinct in the query, it's not a "safety measure for insulin pumps". Scratching my head what I've missed here, please enlighten me.
It did happen, and cloudflare should learn from it, but not just the technical reasons.
Instead of focusing on the technical reasons why, they should answer how such a change bubbled out to cause such a massive impact instead.
Why: Proxy fails requests
Why: Handlers crashed because of OOM
Why: Clickhouse returns too much data
Why: A change was introduced causing double the amount of data
Why: A central change was rolled out immediately to all cluster (single point of failure)
Why: There are exemptions or standard operating procedure (gate) for releasing changes to the hot path for cloudflares network infra.
While the Clickhouse change is important, I personally think it is crucial that Cloudflare tackles the processes, and possibly gates / controls rollout for hot path system, no matter what kind of change they are when they're at that scale it should be possible. But that is probably enough co-driving. It to me seems like a process issue more than a technical one.
Very quick rollout is crucial for this kind of service. On top of what you wrote, institutionalizing rollback by default if something catastrophically breaks should be the norm.
Been there in those calls, begging to people in charge who perhaps shouldn't have been, "eh, maybe we should attempt a rollback to the last known good state?". But investigating further before making any change always seems to be the preferred action to these people. Can't be faulted for being cautious and doing things properly, right?
If I recall correctly it took CF 2 hours to roll back the broken changes.
So If if I were in charge of Cloudflare (4-5k employees) I'd both look at the processes and the people in charge.
I think the author is trying to apply a preconceived cause on to the cloudflare outage, but there’s not a fit.
E.g., they should try to work through how their own suggested fix would actually ensure the problem couldn’t happen. I don’t believe it would… lack of nullable fields and normalization typically simplify relational logic, but hardly prevent logical errors. Formal verification can prove your code satisfies a certain formal specification, but doesn’t prove your specification solves your business problem (or makes sense at all, in fact).
> but it clearly needs a distinct and a limit, since these seem to be crucial business rules.
Isn't that just... wrong ? Throwing arbitrary limit (vs maybe having some alert when the table is too long) would just silently truncate the list
Anybody can be backseat engineer by throwing out industry's best practices like they were gospel but you have to look at entire system, not just the database part
They are not going as far as to blame PostgreSQL, but their switch to ClickHouse seems to suggest that they see PostgreSQL as part of the equation. Would ClickHouse really prevent this type of error from occurring? PostgreSQL already has so many options for setting up solid constrains for data entry. Or do they not have anyone on the team anymore (or never had) who could set up a robust PostgreSQL database? Or are they just piggybacking on the latest trend?
Sure, a different database schema may have helped, but there are going to be bugs either way. In my view a more productive approach is to think about how to limit the blast radius when things inevitably do go wrong.
> FAANG-style companies are unlikely to adopt formal methods or relational rigor wholesale. But for their most critical systems, they should. It’s the only way to make failures like this impossible by design, rather than just less likely.
That relational rigor imposes what one chooses to be true, it isn’t a universal truth.
The frame problem and the qualification problem apply here.
The open domain frame problem == HALT.
When you can for a problem into the relational model things are nice but not everything can be reduced to a trivial property.
That is why Codd had to as nulls etc..
You can choose to decide that the queen is rich OR pigs can fly; but a poor queen doesn’t result in flying pigs.
Choice over finite sets == finite indexes over sets == PEM
If you can restrict your problems to where the Entscheidungsproblem is solvable you can gain many benefits
No, their error was that they shouldn't be querying system tables to perform field discovery; the same method in postgresql (pg_class or whatever its called) would have had the same result. The simple alternative is to use "describe table <table_name>".
On top of that, they shouldn't be writing ad-hoc code to query system tables, but having a separate library instead to perform those kind of task mixed with business logic (crappy application design).
Also, this should never have passed code review in the first place, but lets assume it did because errors happen, and this kind of atrocious code and flaky design is not uncommon.
As an example, they could be reading this data from CSV files *and* have made the same mistake. Conflating this with "database design errors" is just stupid - this is not a schema design error, this is a programmer error.
Yes, pretty basic looking mistakes that, from the outside, make many wonder how this got through. Though analyzing the post-mortem makes me think of the MV Dali crashing into the Francis Scott Key bridge in Baltimore: the whole thing started with a single loose wire which set off a cascading failure. CF's situation was similar in a few ways though finding a bad query (and .unwrap() in production code rather than test code) should have been a lot easier to spot.
Have any of the post-mortems addressed if any of the code that led to CloudFlare's outage was generated by AI?
> And CF doesn't have the "...or people will die" safety criticality.
I disagree with that. Just because you can't point to people falling off a bridge into the water doesn't mean that outages of the web at this scale will not lead to fatalities.
One of the things I recommend most engineers do when they write a bug is to first take a look and see if the bug is required. Very often, I see that the codebase doesn't need the bug added. Then I can just rewrite that code without the bug.
"This massive, accomplished engineering team whose software operates at a scale nearly no one else operates at missed this basic thing" is a hell of a take.
"If they had a perfectly normalized database, no NULLing and formally verified code, this bug would not have happened."
That may be. What's not specified there is the immense, immense cost of driving a dev org on those terms. It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.
Cloudflare may well need to transition to this sort of engineering culture, but there is no doubt that they would not be in the position they are in if they started with this culture -- they would have been too slow to capture the market.
I think critiques that have actionable plans for real dev teams are likely to be more useful than what, to me, reads as a sort of complaint from an ivory tower. Culture matters, shipping speed matters, quality matters, team DNA matters. That's what makes this stuff hard (and interesting!)
That's entirely right. Products have to transition from fast-moving exploratory products to boring infrastructure. We have different goals and expectations for an ecommerce web app vs. a database, or a database vs. the software controlling an insulin pump.
Having said that, at this point, Cloudflare's core DDOS-protection proxy should now be built more like an insulin pump than like a web app. This thing needs to never go down worldwide, much more than it needs to ship a new feature fast.
Precisely. This is key infrastructure we're talking about not some kind of webshop.
Yeah but the anti-DDOS feature needs to react to new methods all the time, it's not a static thing you build once and it works forever.
An insulin pump is very different. Your human body, insulin, and physics isn't changing any time soon.
You are simplifying the control software of an insulin point to a degree that does not match reality. I'm saying that because I actually reviewed the code of one and the amount of safety consciousness on display there was off the charts compared to what you usually encounter in typical web development. You also under-estimate the dynamic nature of the environment these pumps operate in as well as the amount of contingency planning that they embody, failure modes of each and every part in the pump were taken into consideration, and there are more such parts that you are most likely aware of. This includes material defects, defects as a result from abuse, wear & tear, parts being simply out of spec and so on.
To see this as the typical firmware that ships with say a calculator or a watch is to diminish the accomplishment considerably.
I had a former coworker who moved from the medical device industry to similar-to-cloudflare-web software. While he had some appreciation for the validation and intense QA they did (they didn't use formal methods, just heavy QA and deep specs), it became very clear to him very clearly that those approaches don't work with speed-of-release as a concern (his development cycles were annual, not weekly or daily). And they absolutely don't work in contexts where user-abuse or reactivity are necessary. The contexts are just totally different.
an insulin pump is a good metaphor; insulin as a hormone has a lot of interactions and the pump itself, if not wanting to unalive its user, should (most do not) account for external variables, such as: exercise, heart rate, sickness, etc. these variables are left for the user to deal with, and in this case, is a subpar experience in managing a condition.
> This thing needs to never go down worldwide
Quantity introduce a quality all of its own in terms of maintenance.
This bug might not have, but others would. Formal verification methods still rely on humans to input the formal specification, which is where problems happen.
As others point out, if they didn't really ship fast, they certainly would not have become profitable, and they would definitely not have captured the market to the extent they have.
But really, if the market was more distributed, and Cloudflare commanded 5% of the web as the biggest player, any single outage would have been limited in impact. So it's also about market behaviour: yet "nobody is fired for choosing IBM" as it used to go 40 years ago.
But does "formally verified code" really go in the same bag as "normalized database" and ensuring data integrity at the database level? The former is immensely complex and difficult; the other two are more like sound engineering principles?
Software people, especially coming through Rust, are falling into the old trap of believing if code is bug free it is reliable: it isn’t because there is a world of faults outside, including but not limited to the developer intentions.
This inverts everything because structuring to be fault tolerant, of the right things, changes what is a good idea almost entirely.
Rust generally forces you to acknowledge these faults. The problem is managing them in a sane way, which for Rust in many cases simply is failing loudly.
Compared to than many other languages which preferring chugging along and hoping that no downstream corruption happens.
What I have seen work in the past is testing using a production backup as a final step prior to releasing, including applying database scripts. In this case, the permissions change would have been executed, the query would have run, and the failure would have been observed.
> It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.
We could also invest in tooling to make this kind of thing easier. Unclear why humans need to hand-normalise the database schema - isn't this exactly the kind of thing compilers are good at?
I agree with you.
I would just add that I've noticed organizations tend to calcify as they get bigger and older. Kind of like trees, they start out as flexible saplings, and over time develop hard trunks and branches. The rigidity gives them stability.
You're right that there's no way they could have gotten to where they are if they had prioritized data integrity and formal verification in all their practices. Now that they have so much market share, they might collapse under their own weight if their trunk isn't solid. Maybe investing in data integrity and strongly typed, functional programming that's formally verifiable is what will help them keep their market share.
Cultures are hard to change and I'm not suggesting an expectation for them to change beyond what is feasible or practical. I don't lead an engineering organization like it so I'm definitely armchairing here. I just see some of the logic of the argument that them adopting some of these methods would probably benefit everyone using their services.
When you're powering this large a fraction of the internet is it even an option not to work like that? You'd think that with that kind of market cap resource constraints should no longer be holding you back from doing things properly.
I work in formal verification at a FAANG.
It is so wildly more expensive than traditional development that it is simply not feasible to apply it anywhere but absolutely the most critical paths, and even then, the properties asserted by formal verification are often quite a bit less powerful than necessary to truly guarantee something useful.
I want formal verification everywhere. I believe in provable correctness. I wish we could hire people capable of always writing software to that standard and maintaining those proofs alongside their work.
We really can’t, though. Its a frustrating reality of being human — we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.
> we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.
This seems like a contradiction. If the smartest engineers you can hire are not smart enough to work within formal verification constraints then we in fact do not know how to do this.
If formal verification hinges on having perfect engineers then it’s useless because perfect engineers wouldn’t need formal verification.
Ok, let's start off with holding them to the same standards as avionics software development. The formal verification can wait.
I don’t understand why anyone should want this. Why should normal engineering efforts be held to the same standards as life-critical systems? Why would anyone expect that CloudFlare DDoS protection be built to the standards of avionics equipment?
Also if we’re being fair, avionics software is far narrower in scope than just “software in general”. And even with that Boeing managed to kill a bunch of people with shitty software.
Are Cloudflare's customers willing to pay avionics software level prices?
Given that Cloudflare's market cap is 1/2 of Boeing's and they are not making a physical product I would say: Clearly, yes.
The vast majority of Cloudflare's "customers" are paying 0 to 20 dollars a month, for virtually the same protection coverage and features as most of their 200 dollars/mo customers. That's not remotely in the realm of avionics price structure, be it software or hardware.
It is the aggregate they pay that counts here, not the individual payments.
A better comparison would be to compare this to airline passengers paying for their tickets, they pay a few hundred bucks in the expectation that they will arrive at their destination.
Besides, it is not the customers that determine Cloudflare's business model, Cloudflare does. Note that their whole business is to prevent outages and that as soon as they become the cause of an outage they have invalidated their whole reason for existence. Of course you could then turn this into a statistical argument that as long as they prevent more outages than they cause that they are a net benefit but that's not what this discussion is about, it is first and foremost about the standard of development they are held up against.
Ericsson identified similar issues in their offering long ago and created a very capable solution and I'm wondering if that would not have been a better choice for this kind of project, even if it would have resulted in more resource consumption.
> as soon as they become the cause of an outage they have invalidated their whole reason for existence
This is a bar no engineering effort has ever met. “If you ever fail, even for a moment, there’s no reason for you to even exist.”
There have been 6 fatal passenger airplane crashes in the US this year alone. NASA only built 6 shuttles and 2 of those exploded, killing their crews. And these were life-preserving systems that failed.
Discussions around software engineering quality always seem to veer into spaces where we assign almost mythic properties to other engineering efforts in an attempt to paint software engineering as lazy or careless.
Anyone in avionics software dev to give an opinion?
I would presume there's the same issue as parent said:
> Anyone in avionics software dev to give an opinion?
I've done some for fuel estimation of freighter jets (not quite avionics but close enough to get a sense for the development processes) and the amount of rigor involved in that one project made me a better developer for the rest of my career. Was it slow? Yes, it was very slow. A couple of thousand lines of code, a multiple of that in tests for a very long time compare to what it would normally take me.
But within the full envelope of possible inputs it performed exactly as advertised. The funny thing is that I'm not particularly proud of it, it was the process that kept things running even when my former games programmer mentality would have long ago said 'ship it'.
Some things you just need to do properly, or not at all.
Agreed.
I left out any commentary on `.unwrap()` from my original comment, but it’s an obvious example of something that should never have appeared in critical code.
And it's so easy to avoid, as well.
or Put that in your CI pipeline, and voila. Global crash averted.Rust needs to get rid of .unwrap() and its kin. They're from pre-1.0 Rust, before many of the type system features and error handling syntax sugar were added.
There's no reason to use them as the language provides lots of safer alternatives. If you do want to trigger a panic, you can, but I'd also ask - why?
Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures. Any function that then calls a panicky function needs to be similarly marked. In doing this, we can statically be certain no such methods are called if we want to be rid of the behavior.
Perhaps something like:
The ideal future would be to have code that is 100% panic free.> There's no reason to use [panics] as the language provides lots of safer alternatives.
Dunno ... I think runtime assertions and the ability to crash a misbehaving program are a pretty important part of the toolset. If rust required `Result`s to be wired up up and down the entire call tree for the privilege of using a runtime assertion, I think it would be a lot less popular, and probably less safe in practice.
> Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures.
I 100% agree that a mechanism to prove that code can or cannot panic would be great, but why would malloc be special here? Folks who are serious about preventing panics will generally use `no-std` in order to prevent malloc in the first place.
I'd say the equivalent of Erlang's supervisor trees is what is needed but once you go that route you might as well use Erlang.
Or just deploy containers with an orchestrator restarting them when failing?
It is not like an Erlang service would be able to make progress with an invalid config either.
I’ve been seeing you blazing this trail since the incident and it feels short sighted and reductive.
Rust is built on forcing the developer to acknowledge the complexity of reality. Unwrap acknowledges said complexity with a perfectly valid decision.
There are a few warts from early days like indexing and the ”as” operator where the easy path is doing the wrong thing.
But unwraps or expects are where Rust shines. Throwing up your hands is a perfectly reasonable response.
With your approach, what should Cloudflare have done?
Return an error, log it and return a 500 result due to invalid config? They could fail open, but then that opens another enormous can of worms.
There simply are no good options.
The issue rests upstream where deployments and effects between disparate services needs to be mapped and managed.
Which is a truly hard problem, rather than blaming the final piece throwing up its hand when given an invalid config.
> I’ve been seeing you blazing this trail since the incident and it feels a short sighted and reductive.
Why is it inappropriate to be able to statically label the behavior?
Maybe I don't want my failure behavior dictated by a downstream dependency or distracted engineer.
The subject of how to fail is a big topic and is completely orthogonal to the topic of how can we know about this and shape our outcomes.
I would rather the policy be encoded with first class tools rather than engineering guidelines and runbooks. Let me have some additional control at what looks like to me not a great expense.
It doesn't feel "safe" to me to assume the engineer meant to do exactly this and all of the upstream systems accounted for it. I would rather the code explicitly declare this in a policy we can enforce, in an AST we can shallowly reason about.
How deep do you go? Being forced to label any function that allocates memory with ”panic”?
Right now you all the instances where the code can panic are labeled. Grep for unwrap, panic, expect etc.
In all my years of professional Rust development I’ve never seen a potential panic pass code review without a discussion. Unless it was trivial like trying to build an invalid Regex from a static string.
Malloc is fair game.
Unwrap, slice access, etc. are not.
And now the endless bikeshedding has begun.
Thanks for making abundantly clear how such a feature wouldn’t solve a thing.
I would argue the largest CDN provider in the world is a critical path.
I would guess at the individual team level they probably still behave like any other tech shop. When the end of the year comes the higher-ups still expect fancy features and accomplishments and saying "well, we spent months writing a page of TLA+ code" is not going to look as "flashy" as another team who delivered 20 new features. It would take someone from above to push and ask that other team who delivered 20 features, where is their TLA+ code verifying their correctness. But, how many people in the middle management chain would do that?
Why is this inherently slower?
There’s for example, languages or features of languages that work entirely on not allowing these things.
I ask because I feel like I’m missing something
Thank you for putting this in such clear terms. It really is a Catch-22 problem for startups. Most of the time, you can't reach scale unless you cut some corners along the way, and when you reach scale, you benefit from NOT cutting those corners.
I'd not be surprised if root of the issue was some engineer who didn't add DB selector because in other SQL engines SELECT like that would select from currently connected database vs all of them
I’d be with you except that cloudflare prioritizes profit over doing a good job (layoffs, offshoring, etc). You don’t get to make excuses when you willingly reduced quality to keep your profits high.
Not to mention that perfectly normalizing a database always incurs join overhead that limits horizontal scalability. In fact, denormalization is required to achieve scale (with a trade-off).
I’m not sure how formal verification would’ve prevented this issue from happening. In my experience, it’s unusual to have to specify a database name in the query. How could have formal verification covered this outcome?
The recommendations don’t make sense saying that the query needed DISTINCT and LIMIT. Don’t forget that the incoming data was different (r0 and default did not return the same exact data, this is why the config files more than doubled in size), so using DISTINCT would have led to uncertain blending of data, producing neither result and hiding the double-database read altogether. Secondly, LIMIT only makes sense to use in conjunction with a failure circuit breaker (if LIMIT items is returned, fail the query). When does it make business-logic sense to LIMIT the query-in-question’s result? And do you think the authors would have known how to set the LIMIT to not exceed the configuration file consumers’ limitations?
The article says: > “You can’t reliably catch that with more tests or rollouts or flags. You prevent it by construction—through analytical design.”
That’s the big design up front fallacy. Of course you can catch it reliably with more tests, and limit the damage with flags and rollouts. There’s zero guarantee that the analytical design would’ve caught this up front.
Why is being able to "capture the market" something we want to encourage? This leads to monopolies or oligopolies and makes possible various types of abuse that a free competitive market would normally correct.
If you're going to step into the role of managing a large percentage of public internet traffic, maybe you need to be held to a different standard and set of rules than a startup trying to get a foothold among dozens or hundreds of other competitors. Something more like a public utility than a private enterprise.
The three other replies you've gotten so far have given some generically applicable though still good answers, but I want to address something regard Cloudflare specifically: a major part of their entire core goal and value proposition revolves around being able to defend their customers from continuously scaling ever larger hostile attacks. This isn't merely a case of "natural selection" or what a company/VCs might desire, but that it's hard to see how under the current (depressing, shitty) state of the Internet it'd be possible to cheaply defend against terabit-plus class DDOS and the like without Cloudflare level scale in turn. And "cheaply" is in fact critical too because the whole point of resource exhaustion attacks is that they're purely economic, if it costs many times more to mitigate them then to launch and profit from them then the attackers are going to win in the end. Ideally we'd be solving this collective action problem collectively with standards amongst nations and ISPs to mitigate or eliminate botnets at the source, but we have to trundle along as best we can in the mean time right? I'm not sure there is room for a large number of players in Cloudflare's role, and they've been a pretty dang decent one so far.
It doesn't matter what "we" "encourage". This is a natural selection process: all sorts of teams exist, and then the market decides to be captured by certain ones. We do not prescribe which attributes capture the market; we discover them.
I assume wanting a company to succeed is fundamental to hacker news. The world is better of with CF being around for sure
You would have to completely flip how funding works. As of now most VCs have abysmal returns, so heightening the bar is last thing on their mind.
> I base my paragraph on their choice of abandoning PostgreSQL and adopting ClickHouse(Bocharov 2018). The whole post is a great overview on trying to process data fast, without a single line on how to garantee its logical correctness/consistency in the face of changes.
I'm completely mystified how the author concludes that the switch from PostgreSQL to ClickHouse shows the root of this problem.
1. If the point is that PostgreSQL is somehow more less prone to error, it's not in this case. You can make the same mistake if you leave off the table_schema in information_schema.columns queries.
2. If the point is that Cloudflare should have somehow discovered this error through normalization and/or formal methods, perhaps he could demonstrate exactly how this would have (a) worked, (b) been less costly than finding and fixing the query through a better review process or testing, and (c) avoided generating other errors as a side effect.
I'm particularly mystified how lack of normalization is at fault. ClickHouse system.columns is normalized. And if you normalized the query result to remove duplicates that would just result in other kinds of bugs as in 2c above.
Edit: fix typo
This sort of Monday morning quarterbacking is pointless and only serves as a way for random bloggers to try to grab credit without actually doing or creating any value.
I disagree. I learnt good stuff from this article and it’s enough.
> I disagree. I learnt good stuff from this article and it’s enough.
That's perfectly fine. It's also besides the point though. You can learn without reading random people online cynically shit talking others as a self promotion strategy. This is junior dev energy manifesting junior level understanding of the whole problem domain.
There's not a lot to learn from claims that boil down to "don't have bugs".
I laughed out loud when he said Cloudflare should have formally verified its systems.
Not to single you out in particular, but I see this sentiment among programmers a lot and to me it's akin to a structural engineer saying "I laughed out loud when he said they should analyze the forces in the bridge".
You can't formally verify anything that uses consensus, which is the backbone of the entire web. It's a complete non-starter.
It's very similar to LinkedIn posts, where everybody seems to know better than the people actually running the platforms.
This article actually explains how this bug in particular could have been avoided. Sure you may not consider his approach realistic, but it's not at all saying "don't have bugs". In fact, not having formal verification or similar tooling in place, would be more like saying "just don't write buggy code".
> You can learn without reading random people online
Somebody has to write something in the first place for one to learn from it, even if the writing is disagreeable.
You failed to cite the comment you were replying to.
The comment is:
> You can learn without reading random people online cynically shit talking others as a self promotion strategy.
Not commenting on the quality of this post but occasional writing that responds to an event provides a good opportunity to share thoughts that wouldn’t otherwise reach an audience. If you post advice without a concrete scenario you’re responding to, it’s both less tangible for your audience and less likely to find an audience when it’s easier to shrug off (or put off).
What did you learn? The suggestions in the post seem pretty shallow and non-actionable.
Backdooring the internet is certainly a productive venture!
Like your comment? j/k :)
I'm using this incident to draw attention to Rust's panic behavior.
Rust could use additional language features to help us write mostly panic-free* code and statically catch even transitive dependencies that might subject us to unnecessary panics.
We've been talking about it on our team and to other Rust folks, and I think it's worth building a proposal around. Rust should have a way to statically guarantee this never happens. Opt-in at first, but eventually the default.
* with the exception of malloc failures, etc.
It's already in the box... there's a bunch of options from unwrap_or, etc... to actually checking the error result and dealing with it cleanly... that's not what happened.
Not to mention the possibility of just bumping up through Result<> chaining with an app specific error model. The author chose neither... likely because they want the app to crash/reload from an external service. This is often the best approach to an indeterminate or unusable state/configuration.
> This is often the best approach to an indeterminate or unusable state/configuration.
The engineers had more semantic tools at their disposal for this than a bare `unwrap()`.
This was a systems failure. A better set of tools in Rust would have helped mitigate some of the blow.
`unwrap()` is from pre-1.0 Rust, before many of the type system-enabled error safety features existed. And certainly before many of the idiomatic syntactic sugars were put into place.
I posted in another thread that Rust should grow annotation features to allow us to statically rid or minimize our codebase of panic behavior. Outside of malloc failures, we should be able to constrain or rid large classes of them with something like this:
Obviously just an idea, but something like this would be nice. We should be able to do more than just linting, and we should have tools that guarantee transitive dependencies can't blow off our feet with panic shotguns.In any case, until something is done, this is not the last time we'll hear unwrap() horror stories.
What you're suggesting is perfectly reasonable, I wouldn't object to labeling methods that can panic via bare unwrap...
I'm just saying that having a program immediately exit (via panic or not) could very well be the appropriate behavior.
I agree it should not have happened, but I don’t agree that the database schema is the core problem. The “logical single point of failure” here was created by the rapid, global deployment process. If you don’t want to take down all of prod, you can’t update all of prod at the same time. Gradual deployments are a more reliable defense against bugs than careful programming.
One of the things I find fascinating about this is that we don't blink twice about the idea that an update to a "hot" cache entry that's "just data" should propagate rapidly across caches... but we do have change management and gradual deployments for code updates and meaningful configuration changes.
Machine learning feature updates live somewhere in the middle. Large amounts of data, a need for unsupervised deployment that can react in seconds, somewhat opaque. But incredibly impactful if something bad rolls out.
I do agree with the OP that the remediation steps in https://blog.cloudflare.com/18-november-2025-outage/#remedia... seem undercooked. But I'd focus on something entirely different than trying to verify the creation of configuration files. There should be real attention to: "how can we take blue/green approaches to allowing our system to revert to old ML feature data and other autogenerated local caches, self-healing the same way we would when rolling out code updates?"
Of course, this has some risk in Cloudflare's context, because attackers may very well be overjoyed by a slower rollout of ML features that are used to detect their DDoS attacks (or a rollout that they can trigger to rollback by crafting DDoS attacks).
But I very much hope they find a happy medium. This won't be the last time that a behavior-modifying configuration file gets corrupted. And formal verification, as espoused by the OP, doesn't help if the problem is due to a bad business assumption, encoded in a verified way.
>Gradual deployments are a more reliable defense against bugs than careful programming
The challenge, as I understand it, is that the feature in question had an explicit requirement of fast, wide deployment because of the need to react in real time to changing external attacker behaviors.
Yeah, I don’t know how fast “fast” needs to be in this system; but my understanding is this particular failure would have been seen immediately on the first replica. The progression could still be aggressive after verifying the first wave.
> Gradual deployments are a more reliable defense against bugs than careful programming.
Wasn't this one of the key takeaways from the crowdstrike outage?
* The unwrap() in production code should have never passed code review. Damn, it should have been flagged by a linter.
* The deployment should have followed the blue/green pattern, limiting the blast radius of a bad change to a subset of nodes.
* In general, a company so much at the foundational level of internet connectivity should not follow the "move fast, break things" pattern. They did not have an overwhelming reason to hurry and take risks. This has burned a lot of trust, no matter the nature of the actual bug.
Unless you work at Cloudflare it seems very unlikely that you have enough information about systems and tradeoffs there to make these flat assertions about what "should have" happened. Systems can do worse things than crashing in response to unexpected states. Blue/green deployment isn't always possible (eg due to constrained compute resources) or practical (perhaps requiring greatly increased complexity), and is by no means the only approach to reducing deploy risk. We don't know that any of the related code was shipped with a "move fast, break things" mindset; the most careful developers still write bugs.
Actually learning from incidents and making systems more reliable requires curiosity and a willingness to start with questions rather than mechanically applying patterns. This is standard systems-safety stuff. The sort of false confidence involved in making prescriptions from afar suggests a mindset I don't want anywhere near the operation of anything critical.
I wish they do burn a lot of trust to show up in their financial reports. Otherwise it is like "we do not like it but gonna use it anyway".
The scale of the outage was so big and global, that the biggest failure was indeed the blast radius.
> the blue/green pattern
?
This specific terminology was new to me, too: https://en.wikipedia.org/wiki/Blue%E2%80%93green_deployment
unwrap() and the family of methods like it are a Rust anti-pattern from the early days of Rust. It dates back to before many of the modern error-handling and safety-conscious features of the language and type system.
Rust is being pulled in so many different directions from new users that the language perhaps never originally intended. Some engineers will be fine with panicky behavior, but a lot of others want to be able to statically guarantee most panics (outside of perhaps memory allocation failures) cannot occur.
We need more than just a linter on this. A new language feature that poisons, marks, or annotates methods that can potentially panic (for reasons other than allocation) would be amazing. If you then call a method that can panic, you'll have to mark your own method as potentially panicky. The ideal future would be that in time, as more standard library and 3rd party library code adopts this, we can then statically assert our code cannot possibly panic.
As it stands, I'm pretty mortified that some transitive dependency might use unwrap() deep in its internals.
Cloudflare doesn't seem to have called it a "Root Cause Analysis" and, in fact, the term "root cause" doesn't appear to occur in Prince's report. I bring this up because there's a school of thought that says "root cause analysis" is counterproductive: complex systems are always balanced on the precipice of multicausal failure.
When I was at AWS, when we did postmortems on incidents we called it "root cause analysis", but it was understood by everyone that most incidents are multicausal and the actual analyses always ended up being fishbone diagrams.
Probably there are some teams which don't do this and really do treat RCA as trying to find a sole root cause, but I think a lot of "getting mad at RCA" is bikeshedding the terminology, and nothing to do with the actual practice.
Right, I'm not a semantic zealot on this point, but the post we're commenting on really does suggest that the Cloudflare incident had a root cause in basic database management failures, which is the substantive issue the root-cause-haters have with the term.
> to find a sole root cause
"Six billion years ago the dust around the young Sun coalesced into planets"
"Workaround: If we wait long enough, the earth will eventually be consumed by the sun."
https://xkcd.com/1822/
True, and I agree, but from their report they do seem to be doing Root Cause Analysis (RCA) even if they don't call it that.
RCA is a really bad way of investigating a failure. Simply put; if you show me your RCA I know exactly where you couldn't be bothered to look any further.
I think most software engineers using RCA confuse the "cause" ("Why did this happen") with the solution ("We have changed this line of code and it's fixed"). These are quite different problem domains.
Using RCA to determine "Why did this happen" is only useful for explaining the last stages of an accident. It focuses on cause->effect relationships and tells a relatively simple story but one that is easy to communicate - Hi there managers and media! But RCA only encourages simple countermeasures which will probably be ineffective and will be easily outrun by the complexity of real systems
However one thing RCA is really good at is allocating blame. If your organisation is using RCA then, what ever you pretend, your organisation has a culture of blame. With a blame culture (rather than a reporting culture) your organisation is much more likely to fail again. You will lack operational resilience.
then rename it to "root causes analysis"
Of course it shouldn't have happened. But if you run infrastructure as complex as this on the scale that they do, and with the agility that they need, then it was bound to happen eventually. No matter how good you are, there is always some extremely unlikely chain of events that will lead to a catastrophic out. Given enough time, that chain will eventually happen.
While this blog post is pretty useless, it's a hell of a lot better than the LinkedIn posts about the outage... my god, I wish the "Not interested" button worked.
I'd be wanting to have some sort of a "dry run" on the produced artifact by the rust code consuming it, or a deploy to some sort of a test environment before letting it roll out to production. I've been surprised that no mention of that sort of thing in the Cloudflare after-action or here.
> A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.
In a database, you wouldn't solve this with a distinct or a limit? You would make the schema guarantee uniqueness?
And yes, that wouldn't deal with cross database queries. But the solution here is just the filter by db name, the rest is table design.
Adding distinct or group by to a query is not some advanced technic comments are suggesting. It does not slow down development one bit, if you expect distinct result you put explicit distinct in the query, it's not a "safety measure for insulin pumps". Scratching my head what I've missed here, please enlighten me.
Nothing in this thread about "this should not have happened because Cloudflare is too centralized?"
We have far better ideas and working prototypes in terms of how to prevent this from happening again to be up here trying to "fix Cloudflare."
Think bigger, y'all.
It did happen, and cloudflare should learn from it, but not just the technical reasons.
Instead of focusing on the technical reasons why, they should answer how such a change bubbled out to cause such a massive impact instead.
Why: Proxy fails requests
Why: Handlers crashed because of OOM
Why: Clickhouse returns too much data
Why: A change was introduced causing double the amount of data
Why: A central change was rolled out immediately to all cluster (single point of failure)
Why: There are exemptions or standard operating procedure (gate) for releasing changes to the hot path for cloudflares network infra.
While the Clickhouse change is important, I personally think it is crucial that Cloudflare tackles the processes, and possibly gates / controls rollout for hot path system, no matter what kind of change they are when they're at that scale it should be possible. But that is probably enough co-driving. It to me seems like a process issue more than a technical one.
Very quick rollout is crucial for this kind of service. On top of what you wrote, institutionalizing rollback by default if something catastrophically breaks should be the norm.
Been there in those calls, begging to people in charge who perhaps shouldn't have been, "eh, maybe we should attempt a rollback to the last known good state?". But investigating further before making any change always seems to be the preferred action to these people. Can't be faulted for being cautious and doing things properly, right?
If I recall correctly it took CF 2 hours to roll back the broken changes.
So If if I were in charge of Cloudflare (4-5k employees) I'd both look at the processes and the people in charge.
I think the author is trying to apply a preconceived cause on to the cloudflare outage, but there’s not a fit.
E.g., they should try to work through how their own suggested fix would actually ensure the problem couldn’t happen. I don’t believe it would… lack of nullable fields and normalization typically simplify relational logic, but hardly prevent logical errors. Formal verification can prove your code satisfies a certain formal specification, but doesn’t prove your specification solves your business problem (or makes sense at all, in fact).
I initially read the title as "Cloudflare outrage.." and I was thinking how nice someone is thinking of the poor engineers who crashed the Internet.
> No nullable fiels.
If you take away nullability, you eventually get something like a special state that denotes absence and either:
- Assertions that the absence never happens.
- Untested half-baked code paths that try (and fail) to handle absence.
> formally verified
Yeah, this does prevent most bugs.
But it's horrendously expensive. Probably more expensive than the occasional Cloudflare incident
> but it clearly needs a distinct and a limit, since these seem to be crucial business rules.
Isn't that just... wrong ? Throwing arbitrary limit (vs maybe having some alert when the table is too long) would just silently truncate the list
Anybody can be backseat engineer by throwing out industry's best practices like they were gospel but you have to look at entire system, not just the database part
They are not going as far as to blame PostgreSQL, but their switch to ClickHouse seems to suggest that they see PostgreSQL as part of the equation. Would ClickHouse really prevent this type of error from occurring? PostgreSQL already has so many options for setting up solid constrains for data entry. Or do they not have anyone on the team anymore (or never had) who could set up a robust PostgreSQL database? Or are they just piggybacking on the latest trend?
Sure, a different database schema may have helped, but there are going to be bugs either way. In my view a more productive approach is to think about how to limit the blast radius when things inevitably do go wrong.
I was expecting a critique on the centralized nature of the infrastructure and the fragility that comes with it.
Hindsight bias is always easier but:
> FAANG-style companies are unlikely to adopt formal methods or relational rigor wholesale. But for their most critical systems, they should. It’s the only way to make failures like this impossible by design, rather than just less likely.
That relational rigor imposes what one chooses to be true, it isn’t a universal truth.
The frame problem and the qualification problem apply here.
The open domain frame problem == HALT.
When you can for a problem into the relational model things are nice but not everything can be reduced to a trivial property.
That is why Codd had to as nulls etc..
You can choose to decide that the queen is rich OR pigs can fly; but a poor queen doesn’t result in flying pigs.
Choice over finite sets == finite indexes over sets == PEM
If you can restrict your problems to where the Entscheidungsproblem is solvable you can gain many benefits
But it is horses for courses and sub TC.
rolls eyes
No, their error was that they shouldn't be querying system tables to perform field discovery; the same method in postgresql (pg_class or whatever its called) would have had the same result. The simple alternative is to use "describe table <table_name>".
On top of that, they shouldn't be writing ad-hoc code to query system tables, but having a separate library instead to perform those kind of task mixed with business logic (crappy application design).
Also, this should never have passed code review in the first place, but lets assume it did because errors happen, and this kind of atrocious code and flaky design is not uncommon.
As an example, they could be reading this data from CSV files *and* have made the same mistake. Conflating this with "database design errors" is just stupid - this is not a schema design error, this is a programmer error.
Would be interesting to see the DDL of the table, to see if it had unique constraints.
The query not utilising an unique constraint/index should have raised a red flag.
Yes, pretty basic looking mistakes that, from the outside, make many wonder how this got through. Though analyzing the post-mortem makes me think of the MV Dali crashing into the Francis Scott Key bridge in Baltimore: the whole thing started with a single loose wire which set off a cascading failure. CF's situation was similar in a few ways though finding a bad query (and .unwrap() in production code rather than test code) should have been a lot easier to spot.
Have any of the post-mortems addressed if any of the code that led to CloudFlare's outage was generated by AI?
> ...makes me think of the MV Dali crashing...
Yes. Though compared to Cloudflare's infrastructure, the Dali is a wooden rowboat. And CF doesn't have the "...or people will die" safety criticality.
> And CF doesn't have the "...or people will die" safety criticality.
I disagree with that. Just because you can't point to people falling off a bridge into the water doesn't mean that outages of the web at this scale will not lead to fatalities.
One of the things I recommend most engineers do when they write a bug is to first take a look and see if the bug is required. Very often, I see that the codebase doesn't need the bug added. Then I can just rewrite that code without the bug.
Are there outages that should have happened?
"This massive, accomplished engineering team whose software operates at a scale nearly no one else operates at missed this basic thing" is a hell of a take.
Honestly it's a quite lukewarm take.
See for example https://danluu.com/algorithms-interviews/. This sort of thing happens constantly.