yuliyp 4 hours ago

A war room / call is for coordination. If you need the person draining the bad region to know that "oh that other region is bad, so we can't drain too fast" or "the metrics look better now".

For truly understanding an incident? Hell no. That requires heads down *focus*. I've made a habit of taking a few hours later in the day after incidents to calmly go through and try to write a correct timeline. It's amazing how broken peoples' perceptions of what was actually happening at the time are (including my own!). Being able to go through it calmly, alone provided tons of insights.

It's this type of analysis that leads to proper understanding and the right follow-ups rather than knee-jerk reactions.

  • master_crab 3 hours ago

    This. People keep commenting about it being performative. That’s orthogonal to its purpose. Even the original blogpost points out the limitation of singular focused effort without acknowledging it. It took the author weeks to figure out the actual issue.

    If FB had been down that long, they’d be out of business.

  • thwarted an hour ago

    It took me seven weeks (not full time, but from the initial incident to the final publishing) to do the research and write up for a recent event. This included in-person interviews, data correlation, reading code, and revision control spelunking across multiple repositories to understand the series of events and decisions that led to the event, some of them months earlier. Some people were advocating "get it out because we have to move on", which I pushed back on. Once published, the feedback was positive and some folks acknowledged that knee-jerk follow-up reactions would have made things worse. But to get to the point where the post-incident review is valuable someone has to put in the actual work and time to make it so. It should be a learning experience, not a checking a box; otherwise, we're just spinning our wheels without making any progress.

  • DylanDmitri 3 hours ago

    I've been in some good coordinating calls for widespread incidents. Many unique individuals (15+) talked in a ten minute period, sharing context on what their teams were seeing, what re-meditations had worked for them, etc..

belval 7 hours ago

> Could I run my terminals in there? Yes. Did I? Yes, for a while. Was I effective? Not really. I missed my desk, my normal chair, my big Thunderbolt monitor, my full-size (and yet entirely boring) keyboard, and a relatively odor-free environment.

Not Meta but at Amazon I always felt like war rooms are a place for some leader to scream at you and not much else. The reality is that debugging some retry storm, resource exhaustion or whatever won't happen in a room with 18 people talking over one another.

Give me a meeting link, I'll join and provide info as I find it, but this type of sweaty hackathon-style all-hands-on-deck was never productive for me.

  • Hasu 4 hours ago

    > I always felt like war rooms are a place for some leader to scream at you and not much else. The reality is that debugging some retry storm, resource exhaustion or whatever won't happen in a room with 18 people talking over one another.

    I once walked out of a war room (at a much smaller company that I wouldn't be at for much longer) that had devolved into finger-pointing and blame games. Half an hour later, my boss came out to find out what I was doing and I pointed at my screen and said, "This. This is what's wrong. Ship my fix and we're done here." The entire war room came to my desk to see and discuss the fix, which we shipped, which solved the issue.

    At my next job, I had to hold back laughter when the VP of Engineering, who was pushing mob programming, said, "Think about it. When we have an incident, when something is really important, what do we do? We all get in a room together. No one leaves the war room to go solve the incident on their own."

    • bloomingkales 4 hours ago

      "This. This is what's wrong. Ship my fix and we're done here."

      Lol. This is not hyperbole. Just about everyone has several stories like this, and they are quite hilarious in their utter absurdity. It's like these people get possessed by the spirit of Gordon Gekko in that exact moment and must absolutely play out the role to the tee. Then they become unpossessed and go Skiing on weekends.

  • alabastervlog 3 hours ago

    Leaders are really into playing pretend and it seems to just get worse the higher they are on the ladder.

    Like, literally, an effective way to sell to them is to make them feel like they’re in a movie doing Super Important Things. LOL. Executive Disneyland.

    • SpicyLemonZest 24 minutes ago

      Much of a leader's job is to visibly perform leadership. It seems silly until the first time you need something big from a team whose managers do it poorly, and you realize that they're incapable of making commitments or setting priorities.

      The expectation that leaders will play pretend about a "war" and call everyone into a "war room" is just a part of what it means for an organization to commit that consistent high uptime is a top priority.

  • nine_zeros 6 hours ago

    >Not Meta but at Amazon I always felt like war rooms are a place for some leader to scream at you and not much else.

    It is for the some "leader". The vast large tech industry is filled with phony leaders who don't understand how the job is done and what makes the doers tick.

    But they occupy the place of "leadership". They must be seen as doing something. So they are doing the something that they can - scream at people in a locked room.

    If they could actually solve technical problems or talk to their bosses like a real engineering leader, they would. But they literally are incapable of doing so.

    So war rooms and BS performative art it is.

    • teeray 40 minutes ago

      > It is for the some "leader".

      Exactly. It’s so the leader can ask “do we have an update?” every 10 minutes when nothing has changed.

    • bloomingkales 4 hours ago

      The role of a leader is an age old role. When someone is thrown into leadership, I do believe a lot of adrenaline kicks in. You begin acting as if you are a leader similar to how a parent has parent senses and will run into a street to save any kid from a car (poor example, any human should, but hopefully you get my point). I think what you get in a war room is that primal "phenomena" of "oh shit I'm the leader now". You have to weather the primal emotions, and get a cool head back on to fulfill the leadership role.

      If it's your first time, then yeah, you will probably handle it like a dick (or twat). You gotta take on an ancient role with humility.

trollied 6 hours ago

I used to be a Rachel-a-like in a past life. Really tight SLAs (mobile network infra etc, so people have to be able to make emergency calls, for example).

So many times I got bridged into a conference call whilst fixing things & doing RCAs against tight SLAs, as non-technical people didn't have any sort of idea that it was wasting my time. "I am fixing it, I will send updates as per contractual agreements" puts phone down.

On several occasions I got 2 emails after the fact - one praising me for resolving quickly, another asking me to please be nicer to executives. The calls stopped after 5 or 6 times.

Things have moved on these days, and it's much easier to coordinate such events on Slack etc. Thankfully!

  • Nifty3929 3 hours ago

    Please be empathetic to people who do not understand what is going on, but who have tremendous responsibility to the business. The business problem is always a superset of the technology problem.

    Yes, of course you aren't able to fix it while you are on the phone with them. A conference call will not fix the code. They know that too. But they also need meaningful information and updates in order to do their jobs, which often requires them to provide updates to others like important customers, shareholders, the CEO, or even the government. They may also need this in order to plan out other activities.

    Providing useful information and frequent updates (not "contractual" updates) to them with this in mind would go a long way toward solving the whole business problem that is created by the technical problem. It might also get them off your back sooner, and with more respect for you.

    There are two critical pieces of information that would help an executive very much: Do we know what the problem is? and Do we know what the solution is? A simple yes/no on both of those would be a great start.

    • ameliaquining an hour ago

      Communicating that information to executives needs to be the responsibility of someone who isn't currently heads-down debugging. Google's SRE Book suggests creating a "communications lead" role.

      • cratermoon an hour ago

        At one employer our site outage recovery runbook specifically stated that one person was to be designated to communicate status outside the tiger team and be a buffer between panicky people across the company and the technical folks fixing the problem.

    • aqueueaqueue an hour ago

      Surely more than one person is working on the fix? If you have a pair one can pop off and give updates to a third technical person (maybe their manager or an inicident manager) who can liaise.

willvarfar 3 hours ago

The "war room" or "tiger team" or whatever its called is often a way to parachute in a handful of engineers that top management trusts to sort out the mess made by the masses. Often crusty old-timer engineers are kept around just to be called on in these scenarios.

  • Nifty3929 3 hours ago

    Yes, and this also gives the lie to ageism. I hear this from older-than-me people fairly often, that the reason that they can't get the job they want, or a promotion, or whatever, is 'ageism.'

    Meanwhile, I routinely see people older than me (I'm not young) being hired, promoted and generally shown great respect - because their years of experience has given them wisdom. They also remember how things developed over time and have more experience with details farther down the abstraction stack because those abstractions weren't around when they cut their teeth.

    I aspire to be one of those grey beards in the not so distant future. And I doubt my age will ever hold back my career, aside from change my personal choices (for retirement, fewer hours, etc).

    • pbronez 2 hours ago

      Yes, but that expertise may not be easily transferable. Two decades of experience with a firm is much more valuable to that specific firm that anywhere else. If you leave that place, you only have general lessons to apply elsewhere.

hackpelican 4 hours ago

In the places I’ve worked, a war room was always the place where we cut the bleeding and revert the system to a working state. Never was the RCA the intended outcome of a war room, though we’d often reach the RCA in the silence of the meeting bridge while something deployed/rolled back.

Root cause analysis is definitely not a group activity, it’s best done in a place where one can have complete focus.

However, cutting the bleeding requires plenty of communication, weighing different options, having a higher-up sign off on a tradeoff, getting our ops team to coordinate towards some common goal, monitoring the recovery… etc.

  • sunshowers 3 hours ago

    So interestingly, I think root cause analysis can be a group effort, but I think it has to be done on a remote call where everyone is in front of a big monitor or two, and people can take breaks and such. I've been part of teams that have done root cause analysis over a call (sometimes many calls), and it's been quite effective.

  • afro88 2 hours ago

    IIRC, Facebook don't (or didn't) do rollbacks. They always fix forward. I guess hours long incidents like this are the other edge of that double edged sword.

    • claytonjy 19 minutes ago

      Language can be tricky here. If I revert to an older commit, literally rewriting history to remove newer, bad commits, I think we’d all consider that a rollback. But if I instead add a new commit which undoes the bad commits, is that a rollback or a roll forward?

edflsafoiewq an hour ago

> This fbagent process ran as root, ran a bunch of subprocesses, called fork(), didn't handle a -1 return code, and then later went to kill that "wayward child".

In-band error codes strike again.

  • pedrocr 32 minutes ago

    This is a case of both in-band error codes and overloaded meanings of inputs colliding. Modern languages make both things much better but even in C the kill(2) interface seems much too clever. It seems it could have easily been a couple of different functions.

dakiol 3 hours ago

As a software engineer I generally can help little when a non-trivial incident occurs whether it is via war rooms or deep investigations. I do have some kind of access to some logs, traces and metrics (datadog, for instance), but at the end only the SREs or platform engineers are the one who determine the root cause of any incident because they have 100% observability.

Simon_O_Rourke 25 minutes ago

Why do all these posts descend into the "I'm so awesome" archetype, describe the damned problem and how it was resolved and for goodness sake stop trying to stroke that ego while you're doing it.

coldcode 6 hours ago

I've watched many war room in various employers.

At one (20 years ago), they met for six months to determine why our field offices' network connection to the home office was so pathetic and unusable. It was led by the head of networking. After all those meetings, it was decided that all 1000 independent field offices should upgrade their internet to T1 connections. It didn't help. Another six months goes by, and I hear from my connections in networking that the real problem was the head of networking had installed a half-duplex low-speed ethernet card: all 1000 office's data had been going through a pinhole. It was replaced, and suddenly everything was fine again, other than the hole in the office's pockets for an unnecessary upgrade.

No one ever mentioned it publically.

steveBK123 3 hours ago

The purpose of the war room is not to solve the problem but to perform the act of problem solving visibly for certain audiences.

hedayet an hour ago

Ex-Google SRE here with experience in multiple revenue-critical war rooms. At Google, war rooms were particularly useful because saying, "X is in a war room" (at least as late as 2017) gave X the credibility to say no to everything else. Having technically competent leaders made the experience enjoyable—because they weren’t just there to demand updates but actively contributed by writing queries, and nudging the team in the right direction by asking series of right questions.

My worst experience with crisis management was with one particular team at another big tech company, where the leaders were ignorant about the technology—completely clueless about the service and its architecture. In cases like this, the issue becomes a binary 0/1 problem: the service is either broken (0) or running smoothly (1). When a leader lacks the technical knowledge to grasp the intermediate steps, their only contribution is yelling for updates—and that’s exactly what they did.

Bottom line: War rooms can be a space for deep work with good leadership (a combination of technical soundness and co-ordination skills under pressure). But they can quickly turn into hell when leadership lacks one of these two essential qualities—and resorts to yelling to cover their asses.

CapricornNoble 6 hours ago

I'm not familiar with the "War Room" in the context of computer network operations specifically, but I have deep experience running military operations centers and I'm reading this through that lens.

>People figured out that yes, they had run the machines out of memory, specifically with the push - the distribution of new bytecode to the web servers. Other people started taking steps to beat back some of the bloat that had been creeping in that summer, so the memory situation wouldn't be so bad. I suspect some others also dialed back the number of threads (simultaneous requests) on the smaller web servers to keep them from running quite as "hot".

Cross-functional information exchange. Who is coordinating or directing all these disparate actions? Who is fusing the knowledge gained from these actions? Who is disseminating a clearer picture of "what really happened"? Who is using that updated picture to frame new taskings for all the people doing these independent investigations? The answer to all those questions should be "the staff in the War Room", and the leadership in the War Room in particular. My take-away is that the author is arguing that their ability to pursue single-function actions within their domain of expertise was optimized in their work environment, and was degraded in the War Room. They aren't wrong.

>I guess a "war room" might work out if you have a bunch of stuff that has to happen to deal with a possible "crisis" and then it's just a matter of coordinating it. You don't have people doing "heads-down hack" stuff nearly as much in a case like that.

Exactly. Coordinating a bunch of stuff for crisis management = put those people in the War Room. Focused heads-down tasks = put those people where they can ...focus. Now that said....one thing I've come to HATE about working in a military headquarters is open offices for everyone who isn't the G-shop lead and his/her deputy. Everyone else is shoved into a cubicle farm, probably with ESPN blaring in the background on top of a half-dozen conversations and people constantly dropping by your desk to BS about cover sheets for TPS Reports. So even if you're NOT in the War Room, you can't focus.

  • icegreentea2 5 hours ago

    I feel like some things that consistently gets in the way of the clean separation between (crudely speaking) deciders and doers, and keeping the doers out of the war room (so they can work effectively) are:

    Poor, or fear of poor communication. The "do-ers" become compelled to be in the war room to try to mitigate communication failures.

    Unclear decision making processes and ownership. People with high technical expertise (who would be top tier do-ers who maybe should be kept out the war room) are kept around because their immediate feedback in the war room can significantly shift the decision making process and decisions made.

    I should be more specific - I believe there's often a desire (and makes instinctive sense) to fall back to decision by consensus. Once everyone understands that this is how these things work, then obviously you want to pack the smartest, most competent people in the room, either because you're playing political games, and you want more "votes", or because truly you believe that you need the best people in the room to guide consensus.

    These are structural and cultural (non-cynical) issues that drive both doers and decision makers to -want- to keep smart, competent doers in the room, even though separation -should- lead to better outcomes.

gherkinnn 5 hours ago

War room. In the trenches. War stories. Pasty programmers and plump PMs using such terminology is a bit silly.

Fixing a printer that sometimes does something unexpected is not even a sailor's yarn, let alone a war story.

  • alabastervlog 2 hours ago

    “Telemetry” for “keylogging our fart app’s users” because everyone wishes they were doing something cool and/or meaningful.

bossyTeacher 6 hours ago

> I can't imagine doing that kind of multi-window parallel investigation stuff on a teeny little laptop screen with people right next to me on either side

This is it. Managers (I mean non technical folk) don't understand this. They don't understand that putting people physically together won't help you solve the issue faster. This is the same mentality that believes that typing code faster or generating more code is a good thing. The kind that believes that all employees need to always be physically together for "good stuff" to happen.

Sadly, they will never learn. Those managers and c-suite people will never read Rachel's post or investigate if their rto policies are necessarily good for the business. These folks are just reading numbers on a spreadsheet without fully understanding what those numbers actually mean in their business.

Sadly, I don't see that ever changing because that mentality provides a comforting worldview where office gives you sense of control and having all your cows in the farm under your watchful eye (or that of your trusty shepherds) feels so intuitive that any alternatives are simply too uncomfortable to even think about.

  • bloomingkales 4 hours ago

    There was a head of a department that once forced everyone to uninstall iTunes because he believed it was reducing productivity. Feels like a never-ending battle with these types.

cratermoon an hour ago

"This fbagent process ran as root, ran a bunch of subprocesses, called fork(), didn't handle a -1 return code, and then later went to kill that "wayward child". Sending a signal (SIGKILL in this case) to "pid -1" on Linux sends it to everything but init and yourself. If you're root (yep) and not running in some kind of PID namespace (yep to that too), that's pretty much the whole world."

Key phrase "didn't handle a -1 return code".

Yuan, Ding, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. “Simple Testing Can Prevent Most Critical Failures.” Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), 2014, 17. https://www.eecg.utoronto.ca/~yuan/papers/failure_analysis_o...

adolph 7 hours ago

It’s interesting to think about how broad a net must be cast to understand the state of a system.

That was another rathole, and the answer was also a thing to behold: I couldn't see it in the checked-in source code because it had been fixed. Some other engineer on a completely unrelated project had tripped over it, figured it out, and sent a fix to the team which owned that program. They had committed it, so the source code looked fine.

  • esafak 6 hours ago

    And the fact that everyone benefits when people aren't just doing their own, narrowly-defined jobs.

  • tantalor 3 hours ago

    This is an obvious, first thing to check when you are looking directly at source code.

    Oh, the code changed 1 week ago? Let's see the diff. Oooooooh!