Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin

An outage of this magnitude is almost ALWAYS the direct and immediate fault of senior leaderships priorities and focus. Pushing too hard in some areas, not listening to engineers on needed maintenance tasks etc.


And engineers never are the cause of mistakes? There can't possibly be any data to back up that major outages are more often caused by leadership. I've been in SIEs simply because someone pushed a network outage to a switch network. Statements like these only go to show how much we have to learn, humble ourselves, and stop blaming others all the time.


Leadership can include engineers responsible for technical priorities. If you're down for that long though, it's usually an organizational fuck-up because the priorities didn't include identifying and mitigating systemic failure modes. The proximate cause isn't all that important and the people who set organizational priorities are by-and-large not engineers.


PROLONGED outages are a failure point that more often than not, require organizational dysfunction to happen.


Think of airplane safety. I think it is similar. A good culture can make sure $root-cause is more likely detected, tested, isolated, monitored, easy to roll back and so on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: