Crazy that they were single homed. Edit: After seeing the network diagram I have...

stuff4ben · on July 6, 2023

If CloudFlare is down, a significant portion of the Internet is down. Not that it's an excuse, but this isn't Microsoft or Apple. I'm sure funds have to be allocated to take into account the likelihood of something being down. But by all means write a blog post and tell them what they're doing wrong and how you'd fix it. Maybe they'll hire you...

theideaofcoffee · on July 6, 2023

And you don't have to have the resources of Microsoft or Apple to plan and build for the eventuality that a provider becomes intermittent or unavailable. There are fundamental aspects of running an internet-facing service and they failed at one of the most basic.

stuff4ben · on July 6, 2023

LOL ok, they "failed". They haven't had an outage like this in decades and this one only affected a small number of their clients. But sure, let's spend money on providing a backup for CF. Armchair QBs are the worst.

tedivm · on July 6, 2023

The issue isn't that they needed a backup to cloudflare. The problem was they only have a single internet provider at their datacenter, so they couldn't communicate with Cloudflare.

I've honestly never had a service with a single outbound path. Most datacenters where you rent colo have two or three providers as part of their network. In the cases where I've had to manage my own networking inside of a datacenter I always pick two providers in case one fails.

> Work is now underway to select a provider for a second transit connection directly into our servers — either via Megaport, or from a service with their own physical presence in 365’s New Jersey datacenter. Once we have this, we will be able to directly control our outbound traffic flow and route around any network with issues.

Having multiple transit options is High Availability 101 level stuff.

toast0 · on July 6, 2023

> The issue isn't that they needed a backup to cloudflare. The problem was they only have a single internet provider at their datacenter, so they couldn't communicate with Cloudflare.

That's not the issue. With Cloudflare MagicTransit, packets come in from Cloudflare, and egress normally. They were able to get packets from Cloudflare, but egress wasn't working to all destinations. I wasn't able to communicate with them from my CenturyLink DSL in Seattle, but when I forced a new IP that happened to be in a different /24, because I was seeing some other issues too, the fastmail issues resolved (although timing may be coincidental). Connecting via Verizon and T-Mobile, or a rented server in Seattle also worked. It's kind of a shame they don't provide services with IPv6, because if 5% of IPv4 failed and 5% of IPv6 failed, chances are good that the overall impact to users would be less than 5%, possibly much less, depending on exactly what the underlying issue was (which isn't disclosed); if it was a physical link issue, that's going to affect v4 and v6 traffic that is routed over it, but if it's a BGP announcement issue, those are often separate.

withinboredom · on July 6, 2023

You’d be surprised at how many things break when different routes are chosen. Like etcd, MySQL, and so much more.

tedivm · on July 6, 2023

Those are generally on internal networks and rarely need to communicate with the internet. They shouldn't be affected by this.

withinboredom · on July 6, 2023

Twould be nice…

theideaofcoffee · on July 6, 2023

Yet they still had the outage. I take exception to being called an 'armchair QB' when most of my career has been spent being called in to repair failures like this, providing postmortem advice to weather future ones and fix technical and cultural issues that give rise to just this type of thinking: oh, it won't happen to us because it has never happened to us.

2b3a51 · on July 6, 2023

In your experience, what kind of cost multiple is involved in remediation of the kinds of failure you deal with?

Is it x2 or x100 or somewhere in between?

theideaofcoffee · on July 6, 2023

Since you need two of everything, or more, two switches, two physical links, (hopefully) two physical racks or cabinets, and all that, it's minimum x2, but nowhere near x100. The cost for additional physical transit links is generally pretty reasonable, depending on provider, if you have more you can negotiate better rates, same with committed bandwidth. You can get better rates if you buy more.

There are a lot of aspects to that, but the cost of doing all of the above is a lot less than not having it and failing to have it at the wrong moment and losing money that way. Each business needs to weigh their risk against how much they want to invest and how much they think they can tolerate in terms of downtime.

2b3a51 · on July 6, 2023

Seems logical thanks for engaging with the question.

squeaky-clean · on July 7, 2023

You're essentially saying "They haven't had an outage yet, so they don't need redundancy". I hope you realize how bad of an idea that is.

Also calling them an armchair QB? Very mature. Their comment is more correct than yours.

arp242 · on July 6, 2023

> This all seems cobbled together and very prone to failures.

AFAIK it's not like FastMail has a crazy number of network-related outages, so overall it doesn't seem that "prone to failure". As with many things, it's a trade-off with complexity and costs.

paulryanrogers · on July 6, 2023

I'd argue that often the CDN or transit isn't drop-in replaceable. So it's usually more than 2x the cost as one has to maintain two architectures (or at least abstractions). That includes the expertise and not optimizing for strengths of either, or building really robust abstractions/adapters.

bm3 · on July 6, 2023

It is truly crazy that they do not have their own ASN and IP blocks.

I cannot imagine running a service like that with cobbled together DIA circuits and leased IPs.

pc86 · on July 6, 2023

> This all seems cobbled together and very prone to failures.

The entire internet.

macinjosh · on July 6, 2023

To be fair. I've been a customer for 5+ years for my main personal email account and this is the first outage that has impacted me.

hedgehog · on July 6, 2023

15+ and there have been a few but nothing particularly notable. Gmail has had outages too, and I couldn't tell you from personal experience which is more reliable which is interesting given the big difference in the complexity of the deployments (obviously Gmail also has the burden of much bigger scale).

bilbyx · on July 7, 2023

I've had more Microsoft Office 365 outages than Fastmail outages in the past 10+ years, and I'm sure Microsoft has a much deeper pocket than Fastmail.

Things break. More things will break in the future due to increased complexity, brittle network automation processes and poorly written code. You can mitigate failures to a certain extent, but you can't guarantee 100% uptime, even with a triple redundant system. Every business decision is a compromise among various constraints.

bsder · on July 6, 2023

If CF is down, they're down.

The problem here is that there isn't an alternative to Cloudflare.

They say this is the article. None of their DDoS solutions can take the heat except for Cloudflare.

So, if you want resilience in the face of Cloudflare being down, you need to build another Cloudflare. Let me know when you build it. Lots of people will sign up.

nik736 · on July 6, 2023

There are several providers in this space. Path, Voxility, etc.

OhNoMyqueen · on July 6, 2023

They're still singled homed, right?

They just added redondancy to in/outbound routes.

jmull · on July 6, 2023

Why assume CF is a simple box?

I think the whole point of CF is that it isn't.

nik736 · on July 6, 2023

You have to plug in the cable somewhere. If the hardware where the x-connect is plugged in dies, has issues, has to be rebooted, etc. you have an issue and it's not like there were no CF issues ever.

hedora · on July 6, 2023

From the post-mortem, it doesn't sound like like the problem was a single network cable. Redundant network switches have existed for a long time, and they're certainly using them (but not bothering to mention it in the post-mortem).

Their problem was that they only have two transit providers, and one of them black-holed about 3-5% of the internet. Since it was a routing issue, I'd guess it was either a misconfiguration, or that the traffic is being split across dozens of paths, and one path had a correlated failure.

jmull · on July 6, 2023

Why does there have to be just one cable?

withinboredom · on July 6, 2023

Do you think there are two cables that are never bundled in the same underground tube?

All the redundancy in the world can’t protect you from some random person digging in the wrong place.

jmull · on July 6, 2023

Sure, maybe the underground tube was breached while someone was moving goalposts?

ilyt · on July 6, 2023

Doesn't matter.

Single or million boxes, if they apply wrong config, it doesn't work.

You dual home because you hedge your bets and hope no 2 ISPs gonna have fuckup at same time

jmull · on July 6, 2023

I think you’re suggesting that some redundancy you get by accident — by having two ISPs — is better than the redundancy a single ISP could engineer.

That’s certainly possible in specific cases, but not a very good general principle to rely on. One CF could very well be better than two given ISPs.

ilyt · on July 7, 2023

I'd suggest dusting off some math and calculating JUST HOW MUCH CF would need to be better compared to having 2 different ISPs fail at same time.

We had not that happen in 10 years

> That’s certainly possible in specific cases, but not a very good general principle to rely on. One CF could very well be better than two given ISPs.

You might think that if you have no idea what are you doing.

dfcowell · on July 6, 2023

That’s a false dichotomy. You can absolutely have two ISPs, one of whom is CF.

jmull · on July 6, 2023

Sorry, that's not a false dichotomy.

A second ISP isn't free, it has significant costs in terms of dollars and complexity. The question is, does CF and another provider have significant benefits to justify the additional costs? For it to make sense you have to believe the redundancy CF provides is significantly lacking (and in a way that adding a second provider addresses). Maybe it's true, but it would be nuts to just assume it and start spending a lot of money.

ilyt · on July 7, 2023

The cost is few k's at most, not exactly massive cost for medium sized company or bigger.

We pay x10 for power alone