Hacker News new | past | comments | ask | show | jobs | submit login
Hacker News down, unwisely returning HTTP 200 for outage message (bibwild.wordpress.com)
399 points by pcvarmint on Jan 7, 2014 | hide | past | favorite | 113 comments



As people mentioned this site is an exception to how to do things, in that PG actively does not care about search engine results. However, for the people who are interested, here's a few ways you can handle a situation like this.

1. If you add the "max-stale" header to your content you can tell your CDN to hold on to things longer in the event of an outage. What happens is that the CDN will look at the cache time you set and check in as normal when that expires, but if it gets back no response or a 503 code (server maintenance status code- if you work with CDNs this is your friend!) it will continue serving the stale content for as long as you tell it.

2. Lets say you're beyond that, or your site is too dynamic. Instead of setting up an error page that responds to everything, setup a redirect with the 302 status code (so crawlers and browsers know it's a temporary thing). Point that redirect at an error page and you're golden. The best part is these types of requests use minimal resources.

What I do is keep a "maintenance" host up at all times that responds to a few domains. It responds to all requests with a redirect that points to an error page that issues the 503 maintenance code. Whenever there's an issue I just point things at that and stop caring while I deal with the problem. I've seen webservers go down for hours without people realizing anything was up, although with dynamic stuff everything is obviously limited. The other benefit to this system is that it makes planned maintenance a hell of a lot easier too.

Oh, another thought- you can use a service like Dyn (their enterprise DNS) or Edgecast to do "failover dns". Basically, if their monitoring systems notice an issue they just change your DNS records for you to point at that maintenance domain. You can also trigger it manually for planned things.


> As people mentioned this site is an exception to how to do things, in that PG actively does not care about search engine results.

Is pleasing Google now the only reason to obey the HTTP spec?


Unfortunately: Yes.

For example, I had a hard time telling people that they should use a proper HTTP redirect from their domain "example.com" to "www.example.com" (or vice versa), instead of serving the content on both domains. All arguments about standards, best practice etc. were unconvincing. But since Google startet to punish "duplicate content", I never had problems convincing people again.


Search result ranking is an obvious casualty, but it's not the only one. My Firefox start page (as an example) now has a cached but empty version of the HN pages that doesn't really look quite right. Any system that pulls content automatically, like that, or an offline reader, etc. is going to get confused.

It's simply good practice to get the HTTP status codes correct in the ways you outline.


HN was down much longer than I thought it was because the bad pages were stuck in caches. I had to do a force-reload to see that it was back up this morning.


HN was down much LESS time than I thought, because they told browsers that the "Sorry, but we're down" message should stay in cache until 2024! I believed the status report until today, because it didn't occur to me that someone would issue a status report on a temporary site-wide outage and instruct browsers to consider it permanent.


It was also down for much longer than a lot of services (e.g. status checkers) thought, since their only sensible method of working is to inspect the HTTP status. The result being that HN uptime looks better than it really is; of course, this isn't a paid service or something mission-critical, so exaggerating uptime doesn't really make much difference.


Similar problem on my phone and you can't Ctrl+F5 to force a reload on this device.


You can add a random query parameter to the url to force a reload if the option isn't given by other means.


Good point. Didn't think of that!


on some phones holding shift and enter will work for a hard refresh... (it's awkward to say the least)


same here, the only way i finally got out of bad state was to click on the logo to go to /news if i navigate to news.ycombinator.com i'm still seeing the "we're down" message in chrome.


This a strange reaction. I agree that perhaps "this is different", but I am hoping I can assume that the next time I see an article on HN, about another website going down, or losing their data, that the HN community shows as much compassion and understanding. In my experience, when not related to the work of pg, there tends to be a lot of "you should have done X and there is no excuse for not doing Y!!"

If I'm being honest, it's off putting to watch community-wide apologist response to what would normally be outrage of poor execution.


I think it's an expected response from a community that is largely comprised of people that desire YC-bucks.

But you nail it. Had Gmail, Reddit or any other unpaid service gone down like this, the uproar would be heard from another galaxy.


People depend on email. This is just a link aggregating and discussion site. I think people should have some sense of perspective here.


Having spent months telling people that email is not reliable and can fail at any minute and must not be relied on, and then having to put up with the fall out when the shitty provider broke something, I can confirm that people get angry when email breaks.


"email is not reliable and can fail at any minute and must not be relied on,"

In what universe? If email were to stop working in most major corporations that I've worked in for the last 8+ years, the company would basically come to a halt.

Email, for many, many companies is the message/workflow bus, and if it stops - communication comes to a halt.

It is, after electricity, and the network, the one essential function in a company in 2013.

Telephones, Photocopiers, Printers can all cease functioning , with little impact in most technologies companies - but not email.


And yet Email is the most failureprone thing out there. It was not meant to be relied upon in the degree it is today. And yes, many companies grinds to a stop as soon as email is down. There has been no big improvements to the reliability email the last 40 years. I mean, what is the date format in email? Any sort of current standard? What date-stamp are most email readers trusting? Ever had an email that says it arrived 1997? I still get those sometimes.


"Email is the most failureprone thing out there"

I don't know where you keep coming up with that - Email is relatively easy to make highly available, and an Exchange Server in 2014, configured with even a modicum of skill, will likely keep running smoothly for the next 10 years. It's as close to 5 9s, of availability that you can get on a software system.

"There has been no big improvements to the reliability email the last 40 years. "

That's just silly. Take 5 minutes to read through http://en.wikipedia.org/wiki/Email and you'll see what improvements have been made in the last 40 years.


Deliverability of email relies on a number of factors outside your control.

You also mention competance, which is available in variable quantities. You might be able to keep Exchange 2014 running solidly, but I've seen people doing scary things with MS SBS Server 2000 (and the exchange that comes with that).


Eh, as someone who has had to run some large sites I'm a bit more sympathetic regardless of who it is. Shit happens, and they're just websites.


You may be sympathetic, but it seems like most people aren't. Most folks start saying what they'd do better and how they could improve the availability in a weekend.


"actively does not care about search engine results. "

Or data -- they tossed a day or so of submissions and comments when restoring from backup, AFAICT. I'm not saying that's a problem; 4chan loses data and works just fine.


That's a poor comparison. Users expect 4chan to lose that same data as part of normal operation anyways, so you can hardly consider it a problem. That's not the case for HN.


This was all me. I probably should have thought about it more, but just wanted to make it clear we knew something was wrong and were working on it. Load was not a concern.

Though the article is correct, with everything else that was going on response codes and cache headers were the least of my worries.

I think the best takeaway is that you will go down at some point, so it's best to have a reasoned plan in place for when you do. Handling it in the heat of the moment means you'll miss things.


Yep, I think your takeaway is exactly the right one, about planning in advance for outages.

I didn't post to try and make HN look bad, using the wrong http status code isn't a huge deal or anything. I just wanted to take the opportunity to discuss http response codes, an issue near to my heart. (In my day job at an academic library, the fact that most of the vendors we deal with deliver error pages with 200s does interfere with things we'd like to do better).

Thanks for the reply!


If it becomes a big deal, and you can't get the data from ThriftDB for some reason, I've got a copy of HN data (submission & comment content, user, points, item date) up to id 7018491, a comment by kashkhan, time-stamped 2014-01-05 23:56:34.

edit: Ack, I just realized that item_id got reset back to 7015126 on the reboot. My data matches HN up to 7015125, and then diverges after that.


Can you make it available somewhere? I'd be interested in that data, and sure that others would be too. Thanks!


Sure. This is very temporary, I'll be removing the links sometime tomorrow:

http://www.associatedtechs.com/tmp/hn_submissions_7015126.sq...

http://www.associatedtechs.com/tmp/hn_comments_7015126.sql

After a semi-random sampling, the comments file appears to contain nothing but comments pre-crash.

The submissions however got clobbered a little by the crawler at some point. There are some submissions in there pre-crash and some post-crash; I think everything's OK from 7015172 on, which only leaves 15 possibly damaged rows, and of those, I'd expect most of them didn't have id collisions. Sorting out the old stuff from the new stuff could be manually done.

(Please let me know if there's anything I should be concerned about in those, or if they shouldn't be posted for some reason, or something. I'm recovering from flu and am still not entirely all here.)


Thanks!


HTTP 200 = "Cloudflare, please cache this status message instead of passing through a million requests to our dead server while it's busy restoring a backup".

PG doesn't care about HN's search listings, so there's no drawbacks to doing that.


> "PG doesn't care about HN's search listings"

Yup. re: "Why does HN have a relatively low Google PageRank?"

"Probably because we restrict their crawlers. But this is an excellent side effect, because the last thing I want is traffic from Google searches."

https://news.ycombinator.com/item?id=5808990

If you need to search for something that you know was on HN, HNSearch is a great tool. I use it all the time.


> "Probably because we restrict their crawlers. But this is an excellent side effect, because the last thing I want is traffic from Google searches."

Am I the only one who finds that puzzling?

This isn't Fight Club. It's not even Entrepreneur Club. It's a bunch of generally smart people talking about technology, with an emphasis on making money from it. It's one of my favorite sites, and I love it, but it's not an invitation-only club, is it?

(I also find it weird that one of the go-to sites for web-savvy people would be like, "yeah, screw status codes and how the open, linked, web is supposed to work".)

To be clear, I'm not Protesting a Great Evil. I just find it puzzling, as in, "That's odd, I must not understand what this is all about, after all."


I can't speak for PG, but I think the general idea is that a slow influx of new users is less likely to alter the nature of HN as everyone has a chance to acclimatize (avoiding some sort of Eternal September), and the people who really "need" to be on HN (people interested in startups I guess?) will know about HN already, or be told about it. That last part might be a little "fightclub-ish" I guess, but it seems to be working alright.


Couldn't he just turn off registrations for new accounts? Not saying he needs to get HN to the top for a "startup" search query. I found HN by a Google search while looking for a good laptop to run Linux on.


That happens. When too many people register accounts they're locked for the rest of the day and the register account option is no longer there, only login.


IIRC sometimes the "Create Account" section of the login page is missing, but can still be accessed through https://new.ycombinator.com/submit

That's not how it is right now so I'm not sure if I am remembering it correctly.


IIRC (I might search for it later) that was a spambot fix. Apparently it was fairly effective - I presume the bots were smart enough to find the 'login' link on the front page then register an account from there but not much else.



It's not just puzzling, it's almost criminal considering how much of the culture and important decisions get discussed here.

A lot of the time you have people who are direct parties to <insert thing here> come and talk about it only for it to become forever inaccessible because Google can't get it's mittens on it.

Let's not even get started on how some urls expire.


It's also because it saves a ton of resources not having to serve crawler requests.


Thanks for pointing out HNSearch. site:news.ycombinator.com (whatever) works nicely with Google too.


Not really. There were plenty of times when I tried to find an article a few days back but google was coming up blank even with `site:news.ycombinator.com`. Had to resort to scrolling through HN's facebook bot page (posts all the front-paged links)


>> HNSearch is a great tool. I use it all the time.

Agreed - so is pinboard's search for #hn tags


> pass through a million requests...

How does that make sense? As if CloudFlare would honor status codes but not take advantage of cache headers (which in this case stipulated an absurd 10 year expiration).


I seriously doubt Cloudflare's behaviour would be that stupid, wouldn't it momentarily cache error pages instead of hammering the server? At a minimum it would throttle/prevent concurrent requests.


Unfortunately, it took me 12 hours to find out the site was back up because the outage page had been cached for me. I had to realize I had to do a hard-refresh of the page.


That's only because HN set a 10 year expiration header, as pointed out by the article.


Other options would be to return status code 503 Service Unavailable with a Retry-After header, or drop connections while the dead server is busy.

Both of those would be much better than returning 200 OK.


CloudFlare also takes of the load of your dead server if the server is dead. It still has a "Try the original server"-button, but that is more manual.


cache was too long though. i didn't think to try ctl+f5 until i saw that it was on according to twitter.


Browsers tend to cache 200 OK responses. When HN came up (as reported on Twitter) I kept getting the error page until I bust the cache and reloaded. Yup, that's what 200 OK for an error page can cause. A regular reload will still show your _down_ page


Well by the letter of the RFC browsers and middle boxes (all the invisible caching proxies out there at your ISP, etc) are only supposed to cache if the cache/expires headers are set for that, but 200 is a bad choice for most people running a site for the other reasons listed above like crawlers/indexers. 5xx is correct for a problem on the server, usually 500, 503 or 504 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html).

At the same time there are many erroneous implementations (intentional and unintentional) that, as you imply, just cache any 200 without checking cache headers. This is also a good practical reason to avoid doing this.


I think that was actually caused by the incredibly high cache-control expire header, which claimed cached versions of the page should be valid for 10 years (!)


I had the same issue. Checking all day. I saw that it was back up so I tried it - and it was still down.

Finally thought to try reload and it gave the page.


I know that pg "actively does not care about search engine results", but HTTP spec has other applications besides Google pagerank. It's hard to build amazing new technologies and improve the Web if people keep ignoring the standards without a good, technical reason. Please, for the sake of the example set to others, send the proper HTTP codes.


This isn't bad for just Google. RSS aggregators were also getting the "everything's fine" message. I thought I had a bug in my aggregator until I went to the site and realized it was down.


He's right that for most sites this would be undesirable, but PG has stated that they aren't looking for a lot of Google traffic. But then, just because it doesn't matter to PG doesn't mean it doesn't matter at all.

I've more than once found myself Googling old HN threads I'd like to find, but can't. Google's search (and site search) is miles better than HN's, but HN intentionally limits Google's crawl rate, thus limiting the amount of content crawled and indexed.


Just hit HN this morning and got the downtime message. Then I remembered this post and did a hard refresh to get back to normal. Browsers certainly aggressively cached it.


Same here, not sure why that isn't touted as a more important reason to provide the correct status codes rather than simply saying Google search isn't important.

Standards are important.


Completely agree. I didnt even think of hard refreshing until i saw the "server is back up" status on twitter.


I learned about doing a hard refresh on Twitter (I didn't know about this post). I think it's a big problem, there's lot of users who can't currently access the site.


I am not a news junky, but I realized how dependent my newsfeed is from Hacker News. Seeing about 6-8 times the outage message reminded me on why I keep coming back and reading the quality entries of these forums.


Welcome back, HN - I had an unusually productive (yet unstimulating) day at the office.


Don't take this the wrong way, but after seeing hundreds of identical HN-outage productivity jokes/references on Twitter today, I really hope this is the last one I'll see in a while.


I finally added tail-recursion optimisation to my Lisp interpreter while HN was down. Unless I hear otherwise, I'm taking the Onion crown!


Wasn't really meant as a joke, but I completely understand where you're coming from. Haven't really been on Twitter today.


Someone should write a twitter bot that monitors HN status and tweets "Hacker News down, productivity up".

What the hell, I'll do it.


I was 8% more productive today. Down-vote at will.


BTW I just realised that Chrome has been serving me a cached version of HN the whole time. Didn't realise it was up again. When did it go back on-line? An hour ago? More?


Were you a member at computer-forums.net like 8 years ago? I know it's a long shot, but I remember someone with the same username as yours. I know it's not to the most unique username, but I thought I would give it a shot.


No sorry. I did come up with the name a long time ago but (like 1999 or something). I've used this username on /. for a while.


More than 10 hours ago


Thanks.


By the way, HN still says it is down at this url

https://news.ycombinator.com/


The down pages had long cache times, you just need to refresh.


@HNStatus helped keep me updated but it wasn't posted on the maintenance page until the end of the day for everyone.


And because of the caching, it wasn't posted on the maintenance page at all if you visited before it was added. (I was seeing it on one machine and not the other).


Now i know why i had to refresh when i first opened HN 10 minutes ago even though it was up at that time.


I didn't know this was up until somebody told me to ctrl-reload. Nice one with the 10yr cached soft-500 page. :P

Now that our brittle forum is back, let's get back to work nitpicking the Android UIs that aren't quite beautiful enough! This is not enough drop shadows!


Even if the author is correct in that Google's ranking algorithm for HackerNews will be affected by just 24 hours of downtime, wouldn't the algorithm update itself back to normal over the next 24 hours?


Seems like a fair point in principle, except in this case HN is one of the few sites I don't need Google to get to and don't care about any other tools that might rely on returned status codes.


Tools like web browsers?


Of course. foobarian sends an email with the URL to a service and it returns the webpage, which he reads in emacs.


I know it suppose to make you more productive when it's down (I was), but suddenly I felt clueless during my commuting (which can take 2 hours total).


For anyone interested in HTTP statuses this is a great resource:

http://httpstatus.es/


those people certainly don't know their latin. just kidding.


Question: Is the HNsearch still being worked on? It has broken "link" and "parent" for search results.


I just realized I've addicted to HN, have been kept refreshing the page millions time :) thanks HN


Some people where speculating that HN banned goodle and other search engines. But from at least /robots.txt I can't see that. Do they do any ip based filtering? Does any one have information on that? I'm just curios.


Maybe this was a prior error message, but the original CloudFlare "Origin Server" error was returning a 520 - which is a CloudFlare-custom HTTP Status Code.

Edit: CloudFront -> CloudFlare


It's CloudFlare, not CloudFront.


By the logic of this blog post, a page like status.heroku.com should return a 503 when heroku is experiencing downtime and a 200 otherwise.

200 means that the page loaded as intended (which it did).It turns out that some of the page's content (the interesting stuff) was unable to be loaded, and the site's content reflected that.

A 503 would be appropriate if there was a server problem, which might have actually been the case, but with Cloudflare's landing page there was not actually a server problem (since cloudflare served the substitute content properly w/o error)


I disagree with this. By that logic, 404 pages should also return 200, because the error page stating that the content couldn't be found was indeed rendered successfully.

The difference between your status.heroku.com example and this is that in the former case you are seeking to look at a page that tells you about their status, when in this case you were seeking HN's index but instead got a page about status - because there was a problem preventing you from getting what you wanted.


So by that logic a wide variety of Cloudflare's serving modes ought to return a 503, including the one where it preserves cached content because the underlying server is unavailable.

Suppose HN consisted of two content panes and one content pane was unavailable, and the unavailable pane was replaced by a message indicating a partial outage, should HN return a non-200 response code then?

200 means "this response is being served as intended without the server serving it having an error". With Cloudflare acting as a hybrid caching proxy and static site, it is appropriate for it to return a 200.

If HN were not using Cloudflare, the underlying server would probably just show an automatically generated error message of some kind and return a 503 status.


No.

See, if I ask for, say, `https://news.ycombinator.com/item?id=7015129`, and I _get_ the page that URL names (this thread), then that's a 200. Whether it was delivered from a Cloudflare cache or not, I got that page.

IF I ask for `https://news.ycombinator.com/item?id=7015129`, and I get a "Sorry, this service is temporarily unavailable" message instead, I did not get what I asked for (200 "OK"), I got something else -- because the thing I asked for was 503 Temporarily Unavailable.

It's all about what the URL identifies, and if it was in fact succesfully delivered or not.


True, and in the case of the HN page that was up during the outage it was actually the intended content page at the time due to some backend problems, it was a customized static page not an error page.


Not every webpage is created equally.

When we say API, we usually think of it as Twitter API that sort of thing, which we can send a JSON or XML and get back something easily parasable.

Sites like HN doesn't have that kind of service. Instead, it returns HTML. That's fine. No one said API can't return HTML.

When some part of the Twitter API becomes unavailable, the API server should return 503 when we asks those APIs to return response. If these bad APIs are also used to power some part of the frontend, then the frontend will not work properly. For example, viewing thumbnail from dashboard may be down. But the rest of the page is functioning. In that case, you can't send 503 from the frontend. It doesn't make sense. So if any crawler reading Twitter.com it shouldn't see 503 when it hit the home page.

When HN is down, it should return 503. There are two reasons. First, HN is not functional anymore. The page you see may just be a maintenance page configured in Nginx, much like 404 error page. Secondly, HN is itself an API service. When it is broken, it is broken. It just doesn't work anymore. When you try to access it through a Python script, it doesn't return in the format you expect it to return. When you debug you realize it is not returning anything like you were expecting because most of the HTML structure is gone. This is not an API format change. It's simply the backend is not functional anymore.

And semantically, when your site is under a maintenance mode, 503 makes sense. 200 is not that evil. It just doesn't give anyone a better clue.


That is ridiculous. You think the status intended to be conveyed for all HN URLs was "it is up"? It was intended that any automated tools trying to figure out if HN was up or not would decide "Yes, it's up."?

If that was intended, then it was a poor, mis-guided, un-helpful intention.

But HN worker has already said it was not intended, there was not much intention involved at all, they just had other things to worry about (getting the site back up) and weren't thinking about it, due to lack of a pre-planning for how to handle an outage. https://news.ycombinator.com/item?id=7016141

At any rate, your ideology of HTTP status codes does not seem to match that of the actual HTTP designers, or of anyone trying to actually use HTTP status codes for anything. If you aren't going to use HTTP status codes for anything, then it hardly matters what they are, so there's no point in arguing about it. But as soon as you try writing software that uses HTTP status codes for anything, you will start hating sites that give you error messages with 200 OK response codes.


Precisely, I think people try to put too much semantic meaning into HTTP status codes. If you read the spec, most of the exotic semantics are actually part of the WebDAV spec and not actually the HTTP spec.

For something like a pure REST API, basic status codes are fine. It's when you start breaking the REST abstraction (which a human-readable landing page surely does) that it gets tempting to start misusing status codes.

For any kind of non-RESTful or procedural API, HTTP status codes are simply not adequate and application-specific error handling is necessary (HTTP response 200, app-specific-error 999, etc.).


If you don't think any meaning should be taken from http status codes, then why set/send them at all, why do they exist?

If any meaning is going to be taken from them at all, then the difference between "OK" and "Server Error" seems pretty basic and fundamental.


Well, to illustrate my point just look at the WebDAV spec vs the HTTP spec. WebDAV shows how you can add more semantics to HTTP response codes.

But when you think about it, for a pure REST API you really only need classic HTTP response codes.

Response codes only make sense in the context of a resource. Once you introduce query strings or start to stretch the meaning of HTTP verbs the abstraction starts to leak.

So when designing an API it's smart to just handle errors at the application level rather than at the protocol level.

Protcol level errors are for things that are outside of the application. An error response with data validation errors should return a 200, for example.


At my work, we have had arguments about what searches in an REST API should return in the case that no results (literally, in our parlance, "no documents") were found. Is that a 404 or a 200?


Via the W3 spec:

> The request has succeeded. The information returned with the response is dependent on the method used in the request, for example:

Is a query that turns up no documents when there is indeed no documents a success? I would argue yes, as the search service executed the query accurately...in fact, if the query turned up documents when it shouldn't have, that would probably be a problem...

But the way the 404 is worded, it would also fulfill the meaning of "no results".

> The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.


Does your search page display a list of results, or does it redirect to the best result? If the former, I'd say that a list of zero is still a valid search results 'document', and should therefore return a 200. But the latter could certainly return a 404, although I doubt that's how your search actually works :)


Haha, that is the sort of argument I'd avoid having at work because it would lead to bloodshed.


Either would arguably be a reasonable choice, and therefore you should use whichever is going to be more convenient for the actual use cases you know about.

Which is almost always 200. Partially because that's what everyone will expect because that's what everyone else does.


Anyone knows the reason for the downtime ?



CloudFlare really messes a lot of things up. I've seen CloudFlare refuse to give me error responses from forms before. Enter a bad value, get a cached page of the empty form, lol. Server was trying to return a page explaining the wrong entry, but CloudFlare refused to send it to me because it had a non-200 response.


ah. this explains why HN has been telling me it's down for 2 days now. had to open it in incognito to realize it was a cache issue.


who cares...


Of hundreds of comments I have made on this site, only one has been snarky. Here's my second: chill your tits.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: