The Google Stack [pdf]

mrry · on June 3, 2015

It's interesting to compare this with the Facebook stack, drawn by the same author:

http://malteschwarzkopf.de/research/assets/facebook-stack.pd...

…and it would be doubly interesting to see the "related work" arrows between the different companies' infrastructure platforms. We'd need a 3D visualization for that though.

aristus · on June 3, 2015

I'd take this chart with a grain of salt, given the FB paper. Peregrine is no longer in service, IIRC. It was a friendly rival to Scuba, and was eventually replaced by Presto, which is not even mentioned. Also not mentioned are several important things like the pub/sub ecosystem built on Scribe. Haystack is long dead, except for maybe some archived stuff. Lastly, PHP-to-C++ has not been a thing since early 2012.

ms705 · on June 3, 2015

Original author here.

This is interesting -- I tried to do my best to find out what's still in use and what is deprecated based on public information; happy to amend if anything is incorrect. (If you have publicly accessible written sources saying so, that'd be ideal!)

Note that owing to its origin (as part of my PhD thesis), this chart only mentions systems about which scientific, peer-reviewed papers have been published. That's why Scribe and Presto are missing; I couldn't find any papers about them. For Scribe, the Github repo I found explicitly says that it is no longer being maintained, although maybe it's still used internally.

Re Haystack: I'm surprised it's deprecated -- the f4 paper (2014) heavily implies that it is still used for "hot" blobs.

Re HipHop: ok, I wasn't sure if it was still current, since I had heard somewhere that it's been superseded. Couldn't find anything definite saying so, though. If you have a pointer, that'd be great.

BTW, one reason I posted these on Twitter was the hope to get exactly this kind of fact-checking going, so I'm pleased that feedback is coming :-)

aristus · on June 3, 2015

HipHop's replacement was pretty widely reported: http://www.wired.com/2013/06/facebook-hhvm-saga/

This link has papers on pub/sub, HHVM, and so on: https://research.facebook.com/publications/

Re Haystack: possible I am misremembering, or the project to completely replace Haystack stalled since I left.

If you want to gather a more complete picture of infrastructure at these companies I suggest, well, not imposing the strange limitation of only reading peer-reviewed papers. Almost none of the stuff I worked on ended up in conference proceedings.

ms705 · on June 3, 2015

Thanks, I've added HHVM and marked HipHop as superseded.

I also added Wormhole, which I think is the pub/sub system you're referring to (published in NSDI 2015: https://www.usenix.org/system/files/conference/nsdi15/nsdi15...).

Updated version at original URL: http://malteschwarzkopf.de/research/assets/facebook-stack.pd...

Regarding the focus on academic papers: I agree that this does not reflect the entirety of what goes on inside companies (or indeed how open they are; FB and Google also release lots of open-source software). Certainly, only reading the papers would be a bad idea. However, a peer-reviewed paper (unlike, say, a blog post) is a permanent, citable reference and is part of the scientific literature. This sets a quality bar (enforced through peer review, which deemed the description to be plausible and accurate), and allows the amount of information to remain manageable. The number of other sources of information makes them impractical to write up concisely, and it is hard to say what ought to be included and what should not when going beyond published papers.

I don't think anyone should base their perception of how Google's or Facebook's stack works on these charts and bibliographies -- not least because they will quickly be out of date. However, I personally find them useful as a quick reference to comprehensive, high-quality descriptions of systems that are regularly mentioned in discussions :-)

Maro · on June 4, 2015

Other ways to get information:

1. Ask the developers per email.

2. Fly out to SF, visit the campus, have lunch.

Will work for FB (I do it all the time), Google people won't tell you anything.

kajecounterhack · on June 3, 2015

Re Hiphop I think HHVM + Hack (Facebook's internal "improved PHP") has superseded it but while HHVM is open sourced Hack isn't public.

chrsm · on June 3, 2015

Hack is just a piece of HHVM, it's open: http://hacklang.org/

kajecounterhack · on June 3, 2015

Ah thanks, I didn't realize it was out. I read a separate article that said it hadn't been released yet -- it was probably outdated.

nulltype · on June 3, 2015

Does Spanner talk to Bigtable? From reading the paper, I thought it was built directly on Colossus.

ms705 · on June 3, 2015

Ah, you're right!

I misread "This section [...] illustrate[s] how replication and distributed transactions have been layered onto our Bigtable-based implementation." in the paper as meaning that Spanner is partly layered upon BigTable, but what it really means is that the implementation is based upon (as in, inspired by) BigTable.

Spanner actually has its own tablet implementation as part of the spanserver (stored in Colossus) and does not use BigTable. I've amended the diagram to reflect this.

nocarrier · on June 4, 2015

Carlos, you are mistaken re: Haystack. I don't think it was ever planned to be replaced, it's always been in use as a hot blob storage engine since it was launched. There are three storage layers for blobs at Facebook: hot (Haystack @ 3.6x replication), warm (F4 @ 2.1x), cold (Cold Storage @ 1.4x). We have published papers describing each layer.

Haystack still handles the initial writes and the burst of heavy read traffic when an object is fresh. It has a higher space overhead of 3.6x because it's optimized for throughput and latency versus cost savings. The Haystack tier's size is now dwarfed by the size of the F4 and Cold Storage tiers, but it still handles almost all of the reads for user blobs on Facebook due to the age/demand exponential curve.

After a haystack volume is locked and has cooled down enough, it moves to F4 to bump its replication factor way down for cheaper, longer term online storage. And then Cold Storage is used for the older stuff that gets barely any reads but that you still want to store online for perpetuity.

That is why the team that works on media storage is called Everstore; they take storing people's photos and videos seriously and view it as a promise to keep your memories available forever. It feels really good to see photos from 5+ years ago and have them still work, and someday there will be 30 year old photos and videos on Everstore as well.

Source: I built Haystack with a couple other people and founded the Everstore team. :-)

iribe · on June 4, 2015

Presto is a copy of dremel. Putting it all together:

gfs -> hdfs

bigtable -> hbase

google mapreduce -> hadoop

dremel -> presto

protocol buffers -> swift

low_battery · on June 5, 2015

protocol buffers -> thrift

bra-ket · on June 3, 2015

related: https://www.quora.com/What-is-Facebooks-architecture

wwweston · on June 3, 2015

When I'm looking at this diagram I'm struck by how much it diverges from the typical entities invoked when we generally describe a tech stack.

There's almost no discussion of language or framework or "patterns" in the software idiom sense. Instead it's largely a wide variety of data stores and access layers.

Bad programmers worry about code, good programmers worry about data structures?

duaneb · on June 3, 2015

> Bad programmers worry about code, good programmers worry about data structures?

Bingo. At this scale, all that matters are the bounds a service can guarantee. So long as it can comply to the bounds, the underlying algorithm (let along language, patterns, etc) doesn't matter.

Facebook's development of the HHVM is a perfect example of this—they took a nearly universally despised language (from an engineering perspective) and tooled it to fit into the necessary constraints. While crucial for explaining their stack, it's not in any way necessary for any of the individual nodes.

asuffield · on June 4, 2015

(Tedious disclaimer: my opinion, not my employer's. Not representing anybody else. I'm an SRE at Google, and use some of this stuff all the time.)

In all honesty, this diagram is missing so much information that you won't learn very much from it about how we do things (and all the things you can't see are confidential). It's more like a map of the relationships between the few bits that have been published.

However, I would tend to agree with your final suggestion that data structures are much more interesting than code.

endtime · on June 3, 2015

I think it's a matter of scale, rather than quality - we obviously use languages and frameworks and patterns at Google.

wwweston · on June 3, 2015

Scale of operations at Google, or conceptual scale of the diagram?

And if you feel like language and framework choice plays a significant enough role at Google that "stack" in the more traditional sense is something that's carefully considered by engineering, I'd love to hear about the details.

endtime · on June 4, 2015

The scale of the diagram, though obviously a diagram of that scale can only be drawn for an organization with operations at that scale - so kind of both.

I've mostly done backend/data stuff at Google, so I can't speak to traditional web dev decisions, but I've written design docs which discuss the tradeoffs between using Bigtable and Spanner, or Flume vs. Mapreduce, or one serving strategy vs. another, for some specific thing. Maybe those choices are vaguely analogous to choosing between Postgres or Mongo, nginx or Apache, etc. I imagine the guys who write webapps (not Search, obviously, but internal apps or things like our help pages) consider whether to use App Engine, Angular, Django, etc. on a per-app basis.

reagency · on June 4, 2015

This diagram is about services that run o a machine and vend data to other machines, not the libraries or frameworks (GWT, etc) that implement the services. This diagram is more "zoomed out"

NicoJuicy · on June 4, 2015

Everyone worries about code, but on top of code you have architecture.

Worrying about code makes other programmers life easier, just like worrying about endpoints or architecture.

vgt · on June 3, 2015

Interesting is that much of this is externalized through Google Cloud in one way or another:

- GFS/Colossus = GCS

- Borg = Kubernetes

- Dremel = BigQuery

- BigTable = BigTable (naturally)

- FlumeJava,MapReduce,MillWheel = Dataflow

skj · on June 3, 2015

Kubernetes is not an externalized borg. They are completely separate products that do more-or-less similar things. Similarly, dataflow is completely new code that does similar things.

As far as I know, GCS, bigquery, and bigtable actually are externalized offerings of their internal equivalents.

vgt · on June 3, 2015

Right, there are clearly levels of differences between Google-internal and external. Some more than others.

exacube · on June 3, 2015

GFS/Colossues are distributed file systems, which GCS doesn't really address (I think GCS is akin to AWS' S3?)

ecnahc515 · on June 3, 2015

Yep, but GCS is probably powered by GFS

thrownaway2424 · on June 3, 2015

GFS is so dead few people even remember using it. But to say that something is "based on GFS" or "based on Colossus" is to say very little. If it stores data it is eventually "based on Colossus". You could say just as much if you said it was "based on ext4".

jrv · on June 3, 2015

GFS was phased out many years ago - it has been replaced by Colossus.

timdierks · on June 4, 2015

In what fashion is GCS not a distributed file system?

amelius · on June 3, 2015

Google has many services. This diagram means little if it does not specify what services use what parts of the stack.

johannes1234321 · on June 3, 2015

All services need to process and store large amounts of data. For that they need the building blocks firming the core stack which most projects (YouTube seems to be the big exception from what I hear) share. That is shown.

thrownaway2424 · on June 4, 2015

An interesting exercise in blind men describing the elephant.

mrry · on June 4, 2015

Malte Schwarzkopf is the first author of the Omega paper:

http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/S...

...so not entirely blind.

thrownaway2424 · on June 4, 2015

One of the reasons Omega failed was because its authors went up in an ivory tower and ignored everything about what was actually happening in Google datacenters for years. That a diagram this full of WTF could be produced by one of the Omega authors does not surprise me.

yukinon · on June 4, 2015

It looks like there might be some missing parts to this. I was under the impression Google has a layer of MariaDB databases somewhere.

dpe82 · on June 4, 2015

Are you thinking of this? https://github.com/youtube/vitess

yukinon · on June 4, 2015

That is definitely it, it looks like Youtube uses MySQL/MariaDB. Thanks for the link.

Xorlev · on June 4, 2015

How about Mesa?