Distributed systems theory for the distributed systems engineer

catwell · on May 13, 2016

For people not already into distributed systems but want to get started, I blogged my (very) short list of things to read last year [1].

Today I would add a fifth item to that list: "Why Logical Clocks are Easy", which is one of the best explanations of causality I have seen so far [2].

[1] https://blog.separateconcerns.com/2015-07-07-four-easy-reads...

[2] http://queue.acm.org/detail.cfm?id=2917756

hyperpape · on May 13, 2016

Then add it ;)

We can't keep everything we ever write up to date, but there's little point in reading someone's 2015 list of what to read when they've decided there's a good addition in 2016.

catwell · on May 13, 2016

You are right, I will.

Jupe · on May 13, 2016

Practially speaking, I've learned much more from Aphyr's Jepsen Test framework (and write-ups about test results) than from any other single source.

Ref: https://aphyr.com/tags/Jepsen

krat0sprakhar · on May 13, 2016

I recently took a distributed systems course (https://roxanageambasu.github.io/ds2-class/) in school and our professor referred us to Prof Steve Gribble's videos which, IMHO, are extremely informative and fun to listen to.

Couldn't recommend it more - http://courses.cs.washington.edu/courses/csep552/13sp/video/

Class Webpage - http://courses.cs.washington.edu/courses/csep552/13sp/

hyperpape · on May 13, 2016

I'm not sure how it would read to someone who hasn't been previously reading anything on the subject, but I like aphyr's notes on a two day course on distributed systems as a high level overview of the topics involved: https://github.com/aphyr/distsys-class.

marinabercea · on May 13, 2016

Great timing and submission, thank you for posting! I've been meaning to get more in depth knowledge on distributed systems, but despite having access to several academic (text)books, I felt overwhelmed and didn't know where to start exactly and what sub-topics I might want to focus on.

Just downloaded and sent to Kindle 'Distributed Systems for Fun and Profit' as a free PDF written by an engineer currently working for Stripe, a book recommended in the article. It's only 62 pages and doesn't feel intimidating!

sciurus · on May 13, 2016

I'm looking forward to the publication of Martin Kleppmann's book Designing Data-Intensive Applications.

http://shop.oreilly.com/product/0636920032175.do?sortby=publ...

ibash · on May 13, 2016

Join safari books online and start reading it -- it's good.

nazgob · on May 13, 2016

I want to but I never read incomplete books. And its still missing few chapters.

davidw · on May 13, 2016

> But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?

Bingo! We need some "O'Reilly style" distributed systems material. Most of us are not going to be designing new algorithms, but plugging in various pieces. Having a generic understanding of those pieces and where they work well, and when to actually go to the research are kind of missing right now in that world.

Some other links that people might find interesting:

http://videlalvaro.github.io/2015/12/learning-about-distribu...

http://book.mixu.net/distsys/single-page.html

http://dancres.github.io/Pages/

collyw · on May 13, 2016

Pretty much the same as most aspects of IT. How often does anyone write their won sort routine? How often does that sort of thing get asked in interviews?

craigching · on May 13, 2016

Probably one of the more used books (by universities) on the topic is "Distributed Systems: Principles and Paradigms" by Tanenbaum and van Steen. I just finished a class that used this book and I understand that there are criticisms of it, but it did seem to me to be reasonable given the breadth of the subject. And most, if not all, of those papers are covered to some degree in this book.

Something I'm looking forward to, Pearson has returned the copyrights of the book to the authors and they are supposedly updating it. Could be interesting: http://www.distributed-systems.net/index.php?id=distributed-...

The main web site says the 3rd edition is nearing completion.

johnbender · on May 13, 2016

Can anyone familiar with the linked material comment on whether there is a standard model used in the proofs there and in the DS literature?

I'm thinking of something like Lamport's global time model from "On interprocess communication".

einarvollset · on May 13, 2016

No there is not.

dschiptsov · on May 13, 2016

MIT biology courses teaches very fine distributed systems theory.)

rollulus · on May 13, 2016

> Gwen Shapira, SA superstar and now full-time engineer at Cloudera [...]

Gwen is at Confluent, the Kafka company. Doing a great job there!

kod · on May 13, 2016

The post in the OP is from 2014

einarvollset · on May 13, 2016

(Before you down vote: I have a PhD in distributed systems and fault tolerance. Okay, now you can down vote for the duchebaggery of this prescript)

I think a fundamental and very underrated paper and concept (which actually predates Paxos, yet Lamport ignored or was unaware of) is the notion of randomized consensus protocols. Simpler than "structured" leader type algorithms. Believe Ben Or's algorithm was first.

mjb · on May 13, 2016

> Believe Ben Or's algorithm was first.

Ben-Or's "Another Advantage of Free Choice" beat Rabin's "Randomized Byzantine Generals" by a couple of months in 1983. These algorithms show how much people over-extend results like FLP. The result is about a very particular system model, and the addition of even a very tiny extra piece (in Ben-Or's case, a random oracle) makes the consensus problem possible again.

I wouldn't say that these algorithms were really ignored by Lamport when he wrote the Paxos paper. Again, they're solving a different problem in a different system model. If you want to pick on Lamport, talk about Liskov's Viewstamped Replication.

If anybody has a digital copy of Ben-Or's paper that isn't partially cut off, please make it available. Both the copy in the ACM library and the only copy the author himself has are missing some of the right hand side.

einarvollset · on May 13, 2016

I disagree - an ex-colleague at Cornell wrote a paper proving equivalence. Will have to dig that up..

athenot · on May 13, 2016

An illustration is bees vs. flies.

A bee trapped indoors will go for the sun: a great algorithm for a forest/thicket but fatal in a house or car with windows. A fly will randmly try until it succeeds, meaning it is slower at escaping a thicket but will eventually find the open door even if it's opposite from the window.

elviejo · on May 13, 2016

Where did You study your degree? I would be intereses in doing one.