Processing Billion-Node Graphs on an Array of Commodity SSDs

Smerity · on May 20, 2015

I cannot recommend FlashGraph strongly enough. FlashGraph was one of the first graph computation engines to enable near trivial analysis of the Web Data Commons Hyperlink Graph - 3.5 billion web pages and 128 billion links.

For smaller graphs, analysis using FlashGraph is hilariously quick.

If you're interested in how this is achieved, refer to [1]. From memory, Da Zheng said he created FlashGraph primarily as he wanted to prove how efficient the storage system was. Full details on the FlashGraph (though FlashGraph runs far faster now!) in the paper at [2].

Note: I'm a data scientist at Common Crawl, the dataset that the Web Data Commons Hyperlink Graph is based upon and the main developer of Flash Graph, Da Zheng, wrote a guest post for us on this very topic[3], so I'm rightfully biased as to thinking this is an amazing project!

[1]: http://www.cs.jhu.edu/~zhengda/sc13.pdf

[2]: https://www.usenix.org/system/files/conference/fast15/fast15...

[3]: http://blog.commoncrawl.org/2015/02/analyzing-a-web-graph-wi...

mrry · on May 20, 2015

An interesting comparison point: a single core on a late-2014 MacBook Pro can achieve runtimes for the same graph that are within a factor of 4 for WCC (461 seconds for FlashGraph versus 1700 seconds for the laptop).

http://www.frankmcsherry.org/graph/scalability/cost/2015/02/... (previously on HN: https://news.ycombinator.com/item?id=9001618)

There are also results for PageRank on that graph, which make the difference more pronounced. FlashGraph runs PageRank in 2041 seconds (I'm assuming for 30 iterations, per Section 4 of the paper), whereas the laptop takes 46000 seconds for 20 iterations.

Smerity · on May 20, 2015

Absolutely spot on - between FlashGraph and Frank McSherry's COST work, the two have really pushed the envelope on efficient large scale graph analysis.

Frank McSherry wrote a "call to arms" for the broader graph community at [1]. The main point of interest is that academia generally compared their work with existing distributed graph processing systems, celebrating when any achievements were made, yet not aware of the significant overheads brought on by the distributed approach. Both Frank's work (run on a single laptop) and FlashGraph (run on a single powerful machine) run far faster than the distributed approach and have very few disadvantages.

Note: I'm a data scientist at Common Crawl and Frank's graph computation discussion article was a guest post at our blog.

[1]: http://blog.commoncrawl.org/2015/04/evaluating-graph-computa...

nojvek · on May 20, 2015

I wish there was a deeper technical explanation on how they made such a difference with a layer of file system. Would this also improve relational db perf?

corysama · on May 20, 2015

Searching for "set-associative file system" brings up their publications

http://www.ncbi.nlm.nih.gov/pubmed/24402052

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881961/

https://github.com/icoming/FlashGraph

jheriko · on May 20, 2015

i can't help but wonder if this has been tested vs. just letting the os virtualise memory for you... i'm assuming yes.