The example was mostly a toy example. The power really comes when you get intera...

The example was mostly a toy example. The power really comes when you get interactivity for small data and big data. Using this, you could scale up to TBs of data on a cluster and still get results relatively fast, which is not something you can do with R.

I don't expect at small scale to beat R yet. There are a few low-hanging fruits for single node performance. For example, even for single node data, we incur a "shuffle" to do data exchange in aggregations. This is done to ensure both single node program and distributed program go through the same code path, to catch bugs. If we want to optimize more for single node performance, we can get the optimizer to remove the shuffle operation in the middle, and just run the aggregations. Then this toy example will probably be done in the 100ms range.