Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin
How Google and Facebook are using R (dataspora.com)
81 points by soundsop on Feb 20, 2009 | hide | past | favorite | 22 comments


I've been working on a startup that is heavily reliant on R for more then a year. R is not at all an easy language to use/learn. I have never been able to really find what I would call a good resource for learning R. The online manual has incomprehensible examples, very rarely shows the output, and has very vague descriptions about what any method does. Additionally all the datatypes are bizarre.

I would recommend to anyone trying to learn R to digest the all the online books/materials on the Rattle site, and then to use Rattle heavily and look at the logs it produces.

Also getting R to work with php is unpleasant and involves reading a fair amount of Spanish comments.

All that hate said. I really love R and I don't know what I would be doing if it didn't exist.


Out of curiosity, where / what do you do? Feel free to email in profile if you don't want it public. I'm always curious about what people are using R for professionally.


One neat thing about R is that it's become standard in academic statistics to include an R implementation of your new idea with a journal paper. For example, Gareth James and his colleagues came up with a new method called the Gauss-Dantzig estimator for doing prediction in the case where the number of parameters is much larger than the number of data points. You can download the R code from his research web page here: http://www-rcf.usc.edu/~gareth/research/

This makes it much, much easier to try out new prediction methods on your own data. No more having to write code from the paper's description and hoping (praying) that you got it not entirely wrong! Instead you can use the researcher's own code to quickly figure out if the new method is better or worse than previous methods on your own data.

That being said, R does take a lot of getting used to. Graphics in general are tricky, although the ggplot2 package makes some things easier and can produce pretty results: http://had.co.nz/ggplot2/

There also isn't a great story for using R on massive data sets which don't fit in main memory, so far as I know. It doesn't take much before you start hitting data for which an algorithm that requires O(n^2) memory will eat > 15 GB of RAM. At that point you're out of the territory of Amazon instances you can rent cheaply and into building a box just for R, or you're into refactoring your data so you can do the computation in pieces. So you do have to watch out for that a bit when using the default packages.


Web data mining is so due for a renaissance. At big companies, purge rules eliminate old data to save on storage costs. At startups the focus is always on getting a product out, adding features & getting users. When guys start figuring out how to tease real value out of the data, this is going to snowball. And R is the right platform for figuring it out.


How so? Can you elaborate?


Sure; I'm assuming you're asking about my assertion that "R is the right platform ... " I've used S-PLUS fairly extensively, and I think there's a lot in common between R and S-PLUS. But R doesn't have the licensing constraints, which makes all of the difference in the world. To me, R represents a good high-level prototyping language with a very extensive native statistical library.

I would also add that I'm much more attached to the idea that web data mining is the future, than I am attached to the idea that R is the best platform. We'll see what happens!


R is essentially an open source S-Plus implementation. S-Plus licensing can be an issue, particularly when you are running on a cluster. I worked at a place that used S-Plus and they had a to limit the number of S-Plus slots to run jobs against because of licensing.


  I think there's a lot in common between R and S-PLUS.
Indeed. R is sometimes called "GNU S".


That would probably be easier to search for...


Are there any good videos online for learning the statistics behind R? Most of the R videos I've seen have been focused on the tool, not the statistics behind the tool.


Statistics is statistics -- just get a good introductory statistics textbook. MIT's OpenCourseWare is probably a good start, maybe this: http://ocw.mit.edu/OcwWeb/Mathematics/18-443Fall2003/CourseH...


You might want to check out the "Stats 202" course which is based on a class given at Stanford. R is the tool used throughout the course:

http://stats202.com/

All lectures given by David Mease at Google have been recorded and are available online:

http://www.youtube.com/results?search_type=&search_query...


There are a whole bunch of books, at least:

http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Dstri...



I hope they post the video soon. I'm an amateur R user, and it'd be great to see actual use cases from statisticians & developers from Google, facebook, et al.


A good post with interesting pointers for someone who is eyeing with R like me. A solid background of statistics is required, it's time to catch up for me!


I was at the meetup and there were few bits about how Facebook uses R, Hadoop, and python for different levels of data analysis. The guys sitting behind me were talking during the whole presentation which was really annoying... please don't do that.


I just recently heard of R, so don't chew me up for this...but...

How is R different than Matlab (besides the license issue)?


R is a different language (which is vectorized and partially functional) and has a more comprehensive stats library and better for "statsy" things, whereas matlab is better for more "mathy" things like linear algebra.


You may be thinking of Octave, not R. Unfortunately I don't know enough about Octave vs. R (or Matlab vs. S-PLUS) to comment on the similarities and differences.


What about MATLAB? I like using it


Does R have a Java binding?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: