I'm gathering statistics on words, phrases and a few other things from a medium-sized (about a terabyte) corpus. There are several billion that aren't hapaxes. That's just the initial feature collection pass. Next, I'm looking for correlations among these features. It's a challenge to make it fit in 32GB. A lot of effort goes into bit-twiddling to make things fit and into algorithms that try to be intelligent about what to keep and what to discard.