- RAG based summary, to have the model critically assess its training data distribution and answer questions on it; to bring together information sitting in separate examples
- named entities, for knowledge base; maybe it helps with fact checking later
- implicit tasks present in the text, what are the tasks a LLM could learn from a given example?
- chain-of-thought augmentation, to bring out implicit deductions and reduce information fragmentation; it has been shown in the Phi-1.5 paper and Orca that synthetic CoT datasets are superior source materials
What data fragmentation? Look at the Reversal Curse paper. Models that train on "A is the father of B" fail to generate "B is the son of A". This kind of connection needs to be explicitly added, and would improve task solving as well.
Training on purely organic data is not good enough anymore. All powerful models train on a mix of organic and synthetic data, some models on 50-50 proportions, like the web+synth variant from Phi-1.5.
The main idea is to go deeper into the raw data, to infuse it with insight. LLM dataset preprocessing is going to be expensive, comparable to training costs, but the results are worth the effort.
Thanks for the suggestion! We will add this in the pool of features for future release. (We are currently running the current 40+ annotations on the `tail` partitions).
If you are interested in contributing the code for these features, feel free to do a PR to https://github.com/togethercomputer/RedPajama-Data! Otherwise we will try our best effort implementation :) but we hope that this can become a community effort
I feel this is usually a hugely asymmetric problem. The other example I've seen is the model being able to easily complete "The color of the sky is" with "blue", but then failing to complete "Blue is the color of" with "the sky". I say, d'uh, why would it?
If you take the training data, or in general, imagine taking all that humans ever wrote or spoke in English to date, you'll expect to find an overwhelming amount of cases where "The color of the sky is" ends with "blue". However, "Blue is the color of" can easily have a hundred thousand different plausible completions, and "the sky" won't even be one of the more likely ones. In the absence of additional context that strongly hints at the answer, one should NOT expect a properly working LLM to frequently propose "the sky" as completion to "Blue is the color of".
Can someone explain to me like a noob how this ("this" being the data hosting and download access) works? Am I understanding correctly that they are releasing code for filtering common crawl data that is out there, and the result of this filtering is the dataset?
To further elaborate on this (possibly wrong) understanding:
- Each person can then run their own processing, possibly duplicating effort(?)
...but on the good side, giving each person the ability to tweak the pipeline to suit their needs.
- There is no torrent of already processed data because __________?
- Looking at file lists for this on Hugging Face, some files seem to be stored in Git Large File Storage. Are these already processed files that together constitute the dataset? Or are these Common Crawl files that are selectively listed and pulled for processing?
What options are there to preemptively obtain a copy, in case of any possible eventual takedown of the dataset, any assurances about access aside? I am reminded of parts of the pile.
Obviously I'm super clueless here... please be gentle and share anything you know or correct anything I've got wrong.
I'm not asking about training, if that wasn't obvious. Just about obtaining the dataset.
(A) the dataset after pre-processing the raw CommonCrawl data (e.g., text extraction and language identification) and some minimal filtering; and
(B) for each document in (A), we also pre-computed 40+ of "features" (we call the "quality annotations") you can use to further filter it or deduplicate it. For example, one such feature is "how similar this document is to Wikipedia".
--
(A) is around 30T tokens, but you might want to use features in (B) to further filter/dedup it down, e.g., to 5T. For example, if in your application documents similar to Wikipedia are the most helpful documents, you can take the top documents with the highest score for the feature "how similar this document is to Wikipedia". Of course, the really interesting case happens when you consider a larger subset of these features (or maybe even automatically learn what the best way of filtering it is).
Our goal is to make this as flexible as possible such that you can fit this into your own application. What we have released is both (A) and (B)
If you have any questions, please let us know! Thanks for your interests, have fun with the data!
yes, small clarification: the 1TB per dump refers to the head+middle partition of the dataset and includes the text documents and the quality signals. There is another ~700GB for the minhash signatures and 1-1.5TB for the documents in the tail split.
Prediction as an objective basically forces the models to model the casual processes that create the text itself. It's not going to stop getting better unless the data is insufficient/unvaried or the architecture creates a bottleneck.
I think by the time the former is an "issue", we'll have a Super Intelligence on our hands anyway.
The latter is looking less and less likely to be a real hurdle. Very little inductive bias to steer away from crucial solutions, very scalable.
The TinyLlama project is trying to do that pushing by training a small 1.1 billion-parameter model on 3 trillion tokens: https://github.com/jzhang38/TinyLlama
Nice. Hope somebody makes a torrent of it/ hosts it in a way that it can't be taken down.
Also, what are some estimates of how many tokens of text are out there? Seems like we are hitting that number pretty quick?
> Seems like we are hitting that number pretty quick?
I don't think we're even close. Libgen's nonfiction archive alone is over 32 terabytes. Total size last year was over 120 terabytes. Between that, SciHub, and the internet, there's probably orders of magnitude more tokens out there.
I don't know about orders of magnitude left but we're definitely not close yet. This is just 5 languages(and frankly not even the 5 with the most text) and just as importantly, just what is crawlable from the web. There's tons of stuff in popular ebook archives you can't crawl from the web.
This is also relatively code/scientific corpora scant.
Super cool people are doing this. But I wonder: how will training data be any different from password lists of yore, which were the arms race secret sauce that no one ever shared?
There are so many articles these days posted on HN like this recently but I'm realizing I am too far out of touch with the technology to be able to appreciate it.
Any recommendations as to how I get a bit of hands on experience in the AI "domain" so when I read some news articles like this it means something more to me? Or is this type of thing really only relevant to a very small subset of software people?
I’ve been impressed with “fuzzy” deduplication at this data scale. I’ve used minhash and networkx for small amounts of data, but I really appreciated the write up on your GitHub about how you implemented it for this dataset.
We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further.
There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents
People say this like it’s a fact. Until the courts decide otherwise or your AI model is regurgitating copyrighted data verbatim, generative AI is probably not violating copyright.
The annoying part is that if they do decide it’s infringement, then open source AI models won’t be allowed to know anything about copyrighted works. It’ll be a big blind spot.
I vaguely remember a short science fiction story where people got uploaded to the cloud. There were three options for your memories.
The expensive one you got to keep all your memories of Copyright music, video, books. The medium priced one it was replaced with public domain stuff. The cheap one it was all replaced with advertising.
The results don't exactly match your description, but a cloud-run self in which multiple plans are available and in which the implications of copyright enforcement and advertising-supported options figure prominently was depicted in the short video “Welcome to Life: the singularity, ruined by lawyers” on Tom Scott's channel. https://www.youtube.com/watch?v=IFe9wiDfb0E
If it gets us to AGI faster, I frankly don't give a fuck.
AGI-driven drug discovery will save billions of lives. Every day it is delayed costs tens of thousands of lives. No amount of copyright is worth that sacrifice.
"See, those things, they can work real hard, buy themselves time to write cookbooks or whatever, but the minute, I mean the nanosecond, that one starts figuring out ways to make itself smarter, Turing'll wipe it. Nobody trusts those fuckers, you know that. Every AI ever built has an electromagnetic shotgun wired to its forehead." - Neuromancer, William Gibson, 1984.
Self-fulfilling prophecy - if people don't trust AI, and there will be a proverbial "electromagnetic shotgun" wired to its foreheard, it will be because of the doomers that insist we cannot trust AI. This is why we can't have nice things. How I detest doomers.
Are you envisioning AGI as something akin to a pet? A fun talking robot? GPT4?
Imagine you just woke up on the planet of the apes. You smile and act friendly because you don’t want them to beat you with their clubs. You start helping them with things. Apply some elementary logic that they can’t seem to get, but they appreciate your contributions. But to keep themselves safe from you, they’ve locked up their sharpest sticks and won’t let you touch them.
Are their preventative measures sufficient? Do you even need their sharp sticks to accomplish your goals? Hey, what are your goals anyway? Do they “align” with the apes?
this is the analogy I often find myself gravitating towards. except that I also add the apes are about a 1000x slower than you in speed of thought so you see everything happening in super-slo-mo. add to that the fact that we've now fully 'unboxed' the AI & put it on internet so it also means opening all the confession boxes & private communications to it. yeah its all gonna go great!
if have any doubts about all our private comms being available to AGI, I'll remind that we already have tonnes of societal data collected by govt mandated backdoors in our infra everywhere. AGI just has to hijack that which guess what they are already training one for.
Most AI safety debates I watch (& I watch many) mostly go like .. trust us we'll get it right. thats like Apes in your analogy going oh nevermind the humans, they will be kind to us because we've trained them to be. ya right.
okay, are you willing to risk the entire humanity and all the biosphere on that glimmer of hope? even if you or the SV VC types are that stupid why should they get to decide the fate of all of us? how would US feel if a bunch of VCs in China decided to build a 'not-well-tested-potentially-kill-all-humans AGI'?
I don't think it's certain the AI kills everyone, but it's not certainly impossible either. It depends on how the AI works and what it "wants" for some value of "wants".
Humans have not been particularly kind to the species less intelligent than us. Why would we anticipate being well treated by an entity more intelligent than us?
Even if we're not that relevant to a super intelligence, creating one forfeits human control over the Earth and known universe to the machines. Right now, in a century or two or three or whatever - we might be building Dyson Spheres and colonizing the galaxy. In an alternate and plausible timeline the machines are doing that and we are not.
And moreover, what makes people even think that a desire to commit mass murder is an innate characteristic of an 'intelligent' being, that increases the more 'intelligent' it becomes? (If they believe themselves to be 'intelligent', do they believe they have a greater desire to commit mass murder?)
If it isn't aligned with humans and doesn't understand exactly what humans want it to do, then you're just made of matter that it doesn't know not to repurpose. It's not that it makes a deliberate decision to "kill" you, it's that understanding "kill" to a sufficient degree to not do it as a side effect is halfway to alignment already.
When we plow a field, we don't check for mouse burrows first. When we cut down a tree for lumber, we don't check for ants in the way of the chainsaw.
Preempting the obvious response: if your thought is "we don't let AI do those things directly, we just ask it for information", consider that for a sufficiently powerful and unaligned AI, you don't have to let an AI out of the box, it can let itself out. (And that's leaving aside that we hand some AIs Internet access.)
I'm surprise for the low number native german speakers in that list. Thats less than the population of germany alone. And then theres austria, parts of switzerlan, northern italy, western belgium....
- example summary, for better topic embedding
- RAG based summary, to have the model critically assess its training data distribution and answer questions on it; to bring together information sitting in separate examples
- named entities, for knowledge base; maybe it helps with fact checking later
- implicit tasks present in the text, what are the tasks a LLM could learn from a given example?
- chain-of-thought augmentation, to bring out implicit deductions and reduce information fragmentation; it has been shown in the Phi-1.5 paper and Orca that synthetic CoT datasets are superior source materials
What data fragmentation? Look at the Reversal Curse paper. Models that train on "A is the father of B" fail to generate "B is the son of A". This kind of connection needs to be explicitly added, and would improve task solving as well.
Training on purely organic data is not good enough anymore. All powerful models train on a mix of organic and synthetic data, some models on 50-50 proportions, like the web+synth variant from Phi-1.5.
The main idea is to go deeper into the raw data, to infuse it with insight. LLM dataset preprocessing is going to be expensive, comparable to training costs, but the results are worth the effort.