HN2new | past | comments | ask | show | jobs | submitlogin
RedPajama v2 Open Dataset with 30T Tokens for Training LLMs (together.ai)
236 points by programd on Oct 30, 2023 | hide | past | favorite | 60 comments


Great work, may I suggest more analysis features?

- example summary, for better topic embedding

- RAG based summary, to have the model critically assess its training data distribution and answer questions on it; to bring together information sitting in separate examples

- named entities, for knowledge base; maybe it helps with fact checking later

- implicit tasks present in the text, what are the tasks a LLM could learn from a given example?

- chain-of-thought augmentation, to bring out implicit deductions and reduce information fragmentation; it has been shown in the Phi-1.5 paper and Orca that synthetic CoT datasets are superior source materials

What data fragmentation? Look at the Reversal Curse paper. Models that train on "A is the father of B" fail to generate "B is the son of A". This kind of connection needs to be explicitly added, and would improve task solving as well.

Training on purely organic data is not good enough anymore. All powerful models train on a mix of organic and synthetic data, some models on 50-50 proportions, like the web+synth variant from Phi-1.5.

The main idea is to go deeper into the raw data, to infuse it with insight. LLM dataset preprocessing is going to be expensive, comparable to training costs, but the results are worth the effort.


Thanks for the suggestion! We will add this in the pool of features for future release. (We are currently running the current 40+ annotations on the `tail` partitions).

If you are interested in contributing the code for these features, feel free to do a PR to https://github.com/togethercomputer/RedPajama-Data! Otherwise we will try our best effort implementation :) but we hope that this can become a community effort

(feel free to created more issues on github for us to keep track. I created one for this https://github.com/togethercomputer/RedPajama-Data/issues/76)


"B is the son of A" doesn't follow from "A is the father of B".

B could be A's daughter.


I feel this is usually a hugely asymmetric problem. The other example I've seen is the model being able to easily complete "The color of the sky is" with "blue", but then failing to complete "Blue is the color of" with "the sky". I say, d'uh, why would it?

If you take the training data, or in general, imagine taking all that humans ever wrote or spoke in English to date, you'll expect to find an overwhelming amount of cases where "The color of the sky is" ends with "blue". However, "Blue is the color of" can easily have a hundred thousand different plausible completions, and "the sky" won't even be one of the more likely ones. In the absence of additional context that strongly hints at the answer, one should NOT expect a properly working LLM to frequently propose "the sky" as completion to "Blue is the color of".


you're right, replacing son with child


Can someone explain to me like a noob how this ("this" being the data hosting and download access) works? Am I understanding correctly that they are releasing code for filtering common crawl data that is out there, and the result of this filtering is the dataset?

To further elaborate on this (possibly wrong) understanding:

- Each person can then run their own processing, possibly duplicating effort(?)

...but on the good side, giving each person the ability to tweak the pipeline to suit their needs.

- There is no torrent of already processed data because __________?

- Looking at file lists for this on Hugging Face, some files seem to be stored in Git Large File Storage. Are these already processed files that together constitute the dataset? Or are these Common Crawl files that are selectively listed and pulled for processing?

What options are there to preemptively obtain a copy, in case of any possible eventual takedown of the dataset, any assurances about access aside? I am reminded of parts of the pile.

Obviously I'm super clueless here... please be gentle and share anything you know or correct anything I've got wrong.

I'm not asking about training, if that wasn't obvious. Just about obtaining the dataset.


What we make available is:

--

(A) the dataset after pre-processing the raw CommonCrawl data (e.g., text extraction and language identification) and some minimal filtering; and

(B) for each document in (A), we also pre-computed 40+ of "features" (we call the "quality annotations") you can use to further filter it or deduplicate it. For example, one such feature is "how similar this document is to Wikipedia".

--

(A) is around 30T tokens, but you might want to use features in (B) to further filter/dedup it down, e.g., to 5T. For example, if in your application documents similar to Wikipedia are the most helpful documents, you can take the top documents with the highest score for the feature "how similar this document is to Wikipedia". Of course, the really interesting case happens when you consider a larger subset of these features (or maybe even automatically learn what the best way of filtering it is).

Our goal is to make this as flexible as possible such that you can fit this into your own application. What we have released is both (A) and (B)

If you have any questions, please let us know! Thanks for your interests, have fun with the data!


Thanks.

> how similar this document is to Wikipedia

So that’s a measure of how similar it is to the background vector of all (language in focus) Wikipedia data?


There are actually a few ways to do this; and we have four:

- `rps_doc_ml_wikiref_score`: a classifier that classifiers random webpage with Wiki references (used in Llama-1)

- `ccnet_perplexity`: perplexity of an LM trained on Wikipedia (used in CCNet)

- `rps_doc_ml_wikipedia_score`: classifier prediction for the document being a Wikipedia article

- `rps_doc_wikipedia_importance`: Used in https://arxiv.org/abs/2302.03169

You can see the full table here: https://together.ai/blog/redpajama-data-v2


Anyone know how large it is?

They state the 1 trillion token dataset is 5TB.

Is it safe to assume this is 5TB * 30 = 150TB?

The code in the HuggingFace repo downloads data from url base: https://data.together.xyz/redpajama-data-v2/v1.0.0

https://huggingface.co/datasets/togethercomputer/RedPajama-D...


It is around 100TB (84 CommonCrawl dumps, roughly 1TB per dump)


yes, small clarification: the 1TB per dump refers to the head+middle partition of the dataset and includes the text documents and the quality signals. There is another ~700GB for the minhash signatures and 1-1.5TB for the documents in the tail split.


This is a lot of tokens. Llama 2 was trained on two trillion tokens [1]

[1] https://arxiv.org/abs/2307.09288


Loss was still decreasing for the models, there's a sense that we can push the training data much much further than we currently are.


Yup. I found this article quite enlightening:

https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...


Phenomenal blog post about scaling laws.


Prediction as an objective basically forces the models to model the casual processes that create the text itself. It's not going to stop getting better unless the data is insufficient/unvaried or the architecture creates a bottleneck.

I think by the time the former is an "issue", we'll have a Super Intelligence on our hands anyway.

The latter is looking less and less likely to be a real hurdle. Very little inductive bias to steer away from crucial solutions, very scalable.


The TinyLlama project is trying to do that pushing by training a small 1.1 billion-parameter model on 3 trillion tokens: https://github.com/jzhang38/TinyLlama


And Llama 2's training data was less aggressively deduplicated.


Nice. Hope somebody makes a torrent of it/ hosts it in a way that it can't be taken down. Also, what are some estimates of how many tokens of text are out there? Seems like we are hitting that number pretty quick?


> Seems like we are hitting that number pretty quick?

I don't think we're even close. Libgen's nonfiction archive alone is over 32 terabytes. Total size last year was over 120 terabytes. Between that, SciHub, and the internet, there's probably orders of magnitude more tokens out there.


I don't know about orders of magnitude left but we're definitely not close yet. This is just 5 languages(and frankly not even the 5 with the most text) and just as importantly, just what is crawlable from the web. There's tons of stuff in popular ebook archives you can't crawl from the web.

This is also relatively code/scientific corpora scant.

We're just getting started.


Super cool people are doing this. But I wonder: how will training data be any different from password lists of yore, which were the arms race secret sauce that no one ever shared?


What password lists?


Password cracking lists


There are so many articles these days posted on HN like this recently but I'm realizing I am too far out of touch with the technology to be able to appreciate it.

Any recommendations as to how I get a bit of hands on experience in the AI "domain" so when I read some news articles like this it means something more to me? Or is this type of thing really only relevant to a very small subset of software people?


There's a course available here [0] that might interest you.

[0] https://www.fast.ai


I’ve been impressed with “fuzzy” deduplication at this data scale. I’ve used minhash and networkx for small amounts of data, but I really appreciated the write up on your GitHub about how you implemented it for this dataset.


If it's 5 common crawls, isn't data across multiple common crawls mostly similar?


We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further.

There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents


It looks like mass copyright infringement, frankly.


People say this like it’s a fact. Until the courts decide otherwise or your AI model is regurgitating copyrighted data verbatim, generative AI is probably not violating copyright.


The courts are in the process of deciding otherwise. https://hackernews.hn/item?id=37962244

Hopefully they won’t, but it’s not looking good.

The annoying part is that if they do decide it’s infringement, then open source AI models won’t be allowed to know anything about copyrighted works. It’ll be a big blind spot.


Why would this only apply to open source models? Or are you assuming that a company like OAI could license all the material?


Bingo. Big companies won’t be impacted.


Training a model with this data might be legal, but distributing the data without a license probably isn't.


I dunno, at launch if you asked ChatGPT to make up a new Sir Mix-A-Lot song about big butts, it’d write very familiar lyrics more-or-less verbatim…


Lots of people would do the same.

I vaguely remember a short science fiction story where people got uploaded to the cloud. There were three options for your memories.

The expensive one you got to keep all your memories of Copyright music, video, books. The medium priced one it was replaced with public domain stuff. The cheap one it was all replaced with advertising.


The results don't exactly match your description, but a cloud-run self in which multiple plans are available and in which the implications of copyright enforcement and advertising-supported options figure prominently was depicted in the short video “Welcome to Life: the singularity, ruined by lawyers” on Tom Scott's channel. https://www.youtube.com/watch?v=IFe9wiDfb0E


Thanks, Pretty sure thats probably the one I was remembering (and confusing the plot elements a bit).


If it gets us to AGI faster, I frankly don't give a fuck.

AGI-driven drug discovery will save billions of lives. Every day it is delayed costs tens of thousands of lives. No amount of copyright is worth that sacrifice.


AGI will mean we’re no longer the dominant life form on the planet. If AGI were achieved tomorrow, how many humans would be left in 200 years?


"See, those things, they can work real hard, buy themselves time to write cookbooks or whatever, but the minute, I mean the nanosecond, that one starts figuring out ways to make itself smarter, Turing'll wipe it. Nobody trusts those fuckers, you know that. Every AI ever built has an electromagnetic shotgun wired to its forehead." - Neuromancer, William Gibson, 1984.


Self-fulfilling prophecy - if people don't trust AI, and there will be a proverbial "electromagnetic shotgun" wired to its foreheard, it will be because of the doomers that insist we cannot trust AI. This is why we can't have nice things. How I detest doomers.


200 years with AGI? I’d be shocked if there weren’t a few hundred billion humans spread across the solar system.


Are you envisioning AGI as something akin to a pet? A fun talking robot? GPT4?

Imagine you just woke up on the planet of the apes. You smile and act friendly because you don’t want them to beat you with their clubs. You start helping them with things. Apply some elementary logic that they can’t seem to get, but they appreciate your contributions. But to keep themselves safe from you, they’ve locked up their sharpest sticks and won’t let you touch them. Are their preventative measures sufficient? Do you even need their sharp sticks to accomplish your goals? Hey, what are your goals anyway? Do they “align” with the apes?


this is the analogy I often find myself gravitating towards. except that I also add the apes are about a 1000x slower than you in speed of thought so you see everything happening in super-slo-mo. add to that the fact that we've now fully 'unboxed' the AI & put it on internet so it also means opening all the confession boxes & private communications to it. yeah its all gonna go great!

if have any doubts about all our private comms being available to AGI, I'll remind that we already have tonnes of societal data collected by govt mandated backdoors in our infra everywhere. AGI just has to hijack that which guess what they are already training one for.

Most AI safety debates I watch (& I watch many) mostly go like .. trust us we'll get it right. thats like Apes in your analogy going oh nevermind the humans, they will be kind to us because we've trained them to be. ya right.


I imagine we'll be the pets of AGI. And look how our pets thrive in comparison to wildlife. Even our cattle thrive in comparison to wildlife.


okay, are you willing to risk the entire humanity and all the biosphere on that glimmer of hope? even if you or the SV VC types are that stupid why should they get to decide the fate of all of us? how would US feel if a bunch of VCs in China decided to build a 'not-well-tested-potentially-kill-all-humans AGI'?


Humanity existance is by itself a risk. You can't predict if AGI will increase our risks or decrease it.

Humanity's behavior depends on the circumstances of population dynamics, economy and technology.

Who knows, maybe without AGI we are bound to inevitably start nuking ourselves in 10 years because of shrinking resources and growing divides.

You can't decide the future. You can only do what you think is best in given circumstances and hope for a good one.


I find this notion interesting. What makes you think AI will automatically kill humans?


I don't think it's certain the AI kills everyone, but it's not certainly impossible either. It depends on how the AI works and what it "wants" for some value of "wants".

Humans have not been particularly kind to the species less intelligent than us. Why would we anticipate being well treated by an entity more intelligent than us?

Even if we're not that relevant to a super intelligence, creating one forfeits human control over the Earth and known universe to the machines. Right now, in a century or two or three or whatever - we might be building Dyson Spheres and colonizing the galaxy. In an alternate and plausible timeline the machines are doing that and we are not.


And moreover, what makes people even think that a desire to commit mass murder is an innate characteristic of an 'intelligent' being, that increases the more 'intelligent' it becomes? (If they believe themselves to be 'intelligent', do they believe they have a greater desire to commit mass murder?)


This, “desire”, “murder”, “belief” is human thinking about what humans would do. It may not apply to a superhuman machine intelligence.


If it isn't aligned with humans and doesn't understand exactly what humans want it to do, then you're just made of matter that it doesn't know not to repurpose. It's not that it makes a deliberate decision to "kill" you, it's that understanding "kill" to a sufficient degree to not do it as a side effect is halfway to alignment already.

When we plow a field, we don't check for mouse burrows first. When we cut down a tree for lumber, we don't check for ants in the way of the chainsaw.

Preempting the obvious response: if your thought is "we don't let AI do those things directly, we just ask it for information", consider that for a sufficiently powerful and unaligned AI, you don't have to let an AI out of the box, it can let itself out. (And that's leaving aside that we hand some AIs Internet access.)


Heh. Bit too high of a bar. Even if it helps to develop boo about 9000 that's fun to prooomt for a while, I think it's fair game


Since when did hackers care about copyright? I thought this forum was called Hacker News.


Yeah, and the people at Woodstock grew up to run the DEA.


Nice. I admit I find the language selection a bit uninspired and odd:

> Five languages: English, French, Spanish, German, and Italian

Otoh I'm surprised that when counting first and second language proficiency, German is actually ahead of Japanese....

https://en.m.wikipedia.org/wiki/List_of_languages_by_total_n...


I'm surprise for the low number native german speakers in that list. Thats less than the population of germany alone. And then theres austria, parts of switzerlan, northern italy, western belgium....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: