HN2new | past | comments | ask | show | jobs | submitlogin

Gonna dig deeper, but I get the sense Wikipedia is preparing a native dataformat for LLM ingestion.


You're probably thinking of something more along the lines of Wikidata [1], which is over 10 years old.

[1]: https://www.wikidata.org/wiki/Wikidata:Main_Page


It's very much designed for and around Wikidata.

> Wikifunctions will allow easy access to large knowledge bases such as Wikidata, but also to binary input and output files

https://diff.wikimedia.org/2023/08/07/wikifunctions-is-start...

> Using the simple facts housed in Wikidata, you will be able to write functions that make calculations, provide a person’s age, estimate population densities, and more, and integrate the results into Wikipedia.

https://www.wikifunctions.org/wiki/Wikifunctions:FAQ

> In the future:

> - It will be possible to call Wikifunctions functions from other Wikimedia projects, and integrate their results in the output of the page.

> - It will be possible to use data from Wikidata in functions.


Wikidata cannot express the "semantic" content of average encyclopedic text, which is an express goal of Wikifunctions. So they will have to expand the data model quite a bit compared to what Wikidata has today. (This can be done quite cleanly though, since the whole point of RDF is to be able to express general graphs, and this aspect has been made even stronger with RDF* which aligns fairly well with the "frame" semantics Wikifunctions plans to use for the above purpose.)


Wikifunctions are arguably the prerequisite to making that effort, since even if you extended Wikidata with more semantic connections, you couldn't do anything with them.

This is a large part of why Abstract Wikipedia's goal is NLG, and almost all of the initial Wikifunctions facilitate NLG.


> NLG

what is NLG?..


Natural Language Generation



Yes but there are significant restrictions as to what can be expressed there - it's limited to assertions of the form 'well-known entity X has pre-defined property P with value Y (or "some value" or "no value"), with one further layer of "qualifiers"'. RDF itself is fully compositional, and RDF* extends that compositionality even further, providing a standard means to reify arbitrary RDF statements and make complex assertions about them.


And it has full data dumps and query service (https://query.wikidata.org/) so if you wanted to use it in any LLM project or as side service, there's absolutely no problem with that.


This was my first impression as well. Huge untapped possibilities with a project like this...


isn't the whole point of LLMs that their "native data format" is unstructured?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: