> I see a lot of these KG tools pop up, but they never solve the first problem I have, which is actually constructing the KG itself.
I have heard good things about Graphrag [1] (but what a stupid name). I did not have the time to try it properly, but it is supposed to build the knowledge graph itself somewhat transparently, using LLMs. This is a big stumbling block. At least vector stores are easy to understand and trivial to build.
It looks like KAG can do this from the summary on GitHub, but I could not really find how to do it in the documentation.
Indeed they seem to actually know/show how the sausage is made... but still, no fire and forget approach for any random dataset. check out what you need to do if the default isnt working for you (scroll down to eg. entity_extraction settings). there is so much complexity there to deal with that i'd just roll my own extraction pipeline from the start, rather than learning someone elses complex setup (that you have to tweak for each new usecase)
> i'd just roll my own extraction pipeline from the start, rather than learning someone elses complex setup
I have to agree. It’s actually quite a good summary of hacking with AI-related libraries these days. A lot of them get complex fast once you get slightly out of the intended path. I hope it’ll get better, but unfortunately it is where we are.
GraphRAG isn't quite a knowledge graph. It is a graph of document snippets with semantic relations but is not doing fact extraction nor can you do any reasoning over the structure itself.
This is a common issue I've seen from LLM projects that only kind-of understand what is going on here and try and separate their vector database w/ semantic edge information into something that has a formal name.
why stupid? it uses a Graph in RAG. graphrag. if anything its too generic and multiple people who have the same idea now cannot use the name bc microsoft made the most noise about it.
It is trivial, completely devoid of any creativity, and most importantly quite difficult to google. It’s like they did not really think about it even for 5 seconds before uploading.
> if anything its too generic and multiple people who have the same idea now cannot use the name bc microsoft made the most noise about it.
Exactly! Anyway, I am not judging the software, which I have yet to try properly.
You may want to take a look at Graphiti, which accepts plaintext or JSON input and automatically constructs a KG. While it’s primarily designed to enable temporal use cases (where data changes over time), it works just as well with static content.
> Graphiti uses OpenAI for LLM inference and embedding. Ensure that an OPENAI_API_KEY is set in your environment. Support for Anthropic and Groq LLM inferences is available, too.
Don't have time to scan the source code myself, but are you using the OpenAI python library, so the server URL can easily be changed? Didn't see it exposed by your library, so hoping it can at least be overridden with a env var, so we could use local LLMs instead.
> We recommend that you put this on a local fork as we really want the service to be as lightweight and simple as possible as we see this asa good entry point into new developers.
Sadly, it seems like you're recommending forking the library instead of allowing people to use local LLMs. You were smart enough to lock the PR from any further conversation at least :)
You can override the default OpenAI url using an environment variable (iirc, OPENAI_API_BASE). Any LLM provider / inference server offering an OpenAI-compatible API will work.
Granted they use the `openai` python library (or other library/implementation that uses that same env var), hence my question in the previous-previous comment...
There are two paths to KG generation today and both are problematic in their own ways.
1. Natural Language Processing (NLP)
2. LLM
NLP is fast but requires a model that is trained on an ontology that works with your data. Once you do, it’s a matter of simply feeling the model your bazillion CSVs and PDFs.
LLMs are slow but way easier to start as ontologies can be generated on the fly. This is a double edged sword however as LLMs have a tendency to lose fidelity and consistency on edge naming.
I work in NLP, which is the most used in practice as it’s far more consistent and explainable in very large corpora. But the difficulty in starting a fresh ontology dead ends many projects.
This has always been the Hard Problem. For one, constructing an ontology that is comprehensive, flexible, and stable is huge effort. Then, taking the unstructured mess of documents and categorizing them is an entire industry in itself. Librarians have cataloging as a sub-specialty of library sciences devoted to this.
So yes, there's a huge pile of tools and software for working with knowledge graphs, but to date populating the graph is still the realm of human experts.
When you boil it down, the current LLMs could work effectively if a prompt engineer could figure out a converging loop of a librarian tasked with generating a hypertext web ring crossed with a wikipedia.
Perhaps one needs to manually create a starting point then ask the LLM to propse links to various documents or follow an existing one.
Sufficiently loopable transversal should create a KG
There is some automated named entity extraction and relationship building out of un/semi structured data as part of the neo4j onboarding now to go with all these GraphRAG efforts (& maybe honorable mention to WhyHow.ai too)
txtai automatically builds graphs using vector similarly as data is loaded. Another option is to use something like GLiNER and create entities on the fly. And then create relationships between those entities and/or documents. Or you can do both.
Came here to say this and glad I am not the only one. Building out an ontology seems like quite an expensive process. It would be hard to convince my stakeholders to do this.
There are several ontologies already well built out. Utilities and pharma both have them as an example. They are built by committee of vendors and users. They take a bit to penetrate the approach and language used. Often they are built to be adaptable.
I’ve had good success with CIM for Utilities to build a network graph for modelling the distribution and transmission networks adding sensor and event data for monitoring and analysis about 15 years ago.
Anywhere there is a technology focussed consortium of vendors and users building standards you will likely find a prebuilt graph. When RDF was “hot” many of the these groups spun out some attempt to model their domain.
In summary, if you need one look for one. Maybe there’s one waiting for you and you get to do less convincing and more doing.
> The white paper is only available for professional developers from different industries. We need to collect your name, contact information, email address, company name, industry type, position and your download purpose to verify your identity...
I've had just outstanding success with "view source" and grab the "on success" parameter out of the form. Some sites are bright enough to do real server-side work first, and some other sites will email the link, but I'd guess it's easily 75/25 for ones that include the link in the original page body, as does this one:
LLMs are not that different from humans, in both cases you have some limited working memory and you need to fit the most relevant context into it. This means that if you have a new knowledge base for llms it should be useful for humans too. There should be a lot of cross pollination between these tools.
But we need a theory on the differences too. Now it is kind of random how we differentiate the tools. We need ergonomics for llms.
> This means that if you have a new knowledge base for llms it should be useful for humans too. There should be a lot of cross pollination between these tools.
This is realistic but hence going to be unpopular unfortunately, because people expect magic / want zero effort.
When I need to build something for an LLM to use, I ask the LLM to build it. That way, by definition, the LLM has a built in understanding of how the system should work, because the LLM itself invented it.
Similarly, when I was doing some experiments with a GPT-4 powered programmer, in the early days I had to omit most of the context (just have method stubs). During that time I noticed that most of the code written by GPT-4 was consistently the same. So I could omit its context because the LLM would already "know" (based on its mental model) what the code should be.
> the LLM has a built in understanding of how the system should work,
Thats not how an LLM works. It doesn't understand your question, nor the answer. It can only give you a statistically significant sequence of words that should follow what you gave it.
It has come to the point that we need benchmarks for (Graph)-Rag systems now, same as we have for pure LLMs. However vendors will certainly then optimize for the popular ones, so we need a good mix of public, private and dynamic eval datasets.
> Star our repository to stay up-to-date with exciting new features and improvements! Get instant notifications for new releases
That's not even correct, starring isn't going to do that. You'd need to smash that subscribe button and not forget the bell icon (metaphorically), not ~like~ star it.
If, on the other hand, it were a long, drawn-out animation of moving the mouse pointer to the button, hovering for a few seconds, and then slowing clicking while dragging the mouse away so that the button didn't select and they had to repeat the task again-- that would be art.
I like their description/approach for logical problem solving:
2.2.
"The engine includes three types of operators: planning, reasoning, and retrieval, which transform natural language problems into problem solving processes that combine language and notation.
In this process, each step can use different operators, such as exact match retrieval, text retrieval, numerical calculation or semantic reasoning, so as to realize the integration of four different problem solving processes: Retrieval, Knowledge Graph reasoning, language reasoning and numerical calculation."
Somehow the first time I see such pop up in my feed. Glad that someone (judging by the comments that is not the only one project) is working on this, of course I am rather far from the field but to me this feels like a step in the right direction for advancing AI past the hyperadvanced parrot stage that is the current "AI" is (at least per my perception).
All you’re doing here is “front loading” AI: Imstead of running slow and expensive LLMs at query time, you run them at index time.
It’s a method for data augmentation or, in database lingo, index building. You use LLMs to add context to chunks that doesn’t exist on either the word level (searchable by BM25) or the semantic level (searchable by embeddings).
A simple version of this would be to ask an LLM:
“List all questions this chunk is answering.” [0]
But you can do the same thing for time frames, objects, styles, emotions — whatever you need a “handle” for to later retrieve via BM25 or semantic similarity.
I dreamed of doing that back in 2020, but it would’ve been prohibitively expensive. Because it requires passing your whole corpus through an LLM, possibly multiple times, once for each “angle”.
That being said, I recommend running any “Graph RAG” system you see here on HN over some 1% or so of your data. And then look inside the database. Look at all text chunks, original and synthetic, that are now in your index.
I’ve done this for a consulting client who absolutely wanted “Graph RAG”. I found the result to be an absolute mess. That is because these systems are built to cover a broad range of applications and are not adapted at all to your problem domain.
So I prefer working backwards:
What kinds of queries do I need to handle? What does the prompt to my query time LLM need to look like? What context will the LLM need? How can I have this context for each of my chunks, and be able to search by match air similarity? And now how can I make an LLM return exactly that kind of context, with as few hallucinations and as little filler as possible, for each of my chunks?
This gives you a very lean, very efficient index that can do everything you want.
[0] For a prompt, you’d add context and give the model “space to think”, especially when using a smaller model. Also, you’d instruct it to use a particular format, so you can parse out the part that you need. This “unfancy” approach lets you switch out models easily and compare them against each other without having to care about different APIs for “structured output”.
Prompts are a great place to look for these, but the part you linked too isn't very important for knowledge graph generation. It is doing an initial semantic breakdown into more manageable chunks. The actual entity and fact extraction that actually turns this into a knowledge graph is this one:
GraphRAG and a lot of the semantic indexes are simply vector database with pre-computed similarity edges which does not allow you to perform any reasoning over (the definition and intention of a knowledge graph).
This is probably worth looking at, its the first opensource project I've seen that is actually using LLMs to generate knowledge graphs. This does look pretty primitive for that task but it might be a useful reference for others going down this road.
To my knowledge most graph RAG implementations, including the Microsoft research project, rely on LLM entity extraction (subject-predicate-object triplets) to build the graph.
>I found the result to be an absolute mess. That is because these systems are built to cover a broad range of applications and are not adapted at all to your problem domain.
Same findings here, re: legal text. Basic hybrid search performs better. In this use case the user knows what to look for, so the queries are specific. The advantage of graph RAG is when you need to integrate disparate sources for a holistic overview.
If you have to deal with domain specific data, then this would not work as well. I mean it will get you an incremental shift (based on what I see, it's just creating explicit relationships at the index time instead of letting the model do it at runtime before generating an output. Effective incrementally, but depends on type of data.) yes, though not enough to justify redoing your own pipeline. You are likely better off with your current approach and developing robust evals.
If you want a transformational shift in terms of accuracy and reasoning, the answer is different. Many a times RAG accuracy suffers because the text is out of distribution, and ICL does not work well. You get away with it if all your data is in public domain in some form (ergo, llm was trained on it), else you keep seeing the gaps with no way to bridge them. I published a paper around it and how to effciently solve it, if interested. Here is a simplified blog post on the same: https://medium.com/@ankit_94177/expanding-knowledge-in-large...
Edit: Please reach out here or on email if you would like further details. I might have skipped too many things in the above comment.
At this point, the onus is on the developer to prove it's value through AB comparisons versus traditional RAG. No person/team has the bandwidth to try out this (n + 1) solution.
I enjoy the explosion of tools. Only time will tell which ones stand the test of time. But this is my day job so I never get tired of new tools but I can see how non-industry folks can find it overwhelming
Can you expand on that?
Where do big enterprise orgs products fit in, eg Microsoft, Google?
What are the leading providers as you see them?
As an outsider it is bewildering. First I hear that llama_index is good, then I hear that its overcomplicating slop. What sources or resources are reliable on this? How can we develop anything that will still stand in 12 months time?
May help to think of these tools as on the opposite end of the spectrum. As an analogy:
1. langchain, llamaindex, etc are the equivalent of jquery or ORMs for calling third-party LLMs. They're thin adapter layers with a bit of consistency and common tasks across. Arguably like React, where they are thin composition layers. So complaints of being leaky abstractions is in the sense of an ORM getting in the way vs helping.
2. KG/graph RAG libraries are the LLM equivalent of, when regex + LIKE sql statements aren't enough, graduating to a full-blown lucene/solr engine. These are intelligence engines that address index-time, query-time, and likely, both. Thin libraries and those lacking standard benchmarks are a sign of experiments vs production-relevant: unless you're just talking to 1 pdf, not likely what you want. IMO, no 'winners' here yet: llamaindex was part of an early wave of preprocessors that feed PDFs etc to the KG, but not winning the actual 'smart' KG/RAG. In contrast, MSR Graph RAG is popular and benchmarks well, but if you read the github & paper, not intended for use -- ex: it addresses 1 family of infrequent query you'd do in a RAG system ("n-hop"), but not the primary kinds like mixing semantic+keyword search with query rewriting, and struggles with basics like updates.
Most VC infra/DB $ goes to a layer below the KG. For example, vector databases -- but vector DBs are relatively dumb blackboxes, you can think of them more like S3 or a DB index, while the LLM KG/AI quality work is generally a layer above. (We do train & tune our embedding models, but that's a tiny % of the ultimate win, mostly for smarter compression for handling scaling costs, not the bigger smarts.)
+ 1 to presentation being confusing! VC $ on agents, vector DB co's, etc, and well-meaning LLM enthusiasts are cranking out articles on small uses of LLMs, but in reality, these end up being pretty crappy in quality if you'd actually ship them. So once quality matters, you get into things like the KG/graph RAG work & evals, which is a lot more effort & grinding => smaller % of the infotainment & marketing going around.
(We do this stuff at real-time & data-intensive scales as part of Louie.AI, and are always looking for design partners, esp on graph rag, so happy to chat.)
imo, none. Unfortunately, the landscape is changing too fast. May be things will stabilize, but for now I find experimentation a time-consuming but essential part of maintaining any ML stack.
But it's okay not to experiment with every new tool (it can be overwhelming to do this). The key is in understanding one's own stack and filtering out anything that doesn't fit into it.
> How can we develop anything that will still stand in 12 months time?
The pace at which things are moving, likely none. You will have to keep making changes as and when you see newer things. One thing in your favor (arguably) is that every technique is very dependent on the dataset and problem you are solving. So, if you do not have the latest one implemented, you would be okay, as long as your evals and metrics are good. So, if this helps, skip the details, understand the basics, and go for your own implementation. One thing to look out for is new SOTA LLM releases, and the jumps in capability. Eg: 4o did not announce it, but they started doing very well on vision. (GPT-4 was okay, 4o is empirically quite better). These things help when you update your pipeline.
Well the rate of new LLMs keep coming out, but since they’re all trying to model language, they should all be fairly interchangeable and potentially will converge.
It’s not hard for a product to swap the underlying LLM for a given task.
I meant not a jump in text generation ability, but more like adding a completely new modality and the likes. With 4o, you can have a multimodal embedding space and provide more relevant context to a model for fewer tokens (and higher accuracy). Ideally everyone would get there, but upgrading your pipeline is more about getting the latest functionality faster rather than just a slightly better generation.
The issue is that this technology has no most (other than the cost to create models and datasets)
There’s not a lot of secret sauce you can use that someone else can’t trivially replicate, given the resources.
It’s going to come down to good ol product design and engineering.
The issue is openai doesn’t seem to care about what their users want. (I don’t think their users know what they want either, but that’s another discussion)
They want more money to make bigger models in the hope that nobody else can or will.
They want to achieve regulatory capture as their moat.
For all their technical abilities at scaling LLM training and inference, I don’t get the feeling that they have great product direction.
haha I had heard that langchain was overcomplicated, self contradictory slop and that llama index was better. I dont doubt its bad as well.
Both are cut from the same cloth of typical inexperienced devs who made something cool in a new space and posted on GitHub but then immediately morphed into a companies trying to trap users etc. without going through an organic lifecycle of growing, improving, refactoring with the community.
But unfortunately its like a game of musical chairs or whoever is pushing their wares the hardest that we may get stuck with rather than the actual best solution.
In fact, im wondering if thats what happened in the early noughts and we had the misfortune of Java, and still have the misfortune of Javascript.
This is actually the first project I've seen that is actually doing any kind of knowledge graph generation. Most are just precomputing similarity scores as edges between document snippets that act as their nodes. People have basically been calling their vector databases with an index a knowledge graph.
This is actually attempting fact extraction into an ontology so you can reason over this instead of reasoning in the LLM.
not sure if this is addressing my question. as i understand it the RAG augments the knowledge base by representing its content as a graph. but this graph representation needs to be linguistically represented such that an llm can digest it by tokenizing and embedding.
There are lots of ways to go about RAG, many do not require graphs at all.
I recommend looking at some simple spark queries to get an idea of what’s happening.
What I’ve seen is using LLMs to identify what possible relationships some information may have by comparing it to the kinds of relationships in your database.
Then when building the spark query it uses those relationships to query relevant data.
The llm never digests the graph. The system around the llm uses the capabilities of graph data stores to find relevant context for the llm.
What you’ll find with most RAG systems is that the LLM plays a smaller part than you’d think.
It reveals semantic information (such as conceptual relationships) and generates final responses. The system around it is where the far more interesting work happens imo.
i'm talking about a knowledge graph that explicitly stores data (=knowledge) as a graph and the question is how this solution establishes the connection to the llm. so that the llm uses the data ... anyway, never mind :)
I see a lot of these KG tools pop up, but they never solve the first problem I have, which is actually constructing the KG itself.
reply