How do you combine the outputs? Wouldn't there be data duplication between unstr...

whakim · 2024-07-30T15:07:50 1722352070

We just skip several of unstructured's categories, such as tables and images. We also do some deduplication post-ANN as we want to optimize for novelty as well as relevance. That being said, how are you planning to embed an image or a table to make it searchable? It sounds simple in theory, but how do you generate an actually good image summary (without spending huge amounts of money filling OpenAI's coffers for negligible benefit)? How do you embed a table?

mkaszkowiak · 2024-07-30T21:19:59 1722374399

Thanks for answering! In my case, I don't directly use RAG; but rather post-process documents via LLMs to extract a set of specific answers. That's also why I've asked about deduplication - asking LLM to provide an answer from 2 different data sources (invalid unstructured table text & valid structured table contents) quickly ramps up errors.