> This is not an OCR problem, but of course, I can't have a look into the software itself, maybe OCR is still fiddling with the data even though we switched it off.
But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.
It was a problem with employing the JBIG2 compression codec, which cuts and pastes things from different parts of the page to save space.
> But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.
Anyone trying to solve for the contents of a page uses context clues. Even humans reading.
You can OCR raw characters (performance is poor); use letter frequency information; use a dictionary; use word frequencies; or use even more context to know what content is more likely. More context is going to result in many fewer errors (of course, it may result in a bigger proportion of the remaining errors seeming to have significant meaning changes).
A small LLM is just a good way to encode this kind of "how likely are these given alternatives" knowledge.
Traditional OCR neural networks like tesseract crucially they have strong measures of their accuracy levels, including when they employ dictionaries or the like to help with accuracy. LLMs, on the other hand, give you zero guarantees, and have some pretty insane edge cases.
With a traditional OCR architecture maybe you'll get a symbol or two wrong, but an LLM can give you entirely new words or numbers not in the document, or even omit sections of the document. I'd never use an LLM for OCR like this.
If you use LLM stupidly, sure. You can get from the LLM pseudo-probabilities of next symbol and use e.g Bayes rule to combine the information of how well it matches the page. You can also report the total uncertainty at the end.
Done properly, this should strictly improve the results.
It depends what your use-case is. At a low enough cost this would work for a project I'm doing where I really just need to be able to mostly search large documents. 100% accuracy and a lost or hallucinated paragraph here and there wouldn't be a deal-killer, especially if the original page image is available to the user too.
And additionally, this also might work if you are feeding the output into a bunch of humans to proof.