Xerox tried it a while ago. It didn't end well https://www.dkriesel.com/en/blog/...

merb · 2024-07-23T20:05:45 1721765145

> This is not an OCR problem (as we switched off OCR on purpose)

yjftsjthsd-h · 2024-07-23T20:12:52 1721765572

It also says

> This is not an OCR problem, but of course, I can't have a look into the software itself, maybe OCR is still fiddling with the data even though we switched it off.

But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.

mlyle · 2024-07-23T20:20:37 1721766037

> It also says...

It was a problem with employing the JBIG2 compression codec, which cuts and pastes things from different parts of the page to save space.

> But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.

Anyone trying to solve for the contents of a page uses context clues. Even humans reading.

You can OCR raw characters (performance is poor); use letter frequency information; use a dictionary; use word frequencies; or use even more context to know what content is more likely. More context is going to result in many fewer errors (of course, it may result in a bigger proportion of the remaining errors seeming to have significant meaning changes).

A small LLM is just a good way to encode this kind of "how likely are these given alternatives" knowledge.

tensor · 2024-07-23T23:00:30 1721775630

Traditional OCR neural networks like tesseract crucially they have strong measures of their accuracy levels, including when they employ dictionaries or the like to help with accuracy. LLMs, on the other hand, give you zero guarantees, and have some pretty insane edge cases.

With a traditional OCR architecture maybe you'll get a symbol or two wrong, but an LLM can give you entirely new words or numbers not in the document, or even omit sections of the document. I'd never use an LLM for OCR like this.

mlyle · 2024-07-25T17:24:09 1721928249

If you use LLM stupidly, sure. You can get from the LLM pseudo-probabilities of next symbol and use e.g Bayes rule to combine the information of how well it matches the page. You can also report the total uncertainty at the end.

Done properly, this should strictly improve the results.

surfingdino · 2024-07-23T20:55:40 1721768140

It's all fun and games until you need to prove something in court or to the tax office. I don't think that throwing an LLM into this mix helps.

wmf · 2024-07-23T21:27:51 1721770071

Generally when OCRing documents you should keep the original scans so you can refer back to them in case of any questions or disputes.

qingcharles · 2024-07-24T02:32:31 1721788351

It depends what your use-case is. At a low enough cost this would work for a project I'm doing where I really just need to be able to mostly search large documents. 100% accuracy and a lost or hallucinated paragraph here and there wouldn't be a deal-killer, especially if the original page image is available to the user too.

And additionally, this also might work if you are feeding the output into a bunch of humans to proof.

ctm92 · 2024-07-24T10:50:35 1721818235

That was also what first came to my mind, I guess Zerox might be a reference to this