What are you on about? What does it produce if not computer-readable characters?...

Groxx · on Aug 6, 2013

Does it produce ASCII? UTF? If no, it's not OCR.

edit: by the definition you seem to be going on, any facial recognition is also OCR, since you could consider a face a 'glyph' (edit: 'symbol'). The only 'text' thing here that I can see is that it is intended to be used on text, which lends some optimizations, nothing that it's actually text-based in any way.

Dylan16807 · on Aug 6, 2013

If you make a font out of faces and use them as repeated glyphs then yes it's OCR. If you're not using identical symbols over and over than I don't think you have a sane definition of 'glyph'.

eCa · on Aug 6, 2013

It produces symbols, not characters.

Say that the scanner internally splits the scan into regions of 10x10 pixels that it saves in memory. If another region differs on less than (say) 10% of the pixels it is assumed that the two zones are identical and the first one is used in the second place too. The regions have no semantic meaning.

OCR translates the scan into a character set.

Dylan16807 · on Aug 6, 2013

The only thing that's missing is a mapping from 'symbol #28' into 'ascii #63'. Internally it's storing instances of symbols plus font data for those symbols.

Also, something to think about: an EBCDIC document accidentally printed as ASCII/8859-1 would have equally zero semantic meaning when fed into an OCR program. But I don't think anyone would argue it wasn't OCR.

Groxx · on Aug 6, 2013

That "only thing that's missing" is a very very big thing, and difficult to get correct. And where does it say it's storing font data for the symbols?

Dylan16807 · on Aug 6, 2013

A font doesn't need to be anything more than a series of bitmaps. And then each character location on the image, ignoring errors, references one of these bitmaps. That's how documents with embedded bitmap fonts generally work.

That mapping isn't a very big thing. Sometimes text-based PDFs don't even have it, and you don't notice unless you try to copy out and get the wrong letters.

bestham · on Aug 6, 2013

OCR per definition gives out text. Not binary data that resemble the bitmap of the input image.