Reading without Knowing Symbol Shapes

Conventional text recognition systems use a symbol shape classifier trained with huge image databases. Still they cannot deal with unseen fonts. We attempted to abandon shape training entirely and decode the text directly as a cryptogram. Imaging noise, character segmentation errors, and imperfect shape clustering make it much harder than a simple substitution cipher. Our method starts with clustering isolated symbol shapes, makes guesses on the largest clusters, and eventually recovers the symbol identities by sophisticated constraint satisfaction algorithms. Tested on 200 faxed business letters, the method recognized almost 80% of the words for 2/3 of the cases. (joint work with George Nagy)

Related papers: