OCR of books by word recognition

US 20090263019A1
Filed: 04/16/2008
Published: 10/22/2009
Est. Priority Date: 04/16/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of image-to-text processing, comprising the steps of:

acquiring an image of a document having words written thereon;

segmenting said image into areas, each area containing one of said words;

using said areas, defining a dictionary containing reference images of said words, which comprise respective sequences of characters in respective fonts, along with respective codes corresponding to said words;

comparing said areas to said reference images and classifying said words in said document that match said reference images as identified words and classifying said words that do not match any of said reference images as unidentified words;

generating respective new codes for one or more of said unidentified words, and adding said one or more of said unidentified words and said respective new codes to said dictionary for use in comparing other said areas of said document; and

outputting a coded version of said document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed embodiments of the invention provide automated global optimization methods and systems of OCR, tailored to each document being digitized. A document-specific database is created from an OCR scan of a document of interest, which contains an exhaustive listing of words in the document. Images of each word, taken from all the fonts encountered, are entered into the database and mapped to a corresponding textual representation. After entry of a first instance of an image of a word written in a particular font, each new occurrence of the word in that font can be quickly recognized by image processing techniques. The disclosed methods and systems may be used in conjunction with adaptive character recognition training and word recognition training of the OCR engines.

Citations

20 Claims

1. A computer-implemented method of image-to-text processing, comprising the steps of:
- acquiring an image of a document having words written thereon;
  
  segmenting said image into areas, each area containing one of said words;
  
  using said areas, defining a dictionary containing reference images of said words, which comprise respective sequences of characters in respective fonts, along with respective codes corresponding to said words;
  
  comparing said areas to said reference images and classifying said words in said document that match said reference images as identified words and classifying said words that do not match any of said reference images as unidentified words;
  
  generating respective new codes for one or more of said unidentified words, and adding said one or more of said unidentified words and said respective new codes to said dictionary for use in comparing other said areas of said document; and
  
  outputting a coded version of said document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method according to claim 1, wherein said words are written in system-recognized fonts and in system-unrecognized fonts, further comprising the steps of:
    - defining in said image first font areas, wherein said words thereof are written in one of said system-recognized fonts, and second font areas wherein said words thereof are written in one of said system-unrecognized fonts;
      
      associating recognition engines and verification dictionaries with said first font areas, respectively, wherein said verification dictionaries are likely to contain said words in respective ones of said first font areas;
      
      executing said recognition engines using said verification dictionaries, respectively, to obtain recognition results, said recognition engines being operative to categorize said words of said first font areas, into a category selected from the group of valid words and invalid words;
      
      authenticating at least a portion of said invalid words, respectively to define authenticated invalid words;
      
      modifying said verification dictionaries to include at least a portion of said authenticated invalid words to establish revised verification dictionaries;
      
      thereafter repeating said steps of executing, authenticating and modifying using said revised verification dictionaries as said verification dictionaries to obtain updated recognition results until a predefined quality level has been achieved; and
      
      reporting said updated recognition results.
  - 3. The method according to claim 2, further comprising the steps of:
    - defining in said image language-specific areas, wherein said words thereof are written in a single language; and
      
      selecting at least a portion of said verification dictionaries from language-specific dictionaries having words of said single language therein.
  - 4. The method according to claim 2, further comprising the steps of:
    - defining in said image domain-specific areas, wherein said words thereof are likely to be specific to a single domain; and
      
      selecting at least a portion of said verification dictionaries from domain-specific dictionaries having words of said single domain therein.
  - 5. The method according to claim 2, wherein said words comprise icons, further comprising the steps of:
    - arranging said icons in said second font areas in clusters according to shape;
      
      classifying said icons in said clusters with human assistance; and
      
      updating said recognition engines responsively to said step of classifying said icons.
  - 6. The method according to claim 2, wherein said recognition engines are operative to recognize said characters individually, further comprising the steps of:
    - categorizing said characters of said first font areas into a category selected from the group of valid characters and invalid characters; and
      
      adding at least a portion of said valid characters to a set of characters used by said recognition engines.
  - 7. The method according to claim 1, wherein at least a portion of said document is written in a first language, and wherein outputting a coded version comprises displaying said words in a second language that differs from said first language.
  - 8. The method according to claim 1, wherein at least a portion of said document is written in a first alphabet, and wherein outputting a coded version comprises displaying said words in a second alphabet that differs from said first alphabet.

9. A computer software product for image-to-text processing, including a computer storage medium in which computer program instructions are stored, which instructions, when executed by a computer, cause the computer to acquire an image of a document having words written thereon, segment said image into areas, each area containing one of said words, using said areas, define a dictionary containing reference images of said words, which comprise respective sequences of characters in respective fonts, along with respective codes corresponding to said words, compare said areas to said reference images and classifying said words in said document that match said reference images as identified words and classifying said words that do not match any of said reference images as unidentified words, generate respective new codes for one or more of said unidentified words, and adding said one or more of said unidentified words and said respective new codes to said dictionary for use in comparing other said areas of said document, and output a coded version of said document.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The computer software product according to claim 9, wherein said words are written in system-recognized fonts and in system-unrecognized fonts, wherein said instructions further cause said computer to define in said image first font areas, wherein said words thereof are written in one of said system-recognized fonts, and second font areas wherein said words thereof are written in one of said system-unrecognized fonts, associate recognition engines and verification dictionaries with said first font areas, respectively, wherein said verification dictionaries are likely to contain said words in respective ones of said first font areas, and iteratively execute said recognition engines using said verification dictionaries, respectively, to obtain recognition results, said recognition engines being operative to categorize said words of said first font areas, into a category selected from the group of valid words and invalid words, authenticate at least a portion of said invalid words, respectively to define authenticated invalid words, modify said verification dictionaries to include at least a portion of said authenticated invalid words to establish revised verification dictionaries until a predefined quality level has been achieved, and report said recognition results.
  - 11. The computer software product according to claim 10, wherein said instructions further cause said computer to define in said image language-specific areas, wherein said words thereof are written in a single language, and select at least a portion of said verification dictionaries from language-specific dictionaries having words of said single language therein.
  - 12. The computer software product according to claim 10, wherein said instructions further cause said computer to define in said image domain-specific areas, wherein said words thereof are likely to be specific to a single domain, and select at least a portion of said verification dictionaries from domain-specific dictionaries having words of said single domain therein.
  - 13. The computer software product according to claim 10, wherein said words comprise icons, wherein said instructions further cause said computer to arrange said icons in said second font areas in clusters according to shape, classify said icons in said clusters with human assistance, and update said recognition engines responsively to a classification of said icons.
  - 14. The computer software product according to claim 10, wherein said recognition engines are operative to recognize said characters individually, wherein said instructions further cause said computer to categorize said characters of said first font areas into a category selected from the group of valid characters and invalid characters;
    - and add at least a portion of said valid characters to a set of characters used by said recognition engines.

15. A data processing system for image-to-text processing, comprising:
- a processor connectable to an optical scanner; and
  
  a memory accessible by said processor storing programs and data objects therein, said processor cooperative with said optical scanner to acquire an image of a document having words written thereon, segment said image into areas, each area containing one of said words, and using said areas, to define a dictionary containing reference images of said words, which comprise respective sequences of characters in respective fonts, along with respective codes corresponding to said words, compare said areas to said reference images and classifying said words in said document that match said reference images as identified words and classifying said words that do not match any of said reference images as unidentified words, generate respective new codes for one or more of said unidentified words, and adding said one or more of said unidentified words and said respective new codes to said dictionary for use in comparing other said areas of said document, and to output a coded version of said document.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The data processing system according to claim 15, wherein said programs and said data objects comprise recognition engines and verification dictionaries, and wherein said words are written in system-recognized fonts and in system-unrecognized fonts, wherein said instructions further cause said computer to define in said image first font areas, wherein said words thereof are written in one of said system-recognized fonts, and second font areas wherein said words thereof are written in one of said system-unrecognized fonts, associate said recognition engines and said verification dictionaries with said first font areas, respectively, wherein said verification dictionaries are likely to contain said words in respective ones of said first font areas, and iteratively execute said recognition engines using said verification dictionaries, respectively, to obtain recognition results, said recognition engines being operative to categorize said words of said first font areas, into a category selected from the group of valid words and invalid words, authenticate at least a portion of said invalid words, respectively to define authenticated invalid words, modify said verification dictionaries to include at least a portion of said authenticated invalid words to establish revised verification dictionaries until a predefined quality level has been achieved, and report said recognition results.
  - 17. The data processing system according to claim 16, wherein said processor is operative to define in said image language-specific areas, wherein said words thereof are written in a single language, and select at least a portion of said verification dictionaries from language-specific dictionaries having words of said single language therein.
  - 18. The data processing system according to claim 16, wherein said processor is operative to define in said image domain-specific areas, wherein said words thereof are likely to be specific to a single domain, and select at least a portion of said verification dictionaries from domain-specific dictionaries having words of said single domain therein.
  - 19. The data processing system according to claim 16, wherein said words comprise icons, wherein said processor is operative to arrange said icons in said second font areas in clusters according to shape, classify said icons in said clusters with human assistance, and update said recognition engines responsively to a classification of said icons.
  - 20. The data processing system according to claim 16, wherein said recognition engines are operative to recognize said characters individually, wherein said processor is operative to categorize said characters of said first font areas into a category selected from the group of valid characters and invalid characters;
    - and add at least a portion of said valid characters to a set of characters used by said recognition engines.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
WALACH, Eugeniusz, Tzadok, Asaf

Granted Patent

US 8,014,604 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/176
CPC Class Codes

G06F 18/28   Determining representative ...

G06V 30/1914   Determining representative ...

G06V 30/226   of cursive writing

OCR of books by word recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

OCR of books by word recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links