×

Automatic language identification system for multilingual optical character recognition

  • US 6,047,251 A
  • Filed: 09/15/1997
  • Issued: 04/04/2000
  • Est. Priority Date: 09/15/1997
  • Status: Expired due to Term
First Claim
Patent Images

1. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:

  • segmenting the document into a plurality of word tokens;

    forming at least one hypothesis of the characters in said word tokens;

    defining a dictionary for each one of plural languages;

    determining confidence factors with respect to said plural languages for said word hypotheses, which factors are based on whether the dictionary for a given language indicates whether a word hypothesis is found in that language;

    defining a plurality of regions in the document, each of which contains at least one word;

    determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and

    clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language.

View all claims
  • 10 Assignments
Timeline View
Assignment View
    ×
    ×