Automatic language identification system for multilingual optical character recognition

US 6,047,251 A
Filed: 09/15/1997
Issued: 04/04/2000
Est. Priority Date: 09/15/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:

segmenting the document into a plurality of word tokens;

forming at least one hypothesis of the characters in said word tokens;

defining a dictionary for each one of plural languages;

determining confidence factors with respect to said plural languages for said word hypotheses, which factors are based on whether the dictionary for a given language indicates whether a word hypothesis is found in that language;

defining a plurality of regions in the document, each of which contains at least one word;

determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and

clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The disclosed invention utilizes a dictionary-based approach to identify languages within different zones in a multi-lingual document. As a first step, a document image is segmented into various zones, regions and word tokens, using suitable geometric properties. Within each zone, the word tokens are compared to dictionaries associated with various candidate languages, and the language that exhibits the highest confidence factor is initially identified as the language of the zone. Subsequently, each zone is further split into regions. The language for each region is then identified, using the confidence factors for the words of that region. For any language determination having a low confidence value, the previously determined language of the zone is employed to assist the identification process.

Citations

16 Claims

1. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:
- segmenting the document into a plurality of word tokens;
  
  forming at least one hypothesis of the characters in said word tokens;
  
  defining a dictionary for each one of plural languages;
  
  determining confidence factors with respect to said plural languages for said word hypotheses, which factors are based on whether the dictionary for a given language indicates whether a word hypothesis is found in that language;
  
  defining a plurality of regions in the document, each of which contains at least one word;
  
  determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and
  
  clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein a hypothesis is formed only for words having a minimum length of at least two characters.
  - 3. The method of claim 1 wherein said confidence factors for hypothesized words are weighted in accordance with the lengths of the hypothesized words.
  - 4. The method of claim 1 further including the steps of determining a recognition probability for each hypothesis, and weighting said confidence factors in accordance with the recognition probabilities.
  - 5. The method of claim 1 wherein said confidence factors for hypothesized words are weighted in accordance with the frequencies with which the hypothesized words appear in the respective languages.
  - 6. The method of claim 1 wherein said initial hypothesis is formed by means of a classifier that is generic to each of said plural languages.
  - 7. The method of claim 1, wherein a separate dictionary is defined for each of said languages.
  - 8. The method of claim 1, wherein said dictionary is common to a plurality of said languages, and includes information indicating which languages contain words in the dictionary.

9. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of:
- defining at least one zone in the document which contains a plurality of words;
  
  defining a dictionary for each one of plural languages;
  
  for each word in the zone, determining a confidence factor with respect to each of said plural languages, which factor is based on whether the respective dictionaries contain the word;
  
  identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone;
  
  selecting a local region in the zone which contains at least one word;
  
  identifying a region language for the local region, based upon the confidence factor associated with the words in the region;
  
  determining whether the region language is the same as the zone language; and
  
  segregating the local region from other regions in the zone if its region language is not the same as the zone language.

10. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:
- segmenting the document into a plurality of zones containing regions of word tokens;
  
  forming at least one hypothesis of the characters in said word tokens;
  
  defining a dictionary for each one of plural languages;
  
  for each hypothesized word, determining which ones of said dictionaries contain the hypothesis for the word and determining a confidence value for each language;
  
  identifying a zone language for each zone, based upon the confidence values associated with the words in the zone;
  
  identifying a region language for each region, based upon the confidence values associated with the words in the region;
  
  designating the zone language as the region language if the confidence values associated with the words in the region are not sufficiently high; and
  
  clustering regions in a zone which have the same region language to form a subzone that is identified with a particular language.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The method of claim 10 wherein a hypothesis is formed only for words having a predetermined minimum number of characters greater than one.
  - 12. The method of claim 10 further including the step of weighting said confidence values in accordance with the lengths of the hypothesized words.
  - 13. The method of claim 10 further including the steps of determining a recognition probability for each hypothesis, and weighting said confidence values in accordance with the recognition probabilities.
  - 14. The method of claim 10 wherein said initial hypothesis is formed by means of a classifier that is generic to each of said plural languages.

15. A method for automatically determining one or more languages associated with text in a document, comprising the steps of:
- segmenting the document into a plurality of word tokens;
  
  forming at least one hypothesis of the characters in said word tokens;
  
  for each word hypothesis, determining a confidence factor that indicates whether the word is contained in each of said plural languages;
  
  defining a plurality of regions in the document, each of which contains at least one word;
  
  determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and
  
  clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language.

16. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of:
- defining at least one zone in the document which contains a plurality of words;
  
  for each word in the zone, determining a confidence factor that indicates whether the word is contained in each of said plural languages;
  
  identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone;
  
  selecting a local region in the zone which contains at least one word;
  
  identifying a region language for the local region, based upon the confidence factor associated with the words in the region;
  
  determining whether the region language is the same as the zone language; and
  
  segregating the local region from other regions in the zone if its region language is not the same as the zone language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Caere Corp. (Mexico) (Microsoft Corporation)
Inventors
Yang, Jun, Kanungo, Tapas, Pon, Leonard K., Bokser, Mindy R., Choy, Kenneth Chan
Primary Examiner(s)
Isen, Forester W.
Assistant Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US08/929,788
Time in Patent Office

932 Days
Field of Search

704/7, 704/9, 704/10, 382/175-177, 382/228, 382/231, 382/229, 382/181
US Class Current

704/1
CPC Class Codes

G06V 30/242 Division of the character s...

Automatic language identification system for multilingual optical character recognition

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic language identification system for multilingual optical character recognition

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links