Method of automatic language identification for multi-lingual text recognition

US 20040006467A1
Filed: 11/29/2002
Published: 01/08/2004
Est. Priority Date: 10/18/2002
Status: Abandoned Application

First Claim

Patent Images

1. A method for automatically determining one or more languages associated with text in a bit-mapped image, comprising the steps of:

segmenting the image into a plurality of images of word token, recognition of separate characters in said images of word token, joining separate characters into groups presumably comprising words, forming at least one hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language, accepting the hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language;

the said step of forming a hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language, further comprises at least the following steps definition of selected language models set, estimation of word correspondence with lingual and non-lingual models.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The disclosed invention utilizes a complex estimation-based approach to identify languages of portions of a multi-lingual text, recognized from a bit-mapped image. The method comprises besides the traditional steps like the document segmentation, new ones such as generating and testing of a hypothesis about the characters in the word tokens.

The method further includes definition of selected language models set, word estimation via language models, dictionaries set definition for language selection, estimation of word correspondence with chosen languages, calculating a complex estimation for the word taking into account the most or all of above mentioned estimations.

The complex estimation may also include factor of characters and/or words mutual correspondence within the line and/or the text, mutual geometric correspondence of characters within the word and/or the line, linguistic correspondence of the word with neighbors, estimation of image of word token reconstruction accuracy in the presence of distortion.

203 Citations

13 Claims

1. A method for automatically determining one or more languages associated with text in a bit-mapped image, comprising the steps of:
- segmenting the image into a plurality of images of word token, recognition of separate characters in said images of word token, joining separate characters into groups presumably comprising words, forming at least one hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language, accepting the hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language;
  
  the said step of forming a hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language, further comprises at least the following steps definition of selected language models set, estimation of word correspondence with lingual and non-lingual models.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the step of recognition of separate characters in said images of word token is performed by a classifier, that is generic to each of said plural languages.
  - 3. The method of claim 1, wherein the step of accepting the hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language further comprises defining a set of dictionaries for the estimation of the word correspondence to a certain language, estimation of the word correspondence with defined dictionaries.
  - 4. The method of claim 3, wherein the defining of a set of dictionaries for the estimation of language correspondence of the text is made manually.
  - 5. The method of claim 3, wherein the defining of a set of dictionaries for the estimation of language correspondence of the text is made automatically.
  - 6. The method of claim 1, wherein the step of accepting the hypothesis about correspondence of the characters group, presumably comprising a word, to a certain language further comprises a calculation of complex estimation, said complex estimation including at least character recognition quality estimation, dictionary conformity estimation, including language models conformity estimation.
  - 7. The method of claim 6, wherein complex estimation further comprises calculation of a special factor of characters mutual correspondence.
  - 8. The method of claim 6, wherein complex estimation further comprises calculation of a special factor of words relative placement.
  - 9. The method of claim 7, wherein complex estimation further comprises a special factor of words correspondence calculation.
  - 10. The method of claim 9, wherein the special factor comprises geometric conformity of characters within the word.
  - 11. The method of claim 9, wherein the special factor comprises geometric conformity of characters within the line.
  - 12. The method of claim 9, wherein the special factor comprises a linguistic correspondence of word with neighbors,
  - 13. The method of claim 9, wherein the special factor includes accuracy estimation of a word reconstruction from token image, and also in the presence of distortion.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ABBYY Software
Original Assignee
ABBYY Software
Inventors
Tereshchenko, Vadim, Anisimovich, Konstantin, Rybkin, Vladimir

Application Number

US10/305,499
Publication Number

US 20040006467A1
Time in Patent Office

Days
Field of Search
US Class Current

704/251
CPC Class Codes

G06V 30/246 using linguistic properties...

Method of automatic language identification for multi-lingual text recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

203 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Method of automatic language identification for multi-lingual text recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

203 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links