Automatic language identification by stroke geometry analysis

US 6,064,767 A
Filed: 01/16/1998
Issued: 05/16/2000
Est. Priority Date: 01/16/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A computer automated method for identifying an unknown language used to create a document, including the steps of:

defining a set of training documents in a variety of known languages and formed from a variety of text styles;

forming black and white pixel images of text material defining said training documents and said document in said unknown language;

locating a plurality of seed black pixels from a region growing algorithm;

progressively locating black pixels having a selected relationship with said seed pixels to define a plurality of line stroke segments that connect to form a line stroke;

identifying black pixels to define a head and a tail black pixel for each said line stroke;

extracting point features from said line stroke segments, where the point features include a vertical position and slope of individual line stroke segments, and locally-averaged radius of curvature that are effective to characterize each of said languages;

forming feature profiles from said point features for an unknown language and each of said known languages; and

comparing said feature profile from said unknown language with each of said feature profiles from said known languages to identify one of said known languages that best represents said unknown language.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented process identifies an unknown language used to create a document. A set of training documents is defined in a variety of known languages and formed from a variety of text styles. Black and white electronic pixel images are formed of text material forming the training documents and the document in the unknown language. A plurality of line strokes are defined from the black pixels and point features are extracted from the strokes that are effective to characterize each of the languages. Point features from the unknown language are compared with point features from the known languages to identify one of the known languages that best represents the unknown language.

199 Citations

2 Claims

1. A computer automated method for identifying an unknown language used to create a document, including the steps of:
- defining a set of training documents in a variety of known languages and formed from a variety of text styles;
  
  forming black and white pixel images of text material defining said training documents and said document in said unknown language;
  
  locating a plurality of seed black pixels from a region growing algorithm;
  
  progressively locating black pixels having a selected relationship with said seed pixels to define a plurality of line stroke segments that connect to form a line stroke;
  
  identifying black pixels to define a head and a tail black pixel for each said line stroke;
  
  extracting point features from said line stroke segments, where the point features include a vertical position and slope of individual line stroke segments, and locally-averaged radius of curvature that are effective to characterize each of said languages;
  
  forming feature profiles from said point features for an unknown language and each of said known languages; and
  
  comparing said feature profile from said unknown language with each of said feature profiles from said known languages to identify one of said known languages that best represents said unknown language.
- View Dependent Claims (2)
- - 2. A method according to claim 1, wherein said step of comparing said feature profile from said unknown language with said feature profiles from said known language further includes the steps of:
    - generating from said feature profiles of a number of samples of each said known language a mean profile and a covariance matrix as a measure of profile variability for that language;
      
      determining a Mahalanobis distance between said profile for a document in said unknown language and said mean profile for each said known language; and
      
      selecting said known language having a minimum said Mahalanobis distance to best represent said unknown language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Los Alamos National Security LLC (Government of the United States of America)
Original Assignee
Regents of the University of California (University of California)
Inventors
Muir, Douglas W., Thomas, Timothy R.
Primary Examiner(s)
MEHTA, BHAVESH M

Application Number

US09/008,225
Time in Patent Office

851 Days
Field of Search

382/190, 382/203, 382/224, 382/228, 382/229, 382/177, 382/192, 382/225, 382/218, 382/201, 707/500, 707/536
US Class Current

382/190
CPC Class Codes

G06V 30/2445 Alphabet recognition, e.g. ...

Automatic language identification by stroke geometry analysis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

199 Citations

2 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic language identification by stroke geometry analysis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

199 Citations

2 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links