Method of identifying script of line of text

US 7,020,338 B1
Filed: 04/08/2002
Issued: 03/28/2006
Est. Priority Date: 04/08/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of script identification, comprising the steps of:

(a) assigning a weight for each of a user-definable number of n-grams in a user-definable number of documents of known scripts, where each of the user-definable number of documents of known scripts is assigned a score equal to the sum of the weights of the n-grams contained therein;

(b) identifying a line of text in a document of unknown script, where the line of text includes pixels;

(c) cropping the line of text identified in step (b);

(d) rescaling the line of text cropped in step (c);

(e) replacing the line of text rescaled in step (d) with at least one number associated with k-mean cluster centroids of script components to which at least one portion of the line of text most closely matches;

(f) scoring the line of text replaced in step (e) against the user-definable number of documents of known scripts using the n-gram weights assigned in step (a);

(g) identifying the highest score attained in step (f);

(h) identifying the user-definable document of known script against which the highest score in step (f) was attained;

(i) declaring the line of text identified in step (b) as having been written in the script identified in step (h); and

(j) returning to step (b) if another line of text of unknown script is desired to be processed.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of identifying the script of a line of text by first assigning a weight to each n-gram in a group of documents of known scripts, where each n-gram is a sequence of numbers representing k-mean cluster centroids of a known script to which character segments in the documents of known scripts most closely match. A line of text is identified, where the line of text is made up of pixels. The identified line of text is cropped so that only a percentage of the pixels remain. The cropped line is vertically and horizontally rescaled into gray-scale pixels. The vertical gray-scale pixels are replaced with the sequence number of a k-means cluster centroid of a known script to which it most closely matches. The n-grams of the number sequence that represents the line of text is scored against the n-gram weights of the documents of known text. The highest score of the line of text is identified and compared to the scores of the documents of known scripts. The script of the line of text is determined to be the script of the document against which the line of text scores the highest.

Citations

17 Claims

1. A method of script identification, comprising the steps of:
- (a) assigning a weight for each of a user-definable number of n-grams in a user-definable number of documents of known scripts, where each of the user-definable number of documents of known scripts is assigned a score equal to the sum of the weights of the n-grams contained therein;
  
  (b) identifying a line of text in a document of unknown script, where the line of text includes pixels;
  
  (c) cropping the line of text identified in step (b);
  
  (d) rescaling the line of text cropped in step (c);
  
  (e) replacing the line of text rescaled in step (d) with at least one number associated with k-mean cluster centroids of script components to which at least one portion of the line of text most closely matches;
  
  (f) scoring the line of text replaced in step (e) against the user-definable number of documents of known scripts using the n-gram weights assigned in step (a);
  
  (g) identifying the highest score attained in step (f);
  
  (h) identifying the user-definable document of known script against which the highest score in step (f) was attained;
  
  (i) declaring the line of text identified in step (b) as having been written in the script identified in step (h); and
  
  (j) returning to step (b) if another line of text of unknown script is desired to be processed.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein said step of assigning a weight for a user-definable number of n-gram in a user-definable number of documents of known scripts is comprised of the steps of:
    - (a) identifying a user-definable number of documents of known scripts;
      
      (b) selecting one of said user-definable number of documents identified in step (a);
      
      (c) identifying a line of text in the document selected in step (b);
      
      (d) cropping the line of text identified in step (c);
      
      (e) rescaling the line of text cropped in step (d);
      
      (f) replacing the line of text rescaled in step (e) with at least one number associated with k-mean cluster centroids of script components to which at least one portion of the line of text most closely matches;
      
      (g) identifying every n-gram in the result of step (f);
      
      (h) weighting each n-gram in the result of step (g);
      
      (i) returning to step (c) if another line of text is desired to be processed;
      
      (j) returning to step (b) if another document is desired to be processed;
      
      (k) identifying, for each of the user-definable number of documents of known scripts, each set of n-grams that are shared between the document and each of the user-definable number of documents of known scripts;
      
      (l) summing the weights of the n-grams in each set identified in step (k);
      
      (m) assigning the results of step (l) to the corresponding document of known script as its scores; and
      
      (n) if one of the user-definable number of documents does not receive its highest score in step (m) against a document of like script then reducing the contributions of each n-gram weight to the scores of the one of said user-definable number of documents by a user-definable amount and returning to step (k) for additional processing, otherwise stopping.
  - 3. The method of claim 1, wherein said step of cropping the line of text identified in step (b) is comprised of the steps of:
    - (a) deskewing the line of text;
      
      (b) producing a horizontal histogram of the pixels in the line of text, where each entry in the horizontal histogram is a sum of the pixels in a corresponding row of pixels in the line of text; and
      
      (c) selecting the lines of pixels in the line of text that represent approximately a user-definable percentage of the sum of pixels in the horizontal histogram.
  - 4. The method of claim 3, wherein the step of selecting the lines of pixels in the line of text that represent approximately a user-definable percentage of the sum of pixels in the horizontal histogram is comprised of the step of selecting the lines of pixels in the line of text that represent approximately ninety-five percent of the sum of pixels in the horizontal histogram.
  - 5. The method of claim 1, wherein said step of rescaling the line of text cropped in step (c) is comprised of the step of dividing the line of text into a user-definable number of vertical gray-scale pixels and a user-definable number of horizontal gray-scale pixels so that an aspect ratio of the line of text is maintained.
  - 6. The method of claim 5, wherein the step of dividing the line of text into a user-definable number of vertical gray-scale pixels and a user-definable number of horizontal gray-scale pixels so that an aspect ratio of the line of text is maintained is comprised of the step of dividing the line of text into eight vertical gray-scale pixels and a user-definable number of horizontal gray-scale pixels so that an aspect ratio of the line of text is maintained.
  - 7. The method of claim 5, wherein said step of replacing the line of text rescaled in step (d) with at least one number associated with k-mean cluster centroid to which at least one portion of the line of text most closely matches is comprised of the steps of:
    - (a) comparing each of the user-definable number of vertical gray-scale pixels to a user-definable number of k-mean cluster centroids, where each of the user-definable number of k-mean cluster centroids has a unique number; and
      
      (b) assigning each of said user-definable number of vertical gray-scale pixels the unique number of the k-mean cluster centroid to which it best matches.
  - 8. The method of claim 7, wherein said step of comparing each of the user-definable number of vertical gray-scale pixels to a user-definable number of k-mean cluster centroids is comprised of the step of comparing each of the user-definable number of vertical gray-scale pixels to a user-definable number of k-mean cluster centroids, where the user-definable number of k-mean cluster centroids are k-mean cluster centroids of a user-definable sample of Latin script.
  - 9. The method of claim 1, wherein said step of scoring the line of text replaced in step (e) against the user-definable number of documents of known scripts using the n-gram weights assigned in step (a) is comprised of the steps of:
    - (a) identifying each n-gram in the result of step (e);
      
      (b) comparing each n-gram identified in step (a) against the n-grams of each of the user-definable number of documents of known scripts on a per document basis;
      
      (c) accumulating the weights of each n-gram in the user-definable number of documents for which a match occurred in step (b) on a per document basis; and
      
      (d) for each document, assigning the result of step (c) as the score of the line of text replaced in step (e) with respect to the document.
  - 10. The method of claim 2, wherein said step of cropping the line of text identified in step (b) is comprised of the steps of:
    - (a) deskewing the line of text;
      
      (b) producing a horizontal histogram of the pixels in the line of text, where each entry in the horizontal histogram is a sum of the pixels in a corresponding row of pixels in the line of text; and
      
      (c) selecting the lines of pixels in the line of text that represent approximately a user-definable percentage of the sum of pixels in the horizontal histogram.
  - 11. The method of claim 10, wherein the step of selecting the lines of pixels in the line of text that represent approximately a user-definable percentage of the sum of pixels in the horizontal histogram is comprised of the step of selecting the lines of pixels in the line of text that represent approximately ninety-five percent of the sum of pixels in the horizontal histogram.
  - 12. The method of claim 11, wherein said step of rescaling the line of text cropped in step (c) is comprised of the step of dividing the line of text into a user-definable number of vertical gray-scale pixels and a user-definable number of horizontal gray-scale pixels so that an aspect ratio of the line of text is maintained.
  - 13. The method of claim 12, wherein the step of dividing the line of text into a user-definable number of vertical gray-scale pixels and a user-definable number of horizontal gray-scale pixels so that an aspect ratio of the line of text is maintained is comprised of the step of dividing the line of text into eight vertical gray-scale pixels and a user-definable number of horizontal gray-scale pixels so that an aspect ratio of the line of text is maintained.
  - 14. The method of claim 13, wherein said step of replacing the line of text rescaled in step (d) with at least one number associated with k-mean cluster centroid to which at least one portion of the line of text most closely matches is comprised of the steps of:
    - (a) comparing each of the user-definable number of vertical gray-scale pixels to a user-definable number of k-mean cluster centroids, where each of the user-definable number of k-mean cluster centroids has a unique number; and
      
      (b) assigning each of said user-definable number of vertical gray-scale pixels the unique number of the k-mean cluster centroid to which it best matches.
  - 15. The method of claim 14, wherein said step of comparing each of the user-definable number of vertical gray-scale pixels to a user-definable number of k-mean cluster centroids is comprised of the step of comparing each of the user-definable number of vertical gray-scale pixels to a user-definable number of k-mean cluster centroids, where the user-definable number of k-mean cluster centroids are k-mean cluster centroids of a user-definable sample of Latin script.
  - 16. The method of claim 15, wherein said step of scoring the line of text replaced in step (e) against the user-definable number of documents of known scripts using the n-gram weights assigned in step (a) is comprised of the steps of:
    - (a) identifying each n-gram in the result of step (e);
      
      (b) comparing each n-gram identified in step (a) against the n-grams of each of the user-definable number of documents of known scripts on a per document basis;
      
      (c) accumulating the weights of each n-gram in the user-definable number of documents for which a match occurred in step (b) on a per document basis; and
      
      (d) assigning the result of step (c) as the score of the line of text replaced in step (f).
  - 17. The method of claim 2, wherein said step of weighting each n-gram in the result of step (g) is comprised of the step of calculating $W_{j} = ((1 / N_{j})$
    - ∑
      
      i⁢
      
      Gij)/(∑
      
      j⁢
      
      (1/Nj⁢
      
      ∑
      
      i⁢
      
      Gij)),where W_jis the n-gram weight for script j,where Gij is a normalized frequency of occurrence of n-gram G in line i of script j, andwhere N_jis a total number of lines in script j.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
National Security Agency
Original Assignee
The United States of America As Represented By The Director National Security Agency
Inventors
Cumbee, Carson S.
Primary Examiner(s)
Mehta, Bhavesh M.
Assistant Examiner(s)
Hung, Yubin

Application Number

US10/117,896
Time in Patent Office

1,450 Days
Field of Search

382/160, 382/174, 382/203, 382/229, 382/230, 382/296, 382/298, 704/8
US Class Current

382/230
CPC Class Codes

G06F 40/263 Language identification

G06V 30/245 Font recognition

Method of identifying script of line of text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Method of identifying script of line of text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links