Method of identifying script of line of text
First Claim
1. A method of script identification, comprising the steps of:
- (a) assigning a weight for each of a user-definable number of n-grams in a user-definable number of documents of known scripts, where each of the user-definable number of documents of known scripts is assigned a score equal to the sum of the weights of the n-grams contained therein;
(b) identifying a line of text in a document of unknown script, where the line of text includes pixels;
(c) cropping the line of text identified in step (b);
(d) rescaling the line of text cropped in step (c);
(e) replacing the line of text rescaled in step (d) with at least one number associated with k-mean cluster centroids of script components to which at least one portion of the line of text most closely matches;
(f) scoring the line of text replaced in step (e) against the user-definable number of documents of known scripts using the n-gram weights assigned in step (a);
(g) identifying the highest score attained in step (f);
(h) identifying the user-definable document of known script against which the highest score in step (f) was attained;
(i) declaring the line of text identified in step (b) as having been written in the script identified in step (h); and
(j) returning to step (b) if another line of text of unknown script is desired to be processed.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of identifying the script of a line of text by first assigning a weight to each n-gram in a group of documents of known scripts, where each n-gram is a sequence of numbers representing k-mean cluster centroids of a known script to which character segments in the documents of known scripts most closely match. A line of text is identified, where the line of text is made up of pixels. The identified line of text is cropped so that only a percentage of the pixels remain. The cropped line is vertically and horizontally rescaled into gray-scale pixels. The vertical gray-scale pixels are replaced with the sequence number of a k-means cluster centroid of a known script to which it most closely matches. The n-grams of the number sequence that represents the line of text is scored against the n-gram weights of the documents of known text. The highest score of the line of text is identified and compared to the scores of the documents of known scripts. The script of the line of text is determined to be the script of the document against which the line of text scores the highest.
-
Citations
17 Claims
-
1. A method of script identification, comprising the steps of:
-
(a) assigning a weight for each of a user-definable number of n-grams in a user-definable number of documents of known scripts, where each of the user-definable number of documents of known scripts is assigned a score equal to the sum of the weights of the n-grams contained therein; (b) identifying a line of text in a document of unknown script, where the line of text includes pixels; (c) cropping the line of text identified in step (b); (d) rescaling the line of text cropped in step (c); (e) replacing the line of text rescaled in step (d) with at least one number associated with k-mean cluster centroids of script components to which at least one portion of the line of text most closely matches; (f) scoring the line of text replaced in step (e) against the user-definable number of documents of known scripts using the n-gram weights assigned in step (a); (g) identifying the highest score attained in step (f); (h) identifying the user-definable document of known script against which the highest score in step (f) was attained; (i) declaring the line of text identified in step (b) as having been written in the script identified in step (h); and (j) returning to step (b) if another line of text of unknown script is desired to be processed. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
Specification