Method and apparatus for character recognition using stop words
First Claim
Patent Images
1. A method for training an image classifier, the method comprising:
- identifying a plurality of stop words, each stop word being from a same language and having an associated definition in such language, the plurality of stop words being identified as a function of a linguistic model and the plurality of stop words having an expected recognition coverage level associated therewith, wherein the plurality of stop words is limited to the following stop words;
a, about, after, all, also, an, and, any, are, as, at, back, be, because, been, before, being, between, both, but, by, can, could, day, did, do, down, each, even, first, for, from, get, good, had, has, have, he, her, here, him, his, how, I, if, in, into, is, it, its, just, know, life, like, little, long, made, make, man, many, may, me, men, more, most, Mr., much, must, my, never, new, no, not, now, of, old, on, one, only, or, other, our, out, over, own, people, said, same, see, she, should, so, some, state, still, such, than, that, the, their, them, then, there, these, they, this, those, three, through, time, to, too, two, under, up, very, was, way, we, well, were, what, when, where, which, who, will, with, work, world, would, year, years, you, and your;
comparing the plurality of stop words to a plurality of individual words in an input image, each stop word and each individual word being treated as a separate symbol during the comparing;
identifying matches between particular ones of the stop words and particular ones of the individual words of the input image, wherein each particular stop word matches a same particular individual word throughout the input image, to form a plurality of recognized words;
segmenting the plurality of recognized words to form a plurality of character prototypes; and
training the image classifier using the plurality of character prototypes to recognize at least one character from the input image.
2 Assignments
0 Petitions
Accused Products
Abstract
An adaptive OCR technique for character classification and recognition without the input and use of ground truth derived from the image itself. A set of so-called stop words are employed for classifying symbols, e.g., characters, from any image. The stop words are identified independent of any particular image and are used for classification purposes across any set of images of the same language, e.g., English. Advantageously, an adaptive OCR method is realized without the requirement of the selection and inputting of ground truth from each individual image to be recognized.
-
Citations
9 Claims
-
1. A method for training an image classifier, the method comprising:
-
identifying a plurality of stop words, each stop word being from a same language and having an associated definition in such language, the plurality of stop words being identified as a function of a linguistic model and the plurality of stop words having an expected recognition coverage level associated therewith, wherein the plurality of stop words is limited to the following stop words;
a, about, after, all, also, an, and, any, are, as, at, back, be, because, been, before, being, between, both, but, by, can, could, day, did, do, down, each, even, first, for, from, get, good, had, has, have, he, her, here, him, his, how, I, if, in, into, is, it, its, just, know, life, like, little, long, made, make, man, many, may, me, men, more, most, Mr., much, must, my, never, new, no, not, now, of, old, on, one, only, or, other, our, out, over, own, people, said, same, see, she, should, so, some, state, still, such, than, that, the, their, them, then, there, these, they, this, those, three, through, time, to, too, two, under, up, very, was, way, we, well, were, what, when, where, which, who, will, with, work, world, would, year, years, you, and your;
comparing the plurality of stop words to a plurality of individual words in an input image, each stop word and each individual word being treated as a separate symbol during the comparing;
identifying matches between particular ones of the stop words and particular ones of the individual words of the input image, wherein each particular stop word matches a same particular individual word throughout the input image, to form a plurality of recognized words;
segmenting the plurality of recognized words to form a plurality of character prototypes; and
training the image classifier using the plurality of character prototypes to recognize at least one character from the input image. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
constructing a stop word classifier using a decision forest; and
applying the stop word classifier to the input image.
-
-
3. The method of claim 2 wherein the segmenting the plurality of recognized words operation includes:
-
aligning particular ones of the recognized words; and
extracting a plurality of common character prototypes as a function of the aligning between the particular ones of the recognized words.
-
-
4. The method of claim 3 wherein the extracting the plurality of common character prototypes operation further comprises:
-
estimating a width for each character of the particular ones of the recognized words; and
shifting and matching at least one pair of recognized words using the estimated width for each character of the particular ones of the recognized words of the pair.
-
-
5. The method of claim 3 wherein the input image is a printed document.
-
6. The method of claim 5 wherein the image classifier is part of an optical character recognition system.
-
7. The method of claim 2 wherein the constructing the stop word classifier operation further comprises:
-
generating a plurality of synthetic words corresponding to a particular stop word; and
identifying a feature vector for the particular stop word as a function of the corresponding plurality of synthetic words, and using the feature vector to construct the stop word classifier.
-
-
8. The method of claim 7 wherein the feature vector is a combination of a binary subsamples vector, a pixel correlation vector, a vertical runs count vector, and a horizontal runs count vector.
-
9. An optical character recognition apparatus comprising:
-
a selector for selecting at least one input image from an image source, the input image having a plurality of individual words;
an image symbol generator for comparing a plurality of stop words to the input image, each stop word being from a same language and having an associated definition in such language, the plurality of stop words being identified as a function of a linguistic model and the plurality of stop words having an expected recognition coverage level associated therewith, identifying matches between particular ones of the stop words and particular ones of the individual words of the input image, wherein each particular stop word matches a same particular individual word throughout the input image, to form a plurality of matching words, segmenting the plurality of matching words to form a plurality of character prototypes, wherein the plurality of stop words is limited to the following stop words;
a, about, after, all, also, an, and, any, are, as, at, back, be, because, been, before, being, between, both, but, by, can, could, day, did, do, down, each, even, first, for, from, get, good, had, has, have, he, her, here, him, his, how, I, if, in, into, is, it, its, just, know, life, like, little, long, made, make, man, many, may, me, men, more, most, Mr., much, must, my, never, new, no, not, now, of, old, on, one, only, or, other, our, out, over, own, people, said, same, see, she, should, so, some, state, still, such, than, that, the, their, them, then, there, these, they, this, those, three, through, time, to, too, two, under, up, very, was, way, we, well, were, what, when, where, which, who, will, with, work, world, would, year, years, you, and your;
an image classifier for classifying at least one character from the input image using the plurality of character prototypes; and
an image recognizer for producing at least one recognized image from the image source using the at least one character.
-
Specification