Method and apparatus for character recognition
First Claim
Patent Images
1. A character recognizing method in which reference text data to be referred to character recognition and an index file of the reference text data are provided, the method comprising the steps of:
- recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input character for each of input character images indicating input characters of the input document, the one or more conversion candidate characters each being composed of text data;
selecting a series of search character images indicating a series of search input characters from the series of input character images;
selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
searching the reference text data, by using a full text searching technique based on the index file of the reference text data, for one or more particular character strings respectively agreeing with one particular conversion candidate character string for each of the particular conversion candidate character strings to count the number of particular character strings as an occurrence frequency of the particular conversion candidate character string in the reference text data for each of the particular conversion candidate character strings;
selecting a specific particular conversion candidate character string corresponding to the highest occurrence frequency among those of the particular conversion candidate character strings from the particular conversion candidate character strings; and
determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images.
1 Assignment
0 Petitions
Accused Products
Abstract
A character recognizing apparatus has a post-processing unit which makes character strings including a plurality of conversion candidates, respectively, made by a character recognizing unit, and a full text searching unit performs a full text search for the character strings in a plurality of documents having been converted into text data, whereby the post-processing unit determines a correct character on the basis of results of the search to correct misrecognition.
-
Citations
68 Claims
-
1. A character recognizing method in which reference text data to be referred to character recognition and an index file of the reference text data are provided, the method comprising the steps of:
-
recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input character for each of input character images indicating input characters of the input document, the one or more conversion candidate characters each being composed of text data;
selecting a series of search character images indicating a series of search input characters from the series of input character images;
selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
searching the reference text data, by using a full text searching technique based on the index file of the reference text data, for one or more particular character strings respectively agreeing with one particular conversion candidate character string for each of the particular conversion candidate character strings to count the number of particular character strings as an occurrence frequency of the particular conversion candidate character string in the reference text data for each of the particular conversion candidate character strings;
selecting a specific particular conversion candidate character string corresponding to the highest occurrence frequency among those of the particular conversion candidate character strings from the particular conversion candidate character strings; and
determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
calculating an evaluation value indicating a degree of certainty of one conversion candidate character for each of the conversion candidate characters corresponding to the input character images;
selecting one or more specific conversion candidate characters corresponding to the evaluation values higher than those of the other conversion candidate characters from the conversion candidate characters corresponding to one input character image for each of the input character images;
repeatedly selecting one specific conversion candidate character from the specific conversion candidate characters corresponding to one search character image for each of the search character images to produce a plurality of specific conversion candidate character string respectively corresponding to the series of search character images; and
setting each specific conversion candidate character string as one particular conversion candidate character string.
-
-
3. A character recognizing method according to claim 1, further comprising the steps of:
-
repeatedly selecting another series of search character images indicating another series of search input characters from the input character images; and
determining the series of correct characters each time the series of search character images is selected from the input character images to recognize all input character images.
-
-
4. A character recognizing method according to claim 1, in which the number of particular conversion candidate characters in each particular conversion candidate character string is fixed.
-
5. A character recognizing method according to claim 1, in which the series of search character images is interposed between two punctuation marks in the input image data.
-
6. A character recognizing method according to claim 1, in which the search input characters are expressed by a character type selected from the group consisting of a Japanese Hiragana character type, a Japanese Katakana character type and a Kanji character type.
-
7. A character recognizing method according to claim 1, in which the step of searching the reference text data comprises the steps of
searching the reference text data and the input document for one particular conversion candidate character string for each of the particular conversion candidate character strings to count an occurrence frequency of the particular conversion candidate character string in the reference text data and the input document for each of the particular conversion candidate character strings. -
8. A character recognizing method according to claim 1, in which the step of recognizing an input character image includes the steps of
specifying a plurality of character regions existing in the input document; - and
extracting each of the input character images from the character regions, the step of selecting a series of search character images includes the step of combining one or a series of particular input character images extracted from a final portion of one character region and one or a series of particular input character images extracted from a top portion of another character region into the series of search character images, for each pair of character regions, and the step of determining a series of specific particular conversion candidate characters includes the step of coupling a first character region and a second character region together in that order, in cases where one particular conversion candidate character string corresponding to one series of search character images obtained by combining one or a series of particular input character images extracted from a final portion of the first character region and one or a series of particular input character images extracted from a top portion of the second character region is selected as one specific particular conversion candidate character string, for each specific particular conversion candidate character string.
- and
-
9. A character recognizing method according to claim 1, in which the step of searching the reference text data comprises the steps of
selecting a plurality of shortened conversion candidate character strings corresponding to a series of search character images from the particular conversion candidate character strings; -
searching the reference text data for one shortened conversion candidate character string for each of the shortened conversion candidate character strings to count an occurrence frequency of the shortened conversion candidate character string in the reference text data for each of the shortened conversion candidate character strings;
selecting a specific shortened conversion candidate character string corresponding to the highest occurrence frequency among those of the shortened conversion candidate character strings from the shortened conversion candidate character strings;
producing a plurality of particular conversion candidate character strings respectively including the specific shortened conversion candidate character string and corresponding to the series of search character images; and
searching the reference text data for each particular conversion candidate character string to count an occurrence frequency of each particular conversion candidate character string in the reference text data.
-
-
10. A character recognizing method according to claim 1, in which the step of recognizing an input character image includes the steps of
specifying an input attribute of the input document, classifying the reference text data into the plurality of registered documents respectively specified by a registered attribute, and wherein the step of searching the reference text data comprises the steps of selecting one or more particular registered documents respectively specified by the registered attribute, which is the same as the input attribute of the input document, from the registered documents and searching the particular registered documents for one particular conversion candidate character string for each of the particular conversion candidate character strings to count an occurrence frequency of the particular conversion candidate character string in the reference text data for each of the particular conversion candidate character strings. -
11. A character recognizing method according to claim 1, comprising the further step of:
-
preparing misrecognition data respectively composed of a misrecognized character string including a misrecognized character and a correct character string made of a plurality of correct characters, and wherein the step of searching the reference text data comprises the steps of;
searching the misrecognized character strings of the misrecognition data for one particular conversion candidate character string for each of the particular conversion candidate character strings;
recognizing the series of search character images as a series of correct characters composing a correct character string corresponding to one particular conversion candidate character string in the misrecognition data in cases where the particular conversion candidate character string exists in the misrecognized character strings; and
searching the reference text data for one particular conversion candidate character string for each of the particular conversion candidate character strings, in cases where any particular conversion candidate character string does not exist in the misrecognized character strings, to count an occurrence frequency of the particular conversion candidate character string in the reference text data for each of the particular conversion candidate character strings.
-
-
12. A character recognizing method according to claim 1, further comprising the steps of:
-
storing an input layout of the input character images of the input document; and
displaying a corrected document, which is obtained by replacing the series of search character images of the input document with the series of correct characters, in the input layout of the input document.
-
-
13. A character recognizing method according to claim 1, in which the step of selecting a series of search character images comprises the steps of:
-
detecting a series of particular input character images sandwiched by a pair of partition symbols from the series of input character images of the input document; and
setting the series of particular input character images as the series of search character images indicating the series search input characters.
-
-
14. A character recognizing method in which reference text data to be referred to character recognition and an index file of the reference text data are provided, the method comprising the steps of:
-
recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input character for each of input character images indicating input characters of the input document, the one or more conversion candidate characters each being composed of text data, the recognizing step including calculating an evaluation value indicating a degree of certainty of one conversion candidate character for each of the conversion candidate characters corresponding to the input character images;
selecting one or more particular conversion candidate characters corresponding to the evaluation values higher than those of the other conversion candidate characters from the conversion candidate characters corresponding to one input character image for each of the input character images;
selecting a series of search character images indicating a series of search input characters from the series of input character images;
selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
searching the reference text data, by using a full text searching technique based on the index file of the reference text data, for one or more particular character strings respectively agreeing with one particular conversion candidate character string for each of the particular conversion candidate character strings to select a specific particular conversion candidate character string occurring at the highest frequency in the reference text data from the particular conversion candidate character strings; and
determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
specifying a highest evaluation value among the evaluation values of the conversion candidate characters corresponding to one input character image for each of the input character images;
determining a threshold value lower than the highest evaluation value by a prescribed value for each of the input character images;
and adopting one or more conversion candidate characters having the evaluation values equal to or higher than the threshold value as the particular conversion candidate characters for each of the input character images.
-
-
16. A character recognizing method according to claim 14, in which the step of selecting one or more particular conversion candidate characters includes the step of:
selecting one or more conversion candidate characters having the evaluation values equal to or higher than a threshold value as the particular conversion candidate characters for each of the input character images.
-
17. A character recognizing method according to claims 14, in which the step of selecting one or more particular conversion candidate characters comprises the steps of:
-
specifying a highest evaluation value among the evaluation values of the conversion candidate characters corresponding to one input character image for each of the input character mages;
determining a threshold value lower than the highest evaluation value by a prescribed value for each of the input character images;
adopting one or more conversion candidate characters having the evaluation values equal to or higher than the threshold value as the particular conversion candidate characters for each of the input character images; and
judging the particular conversion candidate character having the highest evaluation value corresponding to one input character image as a misrecognized character, in cases where the plurality of particular conversion candidate characters corresponding to the input character image are adopted, for each of the input character images, and the step of selecting a series of search character images includes the step of;
selecting the series of search character images, respectively corresponding to the particular conversion candidate characters including the misrecognized character, from the input character images.
-
-
18. A character recognizing method according to claim 14, in which the step of selecting one or more particular conversion candidate characters includes the steps of:
-
adopting one or more conversion candidate characters having the evaluation values equal to or higher than a threshold value as the particular conversion candidate characters for each of the input character images; and
judging the particular conversion candidate character having the highest evaluation value corresponding to one input character image as a misrecognized character, in cases where the plurality of particular conversion candidate characters corresponding to the input character image are adopted, for each of the input character images, and the step of selecting a series of search character images includes the step of;
selecting the series of search character images, respectively corresponding to the particular conversion candidate characters including the misrecognized character, from the input character images.
-
-
19. A character recognizing method according to claim 14, in which the step of searching the reference text data comprises the step of
searching the reference text data and the input document for one particular conversion candidate character string for each of the particular conversion candidate character strings to select a specific particular conversion candidate character string occurring at the highest frequency in the reference text data and the input document from the particular conversion candidate character strings. -
20. A character recognizing method according to claim 14, in which the step of searching the reference text data comprises the steps of
searching the reference text data for one particular conversion candidate character string for each of the particular conversion candidate character strings to count a first occurrence frequency of the particular conversion candidate character string in the reference text data for each of the particular conversion candidate character strings; -
determining a threshold value lower than the highest first occurrence frequency by a prescribed value;
selecting one or more first selected conversion candidate character strings corresponding to the first occurrence frequencies equal to or higher than the threshold value among those of the particular conversion candidate character strings from the particular conversion candidate character strings;
searching the input document for one first selected conversion candidate character string for each of the first selected conversion candidate character strings to count a second occurrence frequency of the first selected conversion candidate character string in the input document for each of the first selected conversion candidate character strings; and
selecting a specific particular conversion candidate character string corresponding to the highest second occurrence frequency among those of the first selected conversion candidate character strings from the first selected conversion candidate character strings.
-
-
21. A character recognizing method according to claim 14, in which the step of recognizing an input character image includes the steps of
determining a first character image position of one input character image in the input document; - and
extracting the input character image supposed to be placed at the first character image position from the input document, and the step of calculating an evaluation value includes the steps of again determining a second character image position of the input character image in cases where all evaluation values of the conversion candidate characters corresponding to the input character image supposed to be placed at the first character image position are lower than a threshold value;
again extracting the input character image supposed to be placed at the second character image position from the input document;
again recognizing the input character image placed at the second character image position as one or more conversion candidate characters; and
again calculating an evaluation value of each conversion candidate character corresponding to the input character image placed at the second character image position.
- and
-
22. A character recognizing method according to claim 14, in which the step of recognizing an input character image includes the steps of specifying a plurality of character regions existing in the input document;
- and
extracting each of the input character images from the character regions, the step of selecting a series of search character images includes the step of combining one or a series of particular input character images extracted from a final portion of one character region and one or a series of particular input character images extracted from a top portion of another character region into the series of search character images, for each pair of character regions, and the step of determining a series of specific particular conversion candidate characters includes the step of coupling a first character region and a second character region together in that order, in cases where one particular conversion candidate character string corresponding to one series of search character images obtained by combining one or a series of particular input character images extracted from a final portion of the first character region and one or a series of particular input character images extracted from a top portion of the second character region is selected as one specific particular conversion candidate character string, for each specific particular conversion candidate character string.
- and
-
23. A character recognizing method according to claim 14, in which the step of recognizing an input character image includes the step of
specifying an input attribute of the input document, further comprising the step of classifying the reference text data into the plurality of registered documents respectively specified by a registered attribute, and wherein the step of searching the reference text data comprises the steps of selecting one or more particular registered documents respectively specified by the registered attribute, which is the same as the input attribute of the input document, from the registered documents; - and
searching the particular registered documents for one particular conversion candidate character string for each of the particular conversion candidate character strings to select a specific particular conversion candidate character string occurring at the highest frequency in the reference text data from the particular conversion candidate character strings.
- and
-
24. A character recognizing method according to claim 14, comprising the further step of:
-
preparing misrecognition data respectively composed of a misrecognized character string including a misrecognized character and a correct character string made of a plurality of correct characters, and the step of searching the reference text data comprises the steps of;
searching the misrecognized character strings of the misrecognition data for one particular conversion candidate character string for each of the particular conversion candidate character strings;
recognizing the series of search character images as a series of correct characters composing a correct character string corresponding to one particular conversion candidate character string in the misrecognition data in cases where the particular conversion candidate character string exists in the misrecognized character strings; and
searching the reference text data for one particular conversion candidate character string for each of the particular conversion candidate character strings, in cases where any particular conversion candidate character string does not exist in the misrecognized character strings, to select a specific particular conversion candidate character string frequently occurred in the reference text data from the particular conversion candidate character strings.
-
-
25. A character recognizing method according to claim 14, further comprising the steps of:
-
storing an input layout of the input character images of the input document; and
displaying a corrected document, which i s obtained by replacing the series of search character images of the input document with the series of correct characters, in the input layout of the input document.
-
-
26. A character recognizing method according to claim 14, in which the step of selecting a series of search character images comprises the steps of:
-
detecting a series of particular input character images sandwiched by a pair of partition symbols from the series of input character images of the input document; and
setting the series of particular input character images as the series of search character images indicating the series of search input characters.
-
-
27. A character recognizing apparatus, comprising:
-
character recognizing means for recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input character for each of input character images indicating input characters of the input document, the one or more conversion candidate characters each being composed of text data, selecting a series of search character images indicating a series of search input characters from the series of input character images and selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
reference text data storing means for storing reference text data indicating characters arranged in series in one or more registered documents and storing an index file of the reference text data;
full text searching means, using a full text searching technique based on the index file of the reference text data, for searching the reference text data stored by the reference text data storing means for one or more particular stored character strings respectively agreeing with one particular conversion candidate character string for each of the particular conversion candidate character strings recognized by the character recognizing means to count the number of particular stored character strings as an occurrence frequency of the particular conversion candidate character string in the reference text data for each of the particular conversion candidate character strings;
post-processing means for selecting a specific particular conversion candidate character string corresponding to the highest occurrence frequency among those of the particular conversion candidate character strings counted by the full text searching means from the particular conversion candidate character strings recognized by the character recognizing means, and determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images; and
text data outputting means for outputting the series of correct characters determined by the post-processing means as the series of search character images. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
region dividing means for dividing the input document, in which a plurality of character regions are arranged in a particular order, into a plurality of regions of different attributes;
character extracting means for specifying the character regions existing in the particular order in the regions of the input document divided by the region dividing means and extracting each of the input character images from the character regions, each of the input character images extracted being recognized as the conversion candidate characters by the character recognizing means; and
region coupling means for coupling the character regions extracted by the character extracting means together in the particular order, in cases where one particular conversion candidate character string corresponding to one series of search character images of one series of search input characters which extends over two or more character regions divided by the region dividing means is selected as one specific particular conversion candidate character string by the post-processing means.
-
-
30. A character recognizing apparatus according to claim 27, further comprising:
attribute obtaining means for obtaining an input attribute of the input document, wherein the reference text data stored by the text data storing means is classified into the plurality of registered documents respectively specified by a registered attribute, one or more particular registered documents respectively specified by the registered attribute, which is the same as the input attribute of the input document obtained by the attribute obtaining means, are selected from the registered documents by the character recognizing means, and the particular registered documents are searched for one particular conversion candidate character string for each of the particular conversion candidate character strings by the full text searching means to count an occurrence frequency of the particular conversion candidate character string in the reference text data for each of the particular conversion candidate character strings.
-
31. A character recognizing apparatus according to claim 27, further comprising:
-
misrecognition data storing means for storing misrecognition data respectively composed of a misrecognized character string including a misrecognized character and a correct character string made of a plurality of correct characters, wherein the misrecognized character strings of the misrecognition data stored by the misrecognition data storing means are searched for one particular conversion candidate character string by the full text searching means for each of the particular conversion candidate character strings, the series of search character images is recognized by the postprocessing means as a series of correct characters composing correct character string corresponding to one particular conversion candidate character string in the misrecognition data in cases where the particular conversion candidate character string exists in the misrecognized character strings of the misrecognition data, and the reference text data is searched for one particular conversion candidate character string by the full text searching means for each of the particular conversion candidate character strings, in cases where any particular conversion candidate character string does not exist in the misrecognized character strings of the misrecognition data, to count an occurrence frequency of the particular conversion candidate character string in the reference text data for each of the particular conversion candidate character strings.
-
-
32. A character recognizing apparatus according to claim 27, further comprising:
-
layout storing means for storing an input layout of the input character images of the input document recognized by the character recognizing means; and
displaying means for displaying a corrected document, which is obtained by replacing the series of search character images of the input document selected by the character recognizing means with the series of correct characters determined by the post-processing means, in the input layout of the input document stored by the layout storing means.
-
-
33. A character recognizing apparatus according to claim 27, in which a series of particular input character images separated by a pair of partition symbols from a pair of input character images adjacent to the series of particular input character images is detected by the character recognizing means from the series of input character images of the input document, and the series of particular input character images is set as the series of search character images indicating the series of search input characters.
-
34. A character recognizing apparatus according to claim 33 in which the partition symbol is selected from the group consisting of a space, a period, a specific character, a specific symbol and a control code.
-
35. A character recognizing apparatus according to claim 27, in which an evaluation value indicating a degree of certainty of one conversion candidate character is calculated by the character recognizing means for each of the conversion candidate characters corresponding to the input character images, one or more specific conversion candidate characters corresponding to the evaluation values higher than those of the other conversion candidate characters are selected from the conversion candidate characters corresponding to one input character image by the character recognizing means for each of the input character images, one specific conversion candidate character from the specific conversion candidate characters corresponding to one search character image is repeatedly selected by the character recognizing means for each of the search character images to produce a plurality of specific conversion candidate character string respectively corresponding to the series of search character images, and each specific conversion candidate character string is set as one particular conversion candidate character string by the character recognizing means.
-
36. A character recognizing apparatus according to claim 27, further comprising:
-
region dividing means for dividing an area of the input document into a plurality of regions having different attributes; and
character extracting means for specifying a character region divided by the region dividing means and extracting each of the input character images from the character region.
-
-
37. A character recognizing apparatus according to claim 27, in which a plurality of shortened conversion candidate character strings corresponding to a series of search character images is selected by the character recognizing means from the particular conversion candidate character strings, the reference text data is searched for one shortened conversion candidate character string by the full text searching means for each of the shortened conversion candidate character strings to count an occurrence frequency of the shortened conversion candidate character string in the reference text data for each of the shortened conversion candidate character strings, a specific shortened conversion candidate character string corresponding to the highest occurrence frequency among those of the shortened conversion candidate character strings is selected from the shortened conversion candidate character strings by the post-processing means, a plurality of particular conversion candidate character strings respectively including the specific shortened conversion candidate character string and corresponding to the series of search character images are produced by the character recognizing means, and the reference text data is searched for each particular conversion candidate character string by the full text searching means to count an occurrence frequency of each particular conversion candidate character string in the reference text data.
-
38. A character recognizing apparatus, comprising:
-
character recognizing means for recognizing an input character image indicating an input character of an input document as one or more conversion candidate characters denoting candidates for the input character for each of input character images indicating input characters of the input document, the one or more conversion candidate characters each being composed of text data, calculating an evaluation value indicating a degree of certainty of one conversion candidate character for each of the conversion candidate characters corresponding to the input character images, selecting one or more particular conversion candidate characters corresponding to the evaluation values higher than those of the other conversion candidate characters from the conversion candidate characters corresponding to one input character image for each of the input character images, selecting a series of search character images indicating a series of search input characters from the series of input character images, and selecting a plurality of particular conversion candidate character strings respectively corresponding to the series of search character images from the particular conversion candidate characters;
reference text data storing means for storing reference text data indicating characters arranged in series in one or more registered documents and storing an index file of the reference text data;
full text searching means, using the index file of the reference text data, for searching the reference text data stored by the reference text data storing means for one or more particular stored character strings respectively agreeing with one particular conversion candidate character string for each of the particular conversion candidate character strings produced by the character recognizing means to obtain a full text search result;
post-processing means for selecting a specific particular conversion candidate character string occurring at the highest frequency in the reference text data from the particular conversion candidate character strings according to the full text search result obtained by the full text searching means, and determining a series of specific particular conversion candidate characters composing the specific particular conversion candidate character string as a series of correct characters for the series of search character images; and
text data outputting means for outputting the series of correct characters determined by the post-processing means as the series of search character images. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52)
character extracting means for determining a first character image position of one input character image in the input document, extracting the input character image supposed to be placed at the first character image position from the input document, again determining a second character image position of the input character image in cases where all evaluation values of the conversion candidate characters corresponding to the input character image supposed to be placed at the first character image position are lower than a threshold value and again extracting the input character image supposed to be placed at the second character image position from the input document, wherein the input character image placed at the second character image position is again recognized as one or more conversion candidate characters by the character recognizing means, and an evaluation value of each conversion candidate character corresponding to the input character image placed at the second character image position is again calculated by the character recognizing means.
-
-
42. A character recognizing apparatus according to claim 38, further comprising:
-
character extracting means for specifying a plurality of character regions existing in the input document and extracting each of the input character images from the character regions, wherein one or a series of particular input character images extracted from a final portion of one character region and one or a series of particular input character images extracted from a top portion of another character region are combined into the series of search character images by the character recognizing means, for each pair of character regions; and
region coupling means for coupling a first character region and a second character region extracted by the character extracting means together in that order, in cases where one particular conversion candidate character string corresponding to one series of search character images obtained by combining one or a series of particular input character images extracted from a final portion of the first character region and one or a series of particular input character images extracted from a top portion of the second character region is selected as one specific particular conversion candidate character string by the post-processing means, for each specific particular conversion candidate character string.
-
-
43. A character recognizing apparatus according to claim 38, further comprising:
attribute obtaining means for obtaining an input attribute of the input document, wherein the reference text data stored by the text data storing means is classified into the plurality of registered documents respectively specified by a registered attribute, one or more particular registered documents respectively specified by the registered attribute, which is the same as the input attribute of the input document obtained by the attribute obtaining means, are selected from the registered documents by the character recognizing means, and the particular registered documents are searched for one particular conversion candidate character string for each of the particular conversion candidate character strings by the full text searching means to obtain a full text search result.
-
44. A character recognizing apparatus according to claim 38, further comprising:
-
misrecognition data storing means for storing misrecognition data respectively composed of a misrecognized character string including a misrecognized character and a correct character string made of a plurality of correct characters, wherein the misrecognized character strings of the misrecognition data stored by the misrecognition data storing means are searched for one particular conversion candidate character string by the full text searching means for each of the particular conversion candidate character strings, the series of search character images is recognized by the postprocessing means as a series of correct characters composing a correct character string corresponding to one particular conversion candidate character string in the misrecognition data in cases where the particular conversion candidate character string exists in the misrecognized character strings of the misrecognition data, and the reference text data is searched for one particular conversion candidate character string by the full text searching means for each of the particular conversion candidate character strings, in cases where any particular conversion candidate character string does not exist in the misrecognized character strings of the misrecognition data, to obtain a full text search result.
-
-
45. A character recognizing apparatus according to claim 38, further comprising:
- layout storing means for storing an input layout of the input character images of the input document recognized by the character recognizing means; and
displaying means for displaying a corrected document, which is obtained by replacing the series of search character images of the input document selected by the character recognizing means with the series of correct characters determined by the post-processing means, in the input layout of the input document stored by the layout storing means.
- layout storing means for storing an input layout of the input character images of the input document recognized by the character recognizing means; and
-
46. A character recognizing apparatus according to claim 38, which a series of particular input character images separated by a pair of partition symbols from a pair of input character images adjacent to the series of particular input character images is detected by the character recognizing means from the series of input character images of the input document, and the series of particular input character images is set as the series of search character images indicating the series of search input characters.
-
47. A character recognizing apparatus according to claim 46 in which the partition symbol is selected from the group consisting of a space, a period, a specific character, a specific symbol and a control code.
-
48. A character recognizing apparatus according to claim 38, further comprising:
-
region dividing means for dividing an area of an input document into a plurality of regions having different attributes; and
character extracting means for specifying a character region divided by the region dividing means and extracting each of the input character images from the character region.
-
-
49. A character recognizing apparatus according to claim 38, in which a highest evaluation value among the evaluation values of the conversion candidate characters corresponding to one input character image is specified by the character recognizing means for each of the input character images, a threshold value lower than the highest evaluation value by a prescribed value is determined by the character recognizing means for each of the input character images, and one or more conversion candidate characters having the evaluation values equal to or higher than the threshold value are adopted as the particular conversion candidate characters by the character recognizing means for each of the input character images.
-
50. A character recognizing apparatus according to claim 38, in which one or more conversion candidate characters having the evaluation values equal to or higher than a threshold value are adopted as the particular conversion candidate characters by the character recognizing means for each of the input character images.
-
51. A character recognizing apparatus according to claim 38, in which a highest evaluation value among the evaluation values of the conversion candidate characters corresponding to one input character image is specified by the character recognizing means for each of the input character images, a threshold value lower than the highest evaluation value by a prescribed value is determined by the character recognizing means for each of the input character images, one or more conversion candidate characters having the evaluation values equal to or higher than the threshold value are adopted as the particular conversion candidate characters by the character recognizing means for each of the input character images, the particular conversion candidate character having the highest evaluation value corresponding to one input character image is judged as a misrecognized character by the character recognizing means, in cases where the plurality of particular conversion candidate characters corresponding to the input character image are adopted, for each of the input character images, and the series of search character images respectively corresponding to the particular conversion candidate characters including the misrecognized character are selected from the input character images by the character recognizing means.
-
52. A character recognizing apparatus according to claim 38, in which one or more conversion candidate characters having the evaluation values equal to or higher than a threshold value are adopted as the particular conversion candidate characters by the character recognizing means for each of the input character images, the particular conversion candidate character having the highest evaluation value corresponding to one input character image is judged as a misrecognized character by the character recognizing means, in cases where the plurality of particular conversion candidate characters corresponding to the input character image are adopted, for each of the input character images, and the series of search character images respectively corresponding to the particular conversion candidate characters including the misrecognized character are selected from the input character images by the character recognizing means.
-
53. A character recognizing method in which reference text data to be referred to character recognition and an index file of the reference text data are provided, comprising the steps of:
-
recognizing a character image provided so as to include a single character to be recognized as one or more conversion candidate characters for the character, the one or more conversion candidate characters each being composed of text data, and the character image being newly and repeatedly provided so that one or more conversion candidate characters are obtained for every character image;
producing a plurality of search character strings based on the one or more conversion candidate characters;
searching the reference text data, by using a full text search technique based on the index file of the reference text data, for each of the plurality of search character strings to provide an occurrence frequency of each of the search character strings included in the reference text data; and
determining a character most appropriate for the character image by using the occurrence frequency of each of the search character strings. - View Dependent Claims (54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67)
the first producing step is a step of producing the plurality of search character strings by combining one of the one or more conversion candidate characters for a first character with one of the one or more conversion candidate characters for a second character; - and
the second producing step includes the steps of receiving an occurrence frequency obtained responsively to the search character strings produced at the first producing step, removing any of the search character strings, having an occurrence frequency which is lower than a predetermined value, and producing again the plurality of search character strings by adding one of the one or more conversion candidate characters for a third character to remaining strings of the search character strings.
-
-
55. A character recognizing method according to claim 54, wherein the searching step includes a first search step and a second search step, in which
the first search step is a step of searching the reference text data for the search character strings produced by the first producing step and providing information about the occurrence frequency to the second producing step; - and
the second search step is a step of searching the reference text data for the search character strings produced by the second producing step.
- and
-
56. A character recognizing method according to claim 53, wherein the recognizing step includes calculating an evaluation value indicating a degree of certainty of each of the conversion candidate.
-
57. A character recognizing method according to claim 56, wherein the plurality of search character strings are changeable with regard to at least one of the number of characters and a type of characters.
-
58. A character recognizing method according to claim 56, wherein the plurality of search character strings comprise a fixed number of characters.
-
59. A character recognizing method according to claim 56, wherein the producing step includes the steps of:
-
determining whether or not one of the one or more conversion candidate characters, which is the highest in the occurrence frequency, is similar to others of one or more conversion candidate characters using the evaluation value and a threshold value thereof; and
designating as a misrecognized character the character having the one of the one or more conversion candidate characters similar to the others of the one or more conversion candidate characters.
-
-
60. A character recognizing method according to claim 56, wherein the producing step is a step of producing the plurality of search character strings from the one or more conversion candidate characters using the evaluation value and a threshold value thereof set so as to distinguish a misrecognized character from the one or more conversion candidate characters.
-
61. A character recognizing method according to claim 60, wherein the step of determining the correct character includes the steps of:
-
comparing the occurrence frequency of each of the plurality of search character strings with each other;
selecting one of the plurality of search character strings, which has a highest value of the occurrence frequency, from the plurality of search character strings; and
obtaining the correct character from the one of the plurality of search character strings.
-
-
62. A character recognizing method according to claim 61, further comprising a step of correcting the misrecognized character with the correct character.
-
63. A character recognizing method according to claim 53, wherein the reference text data comprise chain information of characters which constitute a word and function as a knowledge database.
-
64. A character recognizing method according to claim 53, wherein the searching step includes the step of searching, in a full text search technique, the text data of the one or more conversion candidate characters for each of the plurality of search character strings.
-
65. A character recognizing method according to claim 53, further comprising the steps of:
-
providing a document including a character region including the character image to be recognized;
dividing the character region from the document; and
extracting the character image of every character from the character region, and providing the extracted character to the recognizing step.
-
-
66. A character recognizing method according to claim 65, wherein the searching step includes the step of instructing the extracting step to re-extract the characters one by one from the character region.
-
67. A character recognizing method according to claim 65, further comprising the step of recombining the correct characters of the character regions using a search for a character string connecting the correct character at an end of one region to the correct character at an end of another region.
-
68. A character recognizing apparatus, in which reference text data to be referred to character recognition and an index file of the reference text data are provided, comprising:
-
recognizing means for recognizing a character image provided so as to include a single character to be recognized as one or more conversion candidate characters for the character, the one or more conversion candidate characters each being composed of text data;
producing means for producing a plurality of search character strings based on the one or more conversion candidate characters;
searching means for searching the reference text data, by using a full text search technique based on the index file of the reference text data, for each of the plurality of search character strings to provide an occurrence frequency of each of the search character strings included in the reference text data; and
determining means for determining a character most appropriate for the character image by using the occurrence frequency of each of the search character strings.
-
Specification