Redigitization system and service
First Claim
1. A method comprising:
- rasterizing an electronic document to obtain a raster image of the electronic document;
determining the author of the electronic document;
performing one or more optical character recognition (OCR) tasks on the raster image of the electronic document, performing the OCR tasks including identifying digitization errors in the electronic document based on a comparison to a personalized tf*idf error dictionary associated with the author to determine known OCR errors specific to the author, the personalized tf*idf error dictionary representing (i) whether a term occurred or not, (ii) how many times the term occurred, (iii) what percent of words are each term, (iv) using log instead of linear scales for the number of occurrences of the term, (v) or a combination thereof;
correcting errors discovered by the OCR tasks; and
creating a customized error corrected version of the electronic document.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method to error correct extant electronic documents is disclosed. An electronic document may be rasterized to obtain a pixel representation of the electronic document (e.g., raster image). One or more optical character recognition (OCR) tasks may be performed on the raster image of the electronic document. Errors discovered by the OCR tasks may be corrected and a customized error corrected version of the electronic document may be created and stored. If the author of the electronic document is known, the raster image may be compared to a personalized tf*idf error dictionary associated with the author to determine known OCR errors specific to the author. The raster image may also be compared to a personalized electronic error dictionary associated with the author to determine known typographical errors specific to the author.
26 Citations
13 Claims
-
1. A method comprising:
-
rasterizing an electronic document to obtain a raster image of the electronic document; determining the author of the electronic document; performing one or more optical character recognition (OCR) tasks on the raster image of the electronic document, performing the OCR tasks including identifying digitization errors in the electronic document based on a comparison to a personalized tf*idf error dictionary associated with the author to determine known OCR errors specific to the author, the personalized tf*idf error dictionary representing (i) whether a term occurred or not, (ii) how many times the term occurred, (iii) what percent of words are each term, (iv) using log instead of linear scales for the number of occurrences of the term, (v) or a combination thereof; correcting errors discovered by the OCR tasks; and creating a customized error corrected version of the electronic document. - View Dependent Claims (2, 3, 4, 13)
-
-
5. An apparatus comprising:
-
a processor circuit; a processing engine to access stored electronic documents; an optical character recognition (OCR) engine under control of the processor;
a rasterizer to create raster images of the electronic documents;an OCR error dictionary stored in a memory;
a personalized tf*idf dictionary stored in the memory; anda personalized electronic error dictionary stored in the memory, the OCR engine to perform OCR tasks on the raster images using the OCR error dictionary, the personalized tf*idf dictionary, and the personalized electronic error dictionary, the OCR tasks to; determine errors in the raster images; correct the errors in the raster images; create a customized error-corrected version of the electronic document; and store the customized error-corrected version of the electronic document in the memory.
-
-
6. An article of manufacture comprising a non-transitory computer-readable storage medium containing instructions that if executed enable a system to:
-
access an electronic document; obtain a pixel representation of a first version of an electronic document; perform one or more optical character recognition (OCR) tasks on the pixel representation of the electronic document; correct errors in the first version of the electronic document discovered by the OCR tasks; search for a second version of the electronic document; and create a customized error corrected version of the electronic document based on a comparison of a differential error rate between the first version of the electronic document and the second version of the electronic document. - View Dependent Claims (7, 8, 9, 10, 11, 12)
-
Specification