Method and apparatus for processing alphanumeric and graphic information to create a data base
First Claim
1. Method for processing page form documents, said documents comprising discrete information portions, each of said portions comprising text and graphic fields, to create a data base stored in a computer system of digital representations of said page form documents which can be searched and edited, comprising the following steps:
- A. creating digitally formatted documents in bitmap form comprising digital representations of said page form documents;
B. a first processing phase comprising the steps of;
(1) identifying characteristic elements of each page of said page form documents in order to verify correct pagination of said digitally formatted documents,(2) determining by calculation what angle of rotation must be applied to properly orient each text field of each digitally formatted document for subsequent Optical Character Recognition conversion of said text fields,(3) creating a bitmap mask of said characteristic elements of each page of said page form documents, while allowing for said angle of rotation,(4) identifying said characteristic elements on each digitally formatted document in order to compare and verify said characteristics with said bitmap mask,(5) window-formatting said digitally formatted documents to separate the text and graphics fields each of said portions into blocks of digital information which can be separately accessed,(6) segmenting said blocks to distinguish text and graphics fields so that said fields may be separately stored,(7) correcting and aligning only said text fields by taking into account said angle of rotation to create aligned text fields,(8) reconstructing said digitally formatted documents from said aligned text fields and graphics fields so that each portion of said digitally formatted documents may be separately stored,(9) storing said digitally formatted documents, each portion of said digitally formatted documents, said blocks of digital information which can be separately accessed, and said text and graphic fields, in files which can be edited, and(10) manually correcting errors of digitization, pagination, indexing, segmenting and alignment, andC. a second processing phase comprising Optical Character Recognition conversion of characters contained within said aligned text fields and storing said characters in a file which can be searched.
1 Assignment
0 Petitions
Accused Products
Abstract
The invention concerns a method and an apparatus for processing alphanumeric and graphics information recorded in page form on a system to create a data base which can be searched and edited, wherein the following steps take place automatically:
A) determination of digitized page images;
B) a first phase of processing digitized page images providing: verification that these pages are sequential, determination of the characteristic elements on each page, determination of the angle of deviation, mask formation, window formatting, alignment correction, identification of characteristic elements, image segmentation; recording in separate files the digitized page images, the digital image entities and the digitized segmented images accessible for editing;
C) a second processing phase consisting of optical character recognition of the components relative to the alphanumeric data from the segmented images and recording the data in a file to create a data base which can be searched.
-
Citations
10 Claims
-
1. Method for processing page form documents, said documents comprising discrete information portions, each of said portions comprising text and graphic fields, to create a data base stored in a computer system of digital representations of said page form documents which can be searched and edited, comprising the following steps:
-
A. creating digitally formatted documents in bitmap form comprising digital representations of said page form documents; B. a first processing phase comprising the steps of; (1) identifying characteristic elements of each page of said page form documents in order to verify correct pagination of said digitally formatted documents, (2) determining by calculation what angle of rotation must be applied to properly orient each text field of each digitally formatted document for subsequent Optical Character Recognition conversion of said text fields, (3) creating a bitmap mask of said characteristic elements of each page of said page form documents, while allowing for said angle of rotation, (4) identifying said characteristic elements on each digitally formatted document in order to compare and verify said characteristics with said bitmap mask, (5) window-formatting said digitally formatted documents to separate the text and graphics fields each of said portions into blocks of digital information which can be separately accessed, (6) segmenting said blocks to distinguish text and graphics fields so that said fields may be separately stored, (7) correcting and aligning only said text fields by taking into account said angle of rotation to create aligned text fields, (8) reconstructing said digitally formatted documents from said aligned text fields and graphics fields so that each portion of said digitally formatted documents may be separately stored, (9) storing said digitally formatted documents, each portion of said digitally formatted documents, said blocks of digital information which can be separately accessed, and said text and graphic fields, in files which can be edited, and (10) manually correcting errors of digitization, pagination, indexing, segmenting and alignment, and C. a second processing phase comprising Optical Character Recognition conversion of characters contained within said aligned text fields and storing said characters in a file which can be searched. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. Apparatus for processing page form documents, said documents comprising discrete information portions, each of said portions comprising text and graphics fields, to create a data base stored in a computer system of digital representations of said page form documents which can be searched and edited, said apparatus comprising:
-
a. means, including scanner means, for creating digitally formatted documents in bitmap form comprising digital representations of said page form documents; b. a computer for controlling said scanner; c. means for identifying characteristic elements of each page of said page form documents in order to verify correct pagination of said digitally formatted documents; d. means for determining by calculation what angle of rotation must be applied to properly orient each text field of each digitally formatted document for subsequent Optical Character Recognition conversion of said text fields; e. means for creating a bitmap mask of said characteristic elements of each page of said page form documents, while allowing for said angle of rotation; f. means for identifying said characteristic elements on each digitally formatted document in order to compare and verify said characteristics with said bitmap mask; g. means for window-formatting said digitally formatted documents to separate the text and graphics fields each of said portions into blocks of digital information which can be separately accessed; h. means for segmenting said blocks to distinguish text and graphics fields so that said fields may be separately stored; i. means for correcting and aligning only said text fields by taking into account said angle of rotation to create aligned text fields; j. means for reconstructing said digitally formatted documents from said aligned text fields and graphics fields so that each portion of said digitally formatted documents may be separately stored; k. means for storing said digitally formatted documents, each portion of said digitally formatted documents, said blocks of digital information which can be separately accessed, and said text and graphics fields, in files which can be edited; l. means for manually correcting errors of digitization, pagination, indexing, segmenting and alignment, said manual correction means including a visual graphics display console; m. means for Optical Character Recognition conversion of characters contained within said aligned text fields and for storing said characters in a file which can be searched; n. a sub-system for archiving and searching said data base; and o. a laser printer. - View Dependent Claims (9, 10)
-
Specification