System and method for data extraction from digital images
First Claim
1. A system for the extraction of textual data from a digital image using predefined patterns based on visible and invisible characters contained in the textual data, comprising:
- database means for storing data base records comprising;
a master document image database comprised of at least one table containing at least one master document image;
a template database comprised of at least one table comprising at least one template associated with the master document image, the template having at least one zone, the zone associated with a unique pattern comprised of one or more nonoverlapping segments, each segment containing one or more characters, with selected ones of the segments being associated with a data field in an extracted data base record;
an extracted data database comprised of at least one table of extracted data base records, each record comprised of at least one data field for storing textual information extracted from the digital image;
an image comparator in communication with the database means and receiving therefrom the master document image, the image comparator having an input for receiving the digital image, the image comparator comparing the master document image to the digital image and providing an output indicative of the success of the comparison;
a template mapper in communication with the database means and the output of the image comparator and having an input for receiving the digital image and, on receiving the image comparator output indicating a successful comparison, retrieving the template from the template database associated with the successfully compared master document image and applying the template to the digital image, the template mapper providing as an output an image of each zone associated with the applied template;
a zone optical character reader (OCR) in communication with the template mapper and receiving the output thereof, the zone OCR creating a zone data file of the characters in each zone image and providing the zone data file as an output;
a zone pattern comparator in communication with the database means and the output of the zone OCR, the zone pattern comparator retrieving from template database the pattern associated with the zone and comparing the pattern to the zone data file, and, in the event that the pattern is found, extracting the data matching the pattern digital into an extracted data file, the zone pattern comparator providing the extracted data file as an output; and
an extracted data parser in communication with the database means and the output of the zone pattern comparator, the parser parsing the data in the extracted data file for populating the data field of the database record associated with the pattern, the parser providing as an output the populated database record to the extracted data database for storage therein.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method of the extraction of textual data from a digital image using a data pattern comprised of visible and invisible characters to locate the data to be extracted and upon find such data populating the fields of an associated data base with the extracted visible data. The digital image to be processed is first compared against master document images contained in a database. Upon determining the proper master document image, a template having predefined data zone is applied to the image to create zone images. The zone images are optically read and converted into a character file which is then parsed with the pattern to locate the text to be extracted. Upon finding data matching the pattern, that data is extracted and the visible portions used to populate data fields in a database record associated with the digital image.
In an alternate embodiment, if the extracted data cannot be successfully matched, a validation file of the unmatched data is created for review by an operator. In a further embodiment, if the scanned digital image cannot be matched with an existing master document image, a new master document image can be created from the unmatched digital image. In another alternate embodiment, alternate patterns can be used to search the data files allowing for variation in format of the data being extracted.
235 Citations
13 Claims
-
1. A system for the extraction of textual data from a digital image using predefined patterns based on visible and invisible characters contained in the textual data, comprising:
- database means for storing data base records comprising;
a master document image database comprised of at least one table containing at least one master document image;
a template database comprised of at least one table comprising at least one template associated with the master document image, the template having at least one zone, the zone associated with a unique pattern comprised of one or more nonoverlapping segments, each segment containing one or more characters, with selected ones of the segments being associated with a data field in an extracted data base record;
an extracted data database comprised of at least one table of extracted data base records, each record comprised of at least one data field for storing textual information extracted from the digital image;
an image comparator in communication with the database means and receiving therefrom the master document image, the image comparator having an input for receiving the digital image, the image comparator comparing the master document image to the digital image and providing an output indicative of the success of the comparison;
a template mapper in communication with the database means and the output of the image comparator and having an input for receiving the digital image and, on receiving the image comparator output indicating a successful comparison, retrieving the template from the template database associated with the successfully compared master document image and applying the template to the digital image, the template mapper providing as an output an image of each zone associated with the applied template;
a zone optical character reader (OCR) in communication with the template mapper and receiving the output thereof, the zone OCR creating a zone data file of the characters in each zone image and providing the zone data file as an output;
a zone pattern comparator in communication with the database means and the output of the zone OCR, the zone pattern comparator retrieving from template database the pattern associated with the zone and comparing the pattern to the zone data file, and, in the event that the pattern is found, extracting the data matching the pattern digital into an extracted data file, the zone pattern comparator providing the extracted data file as an output; and
an extracted data parser in communication with the database means and the output of the zone pattern comparator, the parser parsing the data in the extracted data file for populating the data field of the database record associated with the pattern, the parser providing as an output the populated database record to the extracted data database for storage therein. - View Dependent Claims (2, 3, 4, 5)
an image enhancer having an input for receiving a digital image of a document to be made into a master document image, the image enhancer comprising;
selection means for selectably choosing from the following operations;
deskewing, registration, line management, fix white text, noise removal, character enhancement, image orientation, image enhancement, and image extraction image; and
means for performing each of the selected operations on the digital image to produce an enhanced digital image;
a zone mapper in communication with the image enhancer and receiving therefrom the enhanced image, the zone mapper comprising;
means for selecting one or more regions of said enhanced image and defining each selected region as a zone;
means for selecting a zone and selectably removing images and data contained in a selected zone; and
means for associating each zone defined for the enhanced image with a template;
OCR reader means for converting material in a selected zone into a data file of characters;
a pattern mapper in communication with the OCR reader means and receiving therefrom the data file for a selected zone, the pattern mapper comprising;
means for selecting from the data file a sequence of characters to be used to define a pattern;
means for creating a data template having one or more nonoverlapping data segments, each segment containing one or more characters contained in the pattern;
means for selectably associating with each data segment one or more of the following characteristics;
capture data indicator, data type, data format, element length, table name, field name, field type, field length, and validation indicator; and
means for associating the pattern and its associated characteristics with the zone; and
means for storing the template and the associated zones, patterns and characteristics in a database.
- database means for storing data base records comprising;
-
5. The system of claim 4 further wherein the system further comprises a computer system being responsive to a program operating therein, said programmed computer system comprising the image comparator, the template mapper, the zone optical reader, the zone pattern comparator, the extracted data parser, the database means, and the means for creating a master document image.
-
6. A method for the extraction of textual data from a digital image containing character images using predefined patterns based on visible and invisible characters contained in the textual data, comprising:
-
a) selecting from a database a master document image having associated therewith a template, the template having a zone, the zone having associated therewith one pattern comprised of at least one data segment each containing a data sequence of one or more characters;
b) creating an unpopulated database table having one or more data records, each data record having one or more data fields for containing visible character data extracted from the digital image and associating the database table with the master document image and the database record with the digital image, and, for at least one of the data segments containing visible data associating it with a database field in the database;
c) comparing the digital image to the master document image and upon an successful match occurring;
applying the associated template and the zone therein to the digital image, performing optical character recognition on the character images within the zone, creating a zone data file containing the characters optically read from the zone;
comparing the zone data file with the pattern associated with the zone;
extracting the data in the zone data file that matches the pattern, and, for each data segment associated with a data field, populating the data field with the visible data extracted from the zone data file corresponding to that data segment. - View Dependent Claims (7, 8, 9, 10, 11)
processing the file containing the unmatched digital image to create a new master document and associated template, zones and patterns comprising the steps of;
selectably removing images and data not required to be extracted from the unmatched digital image to form a new master document image;
defining on the unmatched digital image one or more nonoverlapping zones of data to be extracted to form a template;
performing optical character recognition on the character images contained in each zone and converting the character images into a data file of characters;
selecting from the data file for each zone a sequence of characters to be used to define a pattern of data to be extracted;
creating a data template having one or more nonoverlapping data segments, each segment containing one or more characters contained in the pattern;
selectably associating with each data segment one or more of the following characteristics;
capture data indicator, data type, data format, element length, table name, data field name, data field type, data field length, and validation indicator;
associating the pattern and its associated characteristics with its respective zone; and
storing the template and the associated zones, patterns and characteristics in the database.
-
-
9. The method of claim 7 further comprising the step of creating, in the event no zone data matching the pattern is found, a validation file containing the zone data file for operator review.
-
10. The method of claim 6 further comprising the step of scanning a printed document containing data to be extracted and entered into the database record to create a digital image of the printed document for processing.
-
11. The method of claim 6 wherein the template further comprises one or more alternate zones, each alternate zone associated with a unique alternate pattern that is similar to but differs from the pattern associated with the other zones and the step of comparing the zone data file with the zone pattern further comprises:
-
selecting one of the alternate zones in the event no match is found with the pattern in the zone;
comparing the selected alternate zone and its associated pattern with the zone data file;
repeating the foregoing steps until a match is found with one of the alternate zones;
in the event no match is found, creating a validation file containing the zone data for which no much was found; and
in the event of a match, performing the steps of extracting and populating using the data pattern found in the matching alternate zones.
-
-
12. A method for the extraction of textual data from a digital image using predefined patterns based on visible and invisible characters contained in the textual data, comprising:
-
a) creating at least one master document image having associated therewith at least one template, the template having at least one zone, each zone having associated with it one predefined pattern comprised of one or more data segments containing a data sequence of one or more characters;
b) creating an unpopulated database table having one or more data records, each data record having one or more data fields for containing visible character data extracted from the digital image and associating the database table with the master document image and the database record with the digital image, and, for at least one of the data segments containing visible data associating it with a database field;
c) storing the database record, master document image and associated template, zone and pattern in a database;
d) comparing the digital image to a master document image retrieved from the database;
e) upon an unsuccessful match;
1) selecting a new master document image from the database for comparison until a match is found; and
2) in the event no match is found, creating a file for the unmatched image and alerting the operator that no match has been found;
f) upon an successful match;
1) applying the associated template and each zone therein to the digital image, 2) performing optical character recognition on the characters images within each zone, 3) creating, for each zone, a zone data file containing the characters read from the zone;
4) selecting a zone data file;
5) comparing the selected zone data file with the pattern associated with the selected zone;
6) in the event no data matching the pattern is found, creating a validation file containing the zone data file for operator review; and
7) in the event data matching the pattern is found, extracting the data in the zone data file that matches the pattern, and, for each data segment associated with a data field, populating the data field with the visible data extracted from the zone data file corresponding to that data segment. g) selecting, if additional zones are present, the next zone and repeating steps d-f; and
h) selecting, if additional digital images are present, the next digital image to be processed and repeating steps d-g. - View Dependent Claims (13)
determining if the data in the populated data field requires validation, and on determining that validation is required, creating a second validation file containing at least the data field for operator review.
-
Specification