Document analysis system for integration of paper records into a searchable electronic database
First Claim
1. A computer-readable medium, the medium being characterized in that:
- the computer-readable medium contains code which, when executed in a processor, performs document analysis by the steps of;
electronically receiving at least one input scan containing at least one field for containing data;
analyzing the input scan to identify lines and fields within the input scan, by the steps of;
locating at least one. shaded region or line segment;
filtering any shaded region found;
detecting and filling in any gaps in any located line segment;
clustering any line segments co-located within a specified shift distance; and
determining a length and a location for each line segment or line segment cluster;
comparing the analyzed input scan against a library of form templates;
identifying the form template that best matches the input scan;
based on the identified form template, identifying at least one field or line within the input scan; and
extracting data from the identified field or line.
1 Assignment
0 Petitions
Accused Products
Abstract
Electronic extraction of information from fields within documents comprises identifying a document by comparison to a template library, identifying data fields based on size and position, extracting data from the fields, and applying recognition. Line identification employs shaded region identification, line capture and gap filling, line segment clustering, and optional line rotation. Fingerprinting methods compare line segments found in a document with line definitions for templates to identify the template that best matches the document. Templates for new form types are defined by identifying and determining a location and size for lines, boxes, or shaded regions located within the form. Form fields based on location are then defined, any text within each field is recognized, and field identifiers and content descriptors are assigned and stored to define the template. Identification of unmatched documents is facilitated by clustering unidentified documents for use in identification or creation of a new form template.
-
Citations
22 Claims
-
1. A computer-readable medium, the medium being characterized in that:
the computer-readable medium contains code which, when executed in a processor, performs document analysis by the steps of;
electronically receiving at least one input scan containing at least one field for containing data;
analyzing the input scan to identify lines and fields within the input scan, by the steps of;
locating at least one. shaded region or line segment;
filtering any shaded region found;
detecting and filling in any gaps in any located line segment;
clustering any line segments co-located within a specified shift distance; and
determining a length and a location for each line segment or line segment cluster;
comparing the analyzed input scan against a library of form templates;
identifying the form template that best matches the input scan;
based on the identified form template, identifying at least one field or line within the input scan; and
extracting data from the identified field or line. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A computer-readable medium, the medium being characterized in that:
the computer-readable medium contains code which, when executed in a processor, matches an input scan to a form template by the steps of;
for every line segment identified on the input scan, comparing the position and length of the line segment with at least one line definition from a form template contained in a form template library; and
determining the offset between the input scan line segment and the form template line definition;
using the determined offsets for all input scan line segments, determining a score related to the goodness of fit between the input scan and the form template; and
determining which form template most closely matches the input scan by comparing the score for each form template against scores for other form templates in the form template library. - View Dependent Claims (10, 11)
-
12. A computer-readable medium, the medium being characterized in that:
the computer-readable medium contains code which, when executed in a processor, matches an input scan to a form template by the steps of;
determining an overall line length of identified line segments on the input scan;
ordering form templates in a form template library by comparing the overall line length definition for each template to the input scan overall line length;
separating the input scan line segments into a vertical line class and a horizontal line class;
ordering each class by clustering the perpendicular positioning of each line segment in the class and then sorting each cluster by the parallel positioning of each line segment in the cluster;
beginning with the first form template according to the form template order and employing dynamic programming methodologies, determining an alignment and score for each of the vertical and horizontal line classes based on comparisons of line position and length;
concatenating the alignments from the vertical and horizontal classes to obtain an overall score for the form template;
if more form templates remain in the library, repeating for each form template; and
determining which form template most closely matches the input scan by comparing the overall score for each form template against scores for other form templates in the form template library. - View Dependent Claims (13, 14)
-
15. A computer-readable medium, the medium being characterized in that:
the computer-readable medium contains code which, when executed in a processor, performs form template definition by the steps of;
electronically receiving an instance of a new form type;
identifying at least some lines, boxes, or shaded regions located within the form instance;
determining a location and size for each identified line, box, or shaded region;
from the location and size determined for the identified lines, boxes, or shaded regions, defining at least one form field having an associated- form field location;
optionally recognizing any text within each defined form field;
based on the content of any recognized text for a form field and the associated form field location, assigning an associated form field identifier and an associated form field content descriptor for each form field; and
storing the line locations, form field identifiers, associated form field locations, and associated form field content descriptors to define a form template for the new form type. - View Dependent Claims (16, 17, 18, 19)
-
20. A computer-readable medium, the medium being characterized in that:
the computer-readable medium contains code which, when executed in a processor, performs identification of unidentified input scans by the steps of;
identifying a plurality of input scans that have failed to be matched to a template during a document analysis procedure;
performing a document analysis procedure by selecting one unidentified input scan as a template and using the remaining unidentified input scans as input scans;
placing any input scans that match into an unidentified input scan cluster; and
matching the unidentified input scan cluster to an existing form template from another source or to a new form template defined using the unidentified input scan cluster. - View Dependent Claims (21, 22)
Specification