Systems and methods for automatically identifying document information
First Claim
1. A computer-implemented method of processing an electronic document, comprising:
- defining, by a processor, a set of canonical features for a document type and a plurality of attributes for a canonical feature;
receiving, by the processor, an electronic document of the document type;
identifying a set of text rectangles from the electronic document;
obtaining a comparison set of reference document codifications,one of the comparison set of reference document codifications comprising a plurality of canonical feature codifications,one of the plurality of canonical feature codifications comprising one or more attribute values for one or more of the plurality of attributes of one of the set of canonical features as the one canonical feature appears in the one reference document;
for each current canonical feature of the set of canonical features;
selecting a set of canonical feature codifications from the comparison set of reference document codifications;
determining a set of possible data types for the current canonical feature from the set of canonical feature codifications;
calculating a frequency of occurrence for each of the set of possible data types;
filtering out each of the set of canonical features codifications for which the frequency of occurrence of the corresponding data type is below a threshold to obtain a filtered set of canonical feature codifications; and
identifying a match between one of the set of text rectangles and one of the filtered set of canonical feature codifications;
for each of the set of text rectangles, selecting one of the matching canonical feature codifications as a final canonical feature codification for the text rectangle.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented method comprises defining a set of canonical features for a document type and a plurality of attributes for a canonical feature; identifying a set of text rectangles from an electronic document; obtaining a comparison set of reference document codifications, one of which comprising a plurality of canonical feature codifications, one of which comprising one or more attribute values for one or more of the plurality of attributes of one of the set of canonical features as the one canonical feature appears in the one reference document; for each current canonical feature of the set of canonical features: selecting a set of canonical feature codifications from the comparison set and identifying a match between one of the set of text rectangles and one of the set of canonical feature codifications; for each of the set of text rectangles, selecting one of the matching canonical feature codifications.
21 Citations
18 Claims
-
1. A computer-implemented method of processing an electronic document, comprising:
-
defining, by a processor, a set of canonical features for a document type and a plurality of attributes for a canonical feature; receiving, by the processor, an electronic document of the document type; identifying a set of text rectangles from the electronic document; obtaining a comparison set of reference document codifications, one of the comparison set of reference document codifications comprising a plurality of canonical feature codifications, one of the plurality of canonical feature codifications comprising one or more attribute values for one or more of the plurality of attributes of one of the set of canonical features as the one canonical feature appears in the one reference document; for each current canonical feature of the set of canonical features; selecting a set of canonical feature codifications from the comparison set of reference document codifications; determining a set of possible data types for the current canonical feature from the set of canonical feature codifications; calculating a frequency of occurrence for each of the set of possible data types; filtering out each of the set of canonical features codifications for which the frequency of occurrence of the corresponding data type is below a threshold to obtain a filtered set of canonical feature codifications; and identifying a match between one of the set of text rectangles and one of the filtered set of canonical feature codifications; for each of the set of text rectangles, selecting one of the matching canonical feature codifications as a final canonical feature codification for the text rectangle. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. One or more non-transitory storage media storing instructions which, when executed by one or more computing devices, cause performance of a method of processing an electronic document, the method comprising:
-
defining a set of canonical features for a document type and a plurality of attributes for a canonical feature; receiving an electronic document of the document type; identifying a set of text rectangles from the electronic document; obtaining a comparison set of reference document codifications, one of the comparison set of reference document codifications comprising a plurality of canonical feature codifications, one of the plurality of canonical feature codifications comprising one or more attribute values for one or more of the plurality of attributes of one of the set of canonical features as the one canonical feature appears in the one reference document; for each current canonical feature of the set of canonical features; selecting a set of canonical feature codifications from the comparison set of reference document codifications; determining a set of possible data types for the current canonical feature from the set of canonical feature codifications; calculating a frequency of occurrence for each of the set of possible data types; filtering out each of the set of canonical features codifications for which the frequency of occurrence of the corresponding data type is below a threshold to obtain a filtered set of canonical feature codifications; and identifying a match between one of the set of text rectangles and one of the filtered set of canonical feature codifications; for each of the set of text rectangles, selecting one of the matching canonical feature codifications as a final canonical feature codification for the text rectangle. - View Dependent Claims (13, 14, 15, 16, 17, 18)
-
Specification