Intelligent extraction and organization of data from unstructured documents
First Claim
1. A system comprising a computer configured with instructions encoded on a non-transitory computer readable medium, the instructions operable to, when executed, cause the computer to perform acts comprising:
- a) organizing a plurality of data elements from an input document into a set of groups by repeatedly, until satisfaction of a completion condition, performing a grouping process comprising;
i) performing a matching determination of whether a subject data element satisfies a matching criterion for an existing group;
ii) performing an overlap determination of whether the subject data element is associated with a horizontal extent that overlaps a horizontal neighbor of one of the existing group'"'"'s data elements; and
iii) selectively adding the subject data element to the existing group based on the matching determination and the overlap determination, wherein when the subject data element is associated with a horizontal extent that overlaps a horizontal neighbor of one of the existing group'"'"'s data elements the subject data element is excluded from the existing group;
b) assigning each group from the set of groups to a column based on horizontal positions of data elements from groups that have already been assigned to columns;
c) for each data element, assigning that data element to a row based on that data element'"'"'s vertical position; and
d) for each data element, adding that data element to a structured document in the row assigned to that data element and in the column assigned to the group which comprises that data element;
wherein each data element has data comprising one or more non-whitespace characters.
4 Assignments
0 Petitions
Accused Products
Abstract
Data elements from an input document can be automatically organized into rows and columns in a structured output document using a grouping process that automatically applies matching criteria based on horizontal position, data content and horizontal extent and tests for horizontal overlaps between data elements and neighbors of data elements in existing groups, assigns columns to those groups based on horizontal positions of data elements from groups that have already been assigned to columns. Rows may be assigned to data elements based on those data elements'"'"' vertical positions.
33 Citations
18 Claims
-
1. A system comprising a computer configured with instructions encoded on a non-transitory computer readable medium, the instructions operable to, when executed, cause the computer to perform acts comprising:
-
a) organizing a plurality of data elements from an input document into a set of groups by repeatedly, until satisfaction of a completion condition, performing a grouping process comprising; i) performing a matching determination of whether a subject data element satisfies a matching criterion for an existing group; ii) performing an overlap determination of whether the subject data element is associated with a horizontal extent that overlaps a horizontal neighbor of one of the existing group'"'"'s data elements; and iii) selectively adding the subject data element to the existing group based on the matching determination and the overlap determination, wherein when the subject data element is associated with a horizontal extent that overlaps a horizontal neighbor of one of the existing group'"'"'s data elements the subject data element is excluded from the existing group; b) assigning each group from the set of groups to a column based on horizontal positions of data elements from groups that have already been assigned to columns; c) for each data element, assigning that data element to a row based on that data element'"'"'s vertical position; and d) for each data element, adding that data element to a structured document in the row assigned to that data element and in the column assigned to the group which comprises that data element; wherein each data element has data comprising one or more non-whitespace characters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 16, 17)
-
-
8. A method for automatically organizing data from one or more input documents into rows and columns in a structured output document, the method comprising:
-
a) obtaining a set of matching criteria that define matches based on horizontal position, data content and horizontal extent associated with data elements; b) obtaining a plurality of data elements from an input document corresponding to information for rows and columns in the structured output document; c) generating a set of data element groups by repeatedly, until satisfaction of a completion condition, applying the set of matching criteria to the data elements and testing for horizontal overlaps between data elements and neighbors of data elements in existing groups; and d) applying the data elements and set of data element groups, respectively, to rows and columns of the output document to generate a structured output document with columns extending vertically across any internal headings in the one or more input documents that extend across multiple columns; wherein each data element has data comprising one or more non-whitespace characters. - View Dependent Claims (9, 10, 11, 12, 13, 14, 18)
-
-
15. A system for automatically organizing data from one or more input documents into rows and columns in a structured output document, the system comprising:
-
a) means for grouping data elements from an input document into a set of groups based on satisfaction of matching criteria and testing for horizontal overlaps between data elements and neighbors of other data elements; and b) means for assigning columns to data elements from the input document based on the set of groups; wherein A) the means for grouping data elements from the input document into the set of groups based on satisfaction of matching criteria and testing for horizontal overlaps between data elements and neighbors of other data elements and the means for assigning columns to data elements from the input document based on the set of groups are implemented with a single computer; B) the computer is operable to generate a structured output document having columns extending across internal subheadings regardless of whether the internal subheadings prevent data elements in the columns from being separated from data elements in other columns by continuous vertical whitespace.
-
Specification