Intelligent extraction and organization of data from unstructured documents
First Claim
1. A system comprising a computer configured with instructions encoded on a non-transitory computer readable medium, the instructions operable to, when executed, cause the computer to perform a set of acts comprising:
- a) obtaining a plurality of data elements from an input document corresponding to information for rows and columns in a structured output document;
b) generating a set of data element groups based on applying a set of matching criteria to the data elements after the data elements have been assigned to a set of existing groups, each of which comprises one or more data elements, and testing for horizontal overlaps between neighbors of data elements in a first existing group and data elements in a second existing group and combining the first and second existing groups based on determining the existence of a match and absence of neighbor overlap between the first and second existing groups; and
c) assigning the data elements to columns based on the set of data element groups to generate the structured output document with columns having data elements separated from data elements in neighboring columns by whitespace and extending vertically across any internal headings in the one or more input documents regardless of whether the internal headings cause vertical whitespace between neighboring columns to be discontinuous.
2 Assignments
0 Petitions
Accused Products
Abstract
Data elements from an input document can be automatically organized into rows and columns in a structured output document using a grouping process that automatically applies matching criteria based on horizontal position, data content and horizontal extent and tests for horizontal overlaps between data elements and neighbors of data elements in existing groups, assigns columns to those groups based on horizontal positions of data elements from groups that have already been assigned to columns. Rows may be assigned to data elements based on those data elements'"'"' vertical positions.
-
Citations
18 Claims
-
1. A system comprising a computer configured with instructions encoded on a non-transitory computer readable medium, the instructions operable to, when executed, cause the computer to perform a set of acts comprising:
-
a) obtaining a plurality of data elements from an input document corresponding to information for rows and columns in a structured output document; b) generating a set of data element groups based on applying a set of matching criteria to the data elements after the data elements have been assigned to a set of existing groups, each of which comprises one or more data elements, and testing for horizontal overlaps between neighbors of data elements in a first existing group and data elements in a second existing group and combining the first and second existing groups based on determining the existence of a match and absence of neighbor overlap between the first and second existing groups; and c) assigning the data elements to columns based on the set of data element groups to generate the structured output document with columns having data elements separated from data elements in neighboring columns by whitespace and extending vertically across any internal headings in the one or more input documents regardless of whether the internal headings cause vertical whitespace between neighboring columns to be discontinuous. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for automatically organizing data from one or more input documents into rows and columns in a structured output document, the method comprising:
-
a) obtaining a set of matching criteria that define matches between data elements; b) obtaining a plurality of data elements from an input document corresponding to information for rows and columns in the structured output document; c) generating a set of data element groups based on applying the set of matching criteria to the data elements after the data elements have been assigned to a set of existing groups, each of which comprises one or more data elements, and testing for horizontal overlaps between neighbors of data elements in a first existing group and data elements in a second existing group and combining the first and second existing groups based on determining the existence of a match and absence of neighbor overlap between the first and second existing groups; and d) assigning the data elements to columns of the output document based on the set of data element groups to generate a structured output document with columns having data elements separated from data elements in neighboring columns by whitespace and extending vertically across any internal headings in the one or more input documents regardless of whether the internal subheadings cause vertical whitespace between neighboring columns to be discontinuous. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A non-transitory computer readable medium encoded with instructions operable to, when executed, cause a computer to perform a set of acts comprising:
-
a) obtaining a plurality of data elements from an input document corresponding to information for rows and columns in a structured output document; b) generating a set of data element groups based on applying a set of matching criteria to the data elements after the data elements have been assigned to a set of existing groups, each of which comprises one or more data elements, and testing for horizontal overlaps between neighbors of data elements in a first existing group and data elements in a second existing group and combining the first and second existing groups based on determining the existence of a match and absence of neighbor overlap between the first and second existing groups; and c) assigning the data elements to columns based on the set of data element groups to generate the structured output document with columns having data elements separated from data elements in neighboring columns by whitespace and extending vertically across any internal headings in the one or more input documents regardless of whether the internal headings cause vertical whitespace between neighboring columns to be discontinuous. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification