Method and apparatus for extracting relevant data
First Claim
1. A method of extraction, comprising:
- accessing at least a first set of data of a first document, the first document including markup language, wherein the first set of data includes a first selected subset and a second selected subset, such that the second selected subset of data is a subset of the first selected subset of data, the first selected subset at least partly specifying document data, the second selected subset at least partly specifying document data;
accessing at least a second set of data of a second document, the second document including markup language;
determining a first edit sequence between at least part of the first set of data and at least part of the second set of data, the first edit sequence including any of insertions, deletions, substitutions, matches, and repetitions, including;
considering at least repetitions for inclusion in the first edit sequence between at least part of the first set of data and at least part of the second set of data;
finding a first corresponding subset of the second set of data, the first corresponding subset having a correspondence to the first selected subset, the correspondence at least partly found by determining the first edit sequence;
determining a second edit sequence between at least part of the first set of data and at least part of the second set of data, the first set of data including at least part of the first selected subset, the second set of data including at least part of the first corresponding subset, the second edit sequence including any of insertions, deletions, substitutions, matches, and repetitions, including;
considering at least repetitions for inclusion in the second edit sequence between at least part of the first set of data and at least part of the second set of data, the first set of data including at least part of the first selected subset; and
finding a second corresponding subset of the second set of data, the second corresponding subset having a correspondence to the second selected subset, the correspondence at least partly found by determining the second edit sequence;
wherein subsequent sets of data of documents are received, the documents including markup language, document data of the subsequent sets of data are determined by finding corresponding data of the subsequent sets of data, the corresponding data of the subsequent sets correspond to the selected data of earlier sets of data, the corresponding data of the subsequent sets are identified as selected data of the subsequent sets of data, the selected data of the subsequent sets of data at least partly specifying document data, and at least one of selected data of the earlier sets and the selected data of the subsequent data at least partly determine corresponding data of later sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and the later sets of data are received later than the subsequent sets of data.
9 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates to a method and apparatus for extracting relevant data. A first and a second set of data are accessed. The first set includes selected data. An edit sequence is determined between the first and the second sets, including considering at least repetitions for inclusion in the edit sequence. Corresponding data of the second set have a correspondence to the selected data are found at least partly by determining the edit sequence. A first and a second tree of data are accessed. The first tree includes selected data. An edit sequence is determined between the first and the second trees, including considering at least repetitions for inclusion in the edit sequence. Corresponding data of the second tree have a correspondence to the selected data are found at least partly by determining the edit sequence.
180 Citations
115 Claims
-
1. A method of extraction, comprising:
-
accessing at least a first set of data of a first document, the first document including markup language, wherein the first set of data includes a first selected subset and a second selected subset, such that the second selected subset of data is a subset of the first selected subset of data, the first selected subset at least partly specifying document data, the second selected subset at least partly specifying document data; accessing at least a second set of data of a second document, the second document including markup language; determining a first edit sequence between at least part of the first set of data and at least part of the second set of data, the first edit sequence including any of insertions, deletions, substitutions, matches, and repetitions, including; considering at least repetitions for inclusion in the first edit sequence between at least part of the first set of data and at least part of the second set of data; finding a first corresponding subset of the second set of data, the first corresponding subset having a correspondence to the first selected subset, the correspondence at least partly found by determining the first edit sequence; determining a second edit sequence between at least part of the first set of data and at least part of the second set of data, the first set of data including at least part of the first selected subset, the second set of data including at least part of the first corresponding subset, the second edit sequence including any of insertions, deletions, substitutions, matches, and repetitions, including; considering at least repetitions for inclusion in the second edit sequence between at least part of the first set of data and at least part of the second set of data, the first set of data including at least part of the first selected subset; and finding a second corresponding subset of the second set of data, the second corresponding subset having a correspondence to the second selected subset, the correspondence at least partly found by determining the second edit sequence; wherein subsequent sets of data of documents are received, the documents including markup language, document data of the subsequent sets of data are determined by finding corresponding data of the subsequent sets of data, the corresponding data of the subsequent sets correspond to the selected data of earlier sets of data, the corresponding data of the subsequent sets are identified as selected data of the subsequent sets of data, the selected data of the subsequent sets of data at least partly specifying document data, and at least one of selected data of the earlier sets and the selected data of the subsequent data at least partly determine corresponding data of later sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and the later sets of data are received later than the subsequent sets of data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
-
35. A method of extracting relevant data, comprising:
-
accessing at least a first set of data of a first document, the first document including markup language, wherein the first set of data includes selected data of the first document, the selected data at least partly specifying document data; accessing at least a second set of data of a second document, the second document including markup language; determining a first edit sequence between at least part of the first set of data and at least part of the second set of data, the first edit sequence including any of insertions, deletions, substitutions, matches and repetitions, including; considering at least repetitions for inclusion in the first edit sequence between at least part of the first set of data and at least part of the second set of data; finding corresponding data of the second set of data, the corresponding data having a correspondence to the selected data, the correspondence at least partly found by determining the first edit sequence; if two or more corresponding data are found, then; selecting larger selected data, at least part of the larger selected data including a larger subtree in a tree representation of the first set of data, the larger subtree including the selected data; determining a second edit sequence between at least part of the first set of data and at least part of the second set of data, the first set of data including at least part of the larger selected data, the second edit sequence including any of insertions, deletions, substitutions, matches and repetitions, including; considering at least repetitions for inclusion in the second edit sequence between at least part of the first set of data and at least part of the second set of data; finding corresponding data of the second set of data, the corresponding data having a correspondence to the larger selected data, the correspondence at least partly found by determining the second edit sequence; and finding corresponding data of the second set of data, the corresponding data having a correspondence to the selected data, the correspondence at least partly found by determining the second edit sequence; wherein subsequent sets of data of documents are received, the documents including markup language, document data of the subsequent sets of data are determined by finding corresponding data of the subsequent sets of data, the corresponding data of the subsequent sets correspond to the selected data of earlier sets of data, the corresponding data of the subsequent sets are identified as selected data of the subsequent sets of data, the selected data of the subsequent sets of data at least partly specifying document data, and at least one of selected data of the earlier sets and the selected data of the subsequent data at least partly determine corresponding data of later sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and the later sets of data are received later than the subsequent sets of data. - View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
-
47. A method of extraction, comprising:
-
accessing at least a first set of data of a first document, the first document including markup language, wherein the first set of data includes selected data, the selected data at least partly specifying document data; accessing at least a second set of data of a second document, the second document including markup language; determining document data of the second set of data, by finding corresponding data of the second set of data, the corresponding data having a correspondence to the selected data of the first set of data, the correspondence at least partly determined by a first edit sequence between at least part of the first set of data and at least part of the second set of data, the first edit sequence including any of insertions, deletions, substitutions, matches, and repetitions, including; considering at least repetitions for inclusion in the first edit sequence between at least part of the first set of data and at least part of the second set of data; identifying the corresponding data of the second set of data as selected data of the second set of data, the selected data at least partly specifying document data; accessing at least a third set of data of a third document, the third document including markup language; and determining document data of the third set of data, by finding corresponding data of the third set of data, the corresponding data having a correspondence to at least one of the selected data of the first set of data and the selected data of the second set of data, the correspondence at least partly determined by a second edit sequence between at least part of the third set of data and at least one of at least part of the first set of data and at least part of the second set of data, the second edit sequence including any of insertions, deletions, substitutions, matches, and repetitions, including; considering at least repetitions for inclusion in the second edit sequence between at least part of the third set of data and at least one of at least part of the first set of data and at least part of the second set of data; wherein subsequent sets of data of documents are received, the documents including markup language, document data of the subsequent sets of data are determined by finding corresponding data of the subsequent sets of data, the corresponding data of the subsequent sets correspond to the selected data of earlier sets of data, the corresponding data of the subsequent sets are identified as selected data of the subsequent sets of data, the selected data of the subsequent sets of data at least partly specifying document data, and at least one of selected data of the earlier sets and the selected data of the subsequent data at least partly determine corresponding data of later sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and the later sets of data are received later than the subsequent sets of data. - View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81)
-
-
82. A method of extraction, comprising:
-
accessing at least a first set of data of a first document, the first document including markup language, wherein the first set of data includes selected data, the selected data at least partly specifying document data; accessing at least a second set of data of a second document, the second document including markup language; determining document data of the second set of data, by finding corresponding data of the second set of data, the corresponding data having a correspondence to the selected data of the first set of data, the correspondence at least partly determined by a first tree-based edit sequence between at least part of the first set of data and at least part of the second set of data, the first tree-based edit sequence including any of insertions, deletions, substitutions, matches, and repetitions, including; considering at least repetitions for inclusion in the first tree-based edit sequence between at least part of the first set of data and at least part of the second set of data; identifying the corresponding data of the second set of data as selected data of the second set of data, the selected data at least partly specifying document data; accessing at least a third set of data of a third document, the third document including markup language; and determining document data of the third set of data, by finding corresponding data of the third set of data, the corresponding data having a correspondence to at least one of the selected data of the first set of data and the selected data of the second set of data, the correspondence at least partly determined by a second tree-based edit sequence between at least part of the third set of data and at least one of at least part of the first set of data and at least part of the second set of data, the second tree-based edit sequence including any of insertions, deletions, substitutions, matches, and repetitions, including; considering at least repetitions for inclusion in the second tree-based edit sequence between at least part of the third set of data and at least one of at least part of the first set of data and at least part of the second set of data; wherein subsequent sets of data of documents are received, the documents including markup language, document data of the subsequent sets of data are determined by finding corresponding data of the subsequent sets of data, the corresponding data of the subsequent sets correspond to the selected data of earlier sets of data, the corresponding data of the subsequent sets are identified as selected data of the subsequent sets of data, the selected data of the subsequent sets of data at least partly specifying document data, and at least one of selected data of the earlier sets and the selected data of the subsequent data at least partly determine corresponding data of later sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and the later sets of data are received later than the subsequent sets of data. - View Dependent Claims (83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115)
-
Specification