Methods and systems to train models to extract and integrate information from data sources
First Claim
1. A non-transitory computer readable storage medium storing at least one program configured for execution by at least one processor of a computer system, the at least one program comprising instructions to:
- obtain a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar;
receive a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising;
(i) a plurality of user-provided navigational tags, whereina user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and(ii) a plurality of corresponding user-identified tokens in the first source document, whereina user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document;
select a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags;
extract information from a third source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and
transform the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources.
4 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems to model and acquire data from a variety of data and information sources, to integrate the data into a structured database, and to manage the continuing reintegration of updated data from those sources over time. For any given domain, a variety of individual information and data sources that contain information relevant to the schema can be identified. Data elements associated with a schema may be identified in a training source, such as by user tagging. A formal grammar may be induced appropriate to the schema and layout of the training source. A Hidden Markov Model (HMM) corresponding to the grammar may learn where in the sources the elements can be found. The system can automatically mutate its schema into a grammar matching the structure of the source documents. By following an inverse transformation sequence, data that is parsed by the mutated grammar can be fit back into the original grammar structure, matching the original data schema defined through domain modeling. Features disclosed herein may be implemented with respect to web-scraping and data acquisition, and to represent data in support of data-editing and data-merging tasks. A schema may be defined with respect to a graph-based domain model.
-
Citations
24 Claims
-
1. A non-transitory computer readable storage medium storing at least one program configured for execution by at least one processor of a computer system, the at least one program comprising instructions to:
-
obtain a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar; receive a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising; (i) a plurality of user-provided navigational tags, wherein a user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and (ii) a plurality of corresponding user-identified tokens in the first source document, wherein a user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document; select a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags; extract information from a third source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and transform the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources. - View Dependent Claims (2, 3, 6, 9, 12, 15, 21, 22)
-
-
4. A system for extracting and integrating information from one or more sources, comprising:
-
at least one processor; memory; and at least one program stored in the memory and executable by the at least one processor, the at least one program comprising instructions to; obtain a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar; receive a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising; (i) a plurality of user-provided navigational tags, wherein a user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and (ii) a plurality of corresponding user-identified tokens in the first source document, wherein a user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document; select a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags; extract information from a third of source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and transform the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources. - View Dependent Claims (7, 10, 13, 16, 18, 19, 23)
-
-
5. A computer-implemented method for extracting and integrating information from one or more sources, comprising:
-
obtaining a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar; receiving a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising; (i) a plurality of user-provided navigational tags, wherein a user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and (ii) a plurality of corresponding user-identified tokens in the first source document, wherein a user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document; selecting a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags; extracting information from a third of source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and transforming the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources. - View Dependent Claims (8, 11, 14, 17, 20, 24)
-
Specification