METHODS AND SYSTEMS TO TRAIN MODELS TO EXTRACT AND INTEGRATE INFORMATION FROM DATA SOURCES
First Claim
1. A computer program product including a computer readable medium having computer program logic stored therein, the computer program logic comprising:
- logic to cause a processor to analyze user-input tag data, including computer readable tags and associated tokens, corresponding to at least a portion of a first set of one or more documents, and based on the analysis, to assign tags to non-user-tagged tokens within one or more of a second portion of the first set of documents, or a second set of one or more documents;
wherein each token includes one or more computer readable elements of a corresponding document; and
wherein the computer readable tags include one or more of,data tags to associate corresponding tokens with an entity-type of a graph-structured domain model, andnavigation tags to associate corresponding tokens with one or more document navigation actions.
4 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems to model and acquire data from a variety of data and information sources, to integrate the data into a structured database, and to manage the continuing reintegration of updated data from those sources over time. For any given domain, a variety of individual information and data sources that contain information relevant to the schema can be identified. Data elements associated with a schema may be identified in a training source, such as by user tagging. A formal grammar may be induced appropriate to the schema and layout of the training source. A Hidden Markov Model (HMM) corresponding to the grammar may learn where in the sources the elements can be found. The system can automatically mutate its schema into a grammar matching the structure of the source documents. By following an inverse transformation sequence, data that is parsed by the mutated grammar can be fit back into the original grammar structure, matching the original data schema defined through domain modeling. Features disclosed herein may be implemented with respect to web-scraping and data acquisition, and to represent data in support of data-editing and data-merging tasks. A schema may be defined with respect to a graph-based domain model.
-
Citations
51 Claims
-
1. A computer program product including a computer readable medium having computer program logic stored therein, the computer program logic comprising:
-
logic to cause a processor to analyze user-input tag data, including computer readable tags and associated tokens, corresponding to at least a portion of a first set of one or more documents, and based on the analysis, to assign tags to non-user-tagged tokens within one or more of a second portion of the first set of documents, or a second set of one or more documents; wherein each token includes one or more computer readable elements of a corresponding document; and wherein the computer readable tags include one or more of, data tags to associate corresponding tokens with an entity-type of a graph-structured domain model, and navigation tags to associate corresponding tokens with one or more document navigation actions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A system, comprising:
-
means for analyzing user-input tag data, including computer readable tags and associated tokens, corresponding to at least a portion of a first set of one or more documents; and means for assigning the computer readable tags to tokens within one or more of a second portion of the first set of one or more documents and a second set of one or more documents based on the analysis; wherein each token includes one or more computer readable elements of a corresponding document; and wherein the computer readable tags include one or more of, data tags to associate corresponding tokens with an entity-type of a graph-structured domain model, and navigation tags to associate corresponding tokens with one or more document navigation actions.
-
-
23. A method, comprising:
-
analyzing user-input tag data, including computer readable tags and associated tokens, corresponding to at least a portion of a first set of one or more documents; and assigning the computer readable tags to tokens within one or more of a second portion of the first set of one or more documents and a second set of one or more documents based on the analysis; wherein each token includes one or more computer readable elements of a corresponding document; wherein the computer readable tags include one or more of, data tags to associate corresponding tokens with an entity-type of a graph-structured domain model, and navigation tags to associate corresponding tokens with one or more document navigation actions; and wherein the analyzing is performed within a suitably programmed computer system.
-
-
24. A computer program product including a computer readable medium having computer program logic stored therein, the computer program logic comprising:
-
logic to cause a processor to implement a graph-structured domain model schema having a plurality of entity types, a plurality of permissible directional relationship arcs amongst the entity types, and cardinalities associated with the relationship arcs; and logic to cause the processor to interpret the graph-structured domain model schema as a domain grammar over sequences of tags, wherein each tag corresponds to one of an entity-type and a path between entity types in the domain model schema. - View Dependent Claims (25, 26, 27)
-
-
28. A system, comprising:
-
means for implementing a graph-structured domain model schema having a plurality of entity types, a plurality of permissible directional relationship arcs amongst the entity types, and cardinalities associated with the relationship arcs; and means for interpreting the graph-structured domain model schema as a domain grammar over sequences of tags, wherein each tag corresponds to one of an entity-type and a path between entity types in the domain model schema.
-
-
29. A method, comprising:
-
implementing a graph-structured domain model schema having a plurality of entity types, a plurality of permissible directional relationship arcs amongst the entity types, and cardinalities associated with the relationship arcs; and interpreting the graph-structured domain model schema as a domain grammar over sequences of tags, wherein each tag corresponds to one of an entity-type and a path between entity types in the domain model schema; wherein the implementing and the interpreting are performed within a suitably programmed computer system.
-
-
30. A computer program product including a computer readable medium having computer program logic stored therein, the computer program logic comprising:
logic to cause a processor to enforce a graph-structured domain model schema with respect to a database, including to define named relations between entities of different entity types and to associate one of a plurality of cardinality rules with each named relation, wherein the plurality of cardinality rules include a singular cardinality constraint, a multiple cardinality constraint that allows values to accumulate over time, and a multiple cardinality constraint that does not allow values to accumulate over time. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38)
-
39. A system, comprising:
means for enforcing a graph-structured domain model schema with respect to a database, including means for defining named relations between entities of different entity types and means for associating one of a plurality of cardinality rules with each named relation, wherein the plurality of cardinality rules include a singular cardinality constraint, a multiple cardinality constraint that allows values to accumulate over time, and a multiple cardinality constraint that does not allow values to accumulate over time.
-
40. A method, comprising:
-
enforcing a graph-structured domain model schema with respect to a database, including defining named relations between entities of different entity types and associating one of a plurality of cardinality rules with each named relation, wherein the plurality of cardinality rules include a singular cardinality constraint, a multiple cardinality constraint that allows values to accumulate over time, and a multiple cardinality constraint that does not allow values to accumulate over time; wherein the enforcing is performed within a suitably programmed computer system.
-
-
41. A computer program product including a computer readable medium having computer program logic stored therein, the computer program logic comprising:
-
logic to cause a processor to acquire a data object, including one or more corresponding properties, from an information source; logic to cause the processor to incorporate the data object into a database of data objects; logic to cause the processor to revise one or more features associated with the data object in response to user-input, and to preserve an original state of the data object in the database; and logic to cause the processor to re-acquire the data object, including the one or more corresponding properties, from one or more of the information source and another information source, and to determine that the re-acquired data object is redundant to the original state of the data object. - View Dependent Claims (42, 43)
-
-
44. A system, comprising:
-
means for acquiring a data object, including one or more corresponding properties, from an information source; means for incorporating the data object into a database of data objects; means for revising one or more features associated with the data object in response to user-input, and for preserving an original state of the data object in the database; and means for re-acquiring the data object, including the one or more corresponding properties, from one or more of the information source and another information source, and for determining that the re-acquired data object is redundant to the original state of the data object.
-
-
45. A method, comprising:
-
acquiring a data object, including one or more corresponding properties, from an information source; incorporating the data object into a database of data objects; revising one or more features associated with the data object in response to user-input, and preserving an original state of the data object in the database; and re-acquiring the data object, including the one or more corresponding properties, from one or more of the information source and another information source, and for determining that the re-acquired data object is redundant to the original state of the data object; wherein the acquiring, the incorporating, the revising, and the re-acquiring are performed within a suitably programmed computer system.
-
-
46. A computer program product including a computer readable medium having computer program logic stored therein, the computer program logic comprising:
graphical user interface logic to cause a processor to receive user-input graph-based domain model schema definitions and example entity type values, and to display the user-input example values within textual descriptions of cardinality rules of the domain model definitions.
-
47. A system, comprising:
-
means for receiving user-input graph-based domain model schema definitions and example entity type values; and means for displaying the user-input example values within textual descriptions of cardinality rules of the domain model definitions.
-
-
48. A method, comprising:
-
receiving user-input graph-based domain model schema definitions and example entity type values; and displaying the user-input example values within textual descriptions of cardinality rules of the domain model definitions; wherein the receiving and the displaying are performed within a suitably programmed computer system.
-
-
49. A computer program product including a computer readable medium having computer program logic stored therein, the computer program logic comprising:
-
Viterbi logic to cause a processor to determine a sequence of best-predecessor states, of length equal to a number of tokens in a token sequence being parsed, for each of a plurality of states of a computational process; and encoding logic to cause the processor to store each of the sequences of best-predecessor states in a run-length-encoded list.
-
-
50. A system, comprising:
-
means for determining a sequence of best-predecessor states, of length equal to a number of tokens in a token sequence being parsed, for each of a plurality of states of a computational process; and means for storing each of the sequences of best-predecessor states in a run-length-encoded list.
-
-
51. A method, comprising:
-
determining a sequence of best-predecessor states, of length equal to a number of tokens in a token sequence being parsed, for each of a plurality of states of a computational process; and storing each of the sequences of best-predecessor states in a run-length-encoded list; wherein the determining and the storing are performed within a suitably programmed computer system.
-
Specification