Learning facts from semi-structured text
First Claim
1. A computer-implemented method of learning facts, comprising:
- at a computer system including one or more processors and memory storing one or more programs, the one or more processors executing the one or more programs to perform the operations of;
accessing an object within a fact repository, wherein the object includes a name and one or more seed facts;
identifying a set of documents having content and associated with the object name, each document in the set having at least a first predefined number of distinct seed facts in common with the seed facts of the object;
for each of the documents in the identified set;
identifying in the document a contextual pattern associated with the respective seed facts in the document;
confirming that the document includes at least a second predefined number of instances of content matching the contextual pattern in addition to the respective seed facts; and
only when the confirming is successful, extracting an extracted fact from a respective instance of content matching the contextual pattern and merging the extracted fact into the object;
wherein the first predefined number is greater than one and the second predefined number is greater than one.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system of learning, or bootstrapping, facts from semi-structured text is described. Starting with a set of seed facts associated with an object, documents associated with the object are identified. The identified documents are checked to determine if each has at least a first predefined number of seed facts. If a document does have at least a first predefined number of seed facts, a contextual pattern associated with the seed facts is identified and other instances of content in the document matching the contextual pattern are identified. If the document includes at least a second predefined number of the other instances of content matching the contextual pattern, then facts may be extracted from the other instances.
-
Citations
19 Claims
-
1. A computer-implemented method of learning facts, comprising:
-
at a computer system including one or more processors and memory storing one or more programs, the one or more processors executing the one or more programs to perform the operations of; accessing an object within a fact repository, wherein the object includes a name and one or more seed facts; identifying a set of documents having content and associated with the object name, each document in the set having at least a first predefined number of distinct seed facts in common with the seed facts of the object; for each of the documents in the identified set; identifying in the document a contextual pattern associated with the respective seed facts in the document; confirming that the document includes at least a second predefined number of instances of content matching the contextual pattern in addition to the respective seed facts; and only when the confirming is successful, extracting an extracted fact from a respective instance of content matching the contextual pattern and merging the extracted fact into the object; wherein the first predefined number is greater than one and the second predefined number is greater than one. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for learning facts, comprising:
-
one or more processors; and memory storing one or more modules having instructions for execution by the one or more processors, including instructions; to access an object within a fact repository, wherein the object includes a name and one or more seed facts; to identify a set of documents having content and associated with the object name, each document in the set having at least a first predefined number of distinct seed facts in common with the seed facts of the object; for each of the documents in the identified set; to identify in the document a contextual pattern associated with the respective seed facts in the document; and to confirm that the document includes at least a second predefined number of instances of content matching the contextual pattern in addition to the respective seed facts; and to extract an extracted fact only from a respective instance of content matching the contextual pattern and merge the extracted fact into the object; wherein the first predefined number is greater than one and the second predefined number is greater than one. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer readable storage medium storing one or more programs for execution by a computer system, the one or more programs comprising instructions for:
-
accessing an object within a fact repository, wherein the object includes a name and one or more seed facts; identifying a set of documents having content and associated with the object name, each document in the set having at least a first predefined number of distinct seed facts in common with the seed facts of the object; for each of the documents in the identified set; identifying in the document a contextual pattern associated with the respective seed facts in the document; confirming that the document includes at least a second predefined number of instances of content matching the contextual pattern in addition to the respective seed facts; and only when the confirming is successful, extracting an extracted fact from a respective instance of content matching the contextual pattern and merging the extracted fact into the object; wherein the first predefined number is greater than one and the second predefined number is greater than one. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A system for learning facts, comprising:
-
one or more processors; and memory storing one or more modules having instructions for execution by the one or more processors; means for accessing an object within a fact repository, wherein the object includes a name and one or more seed facts; means for identifying a set of documents having content and associated with the object name, each document in the set having at least a first predefined number of distinct seed facts in common with the seed facts of the object; means, for each of the documents in the identified set; for identifying in the document a contextual pattern associated with the respective seed fact in the document; for confirming that the document includes at least a second predefined number of instances of content matching the contextual pattern in addition to the respective seed facts; and only when the confirming is successful, for extracting an extracted fact from a respective instance of content matching the contextual pattern and merging the extracted fact into the object; wherein the first predefined number is greater than one and the second predefined number is greater than one.
-
Specification