Extracting facts from unstructured information
First Claim
Patent Images
1. A system comprising:
- a processing device; and
a computer-readable storage medium storing machine-readable instructions which, when executed by the processing device, cause the processing device to;
identify a plurality of sentences in an entity-tagged corpus that include at least two tagged entities, wherein the entity-tagged corpus is derived from a collection of information items that include unstructured information comprising the plurality of sentences;
parse the plurality of sentences to obtain parsed sentences representing parts of individual sentences as parse trees;
identify a plurality of relations in the parsed sentences, wherein respective relations identify a first argument value associated with a first named entity that corresponds to a subject expressed in a respective parsed sentence, a second argument value associated with a second named entity that corresponds to an object expressed in the respective parsed sentence, and a relation value which reflects a corresponding relationship expressed in the respective parsed sentence, the corresponding relationship being between the first named entity and the second named entity;
form one or more relation clusters based at least on the identified relations, respective relation clusters grouping together relations associated with a same first argument type expressed in the unstructured information, a same second argument type expressed in the unstructured information, and a same relation value expressed in the unstructured information;
generate confidence score information for the relations in said one or more relation clusters to provide scored relations, wherein the confidence score information reflects relative confidence that individual relations express factually true relationships between individual subjects and individual objects and the confidence score information is based at least on a parsing confidence reflecting confidence in the parsing of the plurality of sentences to obtain the parse trees;
output final extracted facts by selecting a subset of the scored relations based at least on the confidence score information; and
store the final extracted facts in a data store,the final extracted facts in the data store being accessible via a user computing device coupled to a computer network.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented technique is described herein for extracting facts from unstructured text documents provided by one or more information sources. The technique uses a pipeline to perform this operation that involves, at least in part, providing a corpus of information items, extracting candidate facts from the information items, merging synonymous argument values associated with the candidate facts, organizing the candidate facts into relation clusters, and assessing the confidence level of the candidate facts within the relation clusters.
-
Citations
20 Claims
-
1. A system comprising:
- a processing device; and
a computer-readable storage medium storing machine-readable instructions which, when executed by the processing device, cause the processing device to; identify a plurality of sentences in an entity-tagged corpus that include at least two tagged entities, wherein the entity-tagged corpus is derived from a collection of information items that include unstructured information comprising the plurality of sentences; parse the plurality of sentences to obtain parsed sentences representing parts of individual sentences as parse trees; identify a plurality of relations in the parsed sentences, wherein respective relations identify a first argument value associated with a first named entity that corresponds to a subject expressed in a respective parsed sentence, a second argument value associated with a second named entity that corresponds to an object expressed in the respective parsed sentence, and a relation value which reflects a corresponding relationship expressed in the respective parsed sentence, the corresponding relationship being between the first named entity and the second named entity; form one or more relation clusters based at least on the identified relations, respective relation clusters grouping together relations associated with a same first argument type expressed in the unstructured information, a same second argument type expressed in the unstructured information, and a same relation value expressed in the unstructured information; generate confidence score information for the relations in said one or more relation clusters to provide scored relations, wherein the confidence score information reflects relative confidence that individual relations express factually true relationships between individual subjects and individual objects and the confidence score information is based at least on a parsing confidence reflecting confidence in the parsing of the plurality of sentences to obtain the parse trees; output final extracted facts by selecting a subset of the scored relations based at least on the confidence score information; and store the final extracted facts in a data store, the final extracted facts in the data store being accessible via a user computing device coupled to a computer network. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- a processing device; and
-
15. A method implemented by one or more computing devices, the method comprising:
-
receiving a collection of information items from one or more information sources via a computer network, the information items presenting unstructured information; identifying a plurality of sentences in the unstructured information that mention at least two entities; parsing the plurality of sentences in the unstructured information to obtain parsed sentences representing parts of individual sentences as parse trees; identifying a plurality of relations in the parsed sentences, wherein respective relations identify a first argument value associated with a first named entity that corresponds to a subject expressed in a respective parsed sentence, a second argument value associated with a second named entity that corresponds to an object expressed in the respective parsed sentence, and a relation value which reflects a corresponding relationship between the first named entity and the second named entity expressed in the respective parsed sentence; merging synonymous argument values within the plurality of relations to provide a set of argument- merged facts; forming relation clusters based at least on the argument-merged facts, individual relation clusters grouping together relations associated with a same first argument type identified from the unstructured information, a same second argument type identified from the unstructured information, and a same relation value identified from the unstructured information; generating confidence score information for the relations in said relation clusters to provide scored relations, the confidence score information reflecting at least a parsing confidence in the parsing of the plurality of sentences to obtain the parse trees; outputting final extracted facts by selecting a subset of the scored relations based at least on the confidence score information; and providing access to the final extracted facts to one or more knowledge-consuming computer-implemented applications. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A computer-readable storage medium storing computer-readable instructions, the computer-readable instructions, when executed by one or more processor devices, performing acts comprising:
-
receiving a collection of information items from one or more information sources via a computer network, the information items presenting unstructured information; identifying a plurality of sentences in the unstructured information that mention at least two entities; parsing the plurality of sentences in the unstructured information to obtain parsed sentences representing parts of individual sentences as parse trees; identifying a plurality of relations between entities in the parsed sentences, wherein respective relations identify a first argument value associated with a first entity that corresponds to a subject expressed in a respective parsed sentence, a second argument value associated with a second entity that corresponds to an object expressed in the respective parsed sentence, and a relation value which reflects a corresponding relationship between the first entity and the second entity expressed in the respective parsed sentence; generating confidence score information for individual relations to provide scored relations, the confidence score information reflecting at least a parsing confidence in the parsing of the plurality of sentences to obtain the parse trees; and outputting final extracted facts by selecting a subset of the scored relations based at least on the confidence score information.
-
Specification