Extracting facts from unstructured text
First Claim
Patent Images
1. A method comprising:
- receiving, by an entity extraction computer, an electronic document having unstructured text, wherein the electronic document is a text file;
extracting, by the entity extraction computer, an entity identifier from the unstructured text in the electronic document;
extracting, by a topic extraction computer, a topic identifier from the unstructured text in the electronic document;
extracting, by a fact extraction computer, a fact identifier from the unstructured text in the electronic document by comparing text string structures in the unstructured text to a fact template database, wherein the fact template database having stored therein a fact template model identifying keywords pertaining to specific fact identifiers and corresponding keyword weights; and
associating, by a fact relatedness estimator computer, the entity identifier with the topic identifier and the fact identifier to determine a confidence score indicative of a degree of accuracy of extraction of the fact identifier, wherein the confidence score is based at least in part on a spatial distance between a part of the unstructured text in the electronic document from where the fact identifier was extracted and a part of the unstructured text from where at least one of the topic identifier or the entity identifier was extracted.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for extracting facts from unstructured text files are disclosed. Embodiments of the disclosed system and method may receive a text file as input and perform extraction and disambiguation of entities, as well as extract topics and facts. The facts are extracted by comparing against a fact template store and associating facts with events or topics. The extracted facts are stored in a data store.
-
Citations
17 Claims
-
1. A method comprising:
-
receiving, by an entity extraction computer, an electronic document having unstructured text, wherein the electronic document is a text file; extracting, by the entity extraction computer, an entity identifier from the unstructured text in the electronic document; extracting, by a topic extraction computer, a topic identifier from the unstructured text in the electronic document; extracting, by a fact extraction computer, a fact identifier from the unstructured text in the electronic document by comparing text string structures in the unstructured text to a fact template database, wherein the fact template database having stored therein a fact template model identifying keywords pertaining to specific fact identifiers and corresponding keyword weights; and associating, by a fact relatedness estimator computer, the entity identifier with the topic identifier and the fact identifier to determine a confidence score indicative of a degree of accuracy of extraction of the fact identifier, wherein the confidence score is based at least in part on a spatial distance between a part of the unstructured text in the electronic document from where the fact identifier was extracted and a part of the unstructured text from where at least one of the topic identifier or the entity identifier was extracted. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system comprising:
one or more server computers having one or more processors executing computer readable instructions for a plurality of computer modules including; an entity extraction module which receives an electronic document having unstructured text and extracts an entity identifier from the unstructured text in the electronic document, wherein the electronic document is a text file; a topic extraction module which extracts a topic identifier from the unstructured text in the electronic document; a fact extraction module which extracts a fact identifier from the unstructured text in the electronic document by comparing text string structures in the unstructured text to a fact template database, wherein the fact template database having stored therein a fact template model identifying keywords pertaining to specific fact identifiers and corresponding keyword weights; and a fact relatedness estimator module which associates the entity identifier with the topic identifier and the fact identifier to determine a confidence score indicative of a degree of accuracy of extraction of the fact identifier, wherein the confidence score is based at least in part on a spatial distance between a part of the unstructured text in the electronic document from where the fact identifier was extracted and a part of the unstructured text from where at least one of the topic identifier or the entity identifier was extracted. - View Dependent Claims (8, 9, 10, 11, 12)
-
13. A non-transitory computer readable medium having stored thereon computer executable instructions instructive of a method comprising:
-
receiving, by an entity extraction computer, an electronic document having unstructured text, wherein the electronic document is a text file; extracting, by the entity extraction computer, an entity identifier from the unstructured text in the electronic document; extracting, by a topic extraction computer, a topic identifier from the unstructured text in the electronic document; extracting, by a fact extraction computer, a fact identifier from the unstructured text in the electronic document by comparing text string structures in the unstructured text to a fact template database, the fact template database having stored therein a fact template model identifying keywords pertaining to specific fact identifiers and corresponding keyword weights; and associating, by a fact relatedness estimator computer, the entity identifier with the topic identifier and the fact identifier to determine a confidence score indicative of a degree of accuracy of extraction of the fact identifier, wherein the confidence score is based at least in part on a spatial distance between a part of the unstructured text in the electronic document from where the fact identifier was extracted and a part of the unstructured text from where at least one of the topic identifier or the entity identifier was extracted. - View Dependent Claims (14, 15, 16, 17)
-
Specification