Automatic annotation for training and evaluation of semantic analysis engines
First Claim
1. A computer system comprising:
- at least one processor; and
memory storing instructions that, when executed by the at least one processor, causes the computer system to perform operations comprising;
receiving documents from a corpus, the corpus comprising;
an authoritative set of documents from an authoritative source, each document in the authoritative set being associated with an entity, anda second set of documents, the second set being documents that are not in the authoritative set and that are not copies of documents in the authoritative set but that each include at least one link to a document in the authoritative set, the at least one link being associated with anchor text,identifying, for each document in the second set, entity mentions in the document based on the anchor text, each entity mention including the anchor text and an identifier of the linked-to authoritative document,associating the identified entity mentions with respective entity types based on content in the linked-to authoritative document, andtraining an entity tagging engine using the identified entity mentions and the entity types associated with the entity mentions.
2 Assignments
0 Petitions
Accused Products
Abstract
Implementations include systems and methods generate data for training or evaluating semantic analysis engines. For example, a method may include receiving documents from a corpus that includes an authoritative set of documents from an authoritative source. Each document in the authoritative set may be associated with an entity. A second set of documents from the corpus that do not overlap with the first set may include at least one link to a document in the authoritative set, the at least one link being associated with anchor text. For each document in the second set, the method may include identifying entity mentions in the document based on the anchor text. The method may include associating the entity mention with the entity in a graph-structured knowledge base or associating entity types with the entity mention. The method may also include training a semantic analysis engine using the identified entity mentions and associations.
53 Citations
21 Claims
-
1. A computer system comprising:
-
at least one processor; and memory storing instructions that, when executed by the at least one processor, causes the computer system to perform operations comprising; receiving documents from a corpus, the corpus comprising; an authoritative set of documents from an authoritative source, each document in the authoritative set being associated with an entity, and a second set of documents, the second set being documents that are not in the authoritative set and that are not copies of documents in the authoritative set but that each include at least one link to a document in the authoritative set, the at least one link being associated with anchor text, identifying, for each document in the second set, entity mentions in the document based on the anchor text, each entity mention including the anchor text and an identifier of the linked-to authoritative document, associating the identified entity mentions with respective entity types based on content in the linked-to authoritative document, and training an entity tagging engine using the identified entity mentions and the entity types associated with the entity mentions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer system comprising:
-
at least one processor; memory storing a graph-structured knowledge base; and memory storing instructions that, when executed by at least one processor, causes the computer system to perform operations comprising; accessing an authoritative set of documents from a corpus of documents, the authoritative set of documents being from an authoritative source, each document in the authoritative set being associated with a respective entity in the graph-structured knowledge base, identifying a second set of documents from the corpus of documents, the second set being documents that are not in the authoritative set and that are not copies of documents in the authoritative set but that each include at least one hyperlink to a document in the authoritative set, the at least one hyperlink being associated with anchor text, for each document in the second set; identifying an entity mention in the document based on the anchor text, the entity mention including the anchor text and an identifier of the linked-to authoritative document, and associating the entity mention with the entity in the graph-structured knowledge base associated with the linked-to authoritative document, and training an entity matching engine using the identified entity mentions and associated entities. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A computer-implemented method comprising:
-
obtaining, using at least one processor, a first document in a corpus of documents that has a link to an authoritative document, the link being associated with anchor text, the authoritative document being from an authoritative source and being associated with an entity, and the first document being from a source other than the authoritative source; determining, using the at least one processor, whether a majority of content of the first document matches content from one of the documents in the authoritative source; identifying at least one entity mention in the first document when it is determined that the majority of the content does not match content from one of the documents in the authoritative source, the entity mention including the anchor text, an identifier of the linked-to authoritative document, and a position of the mention within the content of the first document; storing the entity mentions in memory; repeating the obtaining, determining, identifying, and storing for other documents in the corpus; and evaluating a semantic analysis engine using the stored entity mentions and information associated with the documents in the authoritative source. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21)
-
Specification