Annotating entities using cross-document signals
First Claim
Patent Images
1. A method for annotating an entity in a document corpus using cross-document signals, the method comprising:
- determining which documents in a document corpus of multiple documents mention an entity of interest;
clustering the documents that mention an entity of interest according to similarities across a temporal signal, a structural signal and a content signal, thereby forming multiple clusters of documents;
annotating each document in the multiple clusters of documents with an annotation by marking each occurrence of the entity in each document;
calculating a confidence measure for each occurrence of the entity in each document in each of the multiple clusters, wherein said confidence measure comprises the sum of (i) a measure of similarity between the given occurrence of the entity and the entity of interest, and (ii) a measure of similarity between the documents within the cluster of the given document via
0 Assignments
0 Petitions
Accused Products
Abstract
Techniques for annotating an entity in a document corpus using cross-document signals. A method includes determining which documents in a document corpus mention an entity of interest, clustering the documents that mention an entity of interest according to a temporal signal, a structural signal and/or a content signal, thereby forming at least one cluster of documents, and annotating at least one document in the at least one cluster of documents by marking each occurrence of the entity in the at least one document.
20 Citations
18 Claims
-
1. A method for annotating an entity in a document corpus using cross-document signals, the method comprising:
-
determining which documents in a document corpus of multiple documents mention an entity of interest; clustering the documents that mention an entity of interest according to similarities across a temporal signal, a structural signal and a content signal, thereby forming multiple clusters of documents; annotating each document in the multiple clusters of documents with an annotation by marking each occurrence of the entity in each document; calculating a confidence measure for each occurrence of the entity in each document in each of the multiple clusters, wherein said confidence measure comprises the sum of (i) a measure of similarity between the given occurrence of the entity and the entity of interest, and (ii) a measure of similarity between the documents within the cluster of the given document via - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for annotating an entity in a document corpus, the method comprising:
-
processing each document in a corpus of multiple documents obtained from at least one online source to identify which documents mention an entity of interest; using a description of the entity derived from a database to generate at least one context feature for the entity; processing each of the documents that mention the entity of interest by comparing text from each document with the at least one context feature and grouping the documents with a comparison similarity above a pre-determined threshold into a cluster of documents; annotating each document in the cluster of documents with an annotation by marking each occurrence of the entity in each document; calculating a confidence measure for each occurrence of the entity in each document in the cluster of documents, wherein said confidence measure comprises the sum of (i) the comparison similarity between the given occurrence of the entity and the entity of interest, and (ii) a measure of similarity between the documents within the cluster of documents via
-
Specification