Annotating entities using cross-document signals

US 9,465,865 B2
Filed: 08/16/2012
Issued: 10/11/2016
Est. Priority Date: 05/29/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A method for annotating an entity in a document corpus using cross-document signals, the method comprising:

determining which documents in a document corpus of multiple documents mention an entity of interest;

clustering the documents that mention an entity of interest according to similarities across a temporal signal, a structural signal and a content signal, thereby forming multiple clusters of documents;

annotating each document in the multiple clusters of documents with an annotation by marking each occurrence of the entity in each document;

calculating a confidence measure for each occurrence of the entity in each document in each of the multiple clusters, wherein said confidence measure comprises the sum of (i) a measure of similarity between the given occurrence of the entity and the entity of interest, and (ii) a measure of similarity between the documents within the cluster of the given document via

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for annotating an entity in a document corpus using cross-document signals. A method includes determining which documents in a document corpus mention an entity of interest, clustering the documents that mention an entity of interest according to a temporal signal, a structural signal and/or a content signal, thereby forming at least one cluster of documents, and annotating at least one document in the at least one cluster of documents by marking each occurrence of the entity in the at least one document.

20 Citations

18 Claims

1. A method for annotating an entity in a document corpus using cross-document signals, the method comprising:
- determining which documents in a document corpus of multiple documents mention an entity of interest;
  
  clustering the documents that mention an entity of interest according to similarities across a temporal signal, a structural signal and a content signal, thereby forming multiple clusters of documents;
  
  annotating each document in the multiple clusters of documents with an annotation by marking each occurrence of the entity in each document;
  
  calculating a confidence measure for each occurrence of the entity in each document in each of the multiple clusters, wherein said confidence measure comprises the sum of (i) a measure of similarity between the given occurrence of the entity and the entity of interest, and (ii) a measure of similarity between the documents within the cluster of the given document via
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, comprising marking an occurrence of the entity in at least one document as not applicable, in order to opt out of at least one of an incorrect and uncertain annotation.
  - 3. The method of claim 1, comprising applying a model to each cluster that provides weights to features of a signal to guide said annotating.
  - 4. The method of claim 1, wherein a temporal signal corresponds to a situation in which a document is from the same time epoch as another document.
  - 5. The method of claim 4, wherein the time epoch is measured via at least one granularity including at least one of minutes, hours, days, weeks, and months.
  - 6. The method of claim 1, wherein a structural signal corresponds to a situation in which a document is part of a larger document arrangement.
  - 7. The method of claim 1, wherein said determining comprises using a dictionary of entities.
  - 8. The method of claim 7, wherein the dictionary contains a description corresponding to each entity.
  - 9. The method of claim 1, comprising training a set of documents with labeled entities to determine at least one of a threshold for clustering documents and a document similarity weight.
  - 10. The method of claim 1, comprising annotating a single entity in a single document.
  - 11. The method of claim 10, comprising determining an entity-to-context match for a mention of the entity from the document in an entity database without considering a signal from other documents or other entities in the document.
  - 12. The method of claim 1, comprising annotating multiple entities in a single document.
  - 13. The method of claim 12, comprising determining an entity-to-context match for each mention of the multiple entities from the document in an entity database without considering a signal from other documents.
  - 14. The method of claim 1, comprising annotating a single entity in multiple documents.
  - 15. The method of claim 14, comprising determining at least one of an entity-to-context match and a document-to-document match for a mention of the entity from the document in an entity database, taking into account similarity and temporal proximity to other documents without considering a signal from other entities in the document.
  - 16. The method of claim 1, comprising annotating multiple entities in multiple documents.
  - 17. The method of claim 16, wherein said determining includes considering signals from multiple entities in each document as well as temporal and textual similarity to other documents in the corpus.

18. A method for annotating an entity in a document corpus, the method comprising:
- processing each document in a corpus of multiple documents obtained from at least one online source to identify which documents mention an entity of interest;
  
  using a description of the entity derived from a database to generate at least one context feature for the entity;
  
  processing each of the documents that mention the entity of interest by comparing text from each document with the at least one context feature and grouping the documents with a comparison similarity above a pre-determined threshold into a cluster of documents;
  
  annotating each document in the cluster of documents with an annotation by marking each occurrence of the entity in each document;
  
  calculating a confidence measure for each occurrence of the entity in each document in the cluster of documents, wherein said confidence measure comprises the sum of (i) the comparison similarity between the given occurrence of the entity and the entity of interest, and (ii) a measure of similarity between the documents within the cluster of documents via

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
De, Sushovan, Singh, Amit K., Visweswariah, Karthik
Primary Examiner(s)
Baderman, Scott
Assistant Examiner(s)
EDWARDS, JASON T

Application Number

US13/587,011
Publication Number

US 20130325849A1
Time in Patent Office

1,517 Days
Field of Search

715/230, 715/737, 707/737
US Class Current

1/1
CPC Class Codes

G06F 16/355 Class or cluster creation o...

Annotating entities using cross-document signals

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Annotating entities using cross-document signals

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links