Scalable lookup-driven entity extraction from indexed document collections

US 8,782,061 B2
Filed: 06/24/2008
Issued: 07/15/2014
Est. Priority Date: 06/24/2008
Status: Active Grant

First Claim

Patent Images

1. A method for filtering a set of documents, comprising:

receiving a list of entity strings;

determining a set of token sets that covers the entity strings in the list, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings;

querying an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set;

retrieving from the first set of documents a second set of documents, which is a subset of the first set of documents, identified by the set of document identifiers; and

filtering the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of documents is filtered for entity extraction. A list of entity strings is received. A set of token sets that covers the entity strings in the list is determined. An inverted index generated on a first set of documents is queried using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set. A second set of documents identified by the set of document identifiers is retrieved from the first set of documents. The second set of documents is filtered to include one or more documents of the second set that each includes a match with at least one entity string of the list of entity strings. Entity recognition may be performed on the filtered second set of documents.

21 Citations

View as Search Results

20 Claims

1. A method for filtering a set of documents, comprising:
- receiving a list of entity strings;
  
  determining a set of token sets that covers the entity strings in the list, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings;
  
  querying an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set;
  
  retrieving from the first set of documents a second set of documents, which is a subset of the first set of documents, identified by the set of document identifiers; and
  
  filtering the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - performing entity recognition on the filtered second set of documents.
  - 3. The method of claim 1, wherein said querying comprises:
    - querying the inverted index with a plurality of batch queries, each batch query using a subset of the set of token sets to query the inverted index.
  - 4. The method of claim 3, wherein said determining a set of token sets that covers the entity strings in the list comprises:
    - selecting the subset of the set of token sets that minimizes a sum of a first cost associated with said querying and a second cost associated with said retrieving and said filtering.
  - 5. The method of claim 4, wherein said selecting comprises:
    - defining the first cost according to
  - 6. The method of claim 5, wherein said selecting further comprises:
    - minimizing the sum according to a greedy heuristic.
  - 7. The method of claim 6, wherein said minimizing comprises:
    - initializing the covering set of token sets;
      
      generating a set of candidate token sets;
      
      calculating an initial benefit for each candidate token set in the set of candidate token sets for inclusion in the covering set of token sets;
      
      including in the covering set of token sets a candidate token set in the set of candidate token sets having the greatest calculated initial benefit;
      
      updating any candidate token sets included in the covering set of token sets affected by said including; and
      
      iterating said including and updating.
  - 8. The method of claim 1, wherein said determining a set of token sets that covers the entity strings in the list comprises:
    - generating a set of signature strings for the entity strings in the list, anddetermining a set of token sets that cover the signature strings;
      
      wherein said querying comprises;
      
      querying the inverted index using the set of token sets that cover the signature strings to determine the set of document identifiers for a subset of the documents in the first set; and
      
      wherein said filtering comprises;
      
      filtering the second set of documents to include one or more documents of the second set that each include an approximate mention of at least one entity string of the list of entity strings.

9. A system for filtering a set of documents, comprising:
- a computer processor; and
  
  storage coupled to the computer processor, the storage including computer code configured to be executed by the computer processor, the computer code comprising;
  
  a document identifier filter, that includes a covering token set determiner and an inverted index querier, wherein the covering token set determiner is configured to receive a list of entity strings and to determine a set of token sets that covers the entity strings in the list, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings, and the inverted index querier is configured to query an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set;
  
  a document retriever configured to retrieve from the first set of documents a second set of documents, which is a subset of the first set of documents identified by the set of document identifiers; and
  
  an entity string matcher configured to filter the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, the computer code further comprising:
    - an entity recognition module configured to perform entity recognition on the filtered second set of documents.
  - 11. The system of claim 9, wherein the inverted index querier is configured to query the inverted index with a plurality of batch queries, with each batch query using a subset of the set of token sets to query the inverted index.
  - 12. The system of claim 11, wherein the covering token set determiner is configured to select the subset of the set of token sets that minimizes a sum of a first cost associated with the document identifier filter performing the batch queries and a second cost associated with the document retriever retrieving the second set of documents and with the entity string matcher filtering the second set of documents.
  - 13. The system of claim 12, wherein the first cost is defined as
  - 14. The system of claim 13, wherein the covering token set determiner is configured to minimize the sum according to a greedy heuristic.
  - 15. The system of claim 14, wherein the covering token set determiner is configured to initialize the covering set of token sets, to generate a set of candidate token sets, to calculate an initial benefit for each candidate token set in the set of candidate token sets for inclusion in the covering set of token sets, to include in the covering set of token sets a candidate token set in the set of candidate token sets having the greatest calculated initial benefit, and to update any candidate token sets included in the covering set of token sets affected by inclusion of the candidate token set in the covering set of token sets.
  - 16. The system of claim 9, the computer code further comprisinga signature generator configured to generate a set of signatures for the entity strings in the list.

17. A computer program product comprising a computer-readable device having computer program logic recorded thereon for enabling a processor-based system to filter a set of documents, the computer program product comprising:
- a first program logic that enables the processor-based system to receive a list of entity strings;
  
  a second program logic that enables the processor-based system to determine a set of token sets that covers the entity strings in the list, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings;
  
  a third program logic that enables the processor-based system to query an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set;
  
  a fourth program logic that enables the processor-based system to retrieve from the first set of documents a second set of documents, which is a subset of the first set of documents, identified by the set of document identifiers; and
  
  a fifth program logic that enables the processor-based system to filter the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings.
- View Dependent Claims (18, 19, 20)
- - 18. The computer program product of claim 17, further comprising:
    - a sixth program logic that enables the processor-based system to perform entity recognition on the filtered second set of documents.
  - 19. The computer program product of claim 17, wherein the third program logic module comprises:
    - logic for enabling the processor-based system to query the inverted index with a plurality of batch queries, each batch query using a subset of the set of token sets to query the inverted index.
  - 20. The computer program product of claim 17, wherein the second program logic module comprises:
    - logic for enabling the processor-based system to select the subset of the set of token sets that minimizes a sum of a first cost associated with querying the inverted index and a second cost associated with retrieving the second set of documents and filtering the second set of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Agrawal, Sanjay, Chakrabarti, Kaushik, Chaudhuri, Surajit, Ganti, Venkatesh
Primary Examiner(s)
LU, KUEN S

Application Number

US12/144,675
Publication Number

US 20090319500A1
Time in Patent Office

2,212 Days
Field of Search

707/755, 707/802
US Class Current

707/755
CPC Class Codes

G06F 16/93 Document management systems

G06F 40/295 Named entity recognition

Scalable lookup-driven entity extraction from indexed document collections

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

21 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Scalable lookup-driven entity extraction from indexed document collections

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others