Scalable lookup-driven entity extraction from indexed document collections

US 20090319500A1
Filed: 06/24/2008
Published: 12/24/2009
Est. Priority Date: 06/24/2008
Status: Active Grant

First Claim

Patent Images

1. A method for filtering a set of documents, comprising:

receiving a list of entity strings;

determining a set of token sets that covers the entity strings in the list;

querying an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set;

retrieving from the first set of documents a second set of documents identified by the set of document identifiers; and

filtering the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of documents is filtered for entity extraction. A list of entity strings is received. A set of token sets that covers the entity strings in the list is determined. An inverted index generated on a first set of documents is queried using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set. A second set of documents identified by the set of document identifiers is retrieved from the first set of documents. The second set of documents is filtered to include one or more documents of the second set that each includes a match with at least one entity string of the list of entity strings. Entity recognition may be performed on the filtered second set of documents.

Citations

21 Claims

1. A method for filtering a set of documents, comprising:
- receiving a list of entity strings;
  
  determining a set of token sets that covers the entity strings in the list;
  
  querying an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set;
  
  retrieving from the first set of documents a second set of documents identified by the set of document identifiers; and
  
  filtering the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - performing entity recognition on the filtered second set of documents.
  - 3. The method of claim 1, wherein said querying comprises:
    - querying the inverted index with a plurality of batch queries, each batch query using a subset of the set of token sets to query the inverted index.
  - 4. The method of claim 3, wherein said determining a set of token sets that covers the entity strings in the list comprises:
    - selecting the set of token sets that minimizes a sum of a first cost associated with said querying and a second cost associated with said retrieving and said filtering.
  - 5. The method of claim 4, wherein said selecting comprises:
    - defining the first cost according to $C_{idx} (\sum_{i = 1}^{K} \sum_{i \in Tokens (T_{i})} \langle D (t) \rangle) + C_{ini} ⌈ \frac{K}{B} ⌉,$ whereinT_i=an ith token set the set of token sets,Tokens(Ti)=a set of tokens in T_i,D(t)=a number of document identifiers determined for a token t,K=a number of entries in the set of tokens,B=a maximum number of allowable token sets for querying the inverted index,C_idx=a cost associated with each document identifier determined for each entry of the set of tokens during said querying, andC_ini=an initialization cost associated with each batch query; and
      
      defining the second cost according to $C_{doc} \sum_{i} \langle D (T_{i}) \rangle,$ whereinD(T_i)=a number of document identifiers determined for the ith token set, andC_doc=a cost for each document of the second set associated with said retrieving and said filtering.
  - 6. The method of claim 5, wherein said selecting further comprises:
    - minimizing the sum according to a greedy heuristic.
  - 7. The method of claim 6, wherein said minimizing comprises:
    - initializing the covering set of token sets;
      
      generating a set of candidate token sets;
      
      calculating an initial benefit for each candidate token set in the set of candidate token sets for inclusion in the covering set of token sets;
      
      including in the covering set of token sets a candidate token set in the set of candidate token sets having the greatest calculated initial benefit;
      
      updating any candidate token sets included in the covering set of token sets affected by said including; and
      
      iterating said including and updating.
  - 8. The method of claim 1, wherein said determining a set of token sets that covers the entity strings in the list comprises:
    - generating a set of signature strings for the entity strings in the list, anddetermining a set of token sets that cover the signature strings;
      
      wherein said querying comprises;
      
      querying the inverted index using the set of token sets that cover the signature strings to determine the set of document identifiers for a subset of the documents in the first set; and
      
      wherein said filtering comprises;
      
      filtering the second set of documents to include one or more documents of the second set that each include an approximate mention of at least one entity string of the list of entity strings.

9. A system for filtering a set of documents, comprising:
- a document identifier filter that includes a covering token set determiner and an inverted index querier, wherein the covering token set determiner is configured to receive a list of entity strings and to determine a set of token sets that covers the entity strings in the list, and the inverted index querier is configured to query an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set;
  
  a document retriever configured to retrieve from the first set of documents a second set of documents identified by the set of document identifiers; and
  
  an entity string matcher configured to filter the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, further comprising:
    - an entity recognition module configured to perform entity recognition on the filtered second set of documents.
  - 11. The system of claim 9, wherein the inverted index querier is configured to query the inverted index with a plurality of batch queries, with each batch query using a subset of the set of token sets to query the inverted index.
  - 12. The system of claim 11, wherein the covering token set determiner is configured to select the set of token sets that minimizes a sum of a first cost associated with the document identifier filter performing the batch queries and a second cost associated with the document retriever retrieving the second set of documents and with the entity string matcher filtering the second set of documents.
  - 13. The system of claim 12, wherein the first cost is defined as $C_{idx} (\sum$
    - i = 1 K 
      
      ∑
      
      i ∈
      
      Tokens 
      
      ( T i ) 
      
      
      
      D 
      
      ( t ) 
      
      ) + C ini 
      
      ⌈
      
      K B ⌉
      
      , whereinT_i=an ith token set the set of token sets,Tokens(Ti)=a set of tokens in T_i,D(t)=a number of document identifiers determined for a token t,K=a number of entries in the set of tokens,B=a maximum number of allowable token sets for querying the inverted index,C_idx=a cost associated with each document identifier determined for each entry of the set of tokens during said querying, andC_ini=an initialization cost associated with each batch query; and
      
      wherein the second cost is defined as $C_{doc} \sum_{i} \langle D (T_{i}) \rangle,$ whereinD(T_i)=a number of document identifiers determined for the ith token set, andC_doc=a cost for each document of the second set associated with said retrieving and said filtering.
  - 14. The system of claim 13, wherein the covering token set determiner is configured to minimize the sum according to a greedy heuristic.
  - 15. The system of claim 14, wherein the covering token set determiner is configured to initialize the covering set of token sets, to generate a set of candidate token sets, to calculate an initial benefit for each candidate token set in the set of candidate token sets for inclusion in the covering set of token sets, to include in the covering set of token sets a candidate token set in the set of candidate token sets having the greatest calculated initial benefit, and to update any candidate token sets included in the covering set of token sets affected by inclusion of the candidate token set in the covering set of token sets.
  - 16. The system of claim 9, further comprisinga signature generator configured to generate a set of signatures for the entity strings in the list.

17. A method for ad-hoc entity extraction, comprising:
- filtering a first set of documents to generate a second set of documents that includes documents of the first set having a match with at least one entity string in a list of entity strings; and
  
  performing entity recognition on the second set of documents.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The method of claim 17, wherein said filtering comprises:
    - querying an inverted index generated on the first set of documents using a set of covering token sets to determine a set of document identifiers for a subset of the documents in the first set;
      
      retrieving a subset of documents from the first set identified by the set of document identifiers; and
      
      performing entity string matching on the retrieved subset of documents.
  - 19. The method of claim 18, further comprising:
    - performing said querying and said entity string matching in a manner that balances a first cost of said querying and a second cost of said entity string matching to reduce a sum of the first and second costs.
  - 20. The method of claim 18, further comprising:
    - generating the set of covering tokens to include at least some tokens that are included in a plurality of the entity strings to reduce a total number of tokens included in the set of covering tokens.
  - 21. The method of claim 17, wherein said filtering comprises:
    - filtering the first set of documents to generate the second set of documents to include documents of the first set having an approximate mention of at least one entity string in the list of entity strings.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Ganti, Venkatesh, Agrawal, Sanjay, Chakrabarti, Kaushik

Granted Patent

US 8,782,061 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/93 Document management systems

G06F 40/295 Named entity recognition

Scalable lookup-driven entity extraction from indexed document collections

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable lookup-driven entity extraction from indexed document collections

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links