Scalable lookup-driven entity extraction from indexed document collections
First Claim
1. A method for ad-hoc entity extraction, comprising:
- filtering a first set of documents to generate a second set of documents that includes documents of the first set based at least on a set of token sets that covers entity strings in a list of entity strings, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings, the set of tokens generated based on the entity strings in the list of the entity strings; and
performing entity recognition on the second set of documents.
2 Assignments
0 Petitions
Accused Products
Abstract
A set of documents is filtered for entity extraction. A list of entity strings is received. A set of token sets that covers the entity strings in the list is determined. An inverted index generated on a first set of documents is queried using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set. A second set of documents identified by the set of document identifiers is retrieved from the first set of documents. The second set of documents is filtered to include one or more documents of the second set that each includes a match with at least one entity string of the list of entity strings. Entity recognition may be performed on the filtered second set of documents.
27 Citations
20 Claims
-
1. A method for ad-hoc entity extraction, comprising:
-
filtering a first set of documents to generate a second set of documents that includes documents of the first set based at least on a set of token sets that covers entity strings in a list of entity strings, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings, the set of tokens generated based on the entity strings in the list of the entity strings; and performing entity recognition on the second set of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for ad-hoc entity extraction, comprising:
-
one or more processors; a document filter configured to filter a first set of documents to generate a second set of documents that includes documents of the first set based at least on a set of token sets that covers entity strings in a list of entity strings, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings, the set of tokens generated based on the entity strings in the list of the entity strings; and an entity recognition module configured to perform entity recognition on the second set of documents. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A computer program product comprising a computer-readable storage memory having computer program logic recorded thereon for enabling a processor-based system to perform ad-hoc entity extraction according to a method that comprises:
-
filtering a first set of documents to generate a second set of documents that includes documents of the first set based at least on a list of entity strings a set of token sets that covers entity strings in a list of entity strings, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings, the set of tokens generated based on the entity strings in the list of the entity strings; and performing entity recognition on the second set of documents. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification