Scalable lookup-driven entity extraction from indexed document collections
First Claim
1. A method for filtering a set of documents, comprising:
- receiving a list of entity strings;
determining a set of token sets that covers the entity strings in the list, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings;
querying an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set;
retrieving from the first set of documents a second set of documents, which is a subset of the first set of documents, identified by the set of document identifiers; and
filtering the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings.
2 Assignments
0 Petitions
Accused Products
Abstract
A set of documents is filtered for entity extraction. A list of entity strings is received. A set of token sets that covers the entity strings in the list is determined. An inverted index generated on a first set of documents is queried using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set. A second set of documents identified by the set of document identifiers is retrieved from the first set of documents. The second set of documents is filtered to include one or more documents of the second set that each includes a match with at least one entity string of the list of entity strings. Entity recognition may be performed on the filtered second set of documents.
21 Citations
20 Claims
-
1. A method for filtering a set of documents, comprising:
-
receiving a list of entity strings; determining a set of token sets that covers the entity strings in the list, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings; querying an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set; retrieving from the first set of documents a second set of documents, which is a subset of the first set of documents, identified by the set of document identifiers; and filtering the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for filtering a set of documents, comprising:
-
a computer processor; and storage coupled to the computer processor, the storage including computer code configured to be executed by the computer processor, the computer code comprising; a document identifier filter, that includes a covering token set determiner and an inverted index querier, wherein the covering token set determiner is configured to receive a list of entity strings and to determine a set of token sets that covers the entity strings in the list, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings, and the inverted index querier is configured to query an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set; a document retriever configured to retrieve from the first set of documents a second set of documents, which is a subset of the first set of documents identified by the set of document identifiers; and an entity string matcher configured to filter the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program product comprising a computer-readable device having computer program logic recorded thereon for enabling a processor-based system to filter a set of documents, the computer program product comprising:
-
a first program logic that enables the processor-based system to receive a list of entity strings; a second program logic that enables the processor-based system to determine a set of token sets that covers the entity strings in the list, the number of tokens in the set of token sets being less than the number of words of the entity strings in the list of entity strings; a third program logic that enables the processor-based system to query an inverted index generated on a first set of documents using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set; a fourth program logic that enables the processor-based system to retrieve from the first set of documents a second set of documents, which is a subset of the first set of documents, identified by the set of document identifiers; and a fifth program logic that enables the processor-based system to filter the second set of documents to include one or more documents of the second set that each include a match with at least one entity string of the list of entity strings. - View Dependent Claims (18, 19, 20)
-
Specification