Phrase extraction using subphrase scoring
First Claim
1. A computer implemented method of extracting a set of phrases from a plurality of documents, the method comprising:
- for each document;
identifying a plurality of candidate phrases occurring within the document, wherein a candidate phrase includes two or more consecutive words that are determined to occur in the document, andscoring candidate phrases in the document to produce respective document phrase scores for the candidate phrases for the document, the document phrase score for a candidate phrase being based on attributes of individual occurrences of the candidate phrase in the document, with at least some candidate phrases appearing repeatedly having a higher document phrase score than candidate phrases appearing once;
for a candidate phrase of the plurality of the candidate phrases;
creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and
selecting the candidate phrase for inclusion in the extracted set based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
-
Citations
21 Claims
-
1. A computer implemented method of extracting a set of phrases from a plurality of documents, the method comprising:
-
for each document; identifying a plurality of candidate phrases occurring within the document, wherein a candidate phrase includes two or more consecutive words that are determined to occur in the document, and scoring candidate phrases in the document to produce respective document phrase scores for the candidate phrases for the document, the document phrase score for a candidate phrase being based on attributes of individual occurrences of the candidate phrase in the document, with at least some candidate phrases appearing repeatedly having a higher document phrase score than candidate phrases appearing once; for a candidate phrase of the plurality of the candidate phrases; creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and selecting the candidate phrase for inclusion in the extracted set based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for extracting a set of phrases from a plurality of documents, the system comprising:
-
one or more computer readable media comprising executable instructions; and one or more processors configured to execute the instructions, wherein execution of the instructions causes the system to; for each document; identify a plurality of candidate phrases occurring in the document, wherein a candidate phrase includes two or more consecutive words that are determined to occur in the document; score candidate phrases in the document to produce respective document phrase scores for the candidate phrases for the document, the document phrase score for a candidate phrase being based on attributes of individual occurrences of the candidate phrase in the document, with at least some candidate phrases appearing repeatedly having a higher document phrase score than candidate phrases appearing once, and for a candidate phrase of the plurality of the candidate phrases; create, via the one or more processors, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and selecting the candidate phrase for inclusion in the extracted set based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
Specification