Phrase extraction using subphrase scoring
First Claim
1. A computer implemented method of extracting a set of valid phrases from a plurality of documents, the method comprising:
- for each document;
identifying a plurality of candidate phrases contained in the document, wherein a candidate phrase includes multiple consecutive words that appear in the document, wherein identifying a candidate phrase includes scanning though words of the document to identify the multiple consecutive words of the candidate phrase contained in the document;
scoring each candidate phrase in the document to produce a document phrase score for the candidate phrase for the document, the document phrase score being based on each instance of the candidate phrase that appears in the document,wherein scoring the candidate phrases in the document to produce the document phrase score comprises;
scoring a plurality of instances of the candidate phrase in the document to produce a plurality of instance phrase scores for the candidate phrase for the document, the instance phrase scores being based on a location of the instance of the candidate phrase within the document and being based on a position of the instance of the candidate phrase relative to a sequence of words containing the instance of the candidate phrase; and
combining the plurality of instance phrase scores of the candidate phrase in the document into the document phrase score;
for each candidate phrase;
creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and
determining whether the candidate phrase is a valid phrase based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
241 Citations
20 Claims
-
1. A computer implemented method of extracting a set of valid phrases from a plurality of documents, the method comprising:
-
for each document; identifying a plurality of candidate phrases contained in the document, wherein a candidate phrase includes multiple consecutive words that appear in the document, wherein identifying a candidate phrase includes scanning though words of the document to identify the multiple consecutive words of the candidate phrase contained in the document; scoring each candidate phrase in the document to produce a document phrase score for the candidate phrase for the document, the document phrase score being based on each instance of the candidate phrase that appears in the document, wherein scoring the candidate phrases in the document to produce the document phrase score comprises; scoring a plurality of instances of the candidate phrase in the document to produce a plurality of instance phrase scores for the candidate phrase for the document, the instance phrase scores being based on a location of the instance of the candidate phrase within the document and being based on a position of the instance of the candidate phrase relative to a sequence of words containing the instance of the candidate phrase; and combining the plurality of instance phrase scores of the candidate phrase in the document into the document phrase score; for each candidate phrase; creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and determining whether the candidate phrase is a valid phrase based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer readable medium, having stored thereon, computer program code that, when executed, causes a computer system to extract a set of valid phrases from a plurality of documents, by:
-
for each document; identifying a plurality of candidate phrases contained in the document, wherein a candidate phrase includes multiple consecutive words that appear in the document, wherein identifying a candidate phrase includes scanning though words of the document to identify the multiple consecutive words of the candidate phrase; scoring each candidate phrase in the document to produce a document phrase score for the candidate phrase for the document, the document phrase score being based on each instance of the candidate phrase that appears in the document wherein scoring the candidate phrases in the document to produce the document phrase score comprises; scoring a plurality of instances of the candidate phrase in the document to produce a plurality of instance phrase scores for the candidate phrase for the document, the instance phrase scores being based on a location of the instance of the candidate phrase within the document and being based on a position of the instance of the candidate phrase relative to a sequence of words containing the instance of the candidate phrase; and combining the plurality of instance phrase scores of the candidate phrase in the document into the document phrase score; and for each candidate phrase; creating, via a processor of the computer system, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and determining whether the candidate phrase is a valid phrase based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification