Phrase extraction using subphrase scoring
First Claim
1. A computer implemented method of extracting a set of valid phrases from a plurality of documents, the method comprising:
- for each document;
identifying a plurality of candidate phrases contained in the document, wherein a candidate phrase includes multiple consecutive words that appear in the document;
scoring candidate phrases in the document to produce document phrase scores for the candidate phrases for the document, the document phrase scores for a candidate phrase being based on instances of the candidate phrase that appear in the document,wherein scoring a candidate phrase in the document to produce a document phrase score includes;
determining for the candidate phrase two or more subphrases within the candidate phrase, wherein a subphrase contains two or more words; and
scoring each determined subphrase in the document as a function of the position of the subphrase relative to a sequence of words containing the candidate phrase;
for at least one of the candidate phrases;
creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and
determining whether the candidate phrase is a valid phrase based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase.
1 Assignment
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
215 Citations
26 Claims
-
1. A computer implemented method of extracting a set of valid phrases from a plurality of documents, the method comprising:
-
for each document; identifying a plurality of candidate phrases contained in the document, wherein a candidate phrase includes multiple consecutive words that appear in the document; scoring candidate phrases in the document to produce document phrase scores for the candidate phrases for the document, the document phrase scores for a candidate phrase being based on instances of the candidate phrase that appear in the document, wherein scoring a candidate phrase in the document to produce a document phrase score includes; determining for the candidate phrase two or more subphrases within the candidate phrase, wherein a subphrase contains two or more words; and scoring each determined subphrase in the document as a function of the position of the subphrase relative to a sequence of words containing the candidate phrase; for at least one of the candidate phrases; creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and determining whether the candidate phrase is a valid phrase based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 15)
-
-
10. A non-transitory computer readable medium, having stored thereon, computer program code that, when executed, causes a computer system to extract a set of valid phrases from a plurality of documents, by:
-
for each document; identifying a plurality of candidate phrases contained in the document, wherein a candidate phrase includes multiple consecutive words that appear in the document; scoring candidate phrases in the document to produce document phrase scores for the candidate phrases for the document, the document phrase scores for a candidate phrase being based on instances of the candidate phrase that appear in the document, wherein scoring a candidate phrase in the document to produce a document phrase score for the candidate phrase includes; determining for the candidate phrase two or more subphrases within the candidate phrase, wherein a subphrase contains two or more words; and scoring each determined subphrase in the document as a function of the position of the subphrase relative to a sequence of words containing the candidate phrase; and for at least one of the candidate phrases; creating, via a processor of the computer system, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and determining whether the candidate phrase is a valid phrase based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase. - View Dependent Claims (11, 12, 13, 14, 16)
-
-
17. A system for extracting a set of valid phrases from a plurality of documents, the system comprising:
-
one or more computer readable media comprising executable instructions; and one or more processors configured to execute the instructions, wherein execution of the instructions causes the system to, for each document; identify a plurality of candidate phrases contained in the document, wherein a candidate phrase includes multiple consecutive words that appear in the document; score candidate phrases in the document to produce document phrase scores for the candidate phrases for the document, the document phrase scores for a candidate phrase being based on instances of the candidate phrase that appear in the document, wherein scoring a candidate phrase in the document to produce a document phrase score for the candidate phrase includes; determining for the candidate phrase two or more subphrases within the candidate phrase, wherein a subphrase contains two or more words; and scoring each determined subphrase in the document as a function of the position of the subphrase relative to a sequence of words containing the candidate phrase; and for at least one of the candidate phrases; create, via a processor of the computer system, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and determine whether the candidate phrase is a valid phrase based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26)
-
Specification