Bifurcated document relevance scoring
First Claim
1. A computer implemented method for bifurcated document relevance scoring of documents in a document collection, the method comprising:
- indexing a plurality of documents in the document collection by;
providing a set of phrases;
for a plurality of documents in the document collection;
identifying a plurality of phrases from the set of phrases that occurs in the document;
for each phrase in a plurality of the identified phrases, scoring the phrase to produce a phrase relevance score for the phrase with respect to the document, and storing the phrase relevance score for the document in a phrase posting list for the phrase;
receiving a search query of three or more words;
determining a set of valid phrases in the search query by;
decomposing, by at least one processor of a computer system, the query into a plurality of candidate phrasifications, including different groupings of words of the query, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the query;
scoring, by at least one of the processors of the computer system, at least two candidate phrasifications, wherein the candidate phrasifications include one or more component phrases, wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases in a corpus of documents and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases;
comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; and
selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected candidate phrasification exceed a threshold value and identifying the component phrase(s) of the selected candidate phrasification(s) as valid phrases for the search query;
for each valid phrase for the search query, obtaining from the phrase posting list for the valid phrase the phrase relevance score for documents in which the valid phrase occurs; and
for documents in which a valid phrase of the query occurs, scoring the document to produce a final relevance score using the phrase relevance scores for the document and based on the valid phrases of the search query.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
194 Citations
18 Claims
-
1. A computer implemented method for bifurcated document relevance scoring of documents in a document collection, the method comprising:
indexing a plurality of documents in the document collection by; providing a set of phrases; for a plurality of documents in the document collection; identifying a plurality of phrases from the set of phrases that occurs in the document; for each phrase in a plurality of the identified phrases, scoring the phrase to produce a phrase relevance score for the phrase with respect to the document, and storing the phrase relevance score for the document in a phrase posting list for the phrase; receiving a search query of three or more words; determining a set of valid phrases in the search query by; decomposing, by at least one processor of a computer system, the query into a plurality of candidate phrasifications, including different groupings of words of the query, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the query; scoring, by at least one of the processors of the computer system, at least two candidate phrasifications, wherein the candidate phrasifications include one or more component phrases, wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases in a corpus of documents and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases; comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; and selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected candidate phrasification exceed a threshold value and identifying the component phrase(s) of the selected candidate phrasification(s) as valid phrases for the search query; for each valid phrase for the search query, obtaining from the phrase posting list for the valid phrase the phrase relevance score for documents in which the valid phrase occurs; and for documents in which a valid phrase of the query occurs, scoring the document to produce a final relevance score using the phrase relevance scores for the document and based on the valid phrases of the search query. - View Dependent Claims (2)
-
3. A computer implemented method of bifurcated document relevance scoring of documents in a document collection, the method comprising:
-
storing an index of documents, the index comprising; for each of a plurality of phrases, a phrase posting list identifying the documents that contain the phrase, and for each identified document, a phrase relevance score for the phrase with respect to the document; receiving a search query of three or more words; determining a set of valid phrases in the search query by; decomposing, by at least one processor of a computer system, the query into a plurality of candidate phrasifications, including different groupings of words of the query, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the query; scoring, by at least one of the processors of the computer system, at least two candidate phrasifications, wherein the candidate phrasifications include one or more component phrases, wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases in a corpus of documents and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases; comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; and selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected candidate phrasification exceed a threshold value and identifying the component phrase(s) of the selected candidate phrasification(s) as valid phrases for the search query; for each valid phrase for the search query, obtaining from the phrase posting list for the valid phrase the phrase relevance score for documents in which the valid phrase occurs; and for documents in which a valid phrase of the query occurs, scoring the document to produce a final relevance score using the phrase relevance scores for the document and based on the valid phrases of the search query. - View Dependent Claims (4, 5, 6)
-
-
7. A computer program product stored on a computer readable medium and comprising instructions that when executed cause a computer system to:
index a plurality of documents in the document collection by; providing a set of phrases; for a plurality of documents in the document collection; identifying a plurality of phrases from the set of phrases that occurs in the document; for a plurality of the identified phrases, scoring the phrase to produce a phrase relevance score for the phrase with respect to the document, and storing the phrase relevance score for the document in a phrase posting list for the phrase; receive a search query of three or more words; determining a set of valid phrases in the search query by; decomposing, by at least one processor of a computer system, the query into a plurality of candidate phrasifications, including different groupings of words of the query, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the query; scoring, by at least one of the processors of the computer system, at least two candidate phrasifications, wherein the candidate phrasifications include one or more component phrases, wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases in a corpus of documents and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases; comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; and selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected candidate phrasification exceed a threshold value and identifying the component phrase(s) of the selected candidate phrasification(s) as valid phrases for the search query; for each valid phrase for the search query, obtaining from the phrase posting list for the valid phrase the phrase relevance score for documents in which the valid phrase occurs; and for documents in which a valid phrase of the query occurs, scoring the document to produce a final relevance score using the phrase relevance scores for the document and based on the valid phrases of the search query. - View Dependent Claims (8)
-
9. A computer program product stored on a tangible computer readable medium and comprising instructions that when executed cause a computer system to:
-
store an index of documents, the index comprising; for each of a plurality of phrases, a phrase posting list identifying the documents that contain the phrase, and for each identified document, a phrase relevance score for the phrase with respect to the document; receive a search query of three or more words; determine a set of valid phrases in the search query by; decomposing, by at least one processor of a computer system, the query into a plurality of candidate phrasifications, including different groupings of words of the query, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the query; scoring, by at least one of the processors of the computer system, at least two candidate phrasifications, wherein the candidate phrasifications include one or more component phrases, wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases in a corpus of documents and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases; comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; and selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected candidate phrasification exceed a threshold value and identifying the component phrase(s) of the selected candidate phrasification(s) as valid phrases for the search query; for each valid phrase for the search query, obtaining from the phrase posting list for the valid phrase the phrase relevance score for documents in which the valid phrase occurs; and for documents in which a valid phrase of the query occurs, scoring the document to produce a final relevance score using the phrase relevance scores for the document and based on the valid phrases of the search query. - View Dependent Claims (10, 11, 12)
-
-
13. A system for bifurcated document relevance scoring of documents in a document collection, the system comprising:
-
an indexing system configured for indexing a plurality of documents in the document collection, the indexing system comprising; one or more memories configured for storing a set of phrases; and one or more processors configured for; for a plurality of documents in the document collection, identifying a plurality of phrases from the set of phrases that occurs in the document; and for each phrase in a plurality of the identified phrases, scoring the phrase to produce a phrase relevance score for the phrase with respect to the document, and storing the phrase relevance score for the document in a phrase posting list for the phrase; a search system comprising; a first server configured for receiving a search query of three or more words; and one or more processors configured for; decomposing, by at least one processor of a computer system, the query into a plurality of candidate phrasifications, including different groupings of words of the query, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the query; scoring, by at least one of the processors of the computer system, at least two candidate phrasifications, wherein the candidate phrasifications include one or more component phrases, wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases in a corpus of documents and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases; comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; and selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected candidate phrasification exceed a threshold value and identifying the component phrase(s) of the selected candidate phrasification(s) as valid phrases for the search query; for each valid phrase for the search query, obtaining from the phrase posting list for the valid phrase the phrase relevance score for documents in which the valid phrase occurs; and for documents in which a valid phrase of the query occurs, scoring the document to produce a final relevance score using the phrase relevance scores for the document and based on the valid phrases of the search query. - View Dependent Claims (14)
-
-
15. A system for bifurcated document relevance scoring of documents in a document collection, the system comprising:
-
one or more memories configured for storing an index of documents, the index comprising, for each of a plurality of phrases, a phrase posting list identifying the documents that contain the phrase, and for each identified document, a phrase relevance score for the phrase with respect to the document; a search server comprising; a first server configured for receiving a search query of three or more words; one or more processors configured for determining a set of valid phrases in the search query by; decomposing, by at least one processor of a computer system, the query into a plurality of candidate phrasifications, including different groupings of words of the query, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the query; scoring, by at least one of the processors of the computer system, at least two candidate phrasifications, wherein the candidate phrasifications include one or more component phrases, wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases in a corpus of documents and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases; comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; and selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected candidate phrasification exceed a threshold value and identifying the component phrase(s) of the selected candidate phrasification(s) as valid phrases for the search query; for each valid phrase for the search query, obtaining from the phrase posting list for the valid phrase the phrase relevance score for documents in which the valid phrase occurs; and for documents in which a valid phrase of the query occurs, scoring the document to produce a final relevance score using the phrase relevance scores for the document and based on the valid phrases of the search query. - View Dependent Claims (16, 17, 18)
-
Specification