Query phrasification
First Claim
1. A computer-implemented method for identifying valid phrases in an input text comprising a plurality of three or more words, the method comprising:
- decomposing, by at least one processor of a computer system, the input text into a plurality of candidate phrasifications, including different groupings of words of the input text, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the input text;
scoring, by at least one of the processors of the computer system, at least two of the candidate phrasifications, wherein the candidate phrasifications include two or more component phrases, and wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases, and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases;
comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value;
selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected phrasification exceeds a chosen threshold value; and
identifying, by at least one of the processors of the computer system, the component phrases of each selected phrasification as valid phrases for the input text.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
-
Citations
37 Claims
-
1. A computer-implemented method for identifying valid phrases in an input text comprising a plurality of three or more words, the method comprising:
-
decomposing, by at least one processor of a computer system, the input text into a plurality of candidate phrasifications, including different groupings of words of the input text, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the input text; scoring, by at least one of the processors of the computer system, at least two of the candidate phrasifications, wherein the candidate phrasifications include two or more component phrases, and wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases, and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases; comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected phrasification exceeds a chosen threshold value; and identifying, by at least one of the processors of the computer system, the component phrases of each selected phrasification as valid phrases for the input text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system comprising:
-
a computer program product stored on a tangible computer readable medium for identifying valid phrases in an input text comprising a plurality of three or more words and comprising instructions that when executed cause a computer system to; decompose the input text into a plurality of candidate phrasifications, including different groupings of words of the input text, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the input text; score at least two of the candidate phrasifications, wherein the candidate phrasifications include two or more component phrases, and wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases, and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases; compare a score for each scored candidate phrasification to a threshold value; select at least one candidate phrasification, wherein the scores of each selected phrasification exceeds a chosen threshold value; and identify the phrases of each selected phrasification as valid phrases for the input text; and one or more processors configured for executing the instructions. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
-
Specification