Index server architecture using tiered and sharded phrase posting lists
First Claim
1. A method of indexing documents based on phrases occurring in the documents, the method comprising:
- selecting a phrase posting list associated with a phrase, and identifying a plurality of documents having at least one occurrence of the phrase;
determining a length of the phrase posting list via operation of a processor;
if the length of the phrase posting list is less than a first predetermined length, associating the phrase posting list with one of a plurality of first tier index servers;
if the length of the phrase posting list is greater than the first predetermined length;
dividing the phrase posting list into a plurality of shards, each shard including a subset of the plurality of the documents;
associating each phrase posting list shard with a corresponding selected second tier index server, wherein the number of shards correspond to the number of second tier index servers; and
if the length of the phrase posting list is greater than a second predetermined length that is greater than the first predetermined length;
dividing the phrase posting list into a plurality of shards; and
associating each phrase posting list shard with a corresponding selected third tier index server, wherein the number of shards correspond to the number of third tier index servers, and wherein the number of third tier index servers is an integer multiple of the number of second tier index servers.
3 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
431 Citations
12 Claims
-
1. A method of indexing documents based on phrases occurring in the documents, the method comprising:
-
selecting a phrase posting list associated with a phrase, and identifying a plurality of documents having at least one occurrence of the phrase; determining a length of the phrase posting list via operation of a processor; if the length of the phrase posting list is less than a first predetermined length, associating the phrase posting list with one of a plurality of first tier index servers; if the length of the phrase posting list is greater than the first predetermined length; dividing the phrase posting list into a plurality of shards, each shard including a subset of the plurality of the documents; associating each phrase posting list shard with a corresponding selected second tier index server, wherein the number of shards correspond to the number of second tier index servers; and if the length of the phrase posting list is greater than a second predetermined length that is greater than the first predetermined length; dividing the phrase posting list into a plurality of shards; and associating each phrase posting list shard with a corresponding selected third tier index server, wherein the number of shards correspond to the number of third tier index servers, and wherein the number of third tier index servers is an integer multiple of the number of second tier index servers. - View Dependent Claims (2)
-
-
3. A method of indexing documents based on phrases occurring in the documents, the method comprising:
-
storing a first plurality of phrase posting lists in one or more memories of a first set of index servers, wherein a length of the first plurality of phrase posting lists is less than a first predetermined length, each of the first plurality of phrase posting lists being associated with a phrase and a set of documents each of which includes at least one occurrence of the phrase and being associated with one of the first set of index servers; storing a second plurality of phrase posting lists in one or more memories of a number of second index servers, wherein a length of the second plurality of phrase posting lists is greater than the first predetermined length, each of the second plurality of phrase posting lists being associated with a phrase and a set of documents each of which includes at least one occurrence of the phrase and being partitioned into a number of shards, each shard associated with a corresponding one of a number of the second set of index servers; and storing a third plurality of phrase posting lists in one or more memories of a number of third set of index servers, wherein a length of the third plurality of phrase posting lists is greater than a second predetermined length that is greater than the first predetermined length, each of the third plurality of phrase posting lists being associated with a phrase and a set of documents each of which includes at least one occurrence of the phrase and being partitioned into a number of shards, each shard associated with a corresponding one of a number of the third set of index servers, wherein the number of third index servers is an integer multiple of the number of second index servers.
-
-
4. A phrase-based indexing system for indexing documents according to phrases, the system comprising:
-
a first number of first tier of index servers configured to associate a complete phrase posting list for each of a plurality of phrases, each phrase posting list associated with a phrase and a list of documents having at least one occurrence of the phrase; a second number of second tier of index servers, wherein the second number is an integer multiple of the first number, configured to associate a shard of a phrase posting list for each of a plurality of phrases, each phrase posting list shard associated with a phrase and a list of documents having at least one occurrence of the phrase; and a third number of third tier of index servers, wherein the third number is an integer multiple of the second number, configured to store a shard of a phrase posting list for each of a plurality of phrases, each phrase posting list shard associated with a phrase and a list of documents having at least one occurrence of the phrase, wherein the first tier of index servers is configured to store phrase posting lists having a length less than a first predetermined length; wherein the second tier of index servers is configured to store phrase posting lists having a length greater than the first predetermined length; and wherein the third tier of index servers is configured to store phrase posting lists having a length greater than a second predetermined length that is greater than the first predetermined length, wherein each index server executes on a computer that includes a processor and a memory.
-
-
5. A phrase-based indexing system for indexing documents according to phrases, the system comprising:
-
a first number of first tier of index servers, wherein each of the first tier index servers is configured to store a complete phrase posting list for each of a plurality of phrases, each phrase posting list associated with a phrase and a list of documents having at least one occurrence of the phrase; a second number of second tier of index servers, wherein the second number is an integer multiple of the first number, wherein each of the second tier index servers is configured to store a shard of a phrase posting list for each of a plurality of phrases, each phrase posting list shard associated with a phrase and a list of documents having at least one occurrence of the phrase; and a third number of third tier of index servers, wherein the third number is an integer multiple of the second number, wherein each of the third tier index servers is configured to store a shard of a phrase posting list for each of a plurality of phrases, each phrase posting list shard associated with a phrase and a list of documents having at least one occurrence of the phrase, wherein; the first tier of index servers stores phrase posting lists having a length less than a first predetermined length; the second tier of index servers stores phrase posting lists having a length greater than the first predetermined length; and the third tier of index servers stores phrase posting lists having a length greater than a second predetermined length that is greater than the first predetermined length, wherein each index server executes on a computer that includes a processor and a memory. - View Dependent Claims (6)
-
-
7. A phrase-based indexing system for indexing documents according to phrases, the system comprising:
-
a first tier of index servers, comprising; N index servers, each of which stores a phrase posting list for each of a plurality of phrases, each phrase posting list associated with a phrase and a list of documents having at least one occurrence of the phrase; M tiers of index servers, wherein; M=3 to a predetermined number; each Mth tier includes T index servers, where M=3, T is an integer multiple of N; and where M>
3, T is an integer multiple T in an (M−
I)th tier; andeach Mth tier index server stores a shard of a phrase posting list for each of a plurality of phrases, each phrase posting list shard associated with a phrase and a list of documents having at least one occurrence of the phrase; and responsive to the length of the phrase posting list being greater than a second predetermined length that is greater than the first predetermined length; dividing the phrase posting list into a plurality of shards; and associating each phrase posting list shard with a corresponding selected third tier index server, wherein the number of shards correspond to the number of third tier index servers, and wherein the number of third tier index servers is an integer multiple of the number of second tier index servers, wherein each index server executes on a computer that includes a processor and a memory.
-
-
8. A computer implemented method of indexing documents based on phrases occurring in the documents, the method comprising:
-
for each phrase, creating, via operation of a processor, a phrase posting list identifying a number of documents, each of which is related to the phrase, a length of the phrase posting list being the number of documents; defining a number M tiers, where; M is at least 1; each tier is associated with a minimum length L, such that L is zero for the first tier, and L for the Mth tier is greater than L for the (M−
1) Mth tier;each tier is associated with a number of shards S, such that S for the Mth tier is an integer multiple of S for the (M−
1)th tier;for each phrase posting list, associating the phrase posting list to the tier for which the length of the phrase posting list is greater than or equal to the minimum length of that tier and less than the minimum length of the succeeding tier; for each tier, establishing a set of index servers to store the set of all phrase posting lists associated with that tier, such that; each phrase posting list associated with the tier is divided into the S shards for that tier, with each document in the phrase posting list being assigned one and only one of the shards; each index server stores some plurality of the shards of the phrase posting lists associated with its tier; and each shard of each phrase posting list assigned to the tier is stored in at least one server associated to this tier.
-
-
9. A computer program product embodied on a computer readable storage medium comprising instructions for causing a computer system:
-
to select a phrase posting list associated with a phrase, and identify a plurality of documents having at least one occurrence of the phrase; to determine a length of the phrase posting list; if the length of the phrase posting list is less than a first predetermined length, to associate the phrase posting list with one of a plurality of first tier index servers; if the length of the phrase posting list is greater than the first predetermined length; to divide the phrase posting list into a plurality of shards, each shard including a subset of the plurality of the documents; to associate each phrase posting list shard with a corresponding selected second tier index server, wherein the number of shards correspond to the number of second tier index servers; and if the length of the phrase posting list is greater than a second predetermined length that is greater than the first predetermined length; to divide the phrase posting list into a plurality of shards; and to associate each phrase posting list shard with a corresponding selected third tier index server, wherein the number of shards correspond to the number of third tier index servers, and wherein the number of third tier index servers is an integer multiple of the number of second tier index servers. - View Dependent Claims (10)
-
-
11. A computer program product embodied on a computer readable storage medium comprising instructions for causing a computer system:
-
to store a first plurality of phrase posting lists in a first set of index servers, wherein a length of the first plurality of phrase posting lists is less than a first predetermined length, each of the first plurality of phrase posting lists being associated with a phrase and a set of documents each of which includes at least one occurrence of the phrase and being associated with one of the first set of index servers; to store a second plurality of phrase posting lists in a number of second index servers, wherein a length of the second plurality of phrase posting lists is greater than the first predetermined length, each of the second plurality of phrase posting lists being associated with a phrase and a set of documents each of which includes at least one occurrence of the phrase and being partitioned into a number of shards, each shard associated with a corresponding one of a number of the second set of index servers; and to store a third plurality of phrase posting lists in one or more memories of a number of third set of index servers, wherein a length of the third plurality of phrase posting lists is greater than a second predetermined length that is greater than the first predetermined length, each of the third plurality of phrase posting lists being associated with a phrase and a set of documents each of which includes at least one occurrence of the phrase and being partitioned into a number of shards, each shard associated with a corresponding one of a number of the third set of index servers, wherein the number of third index servers is an integer multiple of the number of second index servers.
-
-
12. A computer program product embodied on a computer readable storage medium comprising instructions for causing a computer system:
-
for each phrase occurring in a plurality of documents, to create a phrase posting list identifying a number of documents, each of which is related to the phrase, a length of the phrase posting list being the number of documents; to define a number M tiers, where; M is at least 1; each tier being associated with a minimum length L, such that L is zero for the first tier, and L for the Mth tier is greater than L for the (M−
1) Mth tier;each tier is associated with a number of shards S, such that S for the Mth tier is an integer multiple of S for the (M−
1)th tier;for each phrase posting list, to associate the phrase posting list to the tier for which the length of the phrase posting list is greater than or equal to the minimum length of that tier and less than the minimum length of the succeeding tier; for each tier, to establish a set of index servers to store the set of all phrase posting lists associated with that tier, such that; each phrase posting list associated with the tier is divided into the S shards for that tier, with each document in the phrase posting list being assigned one and only one of the shards; each index server stores some plurality of the shards of the phrase posting lists associated with its tier; and each shard of each phrase posting list assigned to the tier is stored in at least one server associated to this tier.
-
Specification