Index server architecture using tiered and sharded phrase posting lists
First Claim
1. A method of indexing documents based on phrases occurring in the documents, the method comprising:
- selecting a phrase posting list associated with a phrase, the phrase posting list identifying a number of documents having at least one occurrence of the phrase;
determining, with one or more processors, a length of the phrase posting list, the length of the phrase posting list being based on the number of documents having at least one occurrence of the phrase;
when the length of the phrase posting list is less than a first predetermined length, associating the phrase posting list with one of a plurality of first tier index servers;
when the length of the phrase posting list is greater than the first predetermined length;
dividing the phrase posting list into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list; and
associating each phrase posting list shard with a corresponding selected second tier index server.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
-
Citations
20 Claims
-
1. A method of indexing documents based on phrases occurring in the documents, the method comprising:
-
selecting a phrase posting list associated with a phrase, the phrase posting list identifying a number of documents having at least one occurrence of the phrase; determining, with one or more processors, a length of the phrase posting list, the length of the phrase posting list being based on the number of documents having at least one occurrence of the phrase; when the length of the phrase posting list is less than a first predetermined length, associating the phrase posting list with one of a plurality of first tier index servers; when the length of the phrase posting list is greater than the first predetermined length; dividing the phrase posting list into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list; and associating each phrase posting list shard with a corresponding selected second tier index server. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of indexing documents based on phrases occurring in the documents, the method comprising:
-
storing a first plurality of phrase posting lists in one or more memories of a first set of index servers, each of the first plurality of phrase posting lists being associated with a phrase and identifying a set of documents having at least one occurrence of the phrase, wherein a length of the each of the first plurality of phrase posting lists is less than a first predetermined length; and storing a second plurality of phrase posting lists in one or more memories of a second set of index servers, each of the second plurality of phrase posting lists being associated with a phrase and identifying a set of documents having at least one occurrence of the phrase, wherein a length of the second plurality of phrase posting lists is greater than the first predetermined length, each the second plurality of phrase posting lists being partitioned into a number of shards. - View Dependent Claims (8, 9, 10, 11)
-
-
12. A phrase-based indexing system for indexing documents according to phrases, the system comprising:
-
a first number of first tier index servers configured to store phrase posting lists for each of a plurality of phrases, each phrase posting list associated with one of the plurality of phrases and identifying a number of documents having at least one occurrence of the phrase; and a second number of second tier index servers, configured to store shards of phrase posting lists, the phrase posting lists being associated with a phrase and a list of documents having at least one occurrence of the phrase, and each phrase posting list shard being associated with a phrase and a subset of the list of documents that are associated with the phrase posting list for the phrase; wherein the first tier index servers are configured to store phrase posting lists having a length less than a first predetermined length; wherein the second tier index servers are configured to store phrase posting lists having a length greater than the first predetermined length. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A method of indexing documents based on phrases occurring in the documents of a document collection, the method comprising:
-
determining, with one or more processors, a plurality of phrase posting lists, each phrase posting list associated with a phrase that occurs in the documents and identifying a number of documents having at least one occurrence of the phrase; associating a first subset of the plurality of phrase posting lists with a first tier of index servers; associating a second subset of the plurality of phrase posting lists with a second tier of index servers; and dividing a phrase posting list from the second subset into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list, wherein the first tier index servers are configured to store phrase posting lists having a length less than a first predetermined length; wherein the second tier index servers are configured to store phrase posting lists having a length greater than the first predetermined length. - View Dependent Claims (18, 19, 20)
-
Specification