Index server architecture using tiered and sharded phrase posting lists
First Claim
1. A method of indexing documents of a document collection in an indexing system that includes a plurality of index servers, the method comprising:
- determining a phrase posting list associated with a first phrase, the phrase posting list identifying documents of the document collection associated with the first phrase;
dividing the phrase posting list for the first phrase into a plurality of different shards, each shard identifying a subset of the documents identified by the posting list;
storing each different shard of the phrase posting list for the first phrase on a corresponding different index server;
storing a plurality of shards of different phrase posting lists on a first index server;
storing a plurality of shards of different phrase posting lists on a second index server; and
within each of the first and second index servers, for each shard of a phrase posting list, ordering the shard according to document identifiers of the documents included in the shard.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
233 Citations
51 Claims
-
1. A method of indexing documents of a document collection in an indexing system that includes a plurality of index servers, the method comprising:
-
determining a phrase posting list associated with a first phrase, the phrase posting list identifying documents of the document collection associated with the first phrase; dividing the phrase posting list for the first phrase into a plurality of different shards, each shard identifying a subset of the documents identified by the posting list; storing each different shard of the phrase posting list for the first phrase on a corresponding different index server; storing a plurality of shards of different phrase posting lists on a first index server; storing a plurality of shards of different phrase posting lists on a second index server; and within each of the first and second index servers, for each shard of a phrase posting list, ordering the shard according to document identifiers of the documents included in the shard. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-implemented method comprising:
-
receiving a query for information from the document collection, wherein the query includes a first phrase; accessing a plurality of index servers to retrieve identifiers of documents on a phrase posting list associated with the first phrase, wherein the phrase posting list identifies documents of the document collection associated with the first phrase, and is divided into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list, and each different shard being stored on a corresponding different index server of the plurality of index servers; serving a response to the query, wherein the response is based on information in the phrase posting list associated with the first phrase; and storing a plurality of shards of different phrase posting lists on a first index server; storing plurality of shards of different phrase posting lists on a second index server; and within each of the first and second index servers, for each shard of a phrase posting list, ordering the shard according to document identifiers of the documents included in the shard. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. An indexing system for indexing documents in a document collection, the system comprising:
-
a plurality of index servers; one or more memory devices configured store executable instructions; and one or more processors configured to execute the stored instructions to cause the system to; determine a phrase posting list associated with a first phrase, the phrase posting list identifying documents of the document collection associated with the first phrase; divide the phrase posting list for the first phrase into a plurality of shards, each shard identifying a subset of the documents identified by the posting list; and store each shard of the phrase posting list for the first phrase on a corresponding index server; determine a phrase posting list associated with a second phrase, the phrase posting list identifying a number of documents of the document collection having at least one occurrence of the second phrase; divide the phrase posting list for the second phrase into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list; and store each shard of the phrase posting list for the second phrase on a corresponding index server, wherein shards of the phrase posting list for the first phrase and shards of the phrase posting list for the second phrase, which identify the same document, are stored on the same index server. - View Dependent Claims (19, 20, 21)
-
-
22. An indexing system for indexing documents of a document collection, the system comprising:
-
a plurality of index servers; one or more memory devices configured store executable instructions; and one or more processors configured to execute the stored instructions to cause the system to; receive a query for information from the document collection, wherein the query includes a first phrase and a second phrase; access a plurality of index servers to retrieve identifiers of documents on a phrase posting list associated with the first phrase, wherein the phrase posting list identifies documents of the document collection associated with the first phrase, and is divided into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list, and each shard being stored on a corresponding index server of the plurality of index servers; access a plurality of the index servers to retrieve identifiers of documents on a phrase posting list associated with the second phrase, wherein the phrase posting list identifies documents of the document collection associated with the second phrase, and is divided into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list, and each shard being stored on a corresponding index server of the plurality of index servers; and serve a response to the query, wherein the response is based on information in the phrase posting list associated with the first phrase and is based on information in the phrase posting list associated with the second phrase. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A method of indexing documents of a document collection in an indexing system that includes a plurality of index servers, the method comprising:
-
determining a phrase posting list associated with a first phrase, the phrase posting list identifying documents of the document collection associated with the first phrase; dividing the phrase posting list for the first phrase into a plurality of different shards, each shard identifying a subset of the documents identified by the posting list; storing each different shard of the phrase posting list for the first phrase on a corresponding different index server; determining a phrase posting list associated with a second phrase, the phrase posting list identifying a number of documents of the document collection having at least one occurrence of the second phrase; dividing the phrase posting list for the second phrase into a plurality of different shards, each shard identifying a subset of the plurality of the documents identified by the posting list; and storing each different shard of the phrase posting list for the second phrase on a corresponding different index server, wherein shards of the phrase posting list for the first phrase and shards of the phrase posting list for the second phrase, which identify the same document, are stored on the same index server. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39)
-
-
40. A computer-implemented method comprising:
-
receiving a query for information from the document collection, wherein the query includes a first phrase and a second phrase; accessing a plurality of index servers to retrieve identifiers of documents on a phrase posting list associated with the first phrase, wherein the phrase posting list identifies documents of the document collection associated with the first phrase, and is divided into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list, and each different shard being stored on a corresponding index server of the plurality of different index servers; accessing a plurality of the index servers to retrieve identifiers of documents on a phrase posting list associated with the second phrase, wherein the phrase posting list identifies documents of the document collection associated with the second phrase, and is divided into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list, and each shard being stored on a corresponding index server of the plurality of index servers; and serving a response to the query, wherein the response is based on information in the phrase posting list associated with the first phrase and is based on information in the phrase posting list associated with the second phrase. - View Dependent Claims (41, 42, 43, 44, 45)
-
-
46. An indexing system for indexing documents in a document collection, the system comprising:
-
a plurality of index servers; one or more memory devices configured store executable instructions; and one or more processors configured to execute the stored instructions to cause the system to; determine a phrase posting list associated with a first phrase, the phrase posting list identifying documents of the document collection associated with the first phrase; divide the phrase posting list for the first phrase into a plurality of shards, each shard identifying a subset of the documents identified by the posting list; store each shard of the phrase posting list for the first phrase on a corresponding index server; store a plurality of shards of different phrase posting lists on a first index server; store a plurality of shards of different phrase posting lists on a second index server; and within each of the first and second index servers, for each shard of a phrase posting list, order the shard according to document identifiers of the documents included in the shard. - View Dependent Claims (47, 48, 49)
-
-
50. An indexing system for indexing documents of a document collection, the system comprising:
-
a plurality of index servers; one or more memory devices configured store executable instructions; and one or more processors configured to execute the stored instructions to cause the system to; receive a query for information from the document collection, wherein the query includes a first phrase; access a plurality of index servers to retrieve identifiers of documents on a phrase posting list associated with the first phrase, wherein the phrase posting list identifies documents of the document collection associated with the first phrase, and is divided into a plurality of shards, each shard identifying a subset of the plurality of the documents identified by the posting list, and each shard being stored on a corresponding index server of the plurality of index servers; and serve a response to the query, wherein the response is based on information in the phrase posting list associated with the first phrase; store a plurality of shards of different phrase posting lists on a first index server; store plurality of shards of different phrase posting lists on a second index server; and within each of the first and second index servers, for each shard of a phrase posting list, order the shard according to document identifiers of the documents included in the shard. - View Dependent Claims (51)
-
Specification