Index updating using segment swapping
First Claim
1. A computer implemented method of maintaining a phrase index for a pluralityof documents in a document collection, the method comprising:
- providing a set of phrase posting lists, each phrase posting list associated with a phrase;
establishing a plurality of segments, each segment associated with a subset of the plurality of the documents;
periodically updating each segment by;
for documents associated with the segment, identifying phrases contained in the document, and updating the phrase posting list for each such phrase to include the document;
sharding the phrase posting lists for the identified phrases into a plurality of segment shards, each segment shard containing a disjoint subset of the list of documents in the segment that contain the phrase associated with the phrase posting list;
associating each segment shard with an index shard, such that at least one index shard is associated with a plurality of segment shards, each index shard being served by an index server;
determining a recently updated segment having updated segment shards;
for at least one index shard being served;
determining the index shard'"'"'s associated updated segment shards, and merging the updated segment shards with the index shard to form an updated index shard; and
replacing the index shard with the updated index shard.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
245 Citations
10 Claims
-
1. A computer implemented method of maintaining a phrase index for a plurality
of documents in a document collection, the method comprising: -
providing a set of phrase posting lists, each phrase posting list associated with a phrase; establishing a plurality of segments, each segment associated with a subset of the plurality of the documents; periodically updating each segment by; for documents associated with the segment, identifying phrases contained in the document, and updating the phrase posting list for each such phrase to include the document; sharding the phrase posting lists for the identified phrases into a plurality of segment shards, each segment shard containing a disjoint subset of the list of documents in the segment that contain the phrase associated with the phrase posting list; associating each segment shard with an index shard, such that at least one index shard is associated with a plurality of segment shards, each index shard being served by an index server; determining a recently updated segment having updated segment shards; for at least one index shard being served; determining the index shard'"'"'s associated updated segment shards, and merging the updated segment shards with the index shard to form an updated index shard; and replacing the index shard with the updated index shard. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer program product stored on a computer readable medium and comprising instructions that when executed cause a computer system to:
-
provide a set of phrase posting lists, each phrase posting list associated with a phrase; establish a plurality of segments, each segment associated with a subset of a plurality of the documents; periodically update each segment by; for documents associated with the segment, identifying phrases contained in the document, and updating the phrase posting list for each such phrase to include the document; sharding the phrase posting lists for the identified phrases into a plurality of segment shards, each segment shard containing a disjoint subset of the list of documents in the segment that contain the phrase associated with the phrase posting list; associating each segment shard with an index shard, such that at least one index shard is associated with a plurality of segment shards, each index shard being served by an index server; determine a recently updated segment having updated segment shards; for at least one index shard being served; determine the index shard'"'"'s associated updated segment shards, and merging the updated segment shards with the index shard to form an updated index shard; and replace the index shard with the updated index shard. - View Dependent Claims (7, 8, 9, 10)
-
Specification