Indexing system
First Claim
1. A system comprising:
- distributed computing devices represented by leaf nodes;
memory storing an index of documents, the index being distributed across multiple computing devices, the documents being assigned to respective computing devices, and wherein a first document is in a first set of documents assigned to a first leaf node and a second document is in a second set of documents assigned to the first leaf node, the first document being a base document, wherein;
terms in the first document are identified as document sharded and posting lists for the terms in the first document are stored in fast memory at the first leaf node, andposting lists for at least some terms in the second document that are term sharded and stored at computing devices other than the first leaf node; and
at least one root computing device that includes;
at least one processor,memory storing instructions that, when executed by the at least one processor, cause the root computing device to map documents to computing devices and map term-sharded terms to computing devices, andmemory storing instructions that, when executed by the at least one processor cause the system to access posting lists from the index in response to queries.
2 Assignments
0 Petitions
Accused Products
Abstract
A hybrid-sharded index includes document-sharded posting lists and term-sharded posting lists. Implementations include systems and methods using a distributed hybrid-sharded index. For example, a method may include receiving, at a root node, a query having a first term and a second term and determining, that the first term is term-sharded. The method may also include retrieving a term-sharded posting list for the first term from a first leaf node that stores the term-sharded posting list and determining, at the root node, a second leaf node that stores a document-sharded posting list for the second term. The method may include sending the second term and a sub-set of documents from the term-sharded posting list to the second leaf node, the sub-set being documents assigned to the second leaf node; and generating a search result using a response received from the second leaf node.
237 Citations
23 Claims
-
1. A system comprising:
-
distributed computing devices represented by leaf nodes; memory storing an index of documents, the index being distributed across multiple computing devices, the documents being assigned to respective computing devices, and wherein a first document is in a first set of documents assigned to a first leaf node and a second document is in a second set of documents assigned to the first leaf node, the first document being a base document, wherein; terms in the first document are identified as document sharded and posting lists for the terms in the first document are stored in fast memory at the first leaf node, and posting lists for at least some terms in the second document that are term sharded and stored at computing devices other than the first leaf node; and at least one root computing device that includes; at least one processor, memory storing instructions that, when executed by the at least one processor, cause the root computing device to map documents to computing devices and map term-sharded terms to computing devices, and memory storing instructions that, when executed by the at least one processor cause the system to access posting lists from the index in response to queries. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A data storage system comprising:
-
a plurality of leaf computing devices in a distributed system; a root computing device in communication with the plurality of leaf computing devices, wherein at least one of the leaf computing devices includes; memory, at least some of which is fast-access memory, and at least some which is disk memory, the memory being configured in arrays; and processors for accessing the memory and processing posting lists stored in the memory, each array being accessible at least to one or more processors of the at least one leaf computing device, and wherein the memory stores; documents assigned to the at least one leaf computing device, document-sharded posting lists for terms appearing in documents of a first set of the documents, the document-sharded posting lists being stored in the fast-access memory, and term-sharded posting lists for terms appearing in remaining documents, the terms being assigned to respective leaf computing devices of the plurality of leaf computing devices regardless of the leaf computing device assignment of documents in which the terms appear, the term-sharded posting lists being stored primarily in the disk memory, wherein within each term-sharded posting list, references to documents are organized by the leaf computing device to which the documents are assigned. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A method comprising:
-
receiving, using at least one processor of a root node in a distributed environment, a query having a first term and a second term; determining, using the at least one processor of the root node, that the first term is term-sharded; retrieving a term-sharded posting list for the first term from a first leaf node that stores the term-sharded posting list, the first leaf node being one of a plurality of leaf nodes in the distributed environment; determining, using the at least one processor of the root node, a second leaf node from the plurality of leaf nodes that stores a document-sharded posting list for the second term; sending the second term and a sub-set of documents from the term-sharded posting list to the second leaf node, the sub-set being documents assigned to the second leaf node; and generating a search result using a response received from the second leaf node. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
Specification