Systems and methods for indexing content for fast and scalable retrieval
First Claim
1. A method for generating an index set for use in connection with the querying of documents in a content store, the method comprising:
- updating said index set by storing in the index set a collection of single document indices that index terms from a set of documents, wherein each single document index of the collection of document indices indexes terms from a single document of the set of documents according to a position of the terms in the single document;
monitoring a size of the collection of single document indices;
detecting that the size of the collection has attained a specified size;
in response to detecting that the size of the collection has attained the specified size, generating a new multiple document index by converting a subset of the collection of single document indices to the new multiple document index, wherein;
the new multiple document index indexes terms from the subset, sorted by term;
wherein the subset includes at least two single document indices that index terms from at least two single documents of the set of documents, and wherein the subset includes fewer than all single document indices;
wherein;
converting the subset to the new multiple document index causes the subset to be removed from the collection of single document indices;
each document in said content store is indexed by a single document index that indexes terms of the document if and only if the document is not indexed by a multiple document index that indexes terms of multiple documents;
the method is performed by one or more computing devices.
10 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for query processing and indexing of documents in connection with a content store in a computing system are provided. In various embodiments, an indexing model is provided that is optimized for fast, efficient and scalable retrieval of documents satisfying a query, including the mixed use of forward and inverted indexing representations, including algorithms for achieving a balance between the two representations. When processing queries, fast and efficient generation of reverse chronologically ordered posting lists is enabled for efficient execution of logical operators on query result sets. A term expand index is also provided wherein the overall terms included in the term expand index are decomposed into a plurality of lexicon files, which are combined when convenient for fast, scalable efficiency when performing queries of the content in the content store.
82 Citations
35 Claims
-
1. A method for generating an index set for use in connection with the querying of documents in a content store, the method comprising:
-
updating said index set by storing in the index set a collection of single document indices that index terms from a set of documents, wherein each single document index of the collection of document indices indexes terms from a single document of the set of documents according to a position of the terms in the single document; monitoring a size of the collection of single document indices; detecting that the size of the collection has attained a specified size; in response to detecting that the size of the collection has attained the specified size, generating a new multiple document index by converting a subset of the collection of single document indices to the new multiple document index, wherein; the new multiple document index indexes terms from the subset, sorted by term;
wherein the subset includes at least two single document indices that index terms from at least two single documents of the set of documents, and wherein the subset includes fewer than all single document indices;wherein; converting the subset to the new multiple document index causes the subset to be removed from the collection of single document indices; each document in said content store is indexed by a single document index that indexes terms of the document if and only if the document is not indexed by a multiple document index that indexes terms of multiple documents; the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An apparatus for generating an index set for use in connection with the querying of documents in a content store, the apparatus comprising:
-
one or more processors coupled to one or more storage devices; wherein execution of one or more instructions stored on the one or more storage devices causes the one or more processors to perform; updating said index set by storing in the index set a collection of single document indices that index terms from a set of documents, wherein each single document index of the collection of document indices indexes terms from a single document of the set of documents according to a position of the terms in the single document; monitoring a size of the collection of single document indices; detecting that the size of the collection has attained a specified size; in response to detecting that the size of the collection has attained the specified size, generating a new multiple document index by converting a subset of the collection of single document indices to the new multiple document index, wherein; the new multiple document index indexes terms from the subset, sorted by term;
wherein the subset includes at least two single document indices that index terms from at least two single documents of the set of documents, and wherein the subset includes fewer than all single document indices;wherein; converting the subset to the new multiple document index causes the subset to be removed from the collection of single document indices; each document in said content store is indexed by a single document index that indexes terms of the document if and only if the document is not indexed by a multiple document index that indexes terms of multiple documents.
-
-
20. A method for querying documents in a content store, the method comprising:
-
accessing an index set including; a first plurality of single document indices that index terms of a first set of documents according to a position of the terms in the first set of documents, wherein each single document index of the first plurality of single document indices indexes terms of a single document of the first set of documents, and a second plurality of multiple document indices that index terms of a second set of documents, sorted by term , wherein each multiple document index of the second plurality of multiple document indices indexes terms of multiple documents of the second set of documents; wherein each multiple document index of the second plurality of multiple document indices is generated by converting a subset of a collection of single document indices when a size of the collection of single document indices reaches a specified size; receiving at least one query term for processing against said index set; scanning the first plurality of single document indices for said at least one query term and retrieving a first set of document identifiers for documents that satisfy the at least one query term; and querying against the second plurality of multiple document indices with said at least one query term and retrieving a second set of document identifiers for documents that satisfy the at least one query term; wherein each document in said content store is indexed by a single document index in the first plurality of single document indices if and only if the document is not indexed by a multiple document index in the second plurality of multiple document indices; wherein the method is performed by one or more computing devices. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
-
35. An apparatus for querying documents in a content store, the apparatus comprising:
-
one or more processors coupled to one or more storage devices; wherein execution of one or more instructions stored on the one or more storage devices causes the one or more processors to perform; accessing an index set including; a first plurality of single document indices that index terms of a first set of documents according to a position of the terms in the first set of documents, wherein each single document index of the first plurality of single document indices indexes text of a single document of the first set of documents, and a second plurality of multiple document indices that index terms of a second set of documents, sorted by term, wherein each multiple document index of the second plurality of multiple document indices indexes terms of multiple documents of the second set of documents; wherein each multiple document index of the second plurality of multiple document indices is generated by converting a subset of a collection of single document indices when a size of the collection of single document indices reaches a specified size; receiving at least one query term for processing against said index set; scanning the first plurality of single document indices for said at least one query term and retrieving a first set of document identifiers for documents that satisfy the at least one query term; and querying against the second plurality of multiple document indices with said at least one query term and retrieving a second set of document identifiers for documents that satisfy the at least one query term; wherein each document in said content store is indexed by a single document index in the first plurality of single document indices if and only if the document is not indexed by a multiple document index in the second plurality of multiple document indices.
-
Specification