Multiple index based information retrieval system
First Claim
Patent Images
1. A computer-implemented method comprising:
- assigning each phrase identified in a document collection a phrase number based on frequency of occurrence of the phrase in the document collection, wherein each indexed document has a document identifier;
creating a phrase-sharded index for a search engine by, for each identified phrase;
assigning the phrase to a server of a plurality of index servers based on a hash of the assigned phrase number, andstoring a posting list of identifiers of documents of the document collection that contain the phrase on the assigned index server;
identifying, using the index, documents responsive to a search query; and
providing information about the identified documents to a requestor of the search query.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. The document index is partitioned into multiple indexes, including a primary index and a secondary index. The primary index stores phrase posting lists with relevance rank ordered documents. The secondary index stores excess documents from the posting lists in document order.
237 Citations
17 Claims
-
1. A computer-implemented method comprising:
-
assigning each phrase identified in a document collection a phrase number based on frequency of occurrence of the phrase in the document collection, wherein each indexed document has a document identifier; creating a phrase-sharded index for a search engine by, for each identified phrase; assigning the phrase to a server of a plurality of index servers based on a hash of the assigned phrase number, and storing a posting list of identifiers of documents of the document collection that contain the phrase on the assigned index server; identifying, using the index, documents responsive to a search query; and providing information about the identified documents to a requestor of the search query. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An information retrieval system for retrieving information from a corpus of documents, the system comprising:
-
a primary index server system comprising a primary index, the primary index including primary phrase posting lists, each primary phrase posting list being associated with a phrase; and a secondary index server system comprising a secondary index, the secondary index including secondary phrase posting lists, each secondary phrase posting list being associated with a primary phrase posting list in the primary index, and including documents that contain the phrase that is associated with the primary phrase posting list in the primary index and which have relevance scores less than the relevance score of a lowest ranked document in the primary phrase posting list for the phrase, wherein the primary index server system comprises multiple machines, and wherein each phrase is assigned an identification number and has a primary phrase posting list located on one of the machines. - View Dependent Claims (8, 9, 10)
-
-
11. A non-transitory computer-readable medium having stored thereon instructions that, when executed by at least one processor, cause a computing system to perform operations including:
-
assigning each phrase identified in a document collection a phrase number based on frequency of occurrence of the phrase in the document collection, wherein each document has a respective document identifier; and creating a phrase-sharded index for a search engine by, for each identified phrase; assigning the phrase to a server of a plurality of index servers based on a hash of the assigned phrase number, and storing a posting list of identifiers of documents of the document collection that contain the phrase on the assigned index server; identifying, using the index, documents responsive to a search query; and providing information about the identified documents to a requestor of the search query. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
Specification