METHOD AND APPARATUS FOR SEARCH
First Claim
1. A method for search, comprising:
- performing term segmentation for grabbed documents to count a term frequency of each term, the term frequency of the term representing a number of the grabbed documents containing the term;
generating a high frequency term inverted index and a low frequency term inverted index respectively, wherein the high frequency term inverted index contains terms having a term frequency higher than a predefined threshold, and the low frequency term inverted index contains terms having a term frequency not higher than the predefined threshold; and
loading the high frequency term inverted index and the low frequency term inverted index respectively to different retrieval modules, the retrieval modules respectively corresponding to mutually independent storage devices.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatuses for search are provided and related to the field of search technology. A method may include: performing term segmentation for grabbed documents to count a term frequency of each term, the term frequency of the term representing a number of the grabbed documents containing the term; generating a high frequency term inverted index and a low frequency term inverted index respectively, wherein the high frequency term inverted index contains terms having a term frequency higher than a predefined threshold, and the low frequency term inverted index contains terms having a term frequency not higher than the predefined threshold; and loading the high frequency term inverted index and the low frequency term inverted index respectively to different retrieval modules, the different retrieval modules respectively corresponding to mutually independent storage devices.
-
Citations
14 Claims
-
1. A method for search, comprising:
-
performing term segmentation for grabbed documents to count a term frequency of each term, the term frequency of the term representing a number of the grabbed documents containing the term; generating a high frequency term inverted index and a low frequency term inverted index respectively, wherein the high frequency term inverted index contains terms having a term frequency higher than a predefined threshold, and the low frequency term inverted index contains terms having a term frequency not higher than the predefined threshold; and loading the high frequency term inverted index and the low frequency term inverted index respectively to different retrieval modules, the retrieval modules respectively corresponding to mutually independent storage devices.
-
-
2. The method according to claim 1, further comprising:
generating a search result by receiving, by at least one of the retrieval modules, a keyword in a search phrase and retrieving a document list for the keyword in the inverted index corresponding to the retrieval module which received the keyword.
-
3. The method according to claim 2, wherein generating the search result comprises:
-
receiving the keyword by a first retrieval module of the retrieval modules to determine whether there is a document list for the keyword in the inverted index corresponding to the first retrieval module; if there is the document list for the keyword in the inverted index corresponding to the first retrieval module, generating the document list as the search result; and if there is no document list for the keyword in the inverted index corresponding to the first retrieval module, inputting the keyword to a second retrieval module of the retrieval modules to search for the document list for the keyword in the inverted index corresponding to the second retrieval module to generate the search result.
-
-
4. The method according to claim 3, wherein the first retrieval module comprises a retrieval module loaded with the low frequency term inverted index, and the second retrieval module comprises a retrieval module loaded with the high frequency term inverted index.
-
5. The method according to claim 1, wherein loading the high frequency term inverted index and the low frequency term inverted index respectively to different retrieval modules comprises:
-
loading the high frequency term inverted index into a first retrieval module, and loading the low frequency term inverted index into a second retrieval module; wherein loading the low frequency term inverted index into the second retrieval module further comprises; generating M data blocks each comprising document lists for N low frequency terms, wherein M and N are integers larger than 1; and saving in a hash bucket each of the low frequency terms and a storage location for a data block containing the document list for this low frequency term.
-
-
6. The method according to claim 2, wherein loading the high frequency term inverted index and the low frequency term inverted index respectively to different retrieval modules comprises:
-
loading the high frequency term inverted index into a first retrieval module, and loading the low frequency term inverted index into a second retrieval module; wherein loading the low frequency term inverted index into the second retrieval module further comprises; generating M data blocks each comprising document lists for N low frequency terms, wherein M and N are integers larger than 1; and saving in a hash bucket each of the low frequency terms and a storage location for a data block containing the document list for this low frequency term; if the retrieval module which received the keyword is the second retrieval module, retrieving the document list for the keyword comprising; searching for the keyword in the hash bucket; if the keyword exists in the hash bucket, acquiring from the hash bucket the storage location of the data block in which the document list for the keyword is located; and extracting the document list for the keyword from the data block by reading and traversing the data block in which the document list for the keyword is located from the storage location.
-
-
7. The method according to claim 2, wherein the search phrase comprises a plurality of keywords, and generating the search result comprises:
-
generating a search result for each of the keywords respectively by receiving, by at least one of the retrieval modules, the keywords to be searched and retrieving a document list for the keyword in the inverted index corresponding to the retrieval module which received the keywords; and acquiring a final search result by performing an intersection operation on the generated search results for the keywords.
-
-
8. An apparatus for search, comprising:
-
a term segmentation unit configured to perform term segmentation for grabbed documents to count a term frequency of each term, the term frequency of the term representing a number of the grabbed documents containing the term; an inverted index generation unit configured to generate a high frequency term inverted index and a low frequency term inverted index respectively, wherein the high frequency term inverted index contains terms having a term frequency higher than a predefined threshold, and the low frequency term inverted index contains terms having a term frequency not higher than the predefined threshold; and an inverted index loading unit configured to load the high frequency term inverted index and the low frequency term inverted index respectively to different retrieval modules, the different retrieval modules respectively corresponding to mutually independent storage devices.
-
-
9. The apparatus according to claim 8, further comprising:
a search result generation unit configured to generate a search result by receiving, by at least one of the retrieval modules, a keyword in a search phrase and retrieving a document list for the keyword in the inverted index corresponding to the retrieval module which received the keyword.
-
10. The apparatus according to claim 9, wherein the search result generation unit comprises:
-
a determination subunit configured to determine, by receiving the keyword by a first retrieval module of the retrieval modules, whether there is a document list for the keyword in the inverted index corresponding to the first retrieval module; a first generation subunit configured to generate the document list as the search result if there is the document list for the keyword in the inverted index corresponding to the first retrieval module; and a second generation subunit configured to input the keyword to a second retrieval module of the retrieval modules to search for the document list for the keyword in the inverted index corresponding to the second retrieval module to generate the search result if there is no document list for the keyword in the inverted index corresponding to the first retrieval module.
-
-
11. The apparatus according to claim 10, wherein the first retrieval module comprises a retrieval module loaded with the low frequency term inverted index, and the second retrieval module comprises a retrieval module loaded with the high frequency term inverted index.
-
12. The apparatus according to claim 8, wherein the inverted index loading unit is further configured to:
-
load the high frequency term inverted index into a first retrieval module, and load the low frequency term inverted index into a second retrieval module; wherein the inverted index loading unit further comprises; a data block generation subunit configured to generate M data blocks each comprising document lists for N low frequency terms, wherein M and N are integers larger than 1; and a saving subunit configured to save in a hash bucket each of the low frequency terms and a storage location for a data block containing the document list for this low frequency term.
-
-
13. The apparatus according to claim 9, wherein the inverted index loading unit is further configured to:
-
load the high frequency term inverted index into a first retrieval module, and load the low frequency term inverted index into a second retrieval module; wherein the inverted index loading unit further comprises; a data block generation subunit configured to generate M data blocks each comprising document lists for N low frequency terms, wherein M and N are integers larger than 1; and a saving subunit configured to save in a hash bucket each of the low frequency terms and a storage location for a data block containing the document list for this low frequency term; wherein the search result generation unit further comprises; a searching subunit configured to search for the keyword in the hash bucket if the retrieval module which received the keyword is the second retrieval module; an acquisition subunit configured to acquire from the hash bucket the storage location of the data block in which the document list for the keyword is located, if the keyword exists in the hash bucket; and an extraction subunit configured to extract the document list for the keyword from the data block by reading and traversing the data block in which the document list for the keyword is located from the storage location.
-
-
14. The apparatus according to claim 9, wherein the search result generation unit further comprises:
-
a third generation subunit configured to generating, when the search phrase comprises a plurality of keywords, a search result for each of the keywords respectively by inputting the keyword to be searched into at least one of the retrieval modules respectively and retrieving a document list for the keyword in the inverted index corresponding to the retrieval module to which the keyword was inputted; and an intersection subunit configured to acquire a final search result by performing an intersection operation on the generated search results for the keywords.
-
Specification