Text retrieval method and system using signature of nearby words
First Claim
1. A method for retrieving relevant documents in a corpus of documents based on a search query, the method comprising the steps of:
- storing the corpus of documents in a storage device;
inputting the corpus of documents and the search query on an input device;
generating an index term signature for each index term in the corpus of documents, the index term signature being based on a hash function of a predetermined number of adjacent terms adjacent to the index term;
generating a list containing the index terms in the corpus of documents, the list associating each index term with a document identifier and corresponding index term signatures occurring in the document;
generating a query signature for the search query excluding a reference term, the query signature being based on the hash function of the adjacent query terms adjacent to the reference term;
comparing the query signature to the index term signatures in the list to identify index term signatures that match the query signature, the reference term of the query signature being equivalent to a searched index term of the list; and
outputting a document list indicating the documents that contain the identified index term signatures on an output device.
5 Assignments
0 Petitions
Accused Products
Abstract
A method for searching a document corpus for query terms includes generating a list of document terms including a term signature for each term based upon characteristics of a number of adjacent terms. The term signatures can be generated by generating a bit vector for each term within a predetermined adjacent number of terms from each document term, such as through application of a hash function. The bit vectors can then be combined to form the term signature. The word signature alternatively can be generated using one or more morphological properties of the terms. The predetermined adjacent number of terms can be the number of search terms minus one, and may precede, follow, or both precede and follow the document term for which the term signature is generated. A search signature is generated for the query terms excluding a reference term, based upon the predetermined characteristics. The term signature of the reference term is compared with the search signature, and an indication is provided when the term signature of the reference term and the search signature match.
113 Citations
34 Claims
-
1. A method for retrieving relevant documents in a corpus of documents based on a search query, the method comprising the steps of:
-
storing the corpus of documents in a storage device; inputting the corpus of documents and the search query on an input device; generating an index term signature for each index term in the corpus of documents, the index term signature being based on a hash function of a predetermined number of adjacent terms adjacent to the index term; generating a list containing the index terms in the corpus of documents, the list associating each index term with a document identifier and corresponding index term signatures occurring in the document; generating a query signature for the search query excluding a reference term, the query signature being based on the hash function of the adjacent query terms adjacent to the reference term; comparing the query signature to the index term signatures in the list to identify index term signatures that match the query signature, the reference term of the query signature being equivalent to a searched index term of the list; and outputting a document list indicating the documents that contain the identified index term signatures on an output device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. An apparatus for retrieving relevant documents from a corpus of documents based on a search query, the apparatus comprising:
-
storage means for storing the corpus of documents; input means for inputting the corpus of documents and the search query; a controller for retrieving relevant documents from the corpus of documents, the controller comprising; index term signature generating means for generating an index term signature for each index term in the corpus of documents, the index term signature being based on a hash function of a predetermined number of adjacent terms adjacent to the index term; list generating means for generating a list containing the index terms in the corpus of documents, the list associating an index term with a document identifier and corresponding index term signatures occurring in the document; query signature generating means for generating a query signature for the search query excluding a reference term, the query signature being based on the hash function of the adjacent query terms adjacent to the reference term; and comparing means for comparing the query signature to the index term signatures in the list to identify index term signatures that match the query signature of the reference term, the reference term of the query signature being equivalent to the index term of the list; and output means for outputting a document list indicating the documents that contain the identified index term signatures. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
Specification