Method for automatically selecting collections to search in full text searches
First Claim
1. A method of selecting a subset of a set of document collections containing documents to search based upon a predetermined query text including a search term, said method comprising the steps of:
- a) accessing a meta-file representative of said set of document collections, including a search term occurrence list;
b) determining a document frequency term for said search term relative to each of said document collections within said set of document collections and an inverse collection frequency term for said set of document collections, said inverse collection frequency term being proportional to a ratio of the number of documents in said set of document collections and the number of documents in set of document collections that include said search term;
c) determining a term ranking for each of said document collections that is proportional to the respective said document frequency terms and said inverse collection frequency term;
d) selecting said subset of said set of document collections based on the relative term ranking of each of said document collections.
4 Assignments
0 Petitions
Accused Products
Abstract
A method of selecting a subset of a plurality of document collections for searching in response to a predetermined query is based on accessing a meta-information data file that describes the query significant search terms that are present in a particular document collection correlated to normalized document usage frequencies of such terms within the documents of each document collection. By access to the meta-information data file, a relevance score for each of the document collections is determined. The method then returns an identification of the subset of the plurality of document collections having the highest relevance scores for use in evaluating the predetermined query. The meta-information data file may be constructed to include document normalized term frequencies and other contextual information that can be evaluated in the application of a query against a particular document collection. This other contextual information may include term proximity, capitalization, and phraseology as well as document specific information such as, but not limited to collection name, document type, document title, authors, date of publication, publisher, keywords, summary description of contents, price, language, country of publication, publication name. Statistical data for the collection may include such as, but not limited to number of documents in the collection, the total size of the collection, the average document size and average number of words in the base document collection.
-
Citations
20 Claims
-
1. A method of selecting a subset of a set of document collections containing documents to search based upon a predetermined query text including a search term, said method comprising the steps of:
-
a) accessing a meta-file representative of said set of document collections, including a search term occurrence list; b) determining a document frequency term for said search term relative to each of said document collections within said set of document collections and an inverse collection frequency term for said set of document collections, said inverse collection frequency term being proportional to a ratio of the number of documents in said set of document collections and the number of documents in set of document collections that include said search term; c) determining a term ranking for each of said document collections that is proportional to the respective said document frequency terms and said inverse collection frequency term; d) selecting said subset of said set of document collections based on the relative term ranking of each of said document collections. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of selecting a subset of a set of document collections to search based upon a predetermined query text optionally including any of a search term, a pre-search qualifier, and a post-search qualifier, said method comprising the steps of:
-
a) accessing a meta-information data file that includes a plurality of records representing said set of document collections, each said document collection representing a plurality of documents; b) pre-qualifying a set of said plurality of records based upon said pre-search qualifier, if any; c) determining a search term frequency value for each of said pre-qualified set of said plurality of records with respect to said search term if any, said search term frequency values being normalized against a common factor representative of the frequency of qualifying occurrences of said search term within said documents of said pre-qualified set; d) determining a search term ranking for each of said pre-qualified set of said plurality of records based upon said frequency values and said common factor; and e) selecting said subset of said set of document collections to search based on said search term rankings and said post-search qualifier, if any. - View Dependent Claims (11, 12, 13)
-
-
14. A method for selecting a subset of a set of document collections to search dependant on a predetermined query term, each said document collection including a plurality of documents, said set of document collections being represented as a meta-index that stores search terms and statistical data representative of said set of document collections and said document collections being represented by respective collection indexes that store search terms and statistical data representative of the documents within respective document collections, said method comprising the steps of:
-
a) determining a collection ranking for each said document collection with respect to said predetermined query term with reference to said meta-index, each said collection ranking being normalized with respect to the qualified occurrence of said predetermined query term within the documents of said set of document collections; c) identifying said document collections within said subset of document collections potentially most relevant for searching based on said predetermined query term. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification