Search engine and method with improved relevancy, scope, and timeliness
First Claim
1. A method for providing a training set to build a statistical relevancy scoring function to be used in a search engine, comprising:
- (a) identifying an initial set of hypertext documents as a training set of relevant documents;
(b) identifying all hyperlinks included in each hypertext document of the training set identified and their associated anchortexts;
(c) including the hypertext documents pointed to by the hyperlinks identified in (b) in the training set; and
(d) including the anchortexts associated with the hyperlinks indentified in (b) in a lexicon.
1 Assignment
0 Petitions
Accused Products
Abstract
A search engine and a method achieve timeliness of documents returned in a search result by a relevancy feedback mechanism driven by the frequency in which a URL is returned in recent searches. The relevancy feedback mechanism includes one or more random processes which determine whether or not a cached or indexed web page associated with a URL in the search result should be refreshed. In addition, the random processes also determine whether or not hyperlinks in the cached or indexed web page should be followed to access related web pages. Accesses of web pages resulting from the operations of the random processes are used to update any document index maintained by the search engine. Relevancy scoring functions implemented in look-up tables are also disclosed. A more accurate relevancy scoring function is achieved using a lexicon based on anchortexts of extracted hyperlinks of web documents.
113 Citations
40 Claims
-
1. A method for providing a training set to build a statistical relevancy scoring function to be used in a search engine, comprising:
-
(a) identifying an initial set of hypertext documents as a training set of relevant documents;
(b) identifying all hyperlinks included in each hypertext document of the training set identified and their associated anchortexts;
(c) including the hypertext documents pointed to by the hyperlinks identified in (b) in the training set; and
(d) including the anchortexts associated with the hyperlinks indentified in (b) in a lexicon. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for providing a relevancy scoring function for scoring documents in a search result, comprising:
-
compiling a lexicon including a plurality of terms that can be used in a search query for a search engine;
for each term of the lexicon, identifying from a corpus of documents those in which the term appears, and computing a document frequency based on relative numbers of the identified documents and the documents in the corpus; and
creating a look-up table, indexed by the document frequency and a term frequency, for storing a value of the relevancy scoring function, the term frequency being the frequency of occurrence of a term in a given document, the relevancy scoring function being a function of the term frequency and the document frequency. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An adaptive feedback method for ensuring timeliness of a collection of web pages, comprising:
-
extracting a URL from a search result; and
determining whether or not a web page corresponding to the URL is present in whole or in part in the collection, wherein;
when the web page is determined to be present in the collection, downloading and replacing the web page in the collection with a first probability; and
when the web page is determined not to be present in the collection, downloading and including the web page in the collection. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. An adaptive feedback system for ensuring timeliness of a collection of web pages, comprising:
-
a query processor that extracts a URL from a search result;
a cache including a process for determining whether or not a web page corresponding to the URL is present in whole or in part in the collection;
a web crawler coupled to the processor wherein;
when the web page is determined to be present in the collection, the web crawler downloads and replaces the web page in the collection with a first probability; and
when the web page is determined not to be present in the collection, the web crawler downloads and including the web page in the collection. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
-
Specification