Search engine and method with improved relevancy, scope, and timeliness
First Claim
1. A computer implemented method for adaptive feedback ensuring timeliness of a collection of web pages retrieved from servers in a computer network, the method comprising:
- in a computer system having access to the servers in the computer network, extracting one or more universal resource locators (URLs) from a result of searching for web pages that are served by servers in the computer network; and
for each URL extracted, determining whether or not a web page corresponding to the URL is present in whole or in part in the collection which is a cache of URLs and corresponding web pages accessed by a crawler, wherein, when the web page is determined to be present in the collection, refreshing by the crawler the web page in the collection by requesting a current copy of the web page from a corresponding one of the servers in the computer network in accordance with a first probability, such that due to the first probability a frequency of refreshing the web page over a period of time by the crawler is a function of a frequency with which the URL that is extracted appears in a plurality of the results of searching over the period of time, and when the web page is determined not to be present in the collection, downloading by the crawler the web page from a corresponding one of the servers in the computer network and including the web page in the collection.
0 Assignments
0 Petitions
Accused Products
Abstract
A search engine and a method achieve timeliness of documents returned in a search result by a relevancy feedback mechanism driven by the frequency in which a URL is returned in recent searches. The relevancy feedback mechanism includes one or more random processes which determine whether or not a cached or indexed web page associated with a URL in the search result should be refreshed. In addition, the random processes also determine whether or not hyperlinks in the cached or indexed web page should be followed to access related web pages. Accesses of web pages resulting from the operations of the random processes are used to update any document index maintained by the search engine. Relevancy scoring functions implemented in look-up tables are also disclosed. A more accurate relevancy scoring function is achieved using a lexicon based on anchortexts of extracted hyperlinks of web documents.
-
Citations
26 Claims
-
1. A computer implemented method for adaptive feedback ensuring timeliness of a collection of web pages retrieved from servers in a computer network, the method comprising:
-
in a computer system having access to the servers in the computer network, extracting one or more universal resource locators (URLs) from a result of searching for web pages that are served by servers in the computer network; and for each URL extracted, determining whether or not a web page corresponding to the URL is present in whole or in part in the collection which is a cache of URLs and corresponding web pages accessed by a crawler, wherein, when the web page is determined to be present in the collection, refreshing by the crawler the web page in the collection by requesting a current copy of the web page from a corresponding one of the servers in the computer network in accordance with a first probability, such that due to the first probability a frequency of refreshing the web page over a period of time by the crawler is a function of a frequency with which the URL that is extracted appears in a plurality of the results of searching over the period of time, and when the web page is determined not to be present in the collection, downloading by the crawler the web page from a corresponding one of the servers in the computer network and including the web page in the collection. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. An adaptive feedback system for ensuring currency of a collection of web pages retrieved from servers in a computer network, the system comprising:
-
a crawler that accesses the servers in the computer network; a query processor that extracts one or more URLs from a search result; and a cache for storing the collection of web pages accessed by the crawler, the cache having a document processor that determines whether or not a web page corresponding to each URL extracted from the search result is present in whole or in part in the collection wherein when the web page is determined by the document processor to be present in the collection, the crawler refreshes the web page in the collection in accordance with a first probability, such that due to the first probability a frequency of refresh of the web page over a period of time by the crawler is a function of a frequency with which the URL corresponding to the web page appears in a plurality of the search results over the period of time; and when the web page is determined not to be present in the collection, the crawler downloads the web page from a corresponding one of the servers in the computer network and includes the web page in the collection. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
Specification