Search engine and method with improved relevancy, scope, and timeliness

US 8,886,621 B2
Filed: 03/25/2011
Issued: 11/11/2014
Est. Priority Date: 04/24/2003
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for adaptive feedback ensuring timeliness of a collection of web pages retrieved from servers in a computer network, the method comprising:

in a computer system having access to the servers in the computer network, extracting one or more universal resource locators (URLs) from a result of searching for web pages that are served by servers in the computer network; and

for each URL extracted, determining whether or not a web page corresponding to the URL is present in whole or in part in the collection which is a cache of URLs and corresponding web pages accessed by a crawler, wherein, when the web page is determined to be present in the collection, refreshing by the crawler the web page in the collection by requesting a current copy of the web page from a corresponding one of the servers in the computer network in accordance with a first probability, such that due to the first probability a frequency of refreshing the web page over a period of time by the crawler is a function of a frequency with which the URL that is extracted appears in a plurality of the results of searching over the period of time, and when the web page is determined not to be present in the collection, downloading by the crawler the web page from a corresponding one of the servers in the computer network and including the web page in the collection.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A search engine and a method achieve timeliness of documents returned in a search result by a relevancy feedback mechanism driven by the frequency in which a URL is returned in recent searches. The relevancy feedback mechanism includes one or more random processes which determine whether or not a cached or indexed web page associated with a URL in the search result should be refreshed. In addition, the random processes also determine whether or not hyperlinks in the cached or indexed web page should be followed to access related web pages. Accesses of web pages resulting from the operations of the random processes are used to update any document index maintained by the search engine. Relevancy scoring functions implemented in look-up tables are also disclosed. A more accurate relevancy scoring function is achieved using a lexicon based on anchortexts of extracted hyperlinks of web documents.

Citations

26 Claims

1. A computer implemented method for adaptive feedback ensuring timeliness of a collection of web pages retrieved from servers in a computer network, the method comprising:
- in a computer system having access to the servers in the computer network, extracting one or more universal resource locators (URLs) from a result of searching for web pages that are served by servers in the computer network; and
  
  for each URL extracted, determining whether or not a web page corresponding to the URL is present in whole or in part in the collection which is a cache of URLs and corresponding web pages accessed by a crawler, wherein, when the web page is determined to be present in the collection, refreshing by the crawler the web page in the collection by requesting a current copy of the web page from a corresponding one of the servers in the computer network in accordance with a first probability, such that due to the first probability a frequency of refreshing the web page over a period of time by the crawler is a function of a frequency with which the URL that is extracted appears in a plurality of the results of searching over the period of time, and when the web page is determined not to be present in the collection, downloading by the crawler the web page from a corresponding one of the servers in the computer network and including the web page in the collection.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein refreshing the web page comprises downloading the webpage.
  - 3. The method of claim 2, wherein the URL corresponding to a downloaded or a refreshed webpage is maintained in the cache.
  - 4. The method of claim 1 further comprising, during refreshing the web page corresponding to the URL and upon finding that the webpage no longer exists, deleting the web page from the collection.
  - 5. The method of claim 1, wherein the first probability depends on an age of the web page in the collection.
  - 6. The method of claim 1, wherein the first probability is a constant.
  - 7. The method of claim 1, further comprising extracting one or more hyperlinks from the web page and downloading and saving the web page corresponding to each hyperlink into the collection.
  - 8. The method of claim 7, wherein downloading and saving the web page corresponding to each hyperlink occurs according to a second probability.
  - 9. The method of claim 8, wherein the second probability depends on the number of hyperlinks on the web page.
  - 10. The method of claim 1, wherein the web pages in the collection are indexed in a document index.
  - 11. The method of claim 3, wherein the cache incorporates a replacement policy that favors retaining most recently accessed URLs.
  - 12. The method of claim 3, wherein the cache is indexed using a hash signature of the URL.
  - 13. The method of claim 1, wherein the first probability depends on one of more of:
    - source parameters, a type of the URL, an index size and a workload of the crawler.
  - 14. The method of claim 8, wherein the second probability further depends on an age of the URL.
  - 15. The method of claim 10, further comprising updating the document index using information obtained from accessing the web pages corresponding to the URLs.

16. An adaptive feedback system for ensuring currency of a collection of web pages retrieved from servers in a computer network, the system comprising:
- a crawler that accesses the servers in the computer network;
  
  a query processor that extracts one or more URLs from a search result; and
  
  a cache for storing the collection of web pages accessed by the crawler, the cache having a document processor that determines whether or not a web page corresponding to each URL extracted from the search result is present in whole or in part in the collection whereinwhen the web page is determined by the document processor to be present in the collection, the crawler refreshes the web page in the collection in accordance with a first probability, such that due to the first probability a frequency of refresh of the web page over a period of time by the crawler is a function of a frequency with which the URL corresponding to the web page appears in a plurality of the search results over the period of time; and
  
  when the web page is determined not to be present in the collection, the crawler downloads the web page from a corresponding one of the servers in the computer network and includes the web page in the collection.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 17. The adaptive feedback system of claim 16, wherein the document processor extracts hyperlinks from the web page corresponding to the URL and includes each extracted hyperlink in the cache.
  - 18. The adaptive feedback system of claim 17, wherein the document processor further directs the crawler to download the web page corresponding to each extracted hyperlinks at a second probability.
  - 19. The adaptive feedback system of claim 18, wherein the second probability depends on the number of hyperlinks in the web page.
  - 20. The adaptive feedback system of claim 16, wherein the first probability depends on an age of the web page in the collection.
  - 21. The adaptive feedback system of claim 16, wherein the web pages in the collection are indexed in a document index.
  - 22. The adaptive feedback system of claim 16, wherein the cache incorporates a replacement policy that favors retaining the URLs corresponding to the most recently accessed web pages.
  - 23. The adaptive feedback system of claim 16, wherein the cache is indexed using a hash signature of the URL.
  - 24. The adaptive feedback system of claim 16, wherein the first probability depends on one of more of:
    - source parameters, a type of the URL, an index size and a workload of the crawler.
  - 25. The adaptive feedback system of claim 18, wherein the second probability further depends on an age of the URL.
  - 26. The adaptive feedback system of claim 21, wherein the document processor updates the document index using information obtained from accessing the web pages corresponding to the URLs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Affini, Inc.
Original Assignee
Affini, Inc.
Inventors
Chang, William I.
Primary Examiner(s)
Choi, Yuk Ting

Application Number

US13/072,418
Publication Number

US 20110173179A1
Time in Patent Office

1,327 Days
Field of Search

None
US Class Current

707/705
CPC Class Codes

G06F 16/9535 Search customisation based ...

G06F 16/9538 Presentation of query results

Search engine and method with improved relevancy, scope, and timeliness

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Search engine and method with improved relevancy, scope, and timeliness

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links