Search engine and method with improved relevancy, scope, and timeliness

US 20050004943A1
Filed: 04/26/2004
Published: 01/06/2005
Est. Priority Date: 04/24/2003
Status: Active Grant

First Claim

Patent Images

1. A method for providing a training set to build a statistical relevancy scoring function to be used in a search engine, comprising:

(a) identifying an initial set of hypertext documents as a training set of relevant documents;

(b) identifying all hyperlinks included in each hypertext document of the training set identified and their associated anchortexts;

(c) including the hypertext documents pointed to by the hyperlinks identified in (b) in the training set; and

(d) including the anchortexts associated with the hyperlinks indentified in (b) in a lexicon.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A search engine and a method achieve timeliness of documents returned in a search result by a relevancy feedback mechanism driven by the frequency in which a URL is returned in recent searches. The relevancy feedback mechanism includes one or more random processes which determine whether or not a cached or indexed web page associated with a URL in the search result should be refreshed. In addition, the random processes also determine whether or not hyperlinks in the cached or indexed web page should be followed to access related web pages. Accesses of web pages resulting from the operations of the random processes are used to update any document index maintained by the search engine. Relevancy scoring functions implemented in look-up tables are also disclosed. A more accurate relevancy scoring function is achieved using a lexicon based on anchortexts of extracted hyperlinks of web documents.

113 Citations

View as Search Results

40 Claims

1. A method for providing a training set to build a statistical relevancy scoring function to be used in a search engine, comprising:
- (a) identifying an initial set of hypertext documents as a training set of relevant documents;
  
  (b) identifying all hyperlinks included in each hypertext document of the training set identified and their associated anchortexts;
  
  (c) including the hypertext documents pointed to by the hyperlinks identified in (b) in the training set; and
  
  (d) including the anchortexts associated with the hyperlinks indentified in (b) in a lexicon.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A method as in claim 1, further comprising repeating steps (b)-(d).
  - 3. A method as in claim 1, further comprising ascertaining using an independent method the relevance of the terms in the lexicon to their associated documents.
  - 4. A method as in claim 1, wherein the training set is used to tune a scoring function.
  - 5. A method as in claim 1, wherein the lexicon includes terms consisting more than one word.
  - 6. A method as in claim 1, further comprising clustering of terms in the lexicon.

7. A method for providing a relevancy scoring function for scoring documents in a search result, comprising:
- compiling a lexicon including a plurality of terms that can be used in a search query for a search engine;
  
  for each term of the lexicon, identifying from a corpus of documents those in which the term appears, and computing a document frequency based on relative numbers of the identified documents and the documents in the corpus; and
  
  creating a look-up table, indexed by the document frequency and a term frequency, for storing a value of the relevancy scoring function, the term frequency being the frequency of occurrence of a term in a given document, the relevancy scoring function being a function of the term frequency and the document frequency.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 8. A method as in claim 7, wherein the relevancy scoring function is the product of a function of the document frequency and a function of the term frequency.
  - 9. A method as in claim 7, wherein the relevancy scoring function represents the odds that a document is relevant to a term.
  - 10. A method as in claim 7, wherein the relevancy scoring function is compiled by tallying a number of times that a document is adjudged to be relevant to an included term and the number of times a document is adjudged to be not relevant to the included term.
  - 11. A method as in claim 10, wherein the documents adjudged are each referenced by a hyperlink in which the included term appears in the anchortext of the hyperlink.
  - 12. A method as in claim 7, further comprising computing the term frequency for each of the identified documents, and the relevancy scoring function being a function of the term frequencies associated with the identified documents.
  - 13. A method as in claim 7, wherein the lexicon and the corpus are deemed a set of terms and known relevant documents for each term.
  - 14. A method as in claim 7, wherein the relevancy scoring function is derived from the ratio of two scoring functions.
  - 15. A method as in claim 12, wherein when the search result is responsive to a query including more than one term from the lexicon, a document returned in the search result is assigned the sum of all values of the relevancy scoring function associated with all the terms from the lexicon included in the query.
  - 16. A method as in claim 12, wherein the relevancy scoring function is compiled statistically using the entire corpus.
  - 17. A method as in claim 12, wherein the relevancy scoring function is compiled statistically using a selected fraction of the corpus.
  - 18. A method as in claim 7, further comprising smoothing the adjacent entries of the look-up table.

19. An adaptive feedback method for ensuring timeliness of a collection of web pages, comprising:
- extracting a URL from a search result; and
  
  determining whether or not a web page corresponding to the URL is present in whole or in part in the collection, wherein;
  
  when the web page is determined to be present in the collection, downloading and replacing the web page in the collection with a first probability; and
  
  when the web page is determined not to be present in the collection, downloading and including the web page in the collection.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 20. A method as in claim 19, wherein the first probability depends on an age of the web page in the collection]
  - 21. A method as in claim 19, wherein the webpages are collected in a cache.
  - 22. A method as in claim 19, wherein the web pages are indexed in a document index.
  - 23. A method as in claim 21, wherein the cache incorporates a replacement policy that favors retaining most recently accessed web pages.
  - 24. A method as in claim 21, wherein the cache is indexed using a hash signature of a URL.
  - 25. A method as in claim 19, wherein the first probability depends on one of more of:
    - source parameters, a type of the URL, an index size and a workload of a web crawler.
  - 26. A method as in claim 19, further comprising extracting hyperlinks from the web page corresponding to the URL and downloading the web pages corresponding to the hyperlinks each with a second probability.
  - 27. A method as in claim 26, wherein the second probability depends on the number of hyperlinks in the web page
  - 28. A method as in claim 26, wherein the second probability further depends on an age of the URL.
  - 29. A method as in claim 22, further comprising updating the document index using information obtained from accessing the web page corresponding to the URL.

30. An adaptive feedback system for ensuring timeliness of a collection of web pages, comprising:
- a query processor that extracts a URL from a search result;
  
  a cache including a process for determining whether or not a web page corresponding to the URL is present in whole or in part in the collection;
  
  a web crawler coupled to the processor wherein;
  
  when the web page is determined to be present in the collection, the web crawler downloads and replaces the web page in the collection with a first probability; and
  
  when the web page is determined not to be present in the collection, the web crawler downloads and including the web page in the collection.
- View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 31. A system as in claim 30, wherein the first probability depends on an age of the web page in the collection.
  - 32. A system as in claim 30, wherein the webpages are collected in a cache.
  - 33. A system as in claim 30, wherein the webpages are indexed in a document index.
  - 34. A system as in claim 32, wherein the cache incorporates a replacement policy that favors retaining most recently accessed web pages.
  - 35. A system as in claim 32, wherein the cache is indexed using a hash signature of a URL.
  - 36. A system as in claim 30, wherein the first probability depends on one of more of:
    - source parameters, a type of the URL, an index size and a workload of a web crawler.
  - 37. A system as in claim 30, wherein the processor extracts hyperlinks from the web page corresponding to the URL and directs the web crawler to download the web pages corresponding to the hyperlinks each with a second probability.
  - 38. A system as in claim 37 wherein the second probability depends on the number of hyperlinks in the web page.
  - 39. A system as in claim 37, wherein the second probability further depends on an age of the URL.
  - 40. A system as in claim 33, wherein the processor updates the document index using information obtained from accessing the web page corresponding to the URL.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Affini, Inc.
Original Assignee
Affini, Inc.
Inventors
Chang, William I.

Granted Patent

US 7,917,483 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/9535 Search customisation based ...

G06F 16/9538 Presentation of query results

Search engine and method with improved relevancy, scope, and timeliness

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

113 Citations

40 Claims

Specification

Solutions

Use Cases

Quick Links

Search engine and method with improved relevancy, scope, and timeliness

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

113 Citations

40 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links