Method and apparatus for focused crawling

US 7,080,073 B1
Filed: 06/24/2002
Issued: 07/18/2006
Est. Priority Date: 08/18/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of focused crawling, comprising:

accessing a query input;

crawling a plurality of documents continually, the documents including links to each other, and the crawling at least partly guided by a crawl metric, wherein the crawl metric quantifies priority for crawling links emanating from a certain document within the crawling, the crawl metric at least partly determined by a first mechanism, the first mechanism including a first combination, the first combination including a first plurality of one or more procedures, the first plurality of one or more procedures including evaluating relevance of documents using a link structure of the crawled documents, wherein the evaluating relevance of documents using a link structure of the crawled documents is performed repeatedly and continually, and wherein the evaluating relevance of documents using a link structure of the crawled documents includes;

accessing a first plurality of documents from a database of a plurality of received documents, the plurality of received documents including crawled documents, the first plurality of documents to be ranked,generating a graph of the first plurality of documents,assigning weights to a plurality of nodes of the graph, wherein nodes of the graph represent the documents and edges represent links between the documents,finding an assignment of weights to one or more nodes of the graph, by propagating weights through the graph, the assignment of weight to a node based at least in part on calculating a weighted sum of weights propagated from neighboring nodes, andgenerating a ranked list of at least the first plurality of documents, the ranked list at least partly generated from the graph; and

returning target documents, the target documents being relevant to the query input, the target documents found from the plurality of crawled documents, the target documents returned at least partly based on a search metric, the search metric quantifying relevance or importance of a document to the query input, the search metric at least partly determined by a second mechanism, the second mechanism including a second combination, the second combination being different from the first combination, the second combination including a second plurality of one or more procedures, the second plurality of procedures including evaluating relevance of documents using a template, the template including a plurality of one or more template portions, at least one of the template portions including a second plurality of one or more hierarchical levels.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention pertains to the field of computer software. More specifically, the present invention relates to dynamic discovery of documents or information through a focused crawler or search engine.

463 Citations

19 Claims

1. A method of focused crawling, comprising:
- accessing a query input;
  
  crawling a plurality of documents continually, the documents including links to each other, and the crawling at least partly guided by a crawl metric, wherein the crawl metric quantifies priority for crawling links emanating from a certain document within the crawling, the crawl metric at least partly determined by a first mechanism, the first mechanism including a first combination, the first combination including a first plurality of one or more procedures, the first plurality of one or more procedures including evaluating relevance of documents using a link structure of the crawled documents, wherein the evaluating relevance of documents using a link structure of the crawled documents is performed repeatedly and continually, and wherein the evaluating relevance of documents using a link structure of the crawled documents includes;
  
  accessing a first plurality of documents from a database of a plurality of received documents, the plurality of received documents including crawled documents, the first plurality of documents to be ranked,generating a graph of the first plurality of documents,assigning weights to a plurality of nodes of the graph, wherein nodes of the graph represent the documents and edges represent links between the documents,finding an assignment of weights to one or more nodes of the graph, by propagating weights through the graph, the assignment of weight to a node based at least in part on calculating a weighted sum of weights propagated from neighboring nodes, andgenerating a ranked list of at least the first plurality of documents, the ranked list at least partly generated from the graph; and
  
  returning target documents, the target documents being relevant to the query input, the target documents found from the plurality of crawled documents, the target documents returned at least partly based on a search metric, the search metric quantifying relevance or importance of a document to the query input, the search metric at least partly determined by a second mechanism, the second mechanism including a second combination, the second combination being different from the first combination, the second combination including a second plurality of one or more procedures, the second plurality of procedures including evaluating relevance of documents using a template, the template including a plurality of one or more template portions, at least one of the template portions including a second plurality of one or more hierarchical levels.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein relevance includes importance.
  - 3. The method of claim 1, wherein at least one of the first mechanism and the second mechanism includes:
    - associating a weight to each of the evaluated relevances of the procedures; and
      
      combining the evaluated relevances and the weights of the evaluated relevances.
  - 4. The method of claim 1, wherein one or more of:
    - 1) the first plurality of one or more hierarchical levels and
      
      2) the second plurality of one or more hierarchical levels, includes at least one or more heading levels and one or more content levels.
  - 5. The method of claim 1, wherein evaluating relevance includes evaluating relevance of at least a first document and one or more of a first plurality of one or more referring documents and a second plurality of one or more referring documents, each of the first plurality of one or more referring documents referring to the first document directly, and each of the second plurality of referring documents referring to the first document indirectly through one or more documents.
  - 6. The method of claim 1, wherein the procedure, of the first plurality of one or more procedures, of evaluating relevance of documents using a link structure of the crawled documents, further comprises:
    - expanding the graph with a second plurality of one or more documents from the database, wherein a third plurality includes a union of the first plurality of documents and the second plurality of documents, and the third plurality of documents is smaller than the plurality of received documents.
  - 7. The method of claim 1, wherein the procedure, of the first plurality of one or more procedures, of evaluating relevance of documents using a link structure of the crawled documents, further comprises:
    - expanding the graph with a second plurality of one or more documents from the database, such that a third plurality includes a union of the first plurality of documents and the second plurality of documents, and the third plurality of documents is smaller than the plurality of received documents, the second plurality including one or more of;
      
      1) one or more documents connected within a first specified number of links in a forward direction from one or more documents of the first plurality of documents, the forward direction being forward from the first plurality of documents, and
      
      2) one or more documents connected within a second specified number of links in a backward direction from one or more documents of the first plurality of documents, the backward direction being backward from the first plurality of documents.
  - 8. The method of claim 1, wherein the procedure, of the first plurality of one or more procedures, of evaluating relevance of documents using a link structure of the crawled documents, further comprises:
    - expanding the graph with a second plurality of one or more documents from the database, such that a third plurality includes a union of the first plurality of documents and the second plurality of documents, and the third plurality of documents is smaller than the plurality of received documents, the second plurality including one or more of;
      
      1) all documents connected within a first specified number of links in a forward direction from one or more documents of the first plurality of documents, the forward direction being forward from the first plurality of documents, and
      
      2) all documents connected within a second specified number of links in a backward direction from one or more documents of the first plurality of documents, the backward direction being backward from the first plurality of documents.
  - 9. The method of claim 1, wherein the first plurality of documents includes recently received documents of the plurality of received documents.
  - 10. The method of claim 1, wherein the procedure, of the first plurality of one or more procedures, of evaluating relevance of documents using a link structure of the crawled documents, further comprises:
    - shrinking the graph by removing one or more nodes of the graph.
  - 11. The method of claim 1, wherein the procedure, of the first plurality of one or more procedures, of evaluating relevance of documents using a link structure of the crawled documents, further comprises:
    - shrinking the graph by combing one or more sets of one or more nodes of the graph.
  - 12. The method of claim 11, wherein the combining is based on common characteristics of the nodes or relationships between the nodes.
  - 13. The method of claim 1, wherein the propagating weights through the graph occurs up to a limited node distance.
  - 14. The method of claim 1, wherein weights assigned to a document include at least one of relevance of the document to the query input and importance of the document independent of the query input.
  - 15. The method of claim 1, wherein the second plurality of procedures further includes one or more of:
    - 1) evaluating relevance of documents using logical expressions of keywords and phrases,
      
      2) evaluating relevance of documents using a link structure of the crawled documents, and
      
      3) evaluating relevance based on freshness of documents.

16. A method, comprising:
- performing a plurality of focused crawls, wherein each of the plurality of focused crawls comprises;
  
  accessing a query input;
  
  crawling a plurality of documents, the documents including links to each other, and the crawling at least partly guided by a crawl metric, the crawl metric at least partly determined by a first mechanism, the first mechanism including a first combination, the first combination including evaluating relevance of documents using a link structure of the crawled documents wherein the evaluating relevance of documents using a link structure of the crawled documents is performed repeatedly and continually, and wherein the evaluating relevance of documents using a link structure of the crawled documents includes;
  
  accessing a first plurality of documents from a database of a plurality of received documents, the plurality of received documents including crawled documents, the first plurality of documents to be ranked,generating a graph of the first plurality of documents,assigning weights to a plurality of nodes of the graph wherein nodes of the graph represent the documents and edges represent links between the documents,finding an assignment of weights to one or more nodes of the graph, by propagating weights through the graph, the assignment of weight to a node based at least in part on calculating a weighted sum of weights propagated from neighboring nodes, andgenerating a ranked list of at least the first plurality of documents, the ranked list at least partly generated from the graph; and
  
  returning target documents, the target documents being relevant to the query input, the target documents found from the plurality of crawled documents, the target documents returned at least partly based on a search metric, the search metric quantifying relevance or importance of a document to the query input, the search metric at least partly determined by a second mechanism, the second mechanism including a second combination, the second combination being different from the first combination, the second combination including one or more of
  
       1) evaluating relevance of documents using logical expressions of keywords and phrases,
  
       2) evaluating relevance of documents using a template including a plurality of one or more template portions, at least one of the template portions including a plurality of one or more hierarchical levels,
  
       3) evaluating relevance of documents using a link structure of the crawled documents, and
  
       4) evaluating relevance based on freshness of documents,wherein the method is performed on at least one of
  
       1)a first processor and
  
       2) one or more of a first plurality of one or more processors.
- View Dependent Claims (17, 18, 19)
- - 17. The method of claim 16, wherein relevance includes importance.
  - 18. The method of claim 16, wherein evaluating relevance of documents includes evaluating relevance of at least a first document and a second document, the second document referring to the first document.
  - 19. The method of claim 16, wherein evaluating relevance includes evaluating relevance of at least a first document and one or more of a first plurality of one or more referring documents and a second plurality of one or more referring documents, each of the first plurality of one or more referring documents referring to the first document directly, and each of the second plurality of referring documents referring to the first document indirectly through one or more documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Aurea Software Incorporated (ESW Capital, LLC)
Original Assignee
FirstRain, Inc. (Ignite Technologies Incorporated)
Inventors
Jiang, Dongming, Singh, Jaswinder Pal, Wang, Randolph, Krishnamurthy, Arvind
Primary Examiner(s)
Ali, Mohammad

Application Number

US10/179,476
Time in Patent Office

1,485 Days
Field of Search

707 1- 10, 707100-1041, 707200-205, 715/513
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 40/131   Fragmentation of text files...

G06F 40/143   Markup, e.g. Standard Gener...

Y10S 707/99937   Sorting

Method and apparatus for focused crawling

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

463 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for focused crawling

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

463 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links