×

System and method for prioritizing websites during a webcrawling process

  • US 7,966,337 B2
  • Filed: 06/23/2008
  • Issued: 06/21/2011
  • Est. Priority Date: 03/29/2006
  • Status: Expired due to Fees
First Claim
Patent Images

1. A prioritization method, comprising:

  • extracting, by a web crawler in a computing system, a set of candidate web pages to be crawled, wherein said computing system comprises a memory unit, and wherein said memory unit comprises said web crawler, said set of candidate web pages, an online analysis software application, an offline analysis software application, a web page score database, and a website score database;

    simultaneously executing, by a computer processor of said computing system, said online analysis software application and said offline analysis software application in order to simultaneously perform an online analysis and an offline analysis;

    associating, by said online analysis software application, each web page in said set of candidate web pages with a website in a computer network;

    verifying, by said online analysis software application, that said set of candidate web pages comprise data for analysis;

    determining online, by said online analysis software application that a first website score for said website is not available in said website score database;

    requesting, by said web crawler, a first sample set of web pages from said website, wherein said first sample set of web pages does not include a total set of web pages from said website;

    first analyzing offline, by said offline analysis software application, each sample web page of said first sample set of web pages with a plurality of offline heuristics;

    generating, by said offline analysis software application, a first group of web page scores for each said sample web page of said first sample set of web pages based on results of said first analyzing offline;

    storing, each said first group of web page scores in said web page score database;

    determining, by said offline analysis software application, that a number of Web pages in said first sample set of web pages has reached a predetermined threshold;

    generating, by said offline analysis software application in response to said determining, that said number of Web pages in said first sample set of web pages has reached said predetermined threshold, a first final web page score for each said web page of said first sample set of web pages, wherein each said first final web page score is generated by combining each web page score within each said first group of web page scores;

    storing, each said first final web page score in said web page score database;

    generating, by said offline analysis software application in response to said determining online, a first website score for said website, wherein said first website score is generated by combining said first final web page scores for said first sample set of web pages;

    storing, said first website score in said website score database;

    associating, by said online analysis software application, said first website score for said website with associated web pages in said set of candidate web pages;

    prioritizing, said set of candidate web pages with respect to a first associated website score for each web page in said candidate set of web pages;

    retrieving, by said web crawler, first content from said set of candidate web pages using said prioritizing;

    extracting, by said online analysis software application, first hyperlinks from said first content;

    storing said first hyperlinks in said memory unit.

View all claims
  • 0 Assignments
Timeline View
Assignment View
    ×
    ×