×

System and method for focused re-crawling of web sites

  • US 7,882,099 B2
  • Filed: 03/25/2008
  • Issued: 02/01/2011
  • Est. Priority Date: 12/21/2005
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method of crawling the Web, said method comprising:

  • crawling Web pages on the Web starting from a given set of seed Universal Resource Locators (URLs);

    partitioning crawled Web pages into sets of relevant and irrelevant Web pages;

    discovering from said sets of relevant and irrelevant Web pages;

    a set of exclusion patterns based on determining that all Web pages present in said irrelevant set of Web pages satisfy at least one exclusion pattern of said set of exclusion patterns, and no Web pages present in said relevant set of Web pages satisfy any pattern of said set of exclusion patterns; and

    a set of inclusion patterns based on determining that all Web pages present in said relevant set of Web pages satisfy at least one inclusion pattern of said set of inclusion patterns, and no Web pages present in said irrelevant set of Web pages satisfy any pattern of said set of inclusion patterns;

    re-partitioning irrelevant Web pages containing a link to at least one relevant page as relevant;

    forming a first set containing words appearing in URLs from relevant Web pages and a second set containing words appearing in URLs from irrelevant Web pages; and

    restricting subsequent crawling of the Web pages through said set of exclusion and inclusion patterns.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×