×

Identifying transient portions of web pages

  • US 8,086,953 B1
  • Filed: 12/19/2008
  • Issued: 12/27/2011
  • Est. Priority Date: 12/19/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • retrieving a first version of a web page and a second version of the web page, the web page being associated with a website;

    parsing the first version of the page into a first set of tokens and the second version of the web page into a second set of tokens;

    inserting a first set of fingerprints associated with the first set of tokens into a first data structure, and a second set of fingerprints associated with the second set of tokens into a second data structure;

    comparing the fingerprints included in the first and second data structures;

    marking tokens whose associated fingerprints appear only in one data structure as transient;

    identifying a path associated with every token in the web page;

    identifying a subtree count comprising a number of times the path appears in other web pages associated with the website;

    identifying a marked subtree count comprising a number of times the content associated with the path changes between versions of the respective web pages;

    comparing the subtree count with the marked subtree count; and

    determining whether the path is a transient path based upon the comparison.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×