Identifying transient portions of web pages
First Claim
Patent Images
1. A method comprising:
- retrieving a first version of a web page and a second version of the web page, the web page being associated with a website;
parsing the first version of the page into a first set of tokens and the second version of the web page into a second set of tokens;
inserting a first set of fingerprints associated with the first set of tokens into a first data structure, and a second set of fingerprints associated with the second set of tokens into a second data structure;
comparing the fingerprints included in the first and second data structures;
marking tokens whose associated fingerprints appear only in one data structure as transient;
identifying a path associated with every token in the web page;
identifying a subtree count comprising a number of times the path appears in other web pages associated with the website;
identifying a marked subtree count comprising a number of times the content associated with the path changes between versions of the respective web pages;
comparing the subtree count with the marked subtree count; and
determining whether the path is a transient path based upon the comparison.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods and computer readable media for identifying transient content in web pages. Transient content can be identified, for example, by parsing different versions of the same web page into tokens, and inserting fingerprints associated with the tokens into data structures. The data structures can be compared to each other to identify differences between the web pages, thereby identifying transient content associated with the web pages.
-
Citations
20 Claims
-
1. A method comprising:
-
retrieving a first version of a web page and a second version of the web page, the web page being associated with a website; parsing the first version of the page into a first set of tokens and the second version of the web page into a second set of tokens; inserting a first set of fingerprints associated with the first set of tokens into a first data structure, and a second set of fingerprints associated with the second set of tokens into a second data structure; comparing the fingerprints included in the first and second data structures; marking tokens whose associated fingerprints appear only in one data structure as transient; identifying a path associated with every token in the web page; identifying a subtree count comprising a number of times the path appears in other web pages associated with the website; identifying a marked subtree count comprising a number of times the content associated with the path changes between versions of the respective web pages; comparing the subtree count with the marked subtree count; and determining whether the path is a transient path based upon the comparison. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system comprising:
-
a data processing apparatus comprising one or more computers; and a memory apparatus in data communication with the data processing apparatus and storing instructions defining modules that are executable by the data processing apparatus, the modules comprising; a retrieval module operable to retrieve at least two versions of a web page associated with a website; a parser operable to parse the versions of the web page into sets of tokens being associated with a respective version of the web page; a data structure generator operable to generate data structures for the versions of the web pages, the data structures comprising an entry for each token in respective version of the web page; a fingerprint generator operable to generate fingerprints associated with individual tokens and to insert the fingerprints into the respective data structure at a respective location associated with the token described by the respective fingerprint; a content analysis module operable to compare the fingerprints associated with respective portions the data structures, and to identify transient content associated with the web page based upon the comparison; and a path analysis module operable to identify a path associated with the tokens marked as transient within the web page, a subtree count comprising a number of times the path appears in other web pages associated with the website, and a marked subtree count comprising a number of times the content associated with the path changes between versions of the respective web pages, wherein the path analysis module compares the subtree count with the marked subtree count and determines whether the path is a transient path based upon the comparison. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. One or more non-transitory computer readable media having software program code operable to identify transient content, comprising modules operable to:
-
retrieve a first version of a web page and a second version of the web page, the web page being associated with a website; parse the first version of the page into a first set of tokens and the second version of the web page into a second set of tokens; insert a first set of fingerprints associated with the first set of tokens into a first data structure, and a second set of fingerprints associated with the second set of tokens into a second data structure; compare the fingerprints included in the first and second data structures; mark the tokens whose associated fingerprints appear only in one data structure as transient; identify a path associated with every token in the web page; identify a subtree count comprising a number of times the path appears in other web pages associated with the website; identify a marked subtree count comprising a number of times the content associated with the path changes between versions of the respective web pages; compare the subtree count with the marked subtree count; and determine whether the path is a transient path based upon the comparison. - View Dependent Claims (19, 20)
-
Specification