Discrete Wavelet Transform Method for Document Structure Similarity
First Claim
1. A method for determining document structure similarity, comprising:
- segmenting path sequences (206) of Document Object. Model (DOM) trees (120, 462) from a number of web pages (202) into B components (561);
determining path signals (210) corresponding to the path sequences (206) based on a count of the occurrences of particular paths in the Bth component (571);
transforming unique path signals (210) into discrete wavelet signals (214) (572); and
analyzing the discrete wavelet signals (214) multiple DOM ee resolution levels (573).
1 Assignment
0 Petitions
Accused Products
Abstract
Examples of the present disclosure may include methods, systems, and computer readable media with executable instructions. An example method for determining document structure similarity can include segmenting path sequences (206) of Document Object Model (DOM) trees (120, 462) from a number of web pages (202) into B components (561). Path signals (210) corresponding to the path sequences (206) are determined based on a count of the occurrences of particular paths in the Bth component (571), and unique path signals (210) are transformed into discrete wavelet signals (214)(572). The discrete wavelet signals (214) are analyzed at multiple DOM tree resolution levels (573).
25 Citations
15 Claims
-
1. A method for determining document structure similarity, comprising:
-
segmenting path sequences (206) of Document Object. Model (DOM) trees (120, 462) from a number of web pages (202) into B components (561); determining path signals (210) corresponding to the path sequences (206) based on a count of the occurrences of particular paths in the Bth component (571); transforming unique path signals (210) into discrete wavelet signals (214) (572); and analyzing the discrete wavelet signals (214) multiple DOM ee resolution levels (573). - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A non-transitory computer-readable medium (681, 795) having computer-executable instructions stored thereon, the computer-executable instructions comprising instructions (682) that, if executed by one or more processors (680, 794), cause the one or more processors (680, 794) to:
-
segment path sequences (206) of Document Object Model (DOM) trees (120, 462) of two web pages (202) into a number of equal components (561); determine path signals (210) corresponding to the path sequences (206) based on a count of the occurrences of particular paths in at least one component (571); transform unique path signals (210) into Haar wavelet signals; and compare the Haar wavelet signals for similarity at multiple DOM tree (120, 462) resolution levels. - View Dependent Claims (14)
-
-
15. A computing system for determining document structure comprising:
-
an HTML parser (204) to parse Document Object Model (DOM) trees from a number of web pages (202) into path sequences (206); a path sequence segmentation module (208) to segment the path sequences (206) into B components and determine path signals (210) corresponding to the path sequences (206) based on a count of the occurrences of particular paths in the Bth component; a Harr wavelet transformation module (212) transforming unique path signals (210) into discrete wavelet signals (214); and an analyzer (216) to compute a cumulative distance value of the discrete wavelet signals (214) at multiple DOM tree resolution levels.
-
Specification