Discrete wavelet transform method for document structure similarity
First Claim
1. A method for determining document structure similarity, comprising:
- segmenting, by a computing device, path sequences of Document Object Model (DOM) trees from a number of web pages into B components;
determining path signals corresponding to the path sequences based on a count of the occurrences of particular paths in the Bth component, wherein determining path signals comprises weighting the path signals based on path sequence characteristics of a DOM tree;
transforming unique path signals into discrete wavelet signals;
analyzing the discrete wavelet signals at multiple DOM tree resolution level, wherein analyzing the discrete wavelet signals comprises;
computing a distance value for every common signal path of two DOM trees; and
summing the distance values as a final tree distance for each of the two DOM trees; and
outputting a document structure similarity decision based on the analyses of the discrete wavelet signals.
1 Assignment
0 Petitions
Accused Products
Abstract
Examples of the present disclosure may include methods, systems, and computer readable media with executable instructions. An example method for determining document structure similarity can include segmenting path sequences (206) of Document Object Model (DOM) trees (120, 462) from a number of web pages (202) into B components (561). Path signals (210) corresponding to the path sequences (206) are determined based on a count of the occurrences of particular paths in the Bthe component (571), and unique path signals (210) are transformed into discrete wavelet signals (214)(572). The discrete wavelet signals (214) are analyzed at multiple DOM tree resolution levels (573).
30 Citations
14 Claims
-
1. A method for determining document structure similarity, comprising:
-
segmenting, by a computing device, path sequences of Document Object Model (DOM) trees from a number of web pages into B components; determining path signals corresponding to the path sequences based on a count of the occurrences of particular paths in the Bth component, wherein determining path signals comprises weighting the path signals based on path sequence characteristics of a DOM tree; transforming unique path signals into discrete wavelet signals; analyzing the discrete wavelet signals at multiple DOM tree resolution level, wherein analyzing the discrete wavelet signals comprises; computing a distance value for every common signal path of two DOM trees; and summing the distance values as a final tree distance for each of the two DOM trees; and outputting a document structure similarity decision based on the analyses of the discrete wavelet signals. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer-readable medium having computer-executable instructions stored thereon, the computer-executable instructions comprising instructions that, when executed by one or more processors, cause the one or more processors to:
-
segment path sequences of Document Object Model (DOM) trees of two web pages into a number of equal components; determine path signals corresponding to the path sequences based on a count of the occurrences of particular paths in at least one component, wherein determining path signals comprises weighting the path signals based on path sequence characteristics of a DOM tree; transform unique path signals into Haar wavelet signals; compare the Haar wavelet signals for similarity at multiple DOM tree resolution levels, wherein comparing the Haar wavelet signals comprises; computing a distance value for every common signal path of two DOM trees; and summing the distance values as a final tree distance for each of the two DOM trees; and output a page similarity decision based on the comparison of the Haar wavelet signals. - View Dependent Claims (13)
-
-
14. A computing system coupled to a non-transitory computer readable medium having computer-executable instructions stored thereon to determine document structure similarity when executed by one or more processors, the instructions comprising:
-
an HTML parser to parse Document Object Model (DOM) trees from a number of web pages into path sequences; a path sequence segmentation module to segment the path sequences into B components and determine path signals corresponding to the path sequences based on a count of the occurrences of particular paths in the Bth component, wherein determining path signals comprises weighting the path signals based on path sequence characteristics of a DOM tree; a Harr wavelet transformation module transforming unique path signals into discrete wavelet signals; and an analyzer to; compute a distance value of the discrete wavelet signals at multiple DOM tree resolution levels for every common signal path of the multiple DOM trees; sum the distance values as a final tree distance for each of the multiple DOM trees; and output a document structure similarity decision based on the discrete wavelet signals.
-
Specification