Automatic visual segmentation of webpages
First Claim
1. A method to divide a webpage into semantic units, comprising computer-executed steps of:
- estimating a target optimal number that represents how many semantic units should be associated with said webpage;
identifying what fraction of said webpage is occupied by a rendered area associated with a node of a DOM tree corresponding to said webpage;
determining whether the fraction multiplied by said target optimal number is below a threshold number; and
in response to determining that said fraction multiplied by said target optimal number is below the threshold number, merging a first semantic unit associated with said node with a second semantic unit associated with a second node into a single semantic unit;
wherein the method is performed by one or more computing devices.
9 Assignments
0 Petitions
Accused Products
Abstract
To provide valuable information regarding a webpage, the webpage must be divided into distinct semantically coherent segments for analysis. A set of heuristics allow a segmentation algorithm to identify an optimal number of segments for a given webpage or any portion thereof more accurately. A first heuristic estimates the optimal number of segments for any given webpage or portion thereof. A second heuristic coalesces segments where the number of segments identified far exceeds the optimal number recommended. A third heuristic coalesces segments corresponding to a portion of a webpage with much unused whitespace and little content. A fourth heuristic coalesces segments of nodes that have a recommended number of segments below a certain threshold into segments of other nodes. A fifth heuristic recursively analyzes and splits segments that correspond to webpage portions surpassing a certain threshold portion size.
-
Citations
22 Claims
-
1. A method to divide a webpage into semantic units, comprising computer-executed steps of:
-
estimating a target optimal number that represents how many semantic units should be associated with said webpage; identifying what fraction of said webpage is occupied by a rendered area associated with a node of a DOM tree corresponding to said webpage; determining whether the fraction multiplied by said target optimal number is below a threshold number; and in response to determining that said fraction multiplied by said target optimal number is below the threshold number, merging a first semantic unit associated with said node with a second semantic unit associated with a second node into a single semantic unit; wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 12, 13, 14, 15, 16)
-
-
6. A method to divide a webpage into semantic units, comprising computer-executed steps of:
-
estimating a target optimal number that represents how many semantic units should be associated with said webpage; identifying what fraction of said webpage is occupied by a rendered area on said webpage, wherein said rendered area corresponds to a semantic unit; determining whether the fraction multiplied by said target optimal number exceeds a threshold number; and in response to determining that said fraction multiplied by said target optimal number exceeds said threshold number, dividing said semantic unit into a plurality of semantic units; wherein the method is performed by one or more computing devices. - View Dependent Claims (7, 8, 9, 10, 11, 17, 18, 19, 20, 21, 22)
-
Specification