Information block extraction apparatus and method for Web pages
First Claim
1. A method for segmenting a Web page into information blocks with coherent contents comprising:
- generating a structural information block tree of the Web page;
clustering and merging the structural information blocks; and
labeling the semantic of the resulting blocks.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for identifying coherent areas within a Web page. First, a Web page is parsed into an HTML DOM tree and an HTML tag token stream. Next, repeated-patterns are induced from the Web page. After filtering out improper repeated-patterns and generating corresponding instances of the repeated-patterns, the repeated-patterns are mapped back to corresponding regions in the Web page. Based on the mappings, a hierarchical RST tree containing information blocks is generated. Information items within the information blocks are detected then used to generate a hierarchical structural information block tree. Information blocks from the structural information block tree are then classified into text information blocks and link information blocks. Based on the classification and block semantic similarity, the bocks are clustered then grouped into semantic information blocks. The semantic information blocks contain main text information blocks and related link blocks which, if necessary, can be labeled.
-
Citations
17 Claims
-
1. A method for segmenting a Web page into information blocks with coherent contents comprising:
-
generating a structural information block tree of the Web page;
clustering and merging the structural information blocks; and
labeling the semantic of the resulting blocks. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus for segmenting a Web page into information blocks with coherent contents comprising:
-
a structural information block extracting unit generating a structural information block tree of the Web page; and
a semantic information block extracting unit clustering and merging the structural information blocks and labeling the semantic of the resulting blocks. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A method for segmenting a Web page into information blocks with coherent contents comprising the steps of:
-
extracting structural information blocks from the Web page; and
generating semantic information blocks based on the structural information blocks.
-
Specification