Method of establishing a plain text document from a HTML document
First Claim
1. A method of establishing a plain text document from a HTML document, comprising the steps of:
- (A) acquiring a HTML document defined by HTML elements, each HTML element composed of tags and content between the tags;
(B) pre-processing the HTML document by omitting some of the HTML elements, whereby the rest of the HTML document comprises at least one target tag and at least one corresponding content;
(C) using a data structure to store the remaining tags of the pre-processed HTML document;
(D) grouping the remaining HTML elements with the remaining tags stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s), the step (D) further comprises the steps of;
(D-11) sequentially searching for a first content near the target tag from the rest of the HTML document, and identifying the first content as a first base content;
(D-12) sequentially searching for next content near the target tag from the first base content, and if there is no next content near the target tag, implementing the step (D-15);
(D-13) if an interval between the next content of the step (D-12) and the base content is smaller than a predetermined threshold, identifying the next content of the step (D-12) as a current base content, and repeating the step (D-12), otherwise, implementing the step (D-14);
(D-14) grouping the first content and the current base content(s) into a target group, and identifying the next content as another first base content, implementing the step (D-12); and
(D-15) grouping the first base content into one of the target groups; and
(E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a method of establishing a plain text document from a HTML document. The method including the steps of (A) acquiring a HTML document defined by HTML elements, each composed of tags and content between the tags; (B) pre-processing the HTML document by omitting some of the tags (including the content between those tags), whereby the rest of the HTML document comprises at least one target tag (including content between the target tags); (C) using a data structure to store the remaining tags of the pre-processed HTML document; (D) grouping the remaining tags (including the content between the remaining tags) stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s); and (E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group.
-
Citations
15 Claims
-
1. A method of establishing a plain text document from a HTML document, comprising the steps of:
-
(A) acquiring a HTML document defined by HTML elements, each HTML element composed of tags and content between the tags; (B) pre-processing the HTML document by omitting some of the HTML elements, whereby the rest of the HTML document comprises at least one target tag and at least one corresponding content; (C) using a data structure to store the remaining tags of the pre-processed HTML document; (D) grouping the remaining HTML elements with the remaining tags stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s), the step (D) further comprises the steps of; (D-11) sequentially searching for a first content near the target tag from the rest of the HTML document, and identifying the first content as a first base content; (D-12) sequentially searching for next content near the target tag from the first base content, and if there is no next content near the target tag, implementing the step (D-15); (D-13) if an interval between the next content of the step (D-12) and the base content is smaller than a predetermined threshold, identifying the next content of the step (D-12) as a current base content, and repeating the step (D-12), otherwise, implementing the step (D-14); (D-14) grouping the first content and the current base content(s) into a target group, and identifying the next content as another first base content, implementing the step (D-12); and (D-15) grouping the first base content into one of the target groups; and (E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method of establishing a plain text document from a HTML document, comprising the steps of:
-
(A) acquiring a HTML document defined by HTML elements, each composed of tags and content between the tags; (B) pre-processing the HTML document by omitting some of the HTML elements, whereby the rest of the HTML document comprises at least one target tag and at least one corresponding content; (C) using a data structure to store the remaining tags of the pre-processed HTML document; (D) grouping the remaining HTML elements with the remaining tags stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s); (E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group, wherein the target group(s) most related to the title of the HTML document is identified by the steps; (E-1) if there is no sub-group in the target group(s), identifying the target group most related to the title of the HTML document by comparing correlation(s) between the target group(s) and the title; (E-2) calculating similarities of the target groups not be identified in the step (E-1) to the most title-related target group based on a vector space model to identify the target groups having the similarities higher than a predetermined threshold, and establishing the plain text document having the content of the identified target groups; (E-3) if there is (are) sub-group(s) in the target group(s), identifying the sub-group most related to the title of the HTML document by comparing correlation(s) between the sub-groups and the title; (E-4) if there is only one sub-group, establishing the plain text document having the content of the identified sub-group; and (E-5) if there are more than one sub-groups, calculating similarities of the other sub-groups to the most title-related sub-group based on a vector space model to identify the sub-groups having the similarities higher than a predetermined threshold, and establishing the plain text document having the content of the identified sub-groups. - View Dependent Claims (15)
-
Specification