×

Method of establishing a plain text document from a HTML document

  • US 8,392,820 B2
  • Filed: 12/01/2009
  • Issued: 03/05/2013
  • Est. Priority Date: 12/01/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method of establishing a plain text document from a HTML document, comprising the steps of:

  • (A) acquiring a HTML document defined by HTML elements, each HTML element composed of tags and content between the tags;

    (B) pre-processing the HTML document by omitting some of the HTML elements, whereby the rest of the HTML document comprises at least one target tag and at least one corresponding content;

    (C) using a data structure to store the remaining tags of the pre-processed HTML document;

    (D) grouping the remaining HTML elements with the remaining tags stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s), the step (D) further comprises the steps of;

    (D-11) sequentially searching for a first content near the target tag from the rest of the HTML document, and identifying the first content as a first base content;

    (D-12) sequentially searching for next content near the target tag from the first base content, and if there is no next content near the target tag, implementing the step (D-15);

    (D-13) if an interval between the next content of the step (D-12) and the base content is smaller than a predetermined threshold, identifying the next content of the step (D-12) as a current base content, and repeating the step (D-12), otherwise, implementing the step (D-14);

    (D-14) grouping the first content and the current base content(s) into a target group, and identifying the next content as another first base content, implementing the step (D-12); and

    (D-15) grouping the first base content into one of the target groups; and

    (E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×