Method of establishing a plain text document from a HTML document

US 8,392,820 B2
Filed: 12/01/2009
Issued: 03/05/2013
Est. Priority Date: 12/01/2008
Status: Active Grant

First Claim

Patent Images

1. A method of establishing a plain text document from a HTML document, comprising the steps of:

(A) acquiring a HTML document defined by HTML elements, each HTML element composed of tags and content between the tags;

(B) pre-processing the HTML document by omitting some of the HTML elements, whereby the rest of the HTML document comprises at least one target tag and at least one corresponding content;

(C) using a data structure to store the remaining tags of the pre-processed HTML document;

(D) grouping the remaining HTML elements with the remaining tags stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s), the step (D) further comprises the steps of;

(D-11) sequentially searching for a first content near the target tag from the rest of the HTML document, and identifying the first content as a first base content;

(D-12) sequentially searching for next content near the target tag from the first base content, and if there is no next content near the target tag, implementing the step (D-15);

(D-13) if an interval between the next content of the step (D-12) and the base content is smaller than a predetermined threshold, identifying the next content of the step (D-12) as a current base content, and repeating the step (D-12), otherwise, implementing the step (D-14);

(D-14) grouping the first content and the current base content(s) into a target group, and identifying the next content as another first base content, implementing the step (D-12); and

(D-15) grouping the first base content into one of the target groups; and

(E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method of establishing a plain text document from a HTML document. The method including the steps of (A) acquiring a HTML document defined by HTML elements, each composed of tags and content between the tags; (B) pre-processing the HTML document by omitting some of the tags (including the content between those tags), whereby the rest of the HTML document comprises at least one target tag (including content between the target tags); (C) using a data structure to store the remaining tags of the pre-processed HTML document; (D) grouping the remaining tags (including the content between the remaining tags) stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s); and (E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group.

Citations

15 Claims

1. A method of establishing a plain text document from a HTML document, comprising the steps of:
- (A) acquiring a HTML document defined by HTML elements, each HTML element composed of tags and content between the tags;
  
  (B) pre-processing the HTML document by omitting some of the HTML elements, whereby the rest of the HTML document comprises at least one target tag and at least one corresponding content;
  
  (C) using a data structure to store the remaining tags of the pre-processed HTML document;
  
  (D) grouping the remaining HTML elements with the remaining tags stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s), the step (D) further comprises the steps of;
  
  (D-11) sequentially searching for a first content near the target tag from the rest of the HTML document, and identifying the first content as a first base content;
  
  (D-12) sequentially searching for next content near the target tag from the first base content, and if there is no next content near the target tag, implementing the step (D-15);
  
  (D-13) if an interval between the next content of the step (D-12) and the base content is smaller than a predetermined threshold, identifying the next content of the step (D-12) as a current base content, and repeating the step (D-12), otherwise, implementing the step (D-14);
  
  (D-14) grouping the first content and the current base content(s) into a target group, and identifying the next content as another first base content, implementing the step (D-12); and
  
  (D-15) grouping the first base content into one of the target groups; and
  
  (E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of establishing a plain text document from a HTML document as claimed in claim 1, wherein blank lines, tags <
    - !-- -->
      
      , <
      
      script>
      
      , <
      
      /script>
      
      , <
      
      style>
      
      , <
      
      /style>
      
      , <
      
      a>
      
      , <
      
      /a>
      
      , <
      
      span>
      
      , <
      
      /span> and
      
      <
      
      img>
      
      , and the content in the tag <
      
      !-- -->
      
      , the content between the tags <
      
      script> and
      
      <
      
      /script>
      
      , and the content between the tags <
      
      style> and
      
      <
      
      /style>
      
      are omitted in the pre-processing step while the content between tags <
      
      body> and
      
      <
      
      /body>
      
      is retained in the pre-processed HTML document.
  - 3. The method of establishing a plain text document from a HTML document as claimed in claim 1, wherein the data structure further stores an information about the rest of the HTML document of the pre-processed HTML document.
  - 4. The method of establishing a plain text document from a HTML document as claimed in claim 3, wherein the information comprises indices of the HTML elements, lengths of the contents of the HTML elements and indications of the target tags.
  - 5. The method of establishing a plain text document from a HTML document as claimed in claim 3 further comprises the step of identifying the title of the HTML document using the information stored in the data structure, and deleting the information about the HTML element containing the title and the information about the HTML element(s) ahead of the HTML element containing the title from the data structure before the step (D).
  - 6. The method of establishing a plain text document from a HTML document as claimed in claim 1, wherein the predetermined threshold is 1˜
    - 5.
  - 7. The method of establishing a plain text document from a HTML document as claimed in claim 1, wherein the step (D) further comprises the step of:
    - (D-21) identifying the contents near all the target tags and grouping the contents into different target groups according to the target tags.
  - 8. The method of establishing a plain text document from a HTML document as claimed in claim 7, wherein each of the target groups are further grouped into sub-groups according to a predetermined threshold of intervals between the target tags.
  - 9. The method of establishing a plain text document from a HTML document as claimed in claim 8, wherein the predetermined threshold is from 1 to 10.
  - 10. The method of establishing a plain text document from a HTML document as claimed in claim 1, wherein the contents of the HTML elements identified by a specific searching method are used as the content of the target group(s) if there is no target tag in the pre-processed HTML document.
  - 11. The method of establishing a plain text document from a HTML document as claimed in claim 10, wherein the searching method comprises the steps of:
    - identifying the HTML element having the longest content in the pre-processed HTML document;
      
      shifting a current HTML element to a candidate HTML element having the contents with lengths longer than a first predetermined threshold and having intervals with the longest HTML element smaller than a second predetermined threshold;
      
      repeating the shifting step for the HTML elements ahead of the longest HTML elements until there is no candidate HTML element, and identifying the final current HTML element as a base content, and repeating the shifting step for the HTML elements behind the longest HTML elements until there is no candidate HTML element, and identifying the final current HTML element as an ending HTML element; and
      
      using the contents of the starting and ending HTML elements, and those of the HTML elements between the starting and ending HTML elements as the content of the target group.
  - 12. The method of establishing a plain text document from a HTML document as claimed in claim 11, wherein the second threshold ranges from 1 to 10.
  - 13. The method of establishing a plain text document from a HTML document as claimed in claim 1, wherein the target tags comprise tags <
    - p> and
      
      <
      
      br>
      
      .

14. A method of establishing a plain text document from a HTML document, comprising the steps of:
- (A) acquiring a HTML document defined by HTML elements, each composed of tags and content between the tags;
  
  (B) pre-processing the HTML document by omitting some of the HTML elements, whereby the rest of the HTML document comprises at least one target tag and at least one corresponding content;
  
  (C) using a data structure to store the remaining tags of the pre-processed HTML document;
  
  (D) grouping the remaining HTML elements with the remaining tags stored in the data structure of the pre-processed HTML document into at least one target group according to the target tag(s);
  
  (E) identifying the target group(s) most related to a title of the HTML document by comparing correlation(s) between the target group(s) and the title, and establishing a plain text document having the content of the identified target group, wherein the target group(s) most related to the title of the HTML document is identified by the steps;
  
  (E-1) if there is no sub-group in the target group(s), identifying the target group most related to the title of the HTML document by comparing correlation(s) between the target group(s) and the title;
  
  (E-2) calculating similarities of the target groups not be identified in the step (E-1) to the most title-related target group based on a vector space model to identify the target groups having the similarities higher than a predetermined threshold, and establishing the plain text document having the content of the identified target groups;
  
  (E-3) if there is (are) sub-group(s) in the target group(s), identifying the sub-group most related to the title of the HTML document by comparing correlation(s) between the sub-groups and the title;
  
  (E-4) if there is only one sub-group, establishing the plain text document having the content of the identified sub-group; and
  
  (E-5) if there are more than one sub-groups, calculating similarities of the other sub-groups to the most title-related sub-group based on a vector space model to identify the sub-groups having the similarities higher than a predetermined threshold, and establishing the plain text document having the content of the identified sub-groups.
- View Dependent Claims (15)
- - 15. The method of establishing a plain text document from a HTML document as claimed in claim 14, wherein the predetermined threshold is 0.6.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
esobi, Inc.
Original Assignee
esobi, Inc.
Inventors
Tsai, Hong-Yang, Hung, Chi-Hau
Primary Examiner(s)
Thai, Xuan
Assistant Examiner(s)
Hillery, Nathan

Application Number

US12/628,513
Publication Number

US 20100146381A1
Time in Patent Office

1,190 Days
Field of Search

715/229, 715/234, 715/239
US Class Current

715/229
CPC Class Codes

G06F 16/84 Mapping; Conversion

G06F 40/143 Markup, e.g. Standard Gener...

Method of establishing a plain text document from a HTML document

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method of establishing a plain text document from a HTML document

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links