System and Method for Web Content Extraction

US 20120303636A1
Filed: 12/14/2009
Published: 11/29/2012
Est. Priority Date: 12/14/2009
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for Web content extraction, comprising:

extracting Web content in a Webpage by identifying paragraphs, one or more titles and one or more images in the Web content based on line-break node determination; and

outputting the Web content including the identified paragraphs, the one or more titles, and the one or more images.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for extracting Web content is disclosed. In one embodiment, Web content in a Webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is then identified using a maximum scoring subsequence. Further, the identified text-body is refined using a heuristic rule of substantially horizontal alignment. Furthermore, one or more titles and one or more images associated with the Web content are extracted. Moreover, the Web content including the identified paragraphs, the one or more titles and the one or more images are outputted.

24 Citations

View as Search Results

15 Claims

1. A computer implemented method for Web content extraction, comprising:
- extracting Web content in a Webpage by identifying paragraphs, one or more titles and one or more images in the Web content based on line-break node determination; and
  
  outputting the Web content including the identified paragraphs, the one or more titles, and the one or more images.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer implemented method of claim 1, wherein extracting the Web content in the Webpage by identifying the paragraphs, the one or more titles and one or more images comprises:
    - generating a sequence of leaf nodes in an order of their presence in the Web content in the Webpage;
      
      determining line-break nodes in the generated sequence of leaf nodes;
      
      grouping the sequence of leaf nodes into multiple subsets of leaf nodes including at least one of text leaf nodes and non-text leaf nodes based on the determined line-break nodes;
      
      determining whether the multiple subsets of leaf nodes include one or more subsets of non-text leaf nodes; and
      
      if so, removing the one or more subsets of non-text leaf nodes from the multiple subsets of leaf nodes.
  - 3. The computer implemented method of claim 2, wherein outputting the Web content including the identified paragraphs, the one or more titles and the one or more images comprises:
    - outputting the Web content using text associated with remaining multiple subsets of leaf nodes including the text leaf nodes.
  - 4. The computer implemented method of claim 2, wherein, in generating the sequence of leaf nodes, the Web content comprises formatted content in a markup language, and wherein the markup language is selected from the group consisting of hypertext markup language (HTML), wireless markup language (WML) and extensible markup language (XML).
  - 5. The computer implemented method of claim 3, wherein grouping the sequence of leaf nodes into multiple subsets of leaf nodes comprises:
    - grouping the sequence of leaf nodes into the multiple subsets of leaf nodes based on leaf nodes that are substantially successive to a nearest line-break node.
  - 6. The computer implemented method of claim 5, wherein outputting the Web content using text associated with the remaining multiple subsets of leaf nodes including the text leaf nodes comprises:
    - determining a subsequence of grouped text leaf nodes from the remaining multiple subset of leaf nodes including the text leaf nodes; and
      
      outputting the Web content using text associated with the subsequence of grouped text leaf nodes.
  - 7. The computer implemented method of claim 6, wherein determining the subsequence of grouped text leaf nodes from the remaining multiple subsets of leaf nodes comprises:
    - determining the subsequence of grouped text leaf nodes using visual features of a text format in the remaining multiple subsets of leaf nodes.
  - 8. The computer implemented method of claim 7, wherein, in determining the subsequence of grouped text leaf nodes, the visual features of the text format in the remaining multiple subsets of leaf nodes are selected from the group consisting of font size, font color, and whether the text is in a link.
  - 9. The computer implemented method of claim 7, wherein determining the subsequence of grouped text leaf nodes using the visual features of the text format in the remaining multiple subsets of leaf nodes, comprises:
    - determining a range of text-body associated with each of the remaining multiple subsets of leaf nodes using a maximum scoring subsequence (MSS) that is based on the visual features for each of the remaining multiple subsets of leaf nodes; and
      
      forming the subsequence of grouped text leaf nodes by filtering out the remaining multiple subsets of leaf nodes based on a heuristic rule of substantially horizontal alignment obtained from the determined MSS.
  - 10. The computer implemented method of claim 2, wherein generating the sequence of leaf nodes in the order of their presence in the Web content in the Webpage comprises:
    - obtaining a document object model (DOM) tree associated with the Web content by parsing code in the markup language of the Webpage; and
      
      generating the sequence of leaf nodes in the order of their presence in the Web content using the obtained DOM tree.

11. A non-transitory computer-readable storage medium having instructions that, when executed by a computing device, causes the computing device to perform a method of Web content extraction, comprising:
- extracting Web content in a Webpage by identifying paragraphs, one or more titles and one or more images in the Web content based on line-break node determination; and
  
  outputting the Web content including the identified paragraphs, the one or more titles and the one or more images.
- View Dependent Claims (12)
- - 12. The non-transitory computer-readable storage medium of claim 11, wherein extracting the Web content in the Webpage by identifying the paragraphs, the one or more titles and the one or more images comprises:
    - generating a sequence of leaf nodes in an order of their presence in the Web content in the Webpage;
      
      determining line-break nodes in the generated sequence of leaf nodes;
      
      grouping the sequence of leaf nodes into multiple subsets of leaf nodes including at least one of text leaf nodes and non-text leaf nodes based on the determined line-break nodes;
      
      determining whether the multiple subsets of leaf nodes include one or more subsets of non-text leaf nodes; and
      
      if so, removing the one or more subsets of non-text leaf nodes from the multiple subsets of leaf nodes.

13. A system for Web content extraction, comprising:
- a printer;
  
  a processor coupled to the printer; and
  
  memory coupled to the processor, wherein the memory includes a Web content extraction module, and wherein the Web content extraction module includes a paragraph identification module for extracting Web content in a Webpage by identifying paragraphs, one or more titles and one or more images in the Web content based on line-break node determination, and wherein the Web content extraction module includes an output module for outputting the extracted Web content to the printer.
- View Dependent Claims (14, 15)
- - 14. The system of claim 13, wherein the paragraph identification module generates a sequence of leaf nodes in an order of their presence in the Web content in the Webpage, and wherein the paragraph identification module determines line-break nodes in the generated sequence of leaf nodes, and wherein the paragraph identification module groups the sequence of leaf nodes into multiple subsets of leaf nodes including at least one of text leaf nodes and non-text leaf nodes based on the determined line-break nodes, and wherein the paragraph identification module determines whether the multiple subsets of leaf nodes include one or more subsets of non-text leaf nodes, and wherein the paragraph identification module removes the one or more subsets of non-text leaf nodes from the multiple subsets of leaf nodes.
  - 15. The system of claim 14, wherein the output module outputs the Web content using text associated with remaining multiple subsets of leaf nodes including the text leaf nodes, wherein the Web content comprises formatted content in a markup language, and wherein the markup language is selected from the group consisting of hypertext markup language (HTML), wireless markup language (WML) and extensible markup language (XML).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Luo, Ping, Fan, Jian, Liu, Samson J., Xiong, Yuhong, Liu, Jerry J.

Granted Patent

US 8,819,028 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/748
CPC Class Codes

G06F 16/986 Document structures and sto...

G06F 3/1246 by handling markup language...

System and Method for Web Content Extraction

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

24 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

System and Method for Web Content Extraction

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links