System and Method for Web Content Extraction
First Claim
1. A computer implemented method for Web content extraction, comprising:
- extracting Web content in a Webpage by identifying paragraphs, one or more titles and one or more images in the Web content based on line-break node determination; and
outputting the Web content including the identified paragraphs, the one or more titles, and the one or more images.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for extracting Web content is disclosed. In one embodiment, Web content in a Webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is then identified using a maximum scoring subsequence. Further, the identified text-body is refined using a heuristic rule of substantially horizontal alignment. Furthermore, one or more titles and one or more images associated with the Web content are extracted. Moreover, the Web content including the identified paragraphs, the one or more titles and the one or more images are outputted.
24 Citations
15 Claims
-
1. A computer implemented method for Web content extraction, comprising:
-
extracting Web content in a Webpage by identifying paragraphs, one or more titles and one or more images in the Web content based on line-break node determination; and outputting the Web content including the identified paragraphs, the one or more titles, and the one or more images. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A non-transitory computer-readable storage medium having instructions that, when executed by a computing device, causes the computing device to perform a method of Web content extraction, comprising:
-
extracting Web content in a Webpage by identifying paragraphs, one or more titles and one or more images in the Web content based on line-break node determination; and outputting the Web content including the identified paragraphs, the one or more titles and the one or more images. - View Dependent Claims (12)
-
-
13. A system for Web content extraction, comprising:
-
a printer; a processor coupled to the printer; and memory coupled to the processor, wherein the memory includes a Web content extraction module, and wherein the Web content extraction module includes a paragraph identification module for extracting Web content in a Webpage by identifying paragraphs, one or more titles and one or more images in the Web content based on line-break node determination, and wherein the Web content extraction module includes an output module for outputting the extracted Web content to the printer. - View Dependent Claims (14, 15)
-
Specification