Extracting principal content from web pages
First Claim
1. A method of extracting principal content from Web pages, comprising:
- identifying and classifying items on the Web page;
building a list of candidates;
calculating candidate scores that vary according to a plurality of weights assigned to paragraphs and images of the Web page, wherein a particular paragraph is provided with a first weight based on a number of lines in the particular paragraph and is provided with a second weight based on a number of words in the particular paragraph, the first weight being independent of the second weight and a particular image is provided with a weight based on a size of the particular image;
selecting a top score candidate;
performing clean up processing for the top score candidate; and
performing final page processing for the top score candidate.
10 Assignments
0 Petitions
Accused Products
Abstract
Extracting principal content from Web pages includes identifying and classifying items on the Web page, building a list of candidates, calculating candidate scores, selecting a top score candidate, performing clean up processing for the top score candidate, and performing final page processing for the top score candidate. Candidate scores may vary according to a number of paragraphs and images grouped according to size. A word length of CJK (Chinese-Japanese-Korean) text may be determined according to punctuation therein. Candidate scores may be modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘body’, ‘div’, ‘td’, ‘li’, ‘article/section’ and pieces are candidates that do not include other candidates. Candidate scores may be modified according to a number of ratios corresponding to text and link density.
-
Citations
30 Claims
-
1. A method of extracting principal content from Web pages, comprising:
-
identifying and classifying items on the Web page; building a list of candidates; calculating candidate scores that vary according to a plurality of weights assigned to paragraphs and images of the Web page, wherein a particular paragraph is provided with a first weight based on a number of lines in the particular paragraph and is provided with a second weight based on a number of words in the particular paragraph, the first weight being independent of the second weight and a particular image is provided with a weight based on a size of the particular image; selecting a top score candidate; performing clean up processing for the top score candidate; and performing final page processing for the top score candidate. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A non-transitory computer-readable medium containing software that extracts principal content from Web pages, the software comprising:
-
executable code that identifies and classifies items on the Web page; executable code that builds a list of candidates; executable code that calculates candidate scores that vary according to a plurality of weights assigned to paragraphs and images of the Web page, wherein a particular paragraph is provided with a first weight based on a number of lines in the particular paragraph and is provided with a second weight based on a number of words in the particular paragraph, the first weight being independent of the second weight and a particular image is provided with a weight based on a size of the particular image; executable code that selects a top score candidate; executable code that performs clean up processing for the top score candidate; and executable code that performs final page processing for the top score candidate. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification