EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES
First Claim
1. A method of extracting principal content from Web pages, comprising:
- identifying and classifying items on the Web page;
building a list of candidates;
calculating candidate scores;
selecting a top score candidate;
performing clean up processing for the top score candidate; and
performing final page processing for the top score candidate.
10 Assignments
0 Petitions
Accused Products
Abstract
Extracting principal content from Web pages includes identifying and classifying items on the Web page, building a list of candidates, calculating candidate scores, selecting a top score candidate, performing clean up processing for the top score candidate, and performing final page processing for the top score candidate. Candidate scores may vary according to a number of paragraphs and images grouped according to size. A world length of CJK (Chinese-Japanese-Korean) text may be determined according to punctuation therein. Candidate scores may be modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘body’, ‘div’, ‘td’, ‘li’, ‘article/section’ and pieces are candidates that do not include other candidates. Candidate scores may be modified according to a number of ratios corresponding to text and link density.
-
Citations
44 Claims
-
1. A method of extracting principal content from Web pages, comprising:
-
identifying and classifying items on the Web page; building a list of candidates; calculating candidate scores; selecting a top score candidate; performing clean up processing for the top score candidate; and performing final page processing for the top score candidate. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. Computer software, provided in non-transitory computer-readable media, that extracts principal content from Web pages, the software comprising:
-
executable code that identifies and classifies items on the Web page; executable code that builds a list of candidates; executable code that calculates candidate scores; executable code that selects a top score candidate; executable code that performs clean up processing for the top score candidate; and executable code that performs final page processing for the top score candidate. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A method of identifying separate portions of a principal item provided on multiple Web pages, comprising:
-
determining a last non-empty segment of the path of a URL containing a portion of the principal item; determining a similarity measure between the URL and other URLs on the page corresponding to the last non-empty segment of the path; examining Web link text for words indicating additional portions of the principal item on Web pages corresponding to sufficiently similar URLs; and determining whether additional portions of the principal item are provided on at least some of the Web pages according to the similarity measure and the presence of words indicating additional portions. - View Dependent Claims (34, 35, 36, 37, 38)
-
-
39. Computer software, provided in a non-transitory computer-readable medium, that identifies separate portions of a principal item provided on multiple Web pages, the software comprising:
-
executable code that determines a last non-empty segment of the path of a URL containing a portion of the principal item; executable code that determines a similarity measure between the URL and other URLs on the page corresponding to the last non-empty segment of the path; executable code that examines Web link text for words indicating additional portions of the principal item on Web pages corresponding to sufficiently similar URLs; and executable code that determines whether additional portions of the principal item are provided on at least some of the Web pages according to the similarity measure and the presence of words indicating additional portions. - View Dependent Claims (40, 41, 42, 43, 44)
-
Specification