×

Extracting principal content from web pages

  • US 9,152,730 B2
  • Filed: 07/31/2012
  • Issued: 10/06/2015
  • Est. Priority Date: 11/10/2011
  • Status: Active Grant
First Claim
Patent Images

1. A method of extracting principal content from Web pages, comprising:

  • identifying and classifying items on the Web page;

    building a list of candidates;

    calculating candidate scores that vary according to a plurality of weights assigned to paragraphs and images of the Web page, wherein a particular paragraph is provided with a first weight based on a number of lines in the particular paragraph and is provided with a second weight based on a number of words in the particular paragraph, the first weight being independent of the second weight and a particular image is provided with a weight based on a size of the particular image;

    selecting a top score candidate;

    performing clean up processing for the top score candidate; and

    performing final page processing for the top score candidate.

View all claims
  • 10 Assignments
Timeline View
Assignment View
    ×
    ×