Extracting principal content from web pages

US 9,152,730 B2
Filed: 07/31/2012
Issued: 10/06/2015
Est. Priority Date: 11/10/2011
Status: Active Grant

First Claim

Patent Images

1. A method of extracting principal content from Web pages, comprising:

identifying and classifying items on the Web page;

building a list of candidates;

calculating candidate scores that vary according to a plurality of weights assigned to paragraphs and images of the Web page, wherein a particular paragraph is provided with a first weight based on a number of lines in the particular paragraph and is provided with a second weight based on a number of words in the particular paragraph, the first weight being independent of the second weight and a particular image is provided with a weight based on a size of the particular image;

selecting a top score candidate;

performing clean up processing for the top score candidate; and

performing final page processing for the top score candidate.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Extracting principal content from Web pages includes identifying and classifying items on the Web page, building a list of candidates, calculating candidate scores, selecting a top score candidate, performing clean up processing for the top score candidate, and performing final page processing for the top score candidate. Candidate scores may vary according to a number of paragraphs and images grouped according to size. A word length of CJK (Chinese-Japanese-Korean) text may be determined according to punctuation therein. Candidate scores may be modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘body’, ‘div’, ‘td’, ‘li’, ‘article/section’ and pieces are candidates that do not include other candidates. Candidate scores may be modified according to a number of ratios corresponding to text and link density.

Citations

30 Claims

1. A method of extracting principal content from Web pages, comprising:
- identifying and classifying items on the Web page;
  
  building a list of candidates;
  
  calculating candidate scores that vary according to a plurality of weights assigned to paragraphs and images of the Web page, wherein a particular paragraph is provided with a first weight based on a number of lines in the particular paragraph and is provided with a second weight based on a number of words in the particular paragraph, the first weight being independent of the second weight and a particular image is provided with a weight based on a size of the particular image;
  
  selecting a top score candidate;
  
  performing clean up processing for the top score candidate; and
  
  performing final page processing for the top score candidate.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. A method, according to claim 1, where a word length of CJK (Chinese-Japanese-Korean) text is determined according to punctuation therein.
  - 3. A method, according to claim 1, wherein calculating candidate scores includes determining an Initial Score using the formula:
    - Initial Score=1000×
      
      (1.5×
      
      (N_{5-line-paragraphs}+N_{80-word-paragraphs})+N_{3-line-paragraphs}+N_{50-word-paragraphs}+3×
      
      N_large-images−
      
      0.5×
      
      (N_small-images+N_{skipped-images}))where N_{5-line-paragraphs}is a number of five line paragraphs in the candidate, N_{80-word-paragraphs}is a number of 80 word paragraphs in the candidate, N_{3-line-paragraphs}is a number of 3 line paragraphs in the candidate, N_{50-word-paragraphs}is a number of 50 word paragraphs in the candidate, N_large-imagesis a number of images that contain at least 50K pixels or have a width of at least 350 pixels and a height of at least 75 pixels, N_{skipped-images}is a number of images that have a size not exceeding 5×
      
      5 or are present on Web page as references to at least one of;
      
      blacklisted and quarantined sites, and all other images are count toward N_small-images.
  - 4. A method, according to claim 1, wherein candidate scores are modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘
    - body’
      
      , ‘
      
      div’
      
      , ‘
      
      td’
      
      , ‘
      
      li’
      
      , ‘
      
      article/section’ and
      
      pieces are candidates that do not include other candidates.
  - 5. A method, according to claim 4, wherein candidate scores are modified using a formula:
    - Modified Initial Score=Initial Score×
      
      (⅓
      
      +⅔
      
      ×
      
      ⅓
      
      ×
      
      (1/N_pieces+1/N_candidates+1/N_containers)).
  - 6. A method, according to claim 4, wherein candidate scores are modified according to a number of ratios corresponding to text and link density.
  - 7. A method, according to claim 6, wherein the ratios include a ratio of a length of regular text for the candidate to a total length of regular text, a number of words in regular text of the candidate to a total number of words in regular text, a length of regular text above the candidate to a total length of regular text, and a number of words in regular text above the candidate to a total number of words in regular text.
  - 8. A method, according to claim 7, wherein candidate scores are modified using a formula:
    - Modified Initial Score_n+1=Modified Initial Score_n×
      
      (Percentage_n×
      
      (ratio^degree)_n+(100−
      
      Percentage_n))/100wherein Percentage_nand ratio^degreeare predetermined values that are empirically determined for each of n ratios.
  - 9. A method, according to claim 1, further comprising:
    - following selecting the top score candidate, determining if the top score candidate meets predetermined criteria; and
      
      if the top score candidate does not meet predetermined criteria, determining if a different top score candidate should be selected.
  - 10. A method, according to claim 9, wherein a first set of formulas is used to determine the top score candidate and a second, different, set of formulas is used to determine if a different top score candidate should be selected.
  - 11. A method, according to claim 10, wherein a different top score candidate is not used if using the second set of formulas results in the same top score candidate as using the first set of formulas.
  - 12. A method, according to claim 9, wherein the predetermined criteria is selected from the group consisting of:
    - whether the top score candidate has less than 25 embedded containers, wherein a container is a Web page element that is associated with tags ‘
      
      body’
      
      , div′
      
      , ‘
      
      td’
      
      , ‘
      
      article/section’
      
      , whether the top score candidate has no embedded other candidates and whether the top score candidate has no more than three embedded candidates that have other embedded candidates.
  - 13. A method, according to claim 1, wherein performing clean up processing includes removing floating elements, link boxes, navigation panels, and videos from unknown sources.
  - 14. A method, according to claim 1, wherein performing clean up processing includes reformatting the top scoring candidate by rewriting the text thereof as a new HTML page using only feasible HTML, tags and attributes and ignoring stylistic deficiencies and unnecessary elements in the original page format.
  - 15. A method, according to claim 1, further comprising:
    - a user providing an indication of portions of a displayed top score candidate; and
      
      clipping portions indicated by the user, wherein the portions are subsequently used by other software.

16. A non-transitory computer-readable medium containing software that extracts principal content from Web pages, the software comprising:
- executable code that identifies and classifies items on the Web page;
  
  executable code that builds a list of candidates;
  
  executable code that calculates candidate scores that vary according to a plurality of weights assigned to paragraphs and images of the Web page, wherein a particular paragraph is provided with a first weight based on a number of lines in the particular paragraph and is provided with a second weight based on a number of words in the particular paragraph, the first weight being independent of the second weight and a particular image is provided with a weight based on a size of the particular image;
  
  executable code that selects a top score candidate;
  
  executable code that performs clean up processing for the top score candidate; and
  
  executable code that performs final page processing for the top score candidate.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 17. A non-transitory computer-readable medium, according to claim 16, where a word length of CJK (Chinese-Japanese-Korean) text is determined according to punctuation therein.
  - 18. A non-transitory computer-readable medium, according to claim 16, wherein executable code that calculates candidate scores determines an Initial Score using the formula:
    - Initial Score=1000×
      
      (1.5×
      
      (N_{5-line-paragraphs}+N_{80-word-paragraphs})+N_{3-line-paragraphs}+N_{50-word-paragraphs}+3×
      
      N_large-images−
      
      0.5×
      
      (N_small-images+N_{skipped-images}))where N_{5-line-paragraphs}is a number of five line paragraphs in the candidate, N_{80-word-paragraphs}is a number of 80 word paragraphs in the candidate, N_{3-line-paragraphs}is a number of 3 line paragraphs in the candidate, N_{50-word-paragraphs}is a number of 50 word paragraphs in the candidate, N_large-imagesis a number of images that contain at least 50K pixels or have a width of at least 350 pixels and a height of at least 75 pixels, N_{skipped-images}is a number of images that have a size not exceeding 5×
      
      5 or are present on Web page as references to at least one of;
      
      blacklisted and quarantined sites, and all other images are count toward N_small-images.
  - 19. A non-transitory computer-readable medium, according to claim 16, wherein candidate scores are modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘
    - body’
      
      , ‘
      
      div’
      
      , ‘
      
      td’
      
      , ‘
      
      li’
      
      , ‘
      
      article/section’ and
      
      pieces are candidates that do not include other candidates.
  - 20. A non-transitory computer-readable medium, according to claim 19, wherein candidate scores are modified using a formula:
    - Modified Initial Score=Initial Score×
      
      (⅓
      
      +⅔
      
      ×
      
      ⅓
      
      ×
      
      (1/N_pieces+1/N_candidates+1/N_containers)).
  - 21. A non-transitory computer-readable medium, according to claim 19, wherein candidate scores are modified according to a number of ratios corresponding to text and link density.
  - 22. A non-transitory computer-readable medium, according to claim 21, wherein the ratios include a ratio of a length of regular text for the candidate to a total length of regular text, a number of words in regular text of the candidate to a total number of words in regular text, a length of regular text above the candidate to a total length of regular text, and a number of words in regular text above the candidate to a total number of words in regular text.
  - 23. A non-transitory computer-readable medium, according to claim 22, wherein candidate scores are modified using a formula:
    - Modified Initial Score_n+1=Modified Initial Score_n×
      
      (Percentage_n×
      
      (ratio^degree)_n+(100−
      
      Percentage_n))/100wherein Percentage_nand ratio^degreeare predetermined values that are empirically determined for each of n ratios.
  - 24. A non-transitory computer-readable medium, according to claim 16, further comprising:
    - executable code that determines if the top score candidate meets predetermined criteria following selecting the top score candidate; and
      
      executable code that determines if a different top score candidate should be selected if the top score candidate does not meet the predetermined criteria.
  - 25. A non-transitory computer-readable medium, according to claim 24, wherein a first set of formulas is used to determine the top score candidate and a second, different, set of formulas is used to determine if a different top score candidate should be selected.
  - 26. A non-transitory computer-readable medium, according to claim 25, wherein a different top score candidate is not used if using the second set of formulas results in the same top score candidate as using the first set of formulas.
  - 27. A non-transitory computer-readable medium, according to claim 24, wherein the predetermined criteria is selected from the group consisting of:
    - whether the top score candidate has less than 25 embedded containers, wherein a container is a Web page element that is associated with tags ‘
      
      body’
      
      , ‘
      
      div’
      
      , ‘
      
      td’
      
      , ‘
      
      li’
      
      , ‘
      
      article/section’
      
      , whether the top score candidate has no embedded other candidates and whether the top score candidate has no more than three embedded candidates that have other embedded candidates.
  - 28. A non-transitory computer-readable medium, according to claim 16, wherein executable code that performs clean up processing removes floating elements, link boxes, navigation panels, and videos from unknown sources.
  - 29. A non-transitory computer-readable medium, according to claim 16, wherein executable code that performs clean up processing reformats the top scoring candidate by rewriting the text thereof as a new HTML page using only feasible HTML tags and attributes and ignoring stylistic deficiencies and unnecessary elements in the original page format.
  - 30. A non-transitory computer-readable medium, according to claim 16, further comprising:
    - executable code that clips portions indicated by the user, wherein the portions are subsequently used by other software.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bending Spoons SpA
Original Assignee
Evernote Corp. (Bending Spoons SpA)
Inventors
Bignert, Jakob, Coarna, Gabriel Alexandru
Primary Examiner(s)
Perveen, Rehana
Assistant Examiner(s)
Tran, Loc

Application Number

US13/563,060
Publication Number

US 20130124513A1
Time in Patent Office

1,162 Days
Field of Search

707/728, 707/727
US Class Current

1/1
CPC Class Codes

G06F 16/353 into predefined classes

G06F 16/958 Organisation or management ...

Extracting principal content from web pages

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Extracting principal content from web pages

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links