EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES

US 20130124513A1
Filed: 07/31/2012
Published: 05/16/2013
Est. Priority Date: 11/10/2011
Status: Active Grant

First Claim

Patent Images

1. A method of extracting principal content from Web pages, comprising:

identifying and classifying items on the Web page;

building a list of candidates;

calculating candidate scores;

selecting a top score candidate;

performing clean up processing for the top score candidate; and

performing final page processing for the top score candidate.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Extracting principal content from Web pages includes identifying and classifying items on the Web page, building a list of candidates, calculating candidate scores, selecting a top score candidate, performing clean up processing for the top score candidate, and performing final page processing for the top score candidate. Candidate scores may vary according to a number of paragraphs and images grouped according to size. A world length of CJK (Chinese-Japanese-Korean) text may be determined according to punctuation therein. Candidate scores may be modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘body’, ‘div’, ‘td’, ‘li’, ‘article/section’ and pieces are candidates that do not include other candidates. Candidate scores may be modified according to a number of ratios corresponding to text and link density.

Citations

44 Claims

1. A method of extracting principal content from Web pages, comprising:
- identifying and classifying items on the Web page;
  
  building a list of candidates;
  
  calculating candidate scores;
  
  selecting a top score candidate;
  
  performing clean up processing for the top score candidate; and
  
  performing final page processing for the top score candidate.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. A method, according to claim 1, wherein candidate scores vary according to a number of paragraphs and images grouped according to size.
  - 3. A method, according to claim 2, where a world length of CJK (Chinese-Japanese-Korean) text is determined according to punctuation therein.
  - 4. A method, according to claim 2, wherein calculating candidate scores includes determining an Initial Score using the formula:
    - Initial Score=1000×
      
      (1.5×
      
      (N_{5-line-paragraphs}+N_{80-word-paragraphs})+N_{3-line-paragraphs}+N_{50-word-paragraphs}+3×
      
      N_large-images−
      
      0.5×
      
      (N_small-images+N_{skipped-images}))where N_{5-line-paragraphs}is a number of five line paragraphs in the candidate, N_{80-word-paragraphs}is a number of 80 word paragraphs in the candidate, N_{3-line-paragraphs}is a number of 3 line paragraphs in the candidate, N_{50-word-paragraphs}is a number of 50 word paragraphs in the candidate, N_large-imagesis a number of images that contain at least 50K pixels or have a width of at least 350 pixels and a height of at least 75 pixels, N_{skipped-images}is a number of images that have a size not exceeding 5×
      
      5 or are present on Web page as references to at least one of;
      
      blacklisted and quarantined sites, and all other images are count toward N_small-images.
  - 5. A method, according to claim 2, wherein candidate scores are modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘
    - body’
      
      , ‘
      
      div’
      
      , ‘
      
      td’
      
      , ‘
      
      li’
      
      , ‘
      
      article/section’ and
      
      pieces are candidates that do not include other candidates.
  - 6. A method, according to claim 5, wherein candidate scores are modified using a formula:
    - Modified Initial Score=Initial Score×
      
      (⅓
      
      +⅔
      
      ×
      
      ⅓
      
      ×
      
      (1/N_Pieces+1/N_Candidates+1/N_Containers)).
  - 7. A method, according to claim 5, wherein candidate scores are modified according to a number of ratios corresponding to text and link density.
  - 8. A method, according to claim 7, wherein the ratios include a ratio of a length of regular text for the candidate to a total length of regular text, a number of words in regular text of the candidate to a total number of words in regular text, a length of regular text above the candidate to a total length of regular text, and a number of words in regular text above the candidate to a total number of words in regular text.
  - 9. A method, according to claim 8, wherein candidate scores are modified using a formula:
    - Modified Initial Score_n+1=Modified Initial Score_n×
      
      (Percentage_n×
      
      (ratio^degree)_n+(100−
      
      Percentage_n))/100wherein Percentage_nand ratio^degreeare predetermined values that are empirically determined for each of n ratios.
  - 10. A method, according to claim 1, further comprising:
    - following selecting the top score candidate, determining if the top score candidate meets predetermined criteria; and
      
      if the top score candidate does not meet predetermined criteria, determining if a different top score candidate should be selected.
  - 11. A method, according to claim 10, wherein a first set of formulas is used to determine the top score candidate and a second, different, set of formulas is used to determine if a different top score candidate should be selected.
  - 12. A method, according to claim 11, wherein a different top score candidate is not used if using the second set of formulas results in the same top score candidate as using the first set of formulas.
  - 13. A method, according to claim 10, wherein the predetermined criteria is selected from the group consisting of:
    - whether the top score candidate has less than 25 embedded containers, wherein a container is a Web page element that is associated with tags ‘
      
      body’
      
      , ‘
      
      div’
      
      , ‘
      
      td’
      
      , ‘
      
      li’
      
      , ‘
      
      article/section’
      
      , whether the top score candidate has no embedded other candidates and whether the top score candidate has no more than three embedded candidates that have other embedded candidates.
  - 14. A method, according to claim 1, wherein performing clean up processing includes removing floating elements, link boxes, navigation panels, and videos from unknown sources.
  - 15. A method, according to claim 1, wherein performing clean up processing includes reformatting the top scoring candidate by rewriting the text thereof as a new HTML page using only feasible HTML tags and attributes and ignoring stylistic deficiencies and unnecessary elements in the original page format.
  - 16. A method, according to claim 1, further comprising:
    - a user providing an indication of portions of a displayed top score candidate; and
      
      clipping portions indicated by the user, wherein the portions are subsequently used by other software.

17. Computer software, provided in non-transitory computer-readable media, that extracts principal content from Web pages, the software comprising:
- executable code that identifies and classifies items on the Web page;
  
  executable code that builds a list of candidates;
  
  executable code that calculates candidate scores;
  
  executable code that selects a top score candidate;
  
  executable code that performs clean up processing for the top score candidate; and
  
  executable code that performs final page processing for the top score candidate.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 18. Computer software, according to claim 17, wherein candidate scores vary according to a number of paragraphs and images grouped according to size.
  - 19. Computer software, according to claim 18, where a world length of CJK (Chinese-Japanese-Korean) text is determined according to punctuation therein.
  - 20. Computer software, according to claim 18, wherein executable code that calculates candidate scores determines an Initial Score using the formula:
    - Initial Score=1000×
      
      (1.5×
      
      (N_{5-line-paragraphs}+N_{80-word-paragraphs})+N_{3-line-paragraphs}+N_{50-word-paragraphs}+3×
      
      N_large-images−
      
      0.5×
      
      (N_small-images+N_{skipped-images}))where N_{5-line-paragraphs}is a number of five line paragraphs in the candidate, N_{80-word-paragraphs}is a number of 80 word paragraphs in the candidate, N_{3-line-paragraphs}is a number of 3 line paragraphs in the candidate, N_{50-word-paragraphs}is a number of 50 word paragraphs in the candidate, N_large-imagesis a number of images that contain at least 50K pixels or have a width of at least 350 pixels and a height of at least 75 pixels, N_{skipped-images}is a number of images that have a size not exceeding 5×
      
      5 or are present on Web page as references to at least one of;
      
      blacklisted and quarantined sites, and all other images are count toward N_small-images.
  - 21. Computer software, according to claim 18, wherein candidate scores are modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘
    - body’
      
      , ‘
      
      div’
      
      , ‘
      
      td’
      
      , ‘
      
      li’
      
      , ‘
      
      article/section’ and
      
      pieces are candidates that do not include other candidates.
  - 22. Computer software, according to claim 21, wherein candidate scores are modified using a formula:
    - Modified Initial Score=Initial Score×
      
      (⅓
      
      +⅔
      
      ×
      
      ⅓
      
      ×
      
      (1/N_Pieces+1/N_Candidates+1/N_Containers)).
  - 23. Computer software, according to claim 21, wherein candidate scores are modified according to a number of ratios corresponding to text and link density.
  - 24. Computer software, according to claim 23, wherein the ratios include a ratio of a length of regular text for the candidate to a total length of regular text, a number of words in regular text of the candidate to a total number of words in regular text, a length of regular text above the candidate to a total length of regular text, and a number of words in regular text above the candidate to a total number of words in regular text.
  - 25. Computer software, according to claim 24, wherein candidate scores are modified using a formula:
    - Modified Initial Score_n+1=Modified Initial Score_n×
      
      (Percentage_n×
      
      (ratio^degree)_n+(100−
      
      Percentage_n))/100wherein Percentage_nand ratio^degreeare predetermined values that are empirically determined for each of n ratios.
  - 26. Computer software, according to claim 17, further comprising:
    - executable code that determines if the top score candidate meets predetermined criteria following selecting the top score candidate; and
      
      executable code that determines if a different top score candidate should be selected if the top score candidate does not meet the predetermined criteria.
  - 27. Computer software, according to claim 26, wherein a first set of formulas is used to determine the top score candidate and a second, different, set of formulas is used to determine if a different top score candidate should be selected.
  - 28. Computer software, according to claim 27, wherein a different top score candidate is not used if using the second set of formulas results in the same top score candidate as using the first set of formulas.
  - 29. Computer software, according to claim 26, wherein the predetermined criteria is selected from the group consisting of:
    - whether the top score candidate has less than 25 embedded containers, wherein a container is a Web page element that is associated with tags ‘
      
      body’
      
      , ‘
      
      div’
      
      , ‘
      
      td’
      
      , ‘
      
      li’
      
      , ‘
      
      article/section’
      
      , whether the top score candidate has no embedded other candidates and whether the top score candidate has no more than three embedded candidates that have other embedded candidates.
  - 30. Computer software, according to claim 17, wherein executable code that performs clean up processing removes floating elements, link boxes, navigation panels, and videos from unknown sources.
  - 31. Computer software, according to claim 17, wherein executable code that performs clean up processing reformats the top scoring candidate by rewriting the text thereof as a new HTML page using only feasible HTML tags and attributes and ignoring stylistic deficiencies and unnecessary elements in the original page format.
  - 32. Computer software, according to claim 17, further comprising:
    - executable code that clips portions indicated by the user, wherein the portions are subsequently used by other software.

33. A method of identifying separate portions of a principal item provided on multiple Web pages, comprising:
- determining a last non-empty segment of the path of a URL containing a portion of the principal item;
  
  determining a similarity measure between the URL and other URLs on the page corresponding to the last non-empty segment of the path;
  
  examining Web link text for words indicating additional portions of the principal item on Web pages corresponding to sufficiently similar URLs; and
  
  determining whether additional portions of the principal item are provided on at least some of the Web pages according to the similarity measure and the presence of words indicating additional portions.
- View Dependent Claims (34, 35, 36, 37, 38)
- - 34. A method, according to claim 33, wherein the words indicating additional portions include page numbers.
  - 35. A method, according to claim 33, wherein the words indicating additional portions include “
    - next”
      
      .
  - 36. A method, according to claim 33, wherein the similarity measure is selected from the group consisting of:
    - edit distance, Levenstein'"'"'s distance, and a length of a Largest Common Substring.
  - 37. A method, according to claim 36, further comprising:
    - confirming consistency of the portions.
  - 38. A method, according to claim 37, wherein confirming consistency includes extracting an article title from the portions and comparing the article with a previously-determined original title.

39. Computer software, provided in a non-transitory computer-readable medium, that identifies separate portions of a principal item provided on multiple Web pages, the software comprising:
- executable code that determines a last non-empty segment of the path of a URL containing a portion of the principal item;
  
  executable code that determines a similarity measure between the URL and other URLs on the page corresponding to the last non-empty segment of the path;
  
  executable code that examines Web link text for words indicating additional portions of the principal item on Web pages corresponding to sufficiently similar URLs; and
  
  executable code that determines whether additional portions of the principal item are provided on at least some of the Web pages according to the similarity measure and the presence of words indicating additional portions.
- View Dependent Claims (40, 41, 42, 43, 44)
- - 40. Computer software, according to claim 39, wherein the words indicating additional portions include page numbers.
  - 41. Computer software, according to claim 39, wherein the words indicating additional portions include “
    - next”
      
      .
  - 42. Computer software, according to claim 39, wherein the similarity measure is selected from the group consisting of:
    - edit distance, Levenstein'"'"'s distance, and a length of a Largest Common Substring.
  - 43. Computer software, according to claim 42, further comprising:
    - executable code that confirms consistency of the portions.
  - 44. Computer software, according to claim 43, wherein executable code that confirms consistency extracts an article title from the portions and comparing the article with a previously-determined original title.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bending Spoons SpA
Original Assignee
Evernote Corp. (Bending Spoons SpA)
Inventors
Bignert, Jakob, Coarna, Gabriel Alexandru

Granted Patent

US 9,152,730 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/728
CPC Class Codes

G06F 16/353 into predefined classes

G06F 16/958 Organisation or management ...

EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

44 Claims

Specification

Solutions

Use Cases

Quick Links

EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

44 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links