Extracting structured data from weblogs

US 10,180,986 B2
Filed: 10/12/2015
Issued: 01/15/2019
Est. Priority Date: 06/16/2005
Status: Active Grant

First Claim

Patent Images

1. A method of extracting weblog posts from a weblog, the method comprising:

retrieving a feed referenced on a webpage of the weblog; and

in response to determining that the feed does not contain a first portion of a weblog post;

creating, via a processor, a representation of the weblog post based on a second portion of the weblog post included in the feed;

filtering the representation of the weblog post to summarization artefacts;

searching, via the processor, the weblog for the filtered representation of the second portion of the weblog post;

when the second portion of the weblog post is found in the weblog, identifying, via the processor, a node associated with the second portion in the webpage;

extracting, via the processor, information from markup language contained within the node associated with the second portion of the webpage; and

modifying, via the processor, the representation based on the information extracted from within the node to reconstruct the weblog post.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus for extracting structured data from weblogs are disclosed. In some examples, the methods and apparatus include retrieving a feed referenced on a webpage of the weblog and, in response to determining that the feed does not contain a first portion of a weblog post, creating, via a processor, a representation of the weblog post based on a second portion of the weblog post included in the feed, searching, via the processor, the weblog for the second portion of the weblog post, when the second portion of the weblog post is found in the weblog, identifying, via the processor, a node associated with the second portion in the webpage, and modifying, via the processor, the representation based on information from within the node to reconstruct the weblog post.

224 Citations

21 Claims

1. A method of extracting weblog posts from a weblog, the method comprising:
- retrieving a feed referenced on a webpage of the weblog; and
  
  in response to determining that the feed does not contain a first portion of a weblog post;
  
  creating, via a processor, a representation of the weblog post based on a second portion of the weblog post included in the feed;
  
  filtering the representation of the weblog post to summarization artefacts;
  
  searching, via the processor, the weblog for the filtered representation of the second portion of the weblog post;
  
  when the second portion of the weblog post is found in the weblog, identifying, via the processor, a node associated with the second portion in the webpage;
  
  extracting, via the processor, information from markup language contained within the node associated with the second portion of the webpage; and
  
  modifying, via the processor, the representation based on the information extracted from within the node to reconstruct the weblog post.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. A method as defined in claim 1, further including, in response to determining that the feed contains the first portion and the second portion, mapping the first and second portions into the representation of the weblog post.
  - 3. A method as defined in claim 1, wherein the first portion is at least one of a date for the weblog post, a permalink of the weblog post, a post title of the weblog post, an author of the weblog post, or a summary of the weblog post.
  - 4. A method as defined in claim 1, wherein the determining that the feed does not contain the first portion of the weblog post is based on at least one of a presence of tags, a percentage of posts including ellipses, or a variance in length of the weblog post.
  - 5. A method as defined in claim 1, further including, in response to determining that the feed does not contain a date of the weblog post and at least one of a summary or a full description of the weblog post:
    - extracting dates from the markup language of the webpage;
      
      sorting the extracted dates into ordered lists;
      
      filtering the ordered lists to determine which of the lists correspond to entry dates of the weblog post;
      
      segmenting the weblog into entries based on dates from the filtered list as markers for the entries;
      
      extracting the weblog post from the weblog entries based on post title markers; and
      
      identifying a permalink and an author of the weblog post.
  - 6. A method as defined in claim 5, wherein the filtering of the ordered lists includes:
    - extracting lists whose dates belong to a current year or a past year;
      
      extracting non-singleton date lists;
      
      extracting lists whose dates conform to a similar format;
      
      extracting lists whose dates decrease monotonically;
      
      extracting lists with most recent dates;
      
      extracting lists with a longest date string representation; and
      
      extracting lists with a greatest number of dates.
  - 7. A method as defined in claim 1, further including screen scraping the weblog.

8. An apparatus for extracting weblog posts from a weblog, the apparatus comprising:
- a web crawler to retrieve a feed referenced on a webpage of the weblog;
  
  a feed classifier to determine whether the feed contains a first portion of a weblog post; and
  
  a wrapper to, in response to determining that the feed does not contain a first portion of a weblog post;
  
  filter the representation of the weblog post to summarization artefacts;
  
  create a representation of the weblog post based on a second portion of the weblog post included in the feed;
  
  search the weblog for the filtered representation of the second portion of the weblog post;
  
  when the second portion of the weblog post is found in the weblog, identify a node associated with the second portion in the webpage;
  
  extract information from markup language contained within the node associated with the second portion of the webpage; and
  
  modify the representation based on the information extracted from within the node to reconstruct the weblog post, at least one of the web crawler, the feed classifier, or the wrapper implemented by a logic circuit.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. An apparatus as defined in claim 8, wherein the wrapper is to, in response to determining that the feed contains the first portion and the second portion, map the first and second portions into the representation of the weblog post.
  - 10. An apparatus as defined in claim 8, wherein the first portion is at least one of a date for the weblog post, a permalink of the weblog post, a post title of the weblog post, an author of the weblog post, or a summary of the weblog post.
  - 11. An apparatus as defined in claim 8, wherein the feed classifier is to determine whether the feed contains the first portion of the weblog post based on at least one of a presence of tags, a percentage of posts including ellipses, or a variance in length of the weblog post.
  - 12. An apparatus as defined in claim 8, further including a date extractor to, in response to determining that the feed does not contain a date of the weblog post and at least one of a summary or a full description of the weblog post:
    - extract dates from the markup language of the webpage;
      
      sort the extracted dates into ordered lists;
      
      filter the ordered lists to determine which of the lists correspond to entry dates of the weblog post;
      
      segment the weblog into entries based on dates from the filtered list as markers for the entries;
      
      extract the weblog post from the weblog entries based on post title markers; and
      
      identify a permalink and an author of the weblog post.
  - 13. An apparatus as defined in claim 12, wherein to filter the ordered lists, the date extractor is to:
    - extract lists whose dates belong to a current year or a past year;
      
      extract non-singleton date lists;
      
      extract lists whose dates conform to a similar format;
      
      extract lists whose dates decrease monotonically;
      
      extract lists with most recent dates;
      
      extract lists with a longest date string representation; and
      
      extract lists with a greatest number of dates.
  - 14. An apparatus as defined in claim 8, wherein the wrapper is to screen scrape the weblog.

15. A computer readable storage hardware device or storage disc comprising instructions, that when executed, cause a machine to at least:
- retrieve a feed referenced on a webpage of the weblog; and
  
  in response to determining that the feed does not contain a first portion of a weblog post;
  
  create a representation of the weblog post based on a second portion of the weblog post included in the feed;
  
  filter the representation of the weblog post to summarization artefactssearch the weblog for the filtered representation of the second portion of the weblog post;
  
  when the second portion of the weblog post is found in the weblog, identify a node associated with the second portion in the webpage;
  
  extract information from markup language contained within the node associated with the second portion of the webpage; and
  
  modify the representation based on the information extracted from within the node to reconstruct the weblog post.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. A hardware device as defined in claim 15, wherein the instructions, when executed, cause the machine to, when the feed contains the first portion and the second portion, map the first and second portions into the representation of the weblog post.
  - 17. A hardware device as defined in claim 15, wherein the first portion is at least one of a date of the weblog post, a permalink of the weblog post, a post title of the weblog post, an author of the weblog post, or a summary of the weblog post.
  - 18. A hardware device as defined in claim 15, wherein the instructions, when executed, cause the machine to, determine the feed does not contain the first portion of the weblog post based on at least one of a presence of tags, a percentage of posts including ellipses, or a variance in length of the weblog post.
  - 19. A hardware device as defined in claim 15, wherein the instructions, when executed, cause the machine to, in response to determining that the feed does not contain a date of the weblog post and at least one of a summary or a full description of the weblog post:
    - extract dates from the markup language of the webpage;
      
      sort the extracted dates into ordered lists;
      
      filter the ordered lists to determine which of the lists correspond to entry dates of the weblog post;
      
      segment the weblog into entries based on dates from the filtered list as markers for the entries;
      
      extract the weblog post from the weblog entries based on post title markers; and
      
      identify a permalink and an author of the weblog post.
  - 20. A hardware device as defined in claim 19, wherein to filter the ordered lists, the instructions, when executed, cause the machine to:
    - extract lists whose dates belong to a current year or a past year;
      
      extract non-singleton date lists;
      
      extract lists whose dates conform to a similar format;
      
      extract lists whose dates decrease monotonically;
      
      extract lists with most recent dates;
      
      extract lists with a longest date string representation; and
      
      extract lists with a greatest number of dates.
  - 21. A hardware device as defined in claim 15, wherein the instructions, when executed, cause the machine to screen scrape the weblog.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Buzzmetrics, Ltd. (Nielsen Holdings plc)
Original Assignee
Buzzmetrics, Ltd. (Nielsen Holdings plc)
Inventors
Glance, Natalie
Primary Examiner(s)
Burke, Jeff A

Application Number

US14/881,071
Publication Number

US 20160117390A1
Time in Patent Office

1,191 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/245   Query processing

G06F 16/28   Databases characterised by ...

G06F 16/80   of semi-structured data, e....

G06F 16/951   Indexing; Web crawling tech...

G06F 16/958   Organisation or management ...

G06F 40/205   Parsing

Extracting structured data from weblogs

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

224 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Extracting structured data from weblogs

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

224 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links