Extracting structured data from weblogs
First Claim
1. A method of extracting weblog posts from a weblog, the method comprising:
- retrieving a feed referenced on a webpage of the weblog; and
in response to determining that the feed does not contain a first portion of a weblog post;
creating, via a processor, a representation of the weblog post based on a second portion of the weblog post included in the feed;
filtering the representation of the weblog post to summarization artefacts;
searching, via the processor, the weblog for the filtered representation of the second portion of the weblog post;
when the second portion of the weblog post is found in the weblog, identifying, via the processor, a node associated with the second portion in the webpage;
extracting, via the processor, information from markup language contained within the node associated with the second portion of the webpage; and
modifying, via the processor, the representation based on the information extracted from within the node to reconstruct the weblog post.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatus for extracting structured data from weblogs are disclosed. In some examples, the methods and apparatus include retrieving a feed referenced on a webpage of the weblog and, in response to determining that the feed does not contain a first portion of a weblog post, creating, via a processor, a representation of the weblog post based on a second portion of the weblog post included in the feed, searching, via the processor, the weblog for the second portion of the weblog post, when the second portion of the weblog post is found in the weblog, identifying, via the processor, a node associated with the second portion in the webpage, and modifying, via the processor, the representation based on information from within the node to reconstruct the weblog post.
224 Citations
21 Claims
-
1. A method of extracting weblog posts from a weblog, the method comprising:
-
retrieving a feed referenced on a webpage of the weblog; and in response to determining that the feed does not contain a first portion of a weblog post; creating, via a processor, a representation of the weblog post based on a second portion of the weblog post included in the feed; filtering the representation of the weblog post to summarization artefacts; searching, via the processor, the weblog for the filtered representation of the second portion of the weblog post; when the second portion of the weblog post is found in the weblog, identifying, via the processor, a node associated with the second portion in the webpage; extracting, via the processor, information from markup language contained within the node associated with the second portion of the webpage; and modifying, via the processor, the representation based on the information extracted from within the node to reconstruct the weblog post. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus for extracting weblog posts from a weblog, the apparatus comprising:
-
a web crawler to retrieve a feed referenced on a webpage of the weblog; a feed classifier to determine whether the feed contains a first portion of a weblog post; and a wrapper to, in response to determining that the feed does not contain a first portion of a weblog post; filter the representation of the weblog post to summarization artefacts; create a representation of the weblog post based on a second portion of the weblog post included in the feed; search the weblog for the filtered representation of the second portion of the weblog post; when the second portion of the weblog post is found in the weblog, identify a node associated with the second portion in the webpage; extract information from markup language contained within the node associated with the second portion of the webpage; and modify the representation based on the information extracted from within the node to reconstruct the weblog post, at least one of the web crawler, the feed classifier, or the wrapper implemented by a logic circuit. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer readable storage hardware device or storage disc comprising instructions, that when executed, cause a machine to at least:
-
retrieve a feed referenced on a webpage of the weblog; and in response to determining that the feed does not contain a first portion of a weblog post; create a representation of the weblog post based on a second portion of the weblog post included in the feed; filter the representation of the weblog post to summarization artefacts search the weblog for the filtered representation of the second portion of the weblog post; when the second portion of the weblog post is found in the weblog, identify a node associated with the second portion in the webpage; extract information from markup language contained within the node associated with the second portion of the webpage; and modify the representation based on the information extracted from within the node to reconstruct the weblog post. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification