Harvesting data from page
First Claim
1. A method for automatically mining data, the method comprising:
- identifying, a portion of the page which includes the data to be mined;
determining that the portion of the page links to another page;
identifying, a portion of the linked page which includes the data to be mined;
wherein, the identifying portions of the pages includes;
using a feed representation created for the page by an entity;
selecting the portions of the pages as having the data to be mined, wherein the selected portions of the pages include content that is referenced in the feed representation;
retrieving and storing, the selected portions of the pages as having the data to be mined;
data mining the selected portions of the pages that have been retrieved and stored;
identifying a first portion of a second page that is referenced in the feed representations and a second portion of the second page that is not referenced in the feed representation;
retrieving and storing the second portion and not the first portion of the second page.
6 Assignments
0 Petitions
Accused Products
Abstract
Among other disclosure, computer-implemented methods and computer program products for obtaining data from a page. A method can include initiating a harvesting process for a page available in a computer system. The method can include identifying a feed representation that has been created for the page. The method can include retrieving and storing, as part of the harvesting process, at least a portion from the page based on information in the identified feed representation. The feed representation can include at least excerpts of content from the page. The feed representation can include at least one representation selected from: an RSS feed, an Atom feed, an XML feed, an RDF feed, a serialized data feed representation, and combinations thereof.
227 Citations
19 Claims
-
1. A method for automatically mining data, the method comprising:
-
identifying, a portion of the page which includes the data to be mined; determining that the portion of the page links to another page; identifying, a portion of the linked page which includes the data to be mined; wherein, the identifying portions of the pages includes; using a feed representation created for the page by an entity; selecting the portions of the pages as having the data to be mined, wherein the selected portions of the pages include content that is referenced in the feed representation; retrieving and storing, the selected portions of the pages as having the data to be mined; data mining the selected portions of the pages that have been retrieved and stored; identifying a first portion of a second page that is referenced in the feed representations and a second portion of the second page that is not referenced in the feed representation; retrieving and storing the second portion and not the first portion of the second page.
-
-
2. A computer-implemented method for automatically mining data, the computer-implemented method comprising:
-
identifying, in a portion of a page which includes the data to be mined; identifying, a portion of a linked page as having data to be mined, wherein the linked page is accessible from the portion of the page; wherein, the identifying portions of the page and the linked page includes; using a feed representation that has been created for the page by another entity; wherein, the portions of the page and the linked page selected as having the data to be mined include content that is referenced in the feed representation; retrieving and storing, the portions of the page and the linked page having the data to be mined; data mining the portions of the page and the linked page that have been retrieved and stored; identifying a first portion of a second page that is referenced in the feed representation and a second portion of the second page that is not referenced in the feed representation; retrieving and storing the second portion and not the first portion of the third page.
-
-
3. A computer-implemented method for automatically mining data, the computer-implemented method comprising:
-
identifying a portion of a page which includes the data to be mined; wherein, the portion of the page is selected as having the data to be mined based on determining whether various portions of the page include content that is referenced in a feed representation associated with the page; identifying a second page linked to the portion of the page; further selecting a portion of the second page which includes the data to be mined, wherein the portion of the second page that is selected includes content also referenced in the feed representation; retrieving and storing, the portion of the page and the portion of the second page linked to the page, both portions being identified as having the data to be mined; data mining the portion of the page and the portion of the second page linked to the page that have been retrieved and stored; identifying a first portion of a third page that is referenced in the feed representation and a second portion of the third page that is not referenced in the feed representation; retrieving and storing the second portion and not the first portion of the third page. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method for automatically mining data, the computer-implemented method comprising:
-
identifying a first page as a target for content retrieval, the first page including first content portions; identifying a feed representation that has been created for the first page, the identified feed representation including multiple feed entries each corresponding to some of the first content portions; identifying a second page linked to at least one of the first content portions of the first page; identifying the second page as a target for content retrieval, wherein, the second page includes second content portions; matching content of the multiple feed entries to the first content portions of the first page and the second content portions of the second page; retrieving those of the first and second content portions of the first and second pages based on the results of the matching; and storing those of the first and second content portions that are retrieved based on the results of the matching; performing additional processing, including data mining, on those of the first and second content portions that are stored; identifying a third content portion of a third page that match the content of at least one of the multiple feed entries and a fourth content portion of the third page that does not match any of the multiple feed entries; retrieving and storing the fourth content portion and not the third content portion of the third page. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A system for automatically mining data of a webpage, the system comprising:
-
a first device comprising a processor and memory, cooperating to function as; a first identifying unit configured to identify a portion of a webpage which includes data to be mined; a second identifying unit configured to identify another webpage that is linked to the portion of the webpage as a linked webpage; a third identifying unit configured to identify a portion of the linked webpage which also includes the data to be mined using a feed representation associated with the webpage, and further identify the portion of the webpage and the portion of the linked webpage; an analyzing unit configured to analyze the webpage and the linked webpage using the feed representation; wherein, based on the analysis, the first device identifies the portions of the webpage and the linked webpage as having the data to be mined, based on whether or not the portions include content that is referenced in the feed representation; a retrieving unit configured to retrieve and store, the portions from the webpage and the linked webpage including the data to be mined; a mining unit configured to data mine the portions of the webpage and the linked webpage that have been retrieved and stored; a second device comprising a processor and memory, cooperating to function as; a providing unit configured to provide the webpage, the linked webpage and the feed representation for the webpage; a fourth identifying unit configured to identify a first portion of a second webpage that is referenced in the feed representation and a second portion of the second webpage that is not referenced in the feed representation; a second retrieving unit configured to retrieve and store the second portion and not the first portion of the second webpage.
-
Specification