ADAPTIVE GATHERING OF STRUCTURED AND UNSTRUCTURED DATA SYSTEM AND METHOD
First Claim
1. A computer implement method of obtaining information from a webserver, the method comprising:
- obtaining a first URI from a prioritized URI queue;
utilizing the first URI at a first URI access time to request first content from the webserver;
parsing the first content a first time for first price and product information and saving the result as a first parse result;
utilizing the first URI at a second URI access time to request second content from the webserver;
parsing the second content for second price and product information, and saving the result as a second parse result; and
determining that the first parse result is different than the second parse result and setting a time for accessing the first URI in the prioritized URI queue based on the difference.
3 Assignments
0 Petitions
Accused Products
Abstract
Content is obtained from a webpage accessed via a URI, which URI is obtained from a URI queue. The content is parsed for price and product information according to a parse map, with the resulting parse result being stored. The priority of URIs in the URI queue is adjusted based on analysis of the parse result for changes in price and product attributes and according to other criteria. The parse map may be one associated with the URI or a general purpose parse maps. The parse result may be validated by human- and machine-based systems, including by graphically labeling price and product information in the content for human confirmation or correction.
24 Citations
22 Claims
-
1. A computer implement method of obtaining information from a webserver, the method comprising:
-
obtaining a first URI from a prioritized URI queue; utilizing the first URI at a first URI access time to request first content from the webserver; parsing the first content a first time for first price and product information and saving the result as a first parse result; utilizing the first URI at a second URI access time to request second content from the webserver; parsing the second content for second price and product information, and saving the result as a second parse result; and determining that the first parse result is different than the second parse result and setting a time for accessing the first URI in the prioritized URI queue based on the difference. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer implemented method of classifying a first webpage containing information regarding a product and grouping the first webpage with prior webpages containing information regarding the product, the method comprising:
-
obtaining a first parse result comprising a first set of price and product information parsed from the first webpage and a first identifier associated with the first webpage; utilizing at least a first algorithm to determine a category for the first webpage from a category taxonomy; utilizing at least a second algorithm to extract a first set of price and product attributes from the first parse result; obtaining prior sets of price and product attributes and prior identifiers for other webpages associated with the determined category; weighing at least one of the product attributes in the first set of price and product attributes and in the prior sets of price and product attributes heavier than other of the attributes; clustering the weighted price and product attributes to identify webpages with similar price and product attributes; identifying within each cluster a set of price and product attributes which shares the maximum number of weighted price and product attributes with the other sets of price and product attributes; and assigning the identifier associated with the set of price and product attributes which shares the maximum number of weighted price and product attributes with the other sets of price and product attributes as a common identifier for all of the products in the cluster.
-
-
12. A computer implemented method of obtaining information from a webserver, the method comprising:
-
obtaining a first URI from a prioritized URI queue; utilizing the first URI at a first URI access time to request first content from the webserver; parsing the first content a first time for first price and product information and saving the result as a first parse result; and determining that the first parse result does not contain price and product information and removing the first URI from the prioritized URI queue.
-
-
13. A computer implemented method of obtaining information from a webserver, the method comprising:
-
obtaining a first URI from a prioritized URI queue; utilizing the first URI at a first URI access time to request first content from the webserver; parsing the first content a first time for first price and product information and saving the result as a first parse result; and determining whether the first parse result contains a listing webpage or a product webpage; and if the first parse result contains a listing webpage, reducing the time to the next URI check of the first URI in the prioritized URI queue;
elseincreasing the time to the next URI check of the first URI in the prioritized URI queue.
-
-
14. A computer implemented method of obtaining information from a webserver, the method comprising:
-
obtaining a first URI from a prioritized URI queue; utilizing the first URI at a first URI access time to request first content from the webserver; parsing the first content a first time for first price and product information according to a first parse map and saving the result as a first parse result; determining that a data type of a price or product attribute in the parse result does not match an allowed data type; and validating the parse map.
-
-
15. A computer implemented method of determining a parse map for parsing price and product information from first content obtained via a first URI, the method comprising:
-
obtaining a first list comprising HTML and CSS elements; obtaining a second list comprising price and product attributes, which attributes are each associated with a label; associating at least a first element of the first list with at least a first attribute of the second list, which association is a first parse map; obtaining first content via a first URI, which first content comprises HTML and CSS elements; modifying the first content to graphically identify the portion of the first content encompassed by the first element with the label associated with the first attribute; and transmitting to a second computer the modified first content. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A method of adding URIs to a URI queue, practiced by a first computer comprising a memory, the method comprising:
-
with the first computer, receiving a base URI and sample non-product webpages, sample product pages, sample listing webpages, and sample category webpages associated with the base URI; with the first computer, verifying that the first computer is allowed to crawl a website accessed via the base URI and downloading content from the website; with the first computer, identifying in the content at least one of a site name, a crawl delay, URI structures associated with the listing pages, product pages, and non-product pages, URI deduplication rules for the website; with the first computer, determining a crawling strategy as at least one of a sitemap-based crawling strategy or a wild crawl based crawling strategy; and
for each URI identified thereby,adding the identified URI to a URI queue and setting a time to next check the identified URI.
-
-
22. A computing apparatus for obtaining information from a webserver, the apparatus comprising a processor and a memory storing instructions that, when executed by the processor, configure the apparatus to:
-
obtain a first URI from a prioritized URI queue; utilize the first URI at a first URI access time to request first content from the webserver; parse the first content a first time for first price and product information and save the result as a first parse result; utilize the first URI at a second URI access time to request second content from the webserver; parse the second content for second price and product information, and save the result as a second parse result; and determine that the first parse result is different than the second parse result and set a time for accessing the first URI in the prioritized URI queue based on the difference.
-
Specification