Adaptive gathering of structured and unstructured data system and method
First Claim
1. A computer implement method of obtaining information from a webserver, the method comprising:
- by a first computer processor, obtaining a first Uniform Resource Identifier (“
URI”
) from a prioritized URI queue;
by the first computer processor, utilizing the first URI at a first URI access time to request first content from the webserver;
by a second computer processor, parsing the first content a first time for first price and product information and saving the result as a first parse result in a first computer memory;
by the first computer processor, utilizing the first URI at a second URI access time to request second content from the webserver;
by the second computer processor, parsing the second content for second price and product information, and saving the result as a second parse result in the first computer memory; and
by the second computer processor, determining that the first parse result is different than the second parse result and setting a time for accessing the first URI in the prioritized URI queue based on the difference.
3 Assignments
0 Petitions
Accused Products
Abstract
Content is obtained from a webpage accessed via a URI, which URI is obtained from a URI queue. The content is parsed for price and product information according to a parse map, with the resulting parse result being stored. The priority of URIs in the URI queue is adjusted based on analysis of the parse result for changes in price and product attributes and according to other criteria. The parse map may be one associated with the URI or a general purpose parse maps. The parse result may be validated by human- and machine-based systems, including by graphically labeling price and product information in the content for human confirmation or correction.
33 Citations
22 Claims
-
1. A computer implement method of obtaining information from a webserver, the method comprising:
-
by a first computer processor, obtaining a first Uniform Resource Identifier (“
URI”
) from a prioritized URI queue;by the first computer processor, utilizing the first URI at a first URI access time to request first content from the webserver; by a second computer processor, parsing the first content a first time for first price and product information and saving the result as a first parse result in a first computer memory; by the first computer processor, utilizing the first URI at a second URI access time to request second content from the webserver; by the second computer processor, parsing the second content for second price and product information, and saving the result as a second parse result in the first computer memory; and by the second computer processor, determining that the first parse result is different than the second parse result and setting a time for accessing the first URI in the prioritized URI queue based on the difference. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer implemented method of classifying a first webpage containing information regarding a product and grouping the first webpage with prior webpages containing information regarding the product, the method comprising:
-
by a first computer processor, obtaining a first parse result comprising a first set of price and product information parsed from the first webpage and a first identifier associated with the first webpage; by the first computer processor, utilizing at least a first algorithm to determine a category for the first webpage from a category taxonomy; by the first computer processor, utilizing at least a second algorithm to extract a first set of price and product attributes from the first parse result; by the first, second, or third computer processor, obtaining prior sets of price and product attributes and prior identifiers for other webpages associated with the determined category; by the first computer processor, weighing at least one of the product attributes in the first set of price and product attributes and in the prior sets of price and product attributes heavier than other of the attributes; by the first computer processor, clustering the weighted price and product attributes to identify webpages with similar price and product attributes; by the first computer processor, identifying within each cluster a set of price and product attributes which shares the maximum number of weighted price and product attributes with the other sets of price and product attributes; and by the first computer processor, assigning the identifier associated with the set of price and product attributes which shares the maximum number of weighted price and product attributes with the other sets of price and product attributes as a common identifier for all of the products in the cluster.
-
-
12. A computer implemented method of obtaining information from a webserver, the method comprising:
-
by a first computer processor, obtaining a first Uniform Resource Identifier (“
URI”
) from a prioritized URI queue;by the first computer processor, utilizing the first URI at a first URI access time to request first content from the webserver; by a second computer processor, parsing the first content a first time for first price and product information and saving the result as a first parse result in a first computer memory; and by the second computer processor, determining that the first parse result does not contain price and product information and removing the first URI from the prioritized URI queue.
-
-
13. A computer implemented method of obtaining information from a webserver, the method comprising:
-
by a first computer processor, obtaining a first Uniform Resource Identifier (“
URI”
) from a prioritized URI queue;by the first computer processor, utilizing the first URI at a first URI access time to request first content from the webserver; by a second computer processor, parsing the first content a first time for first price and product information and saving the result as a first parse result in a first computer memory; and by the second computer processor, determining whether the first parse result contains a listing webpage or a product webpage; and if the first parse result contains a listing webpage, by the second computer processor, reducing the time to the next URI check of the first URI in the prioritized URI queue;
elseincreasing the time to the next URI check of the first URI in the prioritized URI queue.
-
-
14. A computer implemented method of obtaining information from a webserver, the method comprising:
-
by a first computer processor, obtaining a first Uniform Resource Identifier (“
URI”
) from a prioritized URI queue;by the first computer processor, utilizing the first URI at a first URI access time to request first content from the webserver; by a second computer processor, parsing the first content a first time for first price and product information according to a first parse map and saving the result as a first parse result in a first computer memory; by the second computer processor, determining that a data type of a price or product attribute in the parse result does not match an allowed data type; and by the second computer processor, validating the parse map.
-
-
15. A computer implemented method of determining a parse map for parsing price and product information from first content obtained via a first Uniform Resource Identifier (“
- URI”
), the method comprising;by a first computer processor, obtaining a first list comprising HyperText Markup Language (“
HTML”
) and Cascading Style Sheet (“
CSS”
) elements;by the first computer processor, obtaining a second list comprising price and product attributes, which attributes are each associated with a label; by the first computer processor, associating at least a first element of the first list with at least a first attribute of the second list, which association is a first parse map; by a second computer processor, obtaining first content via a first URI, which first content comprises HTML and CSS elements; by the first computer processor, modifying the first content to graphically identify the portion of the first content encompassed by the first element with the label associated with the first attribute; and transmitting to a third computer processor the modified first content. - View Dependent Claims (16, 17, 18, 19, 20)
- URI”
-
21. A method of adding Uniform Resource Identifiers (“
- URIs”
) to a URI queue, practiced by a first computer comprising a processor and a memory, the method comprising;with the first computer processor, receiving a base URI and sample non-product webpages, sample product pages, sample listing webpages, and sample category webpages associated with the base URI; with the first computer processor, verifying that the first computer is allowed to crawl a website accessed via the base URI and downloading content from the website; with the first computer processor, identifying in the content at least one of a site name, a crawl delay, URI structures associated with the listing pages, product pages, and non-product pages; with the first computer processor, determining a crawling strategy as at least one of a sitemap-based crawling strategy or a wild crawl based crawling strategy; and
for each URI identified thereby,with the first computer processor, adding the identified URI to a URI queue and setting a time to next check the identified URI.
- URIs”
-
22. A computing apparatus for obtaining information from a webserver, the apparatus comprising a processor and a memory storing instructions that, when executed by the processor, configure the apparatus to:
-
obtain, by the processor, a first Uniform Resource Identifier (“
URI”
) from a prioritized URI queue;utilize, by the processor, the first URI at a first URI access time to request first content from the webserver; parse, by the processor, the first content a first time for first price and product information and save the result as a first parse result; utilize, by the processor, the first URI at a second URI access time to request second content from the webserver; parse, by the processor, the second content for second price and product information, and save the result as a second parse result; and determine, by the processor, that the first parse result is different than the second parse result and set a time for accessing the first URI in the prioritized URI queue based on the difference.
-
Specification