System and method for extracting structured data from classified websites
First Claim
1. A computer implemented method of automatically extracting data from a classified website comprising:
- on a server system having one or more processors and memory storing one or more programs for execution by the one or more processors;
determining that a website is an area specific classified website based at least in part upon determining that the website is geographically localized;
accessing page models for other classified websites;
identifying a listing page in the classified website based on similarity of the listing page to the page models;
creating a listing page model for the listing page comprising;
identifying one or more dynamic regions within the listing page;
determining a type of information associated with a respective dynamic region of the one or more identified dynamic regions;
creating a listing page template that identifies the one or more dynamic regions and their associated type of information; and
storing the listing page template;
extracting data from the classified website based at least in part on the listing page model; and
saving the extracted data in a database responsive to a classified site query by a user.
3 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and computer readable storage mediums are provided for automatically extracting data from a classified website. A website is determined to be a classified website based on a set of heuristics. Then page models for other classified websites are accessed. The page models may include listing page models, detail page models, and/or city page models. A listing page in the classified website is determined based on similarity of the listing page to the page models for the other classified websites. Then a listing page model for the listing page in the classified website is created. After the model has been created data from the classified website is extracted based at least in part on the listing page model. Similar processes are performed for determining a details page, creating a details page model, and extracting data from the classified website using a details page model.
-
Citations
24 Claims
-
1. A computer implemented method of automatically extracting data from a classified website comprising:
-
on a server system having one or more processors and memory storing one or more programs for execution by the one or more processors; determining that a website is an area specific classified website based at least in part upon determining that the website is geographically localized; accessing page models for other classified websites; identifying a listing page in the classified website based on similarity of the listing page to the page models; creating a listing page model for the listing page comprising; identifying one or more dynamic regions within the listing page; determining a type of information associated with a respective dynamic region of the one or more identified dynamic regions; creating a listing page template that identifies the one or more dynamic regions and their associated type of information; and storing the listing page template; extracting data from the classified website based at least in part on the listing page model; and saving the extracted data in a database responsive to a classified site query by a user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A server system, for automatically extracting data from a classified website, comprising:
-
one or more processors; and memory storing one or more programs to be executed by the one or more processors; the one or more programs comprising instructions for; determining that a website is an area specific classified website based at least in part upon determining that the website is geographically localized; accessing page models for other classified websites; identifying a listing page in the classified website based on similarity of the listing page to the page models; creating a listing page model for the listing page comprising; identifying one or more dynamic regions within the listing page; determining a type of information associated with a respective dynamic region of the one or more identified dynamic regions; creating a listing page template that identifies the one or more dynamic regions and their associated type of information; and storing the listing page template; extracting data from the classified website based at least in part on the listing page model; and saving the extracted data in a database responsive to a classified site query by a user. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions for:
-
determining that a website is an area specific classified website based at least in part upon determining that the website is geographically localized; accessing page models for other classified websites; identifying a listing page in the classified website based on similarity of the listing page to the page models; creating a listing page model for the listing page comprising; identifying one or more dynamic regions within the listing page; determining a type of information associated with a respective dynamic region of the one or more identified dynamic regions; creating a listing page template that identifies the one or more dynamic regions and their associated type of information; and storing the listing page template; extracting data from the classified website based at least in part on the listing page model; and saving the extracted data in a database responsive to a classified site query by a user. - View Dependent Claims (19, 20, 21, 22, 23, 24)
-
Specification