Method and system for identifying targeted data on a web page
First Claim
1. A method for identifying product-related information on a web page, the method comprising:
- a. identifying one or more text nodes containing product-related information on a first web page;
b. using one or more vectors to describe the locations of the text nodes containing potential product-related information on the first web page;
c. analyzing one or more of the vectors to identify one or more patterns; and
d. generating a model that discriminates between text nodes that contain product-related information and text nodes that do not contain product-related information on a second web page.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system is provided that in a fully automated manner crawls web sites and identifies specific types of web pages, then extracts targeted data from those web pages. One or more text nodes containing product-related information on a first web page are first identified, and the locations of those text nodes are described using one or more vectors. The vectors are then analyzed to identify one or more patterns and to generate a model from those patterns that discriminates between text nodes that contain product-related information and text nodes that do not contain product-related information on a second web page. The model can then be used to crawl web sites to identify and extract targeted data, or the model can be installed on a user'"'"'s computer to identify and extract targeted information from web sites as the user is browsing.
-
Citations
21 Claims
-
1. A method for identifying product-related information on a web page, the method comprising:
-
a. identifying one or more text nodes containing product-related information on a first web page;
b. using one or more vectors to describe the locations of the text nodes containing potential product-related information on the first web page;
c. analyzing one or more of the vectors to identify one or more patterns; and
d. generating a model that discriminates between text nodes that contain product-related information and text nodes that do not contain product-related information on a second web page. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system comprising:
-
a first computer having a first computer-readable medium containing a copy of source code for a first web page, one or more first computer programs configured to parse the copy of the source code to identify all text nodes and analyze the text nodes to identify any text nodes that contain product-related information;
one or more second computer programs configured to generate vectors describing the location of the text nodes containing product-related information, analyze one or more of the vectors to identify one or more patterns and generate one or more models that discriminate between text nodes that contain product-related information and text nodes that do not contain product-related information on a second web page; and
a second computer coupled to the first computer having a second computer-readable medium, wherein the one or more models are transmitted to the second computer, stored in the second computer-readable medium, and used to identify and extract information about one or more products available for sale on one or more merchant web pages. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method for identifying and extracting product-related information from a web page, the method comprising:
-
a. locating potential product-related text nodes on a first web page;
b. creating a representation space that describes the potential product-related text nodes on the first web page;
c. analyzing the representation space to identify one or more patterns;
d. using the patterns to generate one or more models that discriminate between product-related text nodes and non product-related text nodes on a second web page.
-
Specification