Method and apparatus for improved web scraping
First Claim
Patent Images
1. A method for improved web scraping, comprising the steps of:
- obtaining a results page for a given web site/query;
determining whether the source of said results was previously requested;
IF said source was previously requested, THENretrieving known links from database;
comparing said known links to links on said results page;
determining whether “
N”
good links have been found;
IF said “
N”
good links have been found, THENidentifying said “
N”
good links;
building a stack of potential “
begin hits”
HTML tags and strings for each of selections “
1”
through “
N”
;
comparing entries of said stack to find “
best”
combination of said “
begin hits”
HTML tags and strings;
writing to and updating configuration file so as to terminate process;
OTHERWISE;
returning to said step of parsing said results page to identify all links;
OTHERWISE;
parsing said results page to identify all links;
presenting list of said links to user;
manually selecting “
N”
good links; and
returning to said step of identifying said “
N”
good links.
1 Assignment
0 Petitions
Accused Products
Abstract
Method and apparatus to enable the parser component of a web search engine to adapt in response to frequent web page format changes at web sites. Parser “learns” from a set of defined HTTP links, how to find and parse web pages returned from a search engine query. The invention intelligently locates various token/strings that will correctly extract attributes associated with the returned item. Present invention may operate either automatically or in a user-assisted fashion.
-
Citations
1 Claim
-
1. A method for improved web scraping, comprising the steps of:
-
obtaining a results page for a given web site/query; determining whether the source of said results was previously requested; IF said source was previously requested, THEN retrieving known links from database; comparing said known links to links on said results page; determining whether “
N”
good links have been found;IF said “
N”
good links have been found, THENidentifying said “
N”
good links;building a stack of potential “
begin hits”
HTML tags and strings for each of selections “
1”
through “
N”
;comparing entries of said stack to find “
best”
combination of said “
begin hits”
HTML tags and strings;writing to and updating configuration file so as to terminate process; OTHERWISE; returning to said step of parsing said results page to identify all links; OTHERWISE; parsing said results page to identify all links; presenting list of said links to user; manually selecting “
N”
good links; andreturning to said step of identifying said “
N”
good links.
-
Specification