Method and apparatus for improved web scraping
First Claim
Patent Images
1. A method for improved web scraping, comprising the steps of:
- obtaining a results page for a given web site/query;
determining whether the source of said results was previously requested;
IF said source was previously requested, THEN retrieving known links from database;
comparing said known links to links on said results page;
determining whether “
N”
good links have been found;
IF said “
N”
good links have been found, THEN identifying said “
N”
good links;
building a stack of potential “
begin hits”
HTML tags and strings for each of selections “
1”
through “
N”
;
comparing entries of said stack to find “
best”
combination of said “
begin hits”
HTML tags and strings;
writing to and updating configuration file so as to terminate process;
OTHERWISE;
returning to said step of parsing said results page to identify all links;
OTHERWISE;
parsing said results page to identify all links;
presenting list of said links to user;
manually selecting “
N”
good links; and
returning to said step of identifying said “
N”
good links.
1 Assignment
0 Petitions
Accused Products
Abstract
Method and apparatus to enable the parser component of a web search engine to adapt in response to frequent web page format changes at web sites. Parser “learns” from a set of defined HTTP links, how to find and parse web pages returned from a search engine query. The invention intelligently locates various token/strings that will correctly extract attributes associated with the returned item. Present invention may operate either automatically or in a user-assisted fashion.
28 Citations
2 Claims
-
1. A method for improved web scraping, comprising the steps of:
-
obtaining a results page for a given web site/query;
determining whether the source of said results was previously requested;
IF said source was previously requested, THEN retrieving known links from database;
comparing said known links to links on said results page;
determining whether “
N”
good links have been found;
IF said “
N”
good links have been found, THENidentifying said “
N”
good links;
building a stack of potential “
begin hits”
HTML tags and strings for each of selections “
1”
through “
N”
;
comparing entries of said stack to find “
best”
combination of said “
begin hits”
HTML tags and strings;
writing to and updating configuration file so as to terminate process;
OTHERWISE;
returning to said step of parsing said results page to identify all links;
OTHERWISE;
parsing said results page to identify all links;
presenting list of said links to user;
manually selecting “
N”
good links; and
returning to said step of identifying said “
N”
good links.
-
-
2. Apparatus improved web scraping, comprising:
-
means for obtaining a results page for a given web site/query;
means for determining whether the source of said results was previously requested;
IF said source was previously requested, THEN means for retrieving known links from database;
means for comparing said known links to links on said results page;
means for determining whether “
N”
good links have been found;
IF said “
N”
good links have been found, THENmeans for identifying said “
N”
good links;
building a stack of potential “
begin hits”
HTML tags and strings for each of selections “
1”
through “
N”
;
means for comparing entries of said stack to find “
best”
combination of said “
begin hits”
HTML tags and strings;
means for writing to and updating configuration file so as to terminate process;
OTHERWISE;
means for returning to said step of parsing said results page to identify all links;
OTHERWISE;
means for parsing said results page to identify all links;
means for presenting list of said links to user;
means for manually selecting “
N”
good links; and
means for returning to said step of identifying said “
N”
good links.
-
Specification