×

Method and apparatus for improved web scraping

  • US 20040167876A1
  • Filed: 02/21/2003
  • Published: 08/26/2004
  • Est. Priority Date: 02/21/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method for improved web scraping, comprising the steps of:

  • obtaining a results page for a given web site/query;

    determining whether the source of said results was previously requested;

    IF said source was previously requested, THEN retrieving known links from database;

    comparing said known links to links on said results page;

    determining whether “

    N”

    good links have been found;

    IF said “

    N”

    good links have been found, THEN identifying said “

    N”

    good links;

    building a stack of potential “

    begin hits”

    HTML tags and strings for each of selections “

    1”

    through “

    N”

    ;

    comparing entries of said stack to find “

    best”

    combination of said “

    begin hits”

    HTML tags and strings;

    writing to and updating configuration file so as to terminate process;

    OTHERWISE;

    returning to said step of parsing said results page to identify all links;

    OTHERWISE;

    parsing said results page to identify all links;

    presenting list of said links to user;

    manually selecting “

    N”

    good links; and

    returning to said step of identifying said “

    N”

    good links.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×