×

Systems and methods for identifying and extracting data from HTML pages

  • US 6,920,609 B1
  • Filed: 08/24/2000
  • Issued: 07/19/2005
  • Est. Priority Date: 08/24/2000
  • Status: Expired due to Term
First Claim
Patent Images

1. A computer implemented method of identifying desired content in HTML formatted web pages, comprising the steps of:

  • selecting a model page, wherein the model page includes content data and a plurality of HTML tags for formatting the content data;

    identifying a first area of interest in the model page;

    parsing the model page to generate a first string of symbols for the plurality of HTML tags, the generated symbols in the first string representing only HTML tags, wherein the first area of interest is identified by a first portion of the first string of symbols;

    retrieving a second web page associated with a different URL than the model page;

    parsing the second web page to generate a second string of symbols for a plurality of HTML tags of the second web page, the generated symbols in the second string representing only HTML tags; and

    comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×