×

Extracting information from Web pages

  • US 7,519,621 B2
  • Filed: 05/04/2004
  • Issued: 04/14/2009
  • Est. Priority Date: 05/04/2004
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method for identifying webpage content, the method comprising:

  • receiving from a memory storage device a string of HTML source code that includes tags;

    determining the sequence in which tags occur in the string;

    using the sequence to identify one or more sub-sequences in which tags occur in the string, each sub-sequence being associated with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;

    determining whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining including;

    applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string;

    removing from further consideration sub-sequences that do not satisfy the first set of criteria;

    grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap and are similar, as determined by a measure based on edit distance;

    calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings;

    identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens;

    removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences from the group having a highest associated score among sub-sequences currently selected; and

    returning and storing in the memory storage device the sub-sequences that were not removed from further consideration.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×