×

Extracting information from Web pages

  • US 20050251536A1
  • Filed: 05/04/2004
  • Published: 11/10/2005
  • Est. Priority Date: 05/04/2004
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for identifying Web page content, the method comprising:

  • receiving a string of HTML source code that includes tags;

    determining the sequence in which tags occur in the string;

    using the sequence to identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;

    removing from further consideration sub-sequences that do not satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing;

    grouping into groups sub-sequences that were not removed in the previous step, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group;

    calculating a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group;

    identifying each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence;

    for each portion of the string identified as an overlap, selecting sub-sequences associated with the portion of the string and removing from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected; and

    returning the sub-sequences that were not removed from further consideration.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×