Extracting information from Web pages
First Claim
1. A computer-implemented method for identifying webpage content, the method comprising:
- receiving from a memory storage device a string of HTML source code that includes tags;
determining the sequence in which tags occur in the string;
using the sequence to identify one or more sub-sequences in which tags occur in the string, each sub-sequence being associated with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
determining whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining including;
applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string;
removing from further consideration sub-sequences that do not satisfy the first set of criteria;
grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap and are similar, as determined by a measure based on edit distance;
calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings;
identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens;
removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences from the group having a highest associated score among sub-sequences currently selected; and
returning and storing in the memory storage device the sub-sequences that were not removed from further consideration.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatus, including computer program products, for identifying Web page content with a granularity finer than individual Web pages, e.g., finer than individual HTML documents. The invention provides a computer-implemented method for identifying Web page content. The method includes receiving a string of markup language source code that includes tags. The method includes identifying sub-sequences in which tags occur in the string. Each sub-sequence is associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence. The sub-sequences identified are ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing. The criteria includes a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string. The method includes returning the identified sub-sequences.
50 Citations
10 Claims
-
1. A computer-implemented method for identifying webpage content, the method comprising:
- receiving from a memory storage device a string of HTML source code that includes tags;
determining the sequence in which tags occur in the string; using the sequence to identify one or more sub-sequences in which tags occur in the string, each sub-sequence being associated with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence; determining whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining including; applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; removing from further consideration sub-sequences that do not satisfy the first set of criteria; grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap and are similar, as determined by a measure based on edit distance; calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings; identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens; removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences from the group having a highest associated score among sub-sequences currently selected; and returning and storing in the memory storage device the sub-sequences that were not removed from further consideration. - View Dependent Claims (2, 3, 4, 5)
- receiving from a memory storage device a string of HTML source code that includes tags;
-
6. A computer program product, tangibly embodied in a machine-readable storage device, for identifying webpage content, the computer program product including instructions to cause data processing apparatus to:
-
receive a string of HTML source code that includes tags; determine the sequence in which the tags occur in the string; use the sequence to identify one or more sub-sequences in which the tags occur in the string, each sub-sequence being associated with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence; determine whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining including; applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; remove removing from further consideration sub-sequences that do not satisfy the first set of criteria; grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap and are similar, as determined by a measure based on edit distance; calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings; identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens; removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences from the group having a highest associated score among sub-sequences currently selected; and returning and storing the sub-sequences that were not removed from further consideration. - View Dependent Claims (7, 8, 9, 10)
-
Specification