Extracting information from Web pages
First Claim
1. A computer-implemented method for identifying Web page content, the method comprising:
- receiving a string of HTML source code that includes tags;
determining the sequence in which tags occur in the string;
using the sequence to identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
removing from further consideration sub-sequences that do not satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing;
grouping into groups sub-sequences that were not removed in the previous step, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group;
calculating a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group;
identifying each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence;
for each portion of the string identified as an overlap, selecting sub-sequences associated with the portion of the string and removing from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected; and
returning the sub-sequences that were not removed from further consideration.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatus, including computer program products, for identifying Web page content with a granularity finer than individual Web pages, e.g., finer than individual HTML documents. The invention provides a computer-implemented method for identifying Web page content. The method includes receiving a string of markup language source code that includes tags. The method includes identifying sub-sequences in which tags occur in the string. Each sub-sequence is associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence. The sub-sequences identified are ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing. The criteria includes a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string. The method includes returning the identified sub-sequences.
83 Citations
20 Claims
-
1. A computer-implemented method for identifying Web page content, the method comprising:
-
receiving a string of HTML source code that includes tags;
determining the sequence in which tags occur in the string;
using the sequence to identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
removing from further consideration sub-sequences that do not satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing;
grouping into groups sub-sequences that were not removed in the previous step, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group;
calculating a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group;
identifying each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence;
for each portion of the string identified as an overlap, selecting sub-sequences associated with the portion of the string and removing from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected; and
returning the sub-sequences that were not removed from further consideration.
-
-
2. A computer-implemented method for identifying Web page content, the method comprising:
-
receiving a string of markup language source code that includes tags;
identifying sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence, the sub-sequences identified being ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing, the criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; and
returning the identified sub-sequences. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-implemented method for generating an index for Web pages, the method comprising:
-
crawling the Internet and retrieving a string of markup language source code that includes tags;
identifying sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence, the sub-sequences identified being ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing, the criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; and
indexing the source code associated with the identified sub-sequences.
-
-
11. A computer program product, tangibly embodied in an information carrier, for identifying Web page content, the computer program product being operable to cause data processing apparatus to:
-
receive a string of HTML source code that includes tags;
determine the sequence in which tags occur in the string;
use the sequence to identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
remove from further consideration sub-sequences that do not satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing;
group into groups sub-sequences that were not removed in the previous step, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group;
calculate a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group;
identify each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence;
for each portion of the string identified as an overlap, select sub-sequences associated with the portion of the string and remove from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected; and
return the sub-sequences that were not removed from further consideration.
-
-
12. A computer program product, tangibly embodied in an information carrier, for identifying Web page content, the product comprising instructions operable to cause data processing apparatus to:
-
receive a string of markup language source code that includes tags;
identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence, the sub-sequences identified being ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing, the criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; and
return the identified sub-sequences. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer program product, tangibly embodied in an information carrier, for generating an index for Web pages, the computer program product being operable to cause data processing apparatus to:
-
crawl the Internet and retrieve a string of markup language source code that includes tags;
identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence, the sub-sequences identified being ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing, the criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; and
index the source code associated with the identified sub-sequences.
-
Specification