Extracting information from Web pages

US 20050251536A1
Filed: 05/04/2004
Published: 11/10/2005
Est. Priority Date: 05/04/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for identifying Web page content, the method comprising:

receiving a string of HTML source code that includes tags;

determining the sequence in which tags occur in the string;

using the sequence to identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;

removing from further consideration sub-sequences that do not satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing;

grouping into groups sub-sequences that were not removed in the previous step, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group;

calculating a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group;

identifying each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence;

for each portion of the string identified as an overlap, selecting sub-sequences associated with the portion of the string and removing from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected; and

returning the sub-sequences that were not removed from further consideration.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus, including computer program products, for identifying Web page content with a granularity finer than individual Web pages, e.g., finer than individual HTML documents. The invention provides a computer-implemented method for identifying Web page content. The method includes receiving a string of markup language source code that includes tags. The method includes identifying sub-sequences in which tags occur in the string. Each sub-sequence is associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence. The sub-sequences identified are ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing. The criteria includes a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string. The method includes returning the identified sub-sequences.

83 Citations

View as Search Results

20 Claims

1. A computer-implemented method for identifying Web page content, the method comprising:
- receiving a string of HTML source code that includes tags;
  
  determining the sequence in which tags occur in the string;
  
  using the sequence to identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
  
  removing from further consideration sub-sequences that do not satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing;
  
  grouping into groups sub-sequences that were not removed in the previous step, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group;
  
  calculating a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group;
  
  identifying each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence;
  
  for each portion of the string identified as an overlap, selecting sub-sequences associated with the portion of the string and removing from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected; and
  
  returning the sub-sequences that were not removed from further consideration.

2. A computer-implemented method for identifying Web page content, the method comprising:
- receiving a string of markup language source code that includes tags;
  
  identifying sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence, the sub-sequences identified being ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing, the criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; and
  
  returning the identified sub-sequences.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9)
- - 3. The method of claim 2, wherein identifying sub-sequences includes:
    - determining the sequence in which tags occur in the string;
      
      using the sequence to identify sub-sequences in which tags occur in the string; and
      
      removing from further consideration all sub-sequences that satisfy criteria for being classified as not associated with a portion of the string that define Web page content constituting an entire listing.
  - 4. The method of claim 3, wherein identifying sub-sequences includes:
    - grouping into groups sub-sequences that were not removed from consideration, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group; and
      
      calculating a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group.
  - 5. The method of claim 4, wherein identifying sub-sequences includes:
    - identifying each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence; and
      
      for each portion of the string identified as an overlap, selecting sub-sequences associated with the portion of the string and removing from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected.
  - 6. The method of claim 5, wherein identifying sub-sequences includes:
    - for each group, identifying any portion of the string that is delimited by and does not include source code associated with the sub-sequences of the group, and identifying and adding to the group one or more sub-sequences still under consideration that is associated with the identified portion of string.
  - 7. The method of claim 6, wherein identifying sub-sequences includes:
    - generating a tokenized version of the string, the tokenized version including tag tokens, each of which representing a corresponding tag in the string, the sequence and sub-sequences of tag tokens in the tokenized version being the same as the sequence and sub-sequences of tags in the string; and
      
      using the tokenized version to identify the sequence and sub-sequences in which tags occur in the string.
  - 8. The method of claim 7, wherein the tokenized version includes word token that represent source code defining Web page content;
    - and identifying a portion of the string that is delimited by source code associated with the sub-sequences of a group includes identifying a word token that represents the portion.
  - 9. The method of claim 3, wherein criteria for a sub-sequence being considered to be classified as not associated with a portion of the string that define Web page content constituting an entire listing include any combination of:
    - including a close tag of a first type without including a preceding open tag of the first type;
      
      including an open tag of the first type without including a succeeding close tag of the first type;
      
      including only one tag or more than 200 tags;
      
      including another sub-sequence that is approximately repeated only once in tandem within the sub-sequence being considered;
      
      including another sub-sequence that is exactly repeated in tandem at least once within the sub-sequence being considered; and
      
      including a portion of source code that represents Web page content and, furthermore, is less than 20 characters.

10. A computer-implemented method for generating an index for Web pages, the method comprising:
- crawling the Internet and retrieving a string of markup language source code that includes tags;
  
  identifying sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence, the sub-sequences identified being ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing, the criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; and
  
  indexing the source code associated with the identified sub-sequences.

11. A computer program product, tangibly embodied in an information carrier, for identifying Web page content, the computer program product being operable to cause data processing apparatus to:
- receive a string of HTML source code that includes tags;
  
  determine the sequence in which tags occur in the string;
  
  use the sequence to identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
  
  remove from further consideration sub-sequences that do not satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing;
  
  group into groups sub-sequences that were not removed in the previous step, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group;
  
  calculate a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group;
  
  identify each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence;
  
  for each portion of the string identified as an overlap, select sub-sequences associated with the portion of the string and remove from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected; and
  
  return the sub-sequences that were not removed from further consideration.

12. A computer program product, tangibly embodied in an information carrier, for identifying Web page content, the product comprising instructions operable to cause data processing apparatus to:
- receive a string of markup language source code that includes tags;
  
  identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence, the sub-sequences identified being ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing, the criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; and
  
  return the identified sub-sequences.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The product of claim 12, wherein the instructions to identify sub-sequences includes instructions to:
    - determine the sequence in which tags occur in the string;
      
      use the sequence to identify sub-sequences in which tags occur in the string; and
      
      remove from further consideration all sub-sequences that satisfy criteria for being classified as not associated with a portion of the string that define Web page content constituting an entire listing.
  - 14. The product of claim 13, wherein the instructions to identify sub-sequences includes instructions to:
    - group into groups sub-sequences that were not removed from consideration, wherein sub-sequences that are similar, as determined by a measure based on edit distance, and do not overlap are grouped together in a group; and
      
      calculate a score for each group, the score for a group being indicative of the likelihood that sub-sequences in the group are associated with portions of the string that define Web page content constituting entire listings, the score for a group being associated with each sub-sequence in the group.
  - 15. The product of claim 14, wherein the instructions to identify sub-sequences includes instructions to:
    - identify each portion of the string that represents Web page content and is an overlap, a portion of the string being an overlap when it is associated with more than one sub-sequence; and
      
      for each portion of the string identified as an overlap, select sub-sequences associated with the portion of the string and remove from further consideration all the currently selected sub-sequences except the sub-sequence having a highest associated score among sub-sequences currently selected.
  - 16. The product of claim 15, wherein the instructions to identify sub-sequences includes instructions to:
    - for each group, identify any portion of the string that is delimited by and does not include source code associated with the sub-sequences of the group, and identify and add to the group one or more sub-sequences still under consideration that is associated with the identified portion of string.
  - 17. The product of claim 16, wherein the instructions to identify sub-sequences includes instructions to:
    - generate a tokenized version of the string, the tokenized version including tag tokens, each of which representing a corresponding tag in the string, the sequence and sub-sequences of tag tokens in the tokenized version being the same as the sequence and sub-sequences of tags in the string; and
      
      use the tokenized version to identify the sequence and sub-sequences in which tags occur in the string.
  - 18. The product of claim 17, wherein:
    - the tokenized version includes word token that represent source code defining Web page content; and
      
      identifying a portion of the string that is delimited by source code associated with the sub-sequences of a group includes identifying a word token that represents the portion.
  - 19. The product of claim 13, wherein criteria for a sub-sequence being considered to be classified as not associated with a portion of the string that define Web page content constituting an entire listing include any combination of:
    - including a close tag of a first type without including a preceding open tag of the first type;
      
      including an open tag of the first type without including a succeeding close tag of the first type;
      
      including only one tag or more than 200 tags;
      
      including another sub-sequence that is approximately repeated only once in tandem within the sub-sequence being considered;
      
      including another sub-sequence that is exactly repeated in tandem at least once within the sub-sequence being considered; and
      
      including a portion of source code that represents Web page content and, furthermore, is less than 20 characters.

20. A computer program product, tangibly embodied in an information carrier, for generating an index for Web pages, the computer program product being operable to cause data processing apparatus to:
- crawl the Internet and retrieve a string of markup language source code that includes tags;
  
  identify sub-sequences in which tags occur in the string, each sub-sequence being associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence, the sub-sequences identified being ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing, the criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string; and
  
  index the source code associated with the identified sub-sequences.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
PageBites, Inc. (Singularity Im, Inc.)
Original Assignee
PageBites, Inc. (Singularity Im, Inc.)
Inventors
Harik, Ralph

Granted Patent

US 7,519,621 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 40/143 Markup, e.g. Standard Gener...

G06F 40/284 Lexical analysis, e.g. toke...

Extracting information from Web pages

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

83 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Extracting information from Web pages

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

83 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others