Extracting information from Web pages

US 7,519,621 B2
Filed: 05/04/2004
Issued: 04/14/2009
Est. Priority Date: 05/04/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for identifying webpage content, the method comprising:

receiving from a memory storage device a string of HTML source code that includes tags;

determining the sequence in which tags occur in the string;

using the sequence to identify one or more sub-sequences in which tags occur in the string, each sub-sequence being associated with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;

determining whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining including;

applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string;

removing from further consideration sub-sequences that do not satisfy the first set of criteria;

grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap and are similar, as determined by a measure based on edit distance;

calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings;

identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens;

removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences from the group having a highest associated score among sub-sequences currently selected; and

returning and storing in the memory storage device the sub-sequences that were not removed from further consideration.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus, including computer program products, for identifying Web page content with a granularity finer than individual Web pages, e.g., finer than individual HTML documents. The invention provides a computer-implemented method for identifying Web page content. The method includes receiving a string of markup language source code that includes tags. The method includes identifying sub-sequences in which tags occur in the string. Each sub-sequence is associated with the portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence. The sub-sequences identified are ones that satisfy criteria for being classified as associated with a portion of the string that define Web page content constituting an entire listing. The criteria includes a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string. The method includes returning the identified sub-sequences.

50 Citations

View as Search Results

10 Claims

1. A computer-implemented method for identifying webpage content, the method comprising:
- receiving from a memory storage device a string of HTML source code that includes tags;
  
  determining the sequence in which tags occur in the string;
  
  using the sequence to identify one or more sub-sequences in which tags occur in the string, each sub-sequence being associated with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
  
  determining whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining including;
  
  applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string;
  
  removing from further consideration sub-sequences that do not satisfy the first set of criteria;
  
  grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap and are similar, as determined by a measure based on edit distance;
  
  calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings;
  
  identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens;
  
  removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences from the group having a highest associated score among sub-sequences currently selected; and
  
  returning and storing in the memory storage device the sub-sequences that were not removed from further consideration.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein grouping the remaining sub-sequences into groups includes:
    - for each group, identifying any portion of the string that is delimited by and does not include source code associated with the sub-sequences of the group, and identifying and adding to the group one or more sub-sequences still under consideration and associated with the identified portion of string.
  - 3. The method of claim 2, wherein determining the sequence in which tags occur in the string includes:
    - generating a tokenized version of the string, the tokenized version including tag tokens, each tag token representing a corresponding tag in the string, the sequence and sub-sequences of tag tokens in the tokenized version being the same as the sequence and sub-sequences of tags in the string; and
      
      using the tokenized version to identify the sequence and sub-sequences in which tags occur in the string.
  - 4. The method of claim 3, wherein:
    - the tokenized version includes one or more word token that represent source code defining webpage content; and
      
      identifying a portion of the string that is delimited by source code associated with the sub-sequences of a group includes identifying a word token that represents the portion of the string.
  - 5. The method of claim 1, wherein removing from consideration sub-sequences that do not satisfy the first set of criteria includes removing sub-sequences that include any combination of:
    - a close tag of a first type without including a succeeding close tag of the first type;
      
      an open tag of the first type without including a succeeding close tag of the first type;
      
      only one tag or more than 200 tags;
      
      another sub-sequence that is approximately repeated only once in tandem within the sub-sequence being considered;
      
      another sub-sequence that is exactly repeated in tandem at least once within the sub-sequence being considered; and
      
      a portion of source code that represents webpage content and, furthermore, is less than 20 characters.

6. A computer program product, tangibly embodied in a machine-readable storage device, for identifying webpage content, the computer program product including instructions to cause data processing apparatus to:
- receive a string of HTML source code that includes tags;
  
  determine the sequence in which the tags occur in the string;
  
  use the sequence to identify one or more sub-sequences in which the tags occur in the string, each sub-sequence being associated with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
  
  determine whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining including;
  
  applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string;
  
  remove removing from further consideration sub-sequences that do not satisfy the first set of criteria;
  
  grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap and are similar, as determined by a measure based on edit distance;
  
  calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings;
  
  identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens;
  
  removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences from the group having a highest associated score among sub-sequences currently selected; and
  
  returning and storing the sub-sequences that were not removed from further consideration.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The product of claim 6, wherein the instructions to group the remaining subsequences into groups includes instructions to:
    - for each group, identify any portion of the string that is delimited by and does not include source code associated with the sub-sequences of the group, and identify and add to the group one or more sub-sequences still under consideration and associated with the identified portion of string.
  - 8. The product of claim 7, wherein the instructions to determine the sequence in which tags occur in the string includes instructions to:
    - generate a tokenized version of the string, the tokenized version including tag tokens, each of which representing a corresponding tag in the string, the sequence and sub-sequences of tag tokens in the tokenized version being the same as the sequence and sub-sequences of tags in the string; and
      
      use the tokenized version to identify the sequence and sub-sequences in which tags occur in the string.
  - 9. The product of claim 8, wherein:
    - the tokenized version includes one or more word token that represent source code defining webpage content; and
      
      identifying a portion of the string that is delimited by source code associated with the sub-sequences of a group includes identifying a word token that represents the portion of the string.
  - 10. The product of claim 6, wherein removing from consideration sub-sequences that do not satify the first set of criteria includes removing sub-sequences that include any combination of:
    - a close tag of a first type without including a preceding open tag of the first type;
      
      an open tag of the first type without including a succeeding close tag of the first type;
      
      only one tag or more than 200 tags;
      
      another sub-sequence that is approximately repeated only once in tandem within the sub-sequence being considered;
      
      another sub-sequence that is exactly repeated in tandem at least once within the sub-sequence being considered; and
      
      a portion of source code that represents webpage content and, furthermore, is less than 20 characters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
PageBites, Inc. (Singularity Im, Inc.)
Original Assignee
PageBites, Inc. (Singularity Im, Inc.)
Inventors
Harik, Ralph
Primary Examiner(s)
Rones; Charles
Assistant Examiner(s)
Khoshnoodi; Fariborz

Application Number

US10/838,982
Publication Number

US 20050251536A1
Time in Patent Office

1,806 Days
Field of Search

707/509, 707/517, 707/510, 707/518, 707/513, 707/519, 707/200, 707/3
US Class Current

1/1
CPC Class Codes

G06F 40/143 Markup, e.g. Standard Gener...

G06F 40/284 Lexical analysis, e.g. toke...

Extracting information from Web pages

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

50 Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Extracting information from Web pages

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

50 Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links