System of generating new schema based on selective HTML elements

US 9,460,231 B2
Filed: 03/28/2011
Issued: 10/04/2016
Est. Priority Date: 03/26/2010
Status: Active Grant

First Claim

Patent Images

1. A method of automatically generating a mark-up language schema, the method comprising the steps of:

a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource;

b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema;

c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold;

d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and

e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis.

Citations

9 Claims

1. A method of automatically generating a mark-up language schema, the method comprising the steps of:
- a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource;
  
  b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema;
  
  c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold;
  
  d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and
  
  e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema.
- View Dependent Claims (2, 3, 4, 5)
- - 2. A method as claimed in claim 1, wherein the or each training sample comprises a uniform resource locator.
  - 3. A method as claimed in claim 1, wherein the or each training sample further comprises a text sequence.
  - 4. A method of analysing mark-up language text, the method comprising the steps of:
    - i) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements;
      
      ii) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and
      
      iii) extracting those data elements identified in step ii),wherein the mark-up language schema is generated using a method in accordance with claim 1.
  - 5. A non-transitory computer readable story medium storing computer executable code for performing a method according to claim 1.

6. An apparatus for generating a mark-up language schema, the apparatus comprising:
- a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of;
  
  a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource;
  
  b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema;
  
  c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold;
  
  d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c);
  
  e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema.
- View Dependent Claims (8, 9)
- - 8. A apparatus as claimed in claim 6, wherein the or each training sample comprises a uniform resource locator.
  - 9. A apparatus as claimed in claim 6, wherein the or each training sample further comprises a text sequence.

7. An apparatus for analysing mark-up language text, the apparatus comprising:
- a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of;
  
  a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource;
  
  b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema;
  
  c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold;
  
  d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and
  
  e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema;
  
  f) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements;
  
  g) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and
  
  h) extracting those data elements identified in step g);
  
  wherein the mark-up language schema is generated using steps a)-e).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
British Telecommunications PLC (BT Group PLC)
Original Assignee
British Telecommunications PLC (BT Group PLC)
Inventors
Thompson, Simon G, Nguyen, Duong T, Thint, Marcus Alfred, Gharib, Hamid
Primary Examiner(s)
Paula, Cesar
Assistant Examiner(s)
Huang, Jian

Application Number

US13/637,483
Publication Number

US 20130019163A1
Time in Patent Office

2,017 Days
Field of Search

715/234, 707/3
US Class Current

1/1
CPC Class Codes

G06F 16/00   Information retrieval; Data...

G06F 16/80   of semi-structured data, e....

G06F 17/00   Digital computing or data p...

G06F 40/143   Markup, e.g. Standard Gener...

G06F 40/154   Tree transformation for tre...

System of generating new schema based on selective HTML elements

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

System of generating new schema based on selective HTML elements

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links