System of generating new schema based on selective HTML elements
First Claim
Patent Images
1. A method of automatically generating a mark-up language schema, the method comprising the steps of:
- a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource;
b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema;
c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold;
d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and
e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis.
-
Citations
9 Claims
-
1. A method of automatically generating a mark-up language schema, the method comprising the steps of:
-
a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. - View Dependent Claims (2, 3, 4, 5)
-
-
6. An apparatus for generating a mark-up language schema, the apparatus comprising:
a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of; a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. - View Dependent Claims (8, 9)
-
7. An apparatus for analysing mark-up language text, the apparatus comprising:
a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of; a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema; f) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements; g) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and h) extracting those data elements identified in step g); wherein the mark-up language schema is generated using steps a)-e).
Specification