Machine learning system for extracting structured records from web pages and other text sources
First Claim
1. A method for extracting a structured record from a document, said structured record including information related to a predetermined subject matter, said information to be organized into categories within said structured record, said method comprising the steps of:
- identifying a span of text in said document according to criteria associated with said predetermined subject matter; and
processing said span of text to extract at least one text element associated with at least one of said categories of said structured record from said document.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for extracting a structured record (190) from a document (100) is described where the the structured record includes information related to a predetermined subject matter (120), with this information being organized into categories within the structured record. The method comprises the steps of identifying a span of text (130) in the document (100) according to criteria associated with the predetermined subject matter and processing (150) the span of text to extract at least one text element associated with at least one of the categories of the structured record (190) from the document (100).
-
Citations
27 Claims
-
1. A method for extracting a structured record from a document, said structured record including information related to a predetermined subject matter, said information to be organized into categories within said structured record, said method comprising the steps of:
-
identifying a span of text in said document according to criteria associated with said predetermined subject matter; and
processing said span of text to extract at least one text element associated with at least one of said categories of said structured record from said document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A method for training a classifier to classify for text based elements in a collection of text based elements according to a characteristic, said method comprising the steps of:
-
forming a feature vector corresponding to each text based element;
forming a sequence of said feature vectors corresponding to each of said text based elements in said collection of text based elements;
labeling each text based element according to said characteristic thereby forming a sequence of labels corresponding to said sequence of feature vectors; and
training a predictive algorithm based on said sequence of labels and said corresponding sequence of said feature vectors, said algorithm trained to generate new label sequences from an input sequence of feature vectors thereby classifying text based elements that form said input sequence of feature vectors. - View Dependent Claims (23, 24, 25)
-
-
26. An apparatus adapted for extracting a structured record from a document, said structured record including information related to a predetermined subject matter, said information to be organized into categories within said structured record, said apparatus comprising:
-
processor means adapted to operate in accordance with a predetermined instruction set;
said apparatus in conjunction with said instruction set, being adapted to perform the method of;
identifying a span of text in said document according to criteria associated with said predetermined subject matter; and
processing said span of text to extract at least one text element associated with at least one of said categories of said structured record from said document.
-
-
27. An apparatus adapted to train a classifier to classify for text based elements in a collection of text based elements according to a characteristic, said apparatus comprising:
-
processor means adapted to operate in accordance with a predetermined instruction set;
said apparatus in conjunction with said instruction set, being adapted to perform the method of;
forming a feature vector corresponding to each text based element;
forming a sequence of said feature vectors corresponding to each of said text based elements in said collection of text based elements;
labeling each text based element according to said characteristic thereby forming a sequence of labels corresponding to said sequence of feature vectors; and
training a predictive algorithm based on said sequence of labels and said corresponding sequence of said feature vectors, said algorithm trained to generate new label sequences from an input sequence of feature vectors thereby classifying text based elements that form said input sequence of feature vectors.
-
Specification