System and method for automatic generation of features from datasets for use in an automated machine learning process
First Claim
1. A computer implemented method for generating a statistical classifier, comprising:
- receiving a designation of a training dataset comprising a plurality of raw data instances each including a set of data objects assigned at least one value;
applying a function to each raw data instance to calculate a set of first results comprising a collection of complex objects having a plurality of parameters built from a plurality of primitive types, wherein the function is selected from a plurality of functions;
extracting a plurality of values from the set of first results;
generating a plurality of candidate classification features, each respective candidate classification feature including;
(i) the function that outputs a complex object, selected from the plurality of functions,(ii) at least one condition selected from a plurality of conditions, and(iii) a value selected from the plurality of extracted values,wherein the respective candidate classification feature outputs an output value computed by the at least one condition that compares between the complex object output of the function and the value selected from the plurality of extracted values;
selecting a subset of pivotal classification features from the generated candidate classification features according to a correlation requirement between at least one classification variable and each respective candidate classification feature;
andgenerating a statistical classifier for classification of the at least one classification variable based on the selected subset of pivotal features applied to a new training dataset.
1 Assignment
0 Petitions
Accused Products
Abstract
There is provided a method for generating features for use in an automated machine learning process, comprising: receiving a first training dataset comprising unclassified raw data instances each including a set of objects of arbitrary types; applying a function to each data instance to calculate a set of first results; generating a set of classification features each including the function for application to a newly received data instance to calculate a second result, and a condition defined by a respective member of the set of first results applied to the second result; applying each classification feature to each instance of an unclassified second training dataset to generate a set of extracted features; selecting a subset of pivotal classification features from the set of classification features according to a correlation requirement between classification variable(s) and each respective member of the set of extracted features; and documenting the subset of pivotal features.
23 Citations
21 Claims
-
1. A computer implemented method for generating a statistical classifier, comprising:
-
receiving a designation of a training dataset comprising a plurality of raw data instances each including a set of data objects assigned at least one value; applying a function to each raw data instance to calculate a set of first results comprising a collection of complex objects having a plurality of parameters built from a plurality of primitive types, wherein the function is selected from a plurality of functions; extracting a plurality of values from the set of first results; generating a plurality of candidate classification features, each respective candidate classification feature including; (i) the function that outputs a complex object, selected from the plurality of functions, (ii) at least one condition selected from a plurality of conditions, and (iii) a value selected from the plurality of extracted values, wherein the respective candidate classification feature outputs an output value computed by the at least one condition that compares between the complex object output of the function and the value selected from the plurality of extracted values; selecting a subset of pivotal classification features from the generated candidate classification features according to a correlation requirement between at least one classification variable and each respective candidate classification feature; and generating a statistical classifier for classification of the at least one classification variable based on the selected subset of pivotal features applied to a new training dataset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 20)
-
-
15. A system for generating a statistical classifier, comprising:
-
a data interface for communicating with a storage unit storing thereon at least one training dataset comprising a plurality of raw data instances each including a set of data objects assigned at least one value, and storing a plurality of functions adapted to process to each data instance to generate a plurality of sets of first results comprising a collection of complex objects having a plurality of parameters built from a plurality of primitive types; a program store storing code; and a processor coupled to the data interface and the program store for implementing the stored code, the code comprising; code to apply at least one function from the plurality of functions to each data instance to calculate a set of first results comprising a collection of complex objects having a plurality of parameters built from a plurality of primitive types; code to extract a plurality of values from the the set of first results; code to generate a plurality of candidate classification features, each respective candidate classification feature including; (i) the function that outputs a complex object, selected from the plurality of functions, (ii) at least one condition selected from a plurality of conditions, and (iii) a value selected from the extracted values, wherein the respective candidate classification feature outputs an output value computed by the at least one condition that compares between the complex object output of the function and the value selected from the plurality of extracted values; code to select a subset of pivotal features from the generated candidate classification features according to at least one correlation requirement between at least one classification variable and each respective candidate classification feature; and code to generate a statistical classifier for classification of the at least one classification variable based on the selected subset of pivotal features applied to a new training dataset. - View Dependent Claims (16, 17, 18, 19)
-
-
21. A computer program product comprising a non-transitory computer readable storage medium storing program code thereon for implementation by a processor of a system for generating a statistical classifier, the program code comprising:
-
instructions to receive a designated training dataset comprising a plurality of raw data instances each including a set of data assigned at least one value; instructions to apply a function to each raw data instance to calculate a set of first results comprising a collection of complex objects having a plurality of parameters built from a plurality of primitive types, wherein the function is selected from a plurality of functions; instructions to extract a plurality of values from the set of first results; instructions to generate a plurality of candidate classification features, each respective candidate classification feature including; (i) the function that outputs the a complex object, selected from the plurality of functions, (ii) at least one condition selected from a plurality of conditions, and (iii) a value selected from the plurality of extracted values, wherein the respective candidate classification feature outputs an output value computed by the at least one condition that compares between the complex object output of the function and the value selected from the plurality of extracted values; instructions to select a subset of pivotal classification features from the generated candidate classification features according to a correlation requirement between at least one classification variable and each respective candidate classification feature; and instructions to generate a statistical classifier for classification of the at least one classification variable based on the selected subset of pivotal features applied to a new training dataset.
-
Specification