Systems and methods for statistical modeling of complex data sets
First Claim
1. A computer implemented method to find a mathematical equation that fits a data set having one dependent variable and at least one independent variable comprising determining the relative contribution of the at least one independent variable to the dependent variable, and defining separate functions that each describe the contribution of a single independent variable to the dependent variable, wherein the functions used to describe the contribution of an independent variable to the dependent variable are derived using residuals of the dependent variable, wherein the residuals comprise the portion of the dependent variable for which a contributing independent variable has not been defined.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention comprises methods and a computer-readable medium comprising programming code for automatic statistical modeling of data. In an embodiment, the analysis is completely automatic with a single computer stroke providing an analysis of data with at least nineteen independent variables. The output is provided in multiple formats, as for example, graphs, reports, spreadsheet files, and an electronic calculator. In an embodiment, approximations of missing data can be automatically calculated and used in the model. The modeling systems and software of the present invention may be used to provide information necessary for manufacturing, business models, scientific endeavors, transportation schedules, and other practical applications.
101 Citations
36 Claims
- 1. A computer implemented method to find a mathematical equation that fits a data set having one dependent variable and at least one independent variable comprising determining the relative contribution of the at least one independent variable to the dependent variable, and defining separate functions that each describe the contribution of a single independent variable to the dependent variable, wherein the functions used to describe the contribution of an independent variable to the dependent variable are derived using residuals of the dependent variable, wherein the residuals comprise the portion of the dependent variable for which a contributing independent variable has not been defined.
-
6. A computer implemented method to find a mathematical equation that fits a data set having one dependent variable and at least one independent variable comprising the steps of:
-
(a) identifying the independent variable that makes the largest contribution to the dependent variable as the first most important independent variable;
(b) plotting the dependent variable versus transformations of the first most important independent variable to determine a function that provides a model having the best fit to the data;
(c) identifying the independent variable that makes the next largest contribution to the dependent variable as the next most important independent variable;
(d) plotting the residuals of the dependent variable versus transformations of the next most important variable to determine a function that comprises the best fit of the next most important independent variable to the residuals, wherein the residuals of the dependent variable comprise the portion of the dependent variable for which a contributing independent variable has not yet been defined; and
(e) repeating steps (c) and (d) to identify the next most important independent variable until an optimal number of independent variables having associated functions to describe the dependent variable have been determined. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer implemented method to find a mathematical equation that fits a data set while minimizing the number of terms in the final model comprising the steps of:
-
(a) organizing the data as one dependent variable (y) and at least one independent variable (x1, x2, . . . xn−
1, xn);
(b) determining which independent variable comprises the most significant contribution to the dependent variable by using a program code that performs the following substeps;
(i) plotting the values of the dependent variable against an initial set of selected functions (Finitial) of each independent variable (x1, x2, x3, . . . xn−
1, xn);
(ii) analyzing how well each function describes the values for the dependent variable (y) for each independent variable; and
(iii) choosing an independent variable (x1) which comprises best fit for any one of the predetermined number of analyzed functions;
(c) determining a function, f(x1), and constants, m1 and b1, from an expanded set of functions, which best describes the independent variable comprising the most significant contribution to the dependent variable;
(d) determining the residuals (y−
ŷ
1), where ŷ
1=m1*f(x1)+b1 is the calculated value of (y) for x1;
(e) determining the next most significant independent variable by plotting the value of the residuals (y−
ŷ
1) against an initial set of functions of the remaining independent variables (x2, x3, . . . Xn−
1, xn) and choosing the independent variable (x2) which comprises best fit for any one of the predetermined number of analyzed functions;
(f) determining a function, f(x2), and constants, m2 and b2, from an expanded set functions, which best describes the independent variable comprising the next most significant contribution to the residuals for the dependent variable (y−
ŷ
1);
(g) determining the residuals (y−
ŷ
1,2)=y−
((m1′
*f(x1))+(m2′
*f(x2))+b′
);
(h) plotting selected functions of the remaining independent variables (x3, . . . xn−
1, xn) versus the second level residuals (y−
ŷ
1,2) in order to determine the next most significant independent variable (x3);
(i) determining a function f(x3), and new constants, m3 and b3, which best describes the mathematical relationship between x3 and (y−
ŷ
1,2) from a second expanded set of pre-selected functions (FS2);
(j) repeating steps (g)-(i) using increasing levels of residuals (y−
y1,2,3, . . . n−
1) to characterize additional independent variables (x4, . . . xn−
1, xn) until an optimal number of functions to describe the dependent variable identified (y) have been and described; and
(k) generating an equation which includes at least one optimized function for at least one independent variable to describe the value of the dependent variable for the entire data set.
-
- 19. A computer-readable medium on which is encoded programming code to find a mathematical equation that fits a data set having one dependent variable and at least one independent variable comprising program code for determining the relative contribution of the at least one independent variable to the dependent variable and for defining separate functions that each describe the contribution of a single independent variable to the dependent variable, wherein the functions used to describe the contribution of an independent variable to the dependent variable are derived using residuals of the dependent variable, wherein the residuals comprise the portion of the dependent variable for which a contributing independent variable has not been defined.
-
24. A computer-readable medium on which is encoded programming code to find a mathematical equation that fits a data set having one dependent variable and at least one independent variable comprising:
-
(a) program code for identifying the independent variable that makes the largest contribution to the dependent variable as the first most important independent variable;
(b) program code for plotting the dependent variable versus transformations of the first most important independent variable to determine a function that provides a model having the best fit to the data;
(c) program code for identifying the independent variable that makes the next largest contribution to the dependent variable as the next most important independent variable;
(d) program code for plotting the residuals of the dependent variable versus transformations of the next most important variable to determine a function that comprises the best fit of the next most important independent variable to the residuals, wherein the residuals of the dependent variable comprise the portion of the dependent variable for which a contributing independent variable has not yet been defined; and
(e) program code for repeating steps (c) and (d) to identify the next most important independent variable until an optimal number of independent variables having associated functions to describe the dependent variable have been determined. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. A computer-readable medium on which is encoded programming code to find a mathematical equation that fits a data set while minimizing the number of terms in the final model comprising:
-
(a) program code for organizing the data as one dependent variable (y) and at least one independent variable (x1, x2, . . . xn−
1, xn);
(b) program code for determining which independent variable comprises the most significant contribution to the dependent variable by using a program code that performs the following substeps;
(i) plotting the values of the dependent variable against an initial set of selected functions (Finitial) of each independent variable (x1, x2, x3, . . . xn−
1, xn);
(ii) analyzing how well each function describes the values for (y) for each independent variable; and
(iii) choosing an independent variable (x1) which comprises best fit for any one of the predetermined number of analyzed functions;
(c) program code for determining a function, f(x1), and constants, m1 and b1, from an expanded set of functions, which best describes the mathematical relationship between the independent variable comprising the most significant contribution to (y);
(d) program code for determining the residuals (y−
ŷ
1), where ŷ
1=m1*f(x1)+b1 is the calculated value of (y) for x1;
(e) program code for determining the next most significant independent variable (x2) by plotting the value of the residuals (y−
ŷ
1) against an initial set of functions of the remaining independent variable (x2, x3, . . . xn−
1, xn) and choosing the independent variable (x2 for example) which comprises best fit for any one of the predetermined number of analyzed functions;
(f) program code for determining a function, f(x2), and constants, m2 and b2, from an expanded set functions, which best describes the mathematical relationship between the independent variable comprising the next most significant contribution to (y);
(g) program code for determining the residuals (y−
ŷ
1,2)=y−
((m1′
*f(x1))+(m2′
*f(x2))+b′
);
(h) program code for plotting selected functions of the remaining independent variables (x3, . . . xn−
1, xn) versus the second level residuals (y−
ŷ
1,2) in order to determine the next most significant independent variable (x3);
(i) program code for determining a function f(x3), and new constants, m3 and b3, which best describes the mathematical relationship between x3 and (y−
ŷ
1,2) from a second expanded set of pre-selected functions (FS2);
(j) program code for repeating steps (g)-(i) using increasing levels of residuals (y−
y1,2,3, . . . n−
1) to characterize additional independent variables (x4, . . . xn−
1, xn) until an optimal number of functions to describe the dependent variable (y) have been identified and described; and
(k) program code for generating an equation which includes at least one optimized function for at least one independent variable to describe the value of the dependent variable for the entire data set.
-
Specification