METHOD, APPARATUS, AND COMPUTER-READABLE MEDIUM FOR VISUALIZING RELATIONSHIPS BETWEEN PAIRS OF COLUMNS
First Claim
1. A method executed by one or more computing devices for visualizing relationships between pairs of columns, the method comprising:
- identifying, by at least one of the one or more computing devices, a relationship classification corresponding to two columns in a plurality of columns based on a data type of each column in the two columns, wherein the data type of each column in the plurality of columns comprises either categorical data or numerical data and wherein the relationship classification comprises one of categorical-categorical, categorical-numerical, and numerical-numerical;
applying, by at least one of the one or more computing devices, one or more statistical measures to data in the two columns to generate association data quantifying a plurality of relationships between data values in a first column of the two columns and data values in a second column of the two columns, wherein the one or more statistical measures are determined based at least in part on the relationship classification; and
transforming, by at least one of the one or more computing devices, the association data into a visualization, wherein the visualization comprises one or more indicators corresponding to one or more relationships in the plurality of relationships and wherein a layout of the visualization is determined based on the relationship classification.
6 Assignments
0 Petitions
Accused Products
Abstract
An apparatus, computer-readable medium, and computer-implemented method for visualizing relationships between pairs of columns, comprising identifying a relationship classification corresponding to two columns in a plurality of columns based on a data type of each column in the two columns, applying one or more statistical measures to data in the two columns to generate association data quantifying a plurality of relationships between data values in a first column of the two columns and data values in a second column of the two columns, wherein the one or more statistical measures are determined based at least in part on the relationship classification, and transforming the association data into a visualization, wherein the visualization comprises one or more indicators corresponding to one or more relationships in the plurality of relationships and wherein a layout of the visualization is determined based on the relationship classification.
85 Citations
90 Claims
-
1. A method executed by one or more computing devices for visualizing relationships between pairs of columns, the method comprising:
-
identifying, by at least one of the one or more computing devices, a relationship classification corresponding to two columns in a plurality of columns based on a data type of each column in the two columns, wherein the data type of each column in the plurality of columns comprises either categorical data or numerical data and wherein the relationship classification comprises one of categorical-categorical, categorical-numerical, and numerical-numerical; applying, by at least one of the one or more computing devices, one or more statistical measures to data in the two columns to generate association data quantifying a plurality of relationships between data values in a first column of the two columns and data values in a second column of the two columns, wherein the one or more statistical measures are determined based at least in part on the relationship classification; and transforming, by at least one of the one or more computing devices, the association data into a visualization, wherein the visualization comprises one or more indicators corresponding to one or more relationships in the plurality of relationships and wherein a layout of the visualization is determined based on the relationship classification. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
2. The method of claim 1, wherein the categorical data comprises one or more of nominal data and ordinal data.
-
3. The method of claim 1, further comprising:
receiving, by at least one of the one or more computing devices, a selection of two column identifiers corresponding to the two columns in the plurality of columns prior to identifying the relationship classification for the two columns of data.
-
4. The method of claim 1, further comprising:
-
determining, by at least one of the one or more computing devices, a plurality of relationship classifications corresponding to a plurality of pairs of columns in the plurality of columns based on the data type of each column in each pair of columns, wherein each column in the plurality of columns has a corresponding relationship count; for each pair of columns in the plurality of pairs of columns; applying, by at least one of the one or more computing devices, one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns, wherein the one or more global statistical measures are determined based at least in part on the relationship classification; and incrementing, by at least one of the one or more computing devices, the relationship count corresponding to each column in the pair of columns based at least in part on a determination that a significant relationship exists between data values in the first column of the pair of columns and data values in the second column of the pair of columns; and transmitting, by at least one of the one or more computing devices, a plurality of relationship indicators corresponding to the plurality of columns, wherein each relationship indicator corresponds to a column in the plurality of columns and indicates the relationship count of that column.
-
-
5. The method of claim 4, wherein applying one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns comprises one of:
-
applying a Pearson correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in both columns of the pair of columns are continuous data values; applying a Spearman correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in at least one column of the pair of columns are ordinal data values; applying a Chi-squared test and Cramer'"'"'s V measure based at least in part on a determination that the relationship classification comprises categorical-categorical;
orapplying one or more of a one-way Analysis of Variance (ANOVA) test or a plurality of one-sample T-tests based at least in part on a determination that the relationship classification comprises categorical-numerical.
-
-
6. The method of claim 4, wherein applying one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns comprises:
-
applying the one or more global statistical measures to determine a strength of relationship between data values in a first column of the pair of columns and data values in a second column of the pair of columns; determining whether the strength of relationship is above a predetermined threshold; and determining that a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns based at least in part on a determination that the strength of relationship is above the predetermined threshold.
-
-
7. The method of claim 4, wherein each relationship indicator in the plurality of relationship indicators comprises a circle having a size proportional to the relationship count of the corresponding column.
-
8. The method of claim 1, wherein applying one or more statistical measures to data in the two columns to generate association data quantifying a plurality of relationships between data values in a first column of the two columns and data values in a second column of the two columns comprises:
-
applying one or more global statistical measures to the data in the two columns to generate global association data, wherein the one or more global statistical measures are based at least in part on the relationship classification; and applying one or more categorical statistical measures to the data in the two columns generate categorical association data based at least in part on a determination that the relationship classification comprises either categorical-categorical or categorical-numerical, wherein the one or more categorical statistical measures are based at least in part on the relationship classification.
-
-
9. The method of claim 8, wherein applying one or more global statistical measures based at least in part on the relationship classification comprises one of:
-
applying a Pearson correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in both columns of the pair of columns are continuous data values; applying a Spearman correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in at least one column of the pair of columns are ordinal data values; applying a Chi-squared test and Cramer'"'"'s V measure based at least in part on a determination that the relationship classification comprises categorical-categorical;
orapplying one or more of a one-way ANOVA test or a plurality of one-sample T-tests based at least in part on a determination that the relationship classification comprises categorical-numerical.
-
-
10. The method of claim 8, wherein the relationship classification comprises categorical-categorical and wherein applying one or more categorical statistical measures to generate categorical association data comprises:
-
determining an observed frequency of co-occurrence of categories in the second column with categories in the first column; determining an expected frequency of co-occurrence of the categories in the second column with the categories in the first column; and generating the categorical association data quantifying each relationship between each category in the first column and each category in the second column based at least in part on the observed frequency of co-occurrence and the expected frequency of co-occurrence.
-
-
11. The method of claim 10, wherein the visualization comprises:
-
a global relationship indicator corresponding to the global association data; a first axis comprising a first plurality of category indicators representing a plurality of categories of the first column, wherein the first plurality of category indicators are sorted according to a sorting criterion and wherein each category indicator visually represents a category attribute of the corresponding category; a second axis comprising a second plurality of category indicators representing a plurality of categories in the second column, wherein the second plurality of category indicators are sorted according to the sorting criterion and wherein each category indicator visually represents the category attribute of the corresponding category; a plurality of categorical association indicators corresponding to the categorical association data, wherein each categorical association indicator visually represents a relationship between a category in the plurality of categories in the first column and a category in the plurality of categories in the second column; and an interface configured to receive a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the first plurality of category indicators, or one or more category indicators in the second plurality of category indicators.
-
-
12. The method of claim 11, wherein the plurality of categorical association indicators are arranged in rows corresponding to the first plurality of category indicators and columns corresponding to the second plurality of category indicators.
-
13. The method of claim 11, wherein each categorical association indicator in the plurality of categorical association indicators comprises one or more of:
- a color, a number, or a shape.
-
14. The method of claim 11, wherein the category attribute comprises one or more of:
a name of a corresponding category, an intrinsic rank of a corresponding category, a frequency of a corresponding category, or a strength of association between a corresponding category in a column and all categories in another column.
-
15. The method of claim 11, further comprising:
-
receiving, by at least one of the one or more computing devices, via the interface, a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the first plurality of category indicators, or one or more category indicators in the second plurality of category indicators; andupdating, by at least one of the one or more computing devices, one or more of;
the global relationship indicator, the first plurality of category indicators, the second plurality of category indicators, or the categorical association indicators based at least in part on the user input.
-
-
16. The method of claim 15, wherein the user input comprise a selection of one or more category indicators in the first plurality of category indicators and a selection of one or more category indicators in the second plurality of category indicators and further comprising:
-
applying, by at least one of the one or more computing devices, the one or more global statistical measures to the data in one or more categories of the first column corresponding to the one or more category indicators and one or more categories of the second column corresponding to the one or more category indicators to generate new global association data; applying, by at least one of the one or more computing devices, the one or more categorical statistical measures to the data in the one or more categories of the first column and the one or more categories of the second column to generate new categorical association data; and updating, by at least one of the one or more computing devices, the visualization based at least in part on one or more of the new global association data or the new categorical association data; wherein the one or more categories of the first column correspond to the selected one or more category indicators in the first plurality of category indicators and wherein the one or more categories of the second column correspond to the selected one or more category indicators in the second plurality of category indicators.
-
-
17. The method of claim 8, wherein the relationship classification comprises categorical-numerical and wherein applying one or more categorical statistical measures to generate categorical association data comprises either:
-
calculating results of a plurality of one-sample T-tests for categories in the first column and ranges of data values in the second column to generate the categorical association data quantifying each relationship between each category in the first column and each range of data values in the second column; or determining an observed frequency of co-occurrence of data values within ranges of data values in the second column with categories in the first column; determining an expected frequency of co-occurrence of data values within the ranges of data values in the second column with the categories in the first column; and generating the categorical association data quantifying each relationship between each category in the first column and each range of data values in the second column based at least in part on the observed frequency of co-occurrence and the expected frequency of co-occurrence.
-
-
18. The method of claim 17, wherein the visualization comprises:
-
a global relationship indicator corresponding to the global association data; a first axis comprising a plurality of category indicators representing a plurality of categories of the first column, wherein the plurality of category indicators are sorted according to a sorting criterion and wherein each category indicator visually represents a category attribute of the corresponding category; a second axis comprising a distribution of data values in the second column; and a plurality of categorical association indicators corresponding to the categorical association data, wherein each categorical association indicator visually represents a relationship between a corresponding category in the plurality of categories in the first column and one or more ranges of data values in the second column; and an interface configured to receive a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the plurality of category indicators, or a range of data values in the distribution of data values in the second column.
-
-
19. The method of claim 18, wherein the visualization further comprises:
-
a plurality of categorical distribution indicators corresponding to a distribution visualization type, wherein each categorical distribution indicator visually represents a distribution of data values in the second column corresponding to a category in the plurality of categories of the first column; wherein the interface is further configured to receive a user input relating to the distribution visualization type.
-
-
20. The method of claim 19, further comprising:
-
receiving, by at least one of the one or more computing devices, via the interface, a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the plurality of category indicators, the range of data values in the distribution of data values in the second column, or the distribution visualization type; andupdating, by at least one of the one or more computing devices, one or more of;
the global relationship indicator, the plurality of category indicators, the categorical association indicators, or the plurality of categorical distribution indicators based at least in part on the user input.
-
-
21. The method of claim 18, wherein each categorical association indicator in the plurality of categorical association indicators comprises one or more of:
- a color, a number, or a shape.
-
22. The method of claim 18, wherein the category attribute comprises one or more of:
a sum of a corresponding category, a mean of a corresponding category, a frequency of a corresponding category, or a strength of association between a corresponding category in a column and all data values in another column.
-
23. The method of claim 18, further comprising:
-
receiving, by at least one of the one or more computing devices, via the interface, a selection of a range of data values in the distribution of data values in the second column; applying, by at least one of the one or more computing devices, the one or more global statistical measures to the data in the first column and data corresponding to the selected range of data values in the second column to generate new global association data; applying, by at least one of the one or more computing devices, the one or more categorical statistical measures to the data in the first column and data corresponding to the selected range of data values in the second column to generate new categorical association data quantifying each relationship between each category in the first column and the selected range of data values in the second column; and updating, by at least one of the one or more computing devices, the visualization with one or more of the new global association data or the new categorical association data.
-
-
24. The method of claim 23, wherein updating the visualization with one or more of the new global association data or the new categorical association data comprises:
transmitting one or more new categorical association indicators corresponding to the new categorical association data, wherein each new categorical association indicator in the one or more new categorical association indicators visually represents a relationship between a corresponding category in the plurality of categories in the first column and the selected range of data values in the second column.
-
25. The method of claim 24, wherein updating the visualization with one or more of the new global association data or the new categorical association data further comprises:
transmitting one or more remaining distribution indicators, wherein each remaining distribution indicator in the one or more remaining distribution indicators corresponds to a category in the plurality of categories in the first column and visually represents an attribute of the distribution of data values in the second column for that category relative to the selected range of data values in the second column for that category.
-
26. The method of claim 25, wherein each remaining distribution indicator visually represents a distance between a bound of the selected range of data values and a bound of a range of data values which includes a minimum percentage of all data values for that category.
-
27. The method of claim 26, wherein each remaining distribution indicator visually represents a quantity of data values for that category required to reach the minimum percentage.
-
28. The method of claim 27, wherein each remaining distribution indicator comprises a triangle, wherein the triangle is positioned relative to the selected range of data values based on the distance, and wherein the height of the triangle visually represents the quantity of data values for that category required to reach the minimum percentage.
-
29. The method of claim 23, further comprising:
-
applying, by at least one of the one or more computing devices, the one or more categorical statistical measures to the data in the first column and data corresponding to a plurality of subsets of the selected range of data values to generate subset categorical association data quantifying each relationship between each category in the first column and each subset in the plurality of subsets of the selected range of data values in the second column; and updating, by at least one of the one or more computing devices, the visualization with the subset categorical association data.
-
-
30. The method of claim 23, wherein receiving, via the interface, a selection of a range of data values in the distribution of data values in the second column comprises:
-
detecting, via the interface, a user input beginning at a starting point in the distribution of data values; detecting, via the interface, a continuation of the user input to a current position beyond the starting point in the distribution of data values; and setting the range of data values to be the range between the starting point and the current position.
-
-
2. The method of claim 1, wherein the categorical data comprises one or more of nominal data and ordinal data.
-
-
31. An apparatus for visualizing relationships between pairs of columns, the apparatus comprising:
-
one or more processors; and one or more memories operatively coupled to at least one of the one or more processors and having instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to; identify a relationship classification corresponding to two columns in a plurality of columns based on a data type of each column in the two columns, wherein the data type of each column in the plurality of columns comprises either categorical data or numerical data and wherein the relationship classification comprises one of categorical-categorical, categorical-numerical, and numerical-numerical; apply one or more statistical measures to data in the two columns to generate association data quantifying a plurality of relationships between data values in a first column of the two columns and data values in a second column of the two columns, wherein the one or more statistical measures are determined based at least in part on the relationship classification; and transform the association data into a visualization, wherein the visualization comprises one or more indicators corresponding to one or more relationships in the plurality of relationships and wherein a layout of the visualization is determined based on the relationship classification. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
-
32. The apparatus of claim 31, wherein the categorical data comprises one or more of nominal data and ordinal data.
-
33. The apparatus of claim 31, wherein at least one of the one or more memories has further instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to:
receive a selection of two column identifiers corresponding to the two columns in the plurality of columns prior to identifying the relationship classification for the two columns of data.
-
34. The apparatus of claim 31, wherein at least one of the one or more memories has further instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to:
-
determine a plurality of relationship classifications corresponding to a plurality of pairs of columns in the plurality of columns based on the data type of each column in each pair of columns, wherein each column in the plurality of columns has a corresponding relationship count; for each pair of columns in the plurality of pairs of columns; apply one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns, wherein the one or more global statistical measures are determined based at least in part on the relationship classification; and increment the relationship count corresponding to each column in the pair of columns based at least in part on a determination that a significant relationship exists between data values in the first column of the pair of columns and data values in the second column of the pair of columns; and transmit a plurality of relationship indicators corresponding to the plurality of columns, wherein each relationship indicator corresponds to a column in the plurality of columns and indicates the relationship count of that column.
-
-
35. The apparatus of claim 34, wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to apply one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns further cause at least one of the one or more processors to perform one of:
-
applying a Pearson correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in both columns of the pair of columns are continuous data values; applying a Spearman correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in at least one column of the pair of columns are ordinal data values; applying a Chi-squared test and Cramer'"'"'s V measure based at least in part on a determination that the relationship classification comprises categorical-categorical;
orapplying one or more of a one-way Analysis of Variance (ANOVA) test or a plurality of one-sample T-tests based at least in part on a determination that the relationship classification comprises categorical-numerical.
-
-
36. The apparatus of claim 34, wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to apply one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns further cause at least one of the one or more processors to:
-
apply the one or more global statistical measures to determine a strength of relationship between data values in a first column of the pair of columns and data values in a second column of the pair of columns; determine whether the strength of relationship is above a predetermined threshold; and determine that a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns based at least in part on a determination that the strength of relationship is above the predetermined threshold.
-
-
37. The apparatus of claim 34, wherein each relationship indicator in the plurality of relationship indicators comprises a circle having a size proportional to the relationship count of the corresponding column.
-
38. The apparatus of claim 31, wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to apply one or more statistical measures to data in the two columns to generate association data quantifying a plurality of relationships between data values in a first column of the two columns and data values in a second column of the two columns further cause at least one of the one or more processors to:
-
apply one or more global statistical measures to the data in the two columns to generate global association data, wherein the one or more global statistical measures are based at least in part on the relationship classification; and apply one or more categorical statistical measures to the data in the two columns generate categorical association data based at least in part on a determination that the relationship classification comprises either categorical-categorical or categorical-numerical, wherein the one or more categorical statistical measures are based at least in part on the relationship classification.
-
-
39. The apparatus of claim 38, wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to applying one or more global statistical measures based at least in part on the relationship classification further cause at least one of the one or more processors to:
-
applying a Pearson correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in both columns of the pair of columns are continuous data values; applying a Spearman correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in at least one column of the pair of columns are ordinal data values; applying a Chi-squared test and Cramer'"'"'s V measure based at least in part on a determination that the relationship classification comprises categorical-categorical;
orapplying one or more of a one-way ANOVA test or a plurality of one-sample T-tests based at least in part on a determination that the relationship classification comprises categorical-numerical.
-
-
40. The apparatus of claim 38, wherein the relationship classification comprises categorical-categorical and wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to apply one or more categorical statistical measures to generate categorical association data further cause at least one of the one or more processors to:
-
determine an observed frequency of co-occurrence of categories in the second column with categories in the first column; determine an expected frequency of co-occurrence of the categories in the second column with the categories in the first column; and generate the categorical association data quantifying each relationship between each category in the first column and each category in the second column based at least in part on the observed frequency of co-occurrence and the expected frequency of co-occurrence.
-
-
41. The apparatus of claim 40, wherein the visualization comprises:
-
a global relationship indicator corresponding to the global association data; a first axis comprising a first plurality of category indicators representing a plurality of categories of the first column, wherein the first plurality of category indicators are sorted according to a sorting criterion and wherein each category indicator visually represents a category attribute of the corresponding category; a second axis comprising a second plurality of category indicators representing a plurality of categories in the second column, wherein the second plurality of category indicators are sorted according to the sorting criterion and wherein each category indicator visually represents the category attribute of the corresponding category; a plurality of categorical association indicators corresponding to the categorical association data, wherein each categorical association indicator visually represents a relationship between a category in the plurality of categories in the first column and a category in the plurality of categories in the second column; and an interface configured to receive a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the first plurality of category indicators, or one or more category indicators in the second plurality of category indicators.
-
-
42. The apparatus of claim 41, wherein the plurality of categorical association indicators are arranged in rows corresponding to the first plurality of category indicators and columns corresponding to the second plurality of category indicators.
-
43. The apparatus of claim 41, wherein each categorical association indicator in the plurality of categorical association indicators comprises one or more of:
- a color, a number, or a shape.
-
44. The apparatus of claim 41, wherein the category attribute comprises one or more of:
- a name of a corresponding category, an intrinsic rank of a corresponding category, a frequency of a corresponding category, or a strength of association between a corresponding category in a column and all categories in another column.
-
45. The apparatus of claim 41, wherein at least one of the one or more memories has further instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to:
-
receive, via the interface, a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the first plurality of category indicators, or one or more category indicators in the second plurality of category indicators; andupdate one or more of;
the global relationship indicator, the first plurality of category indicators, the second plurality of category indicators, or the categorical association indicators based at least in part on the user input.
-
-
46. The apparatus of claim 45, wherein the user input comprise a selection of one or more category indicators in the first plurality of category indicators and a selection of one or more category indicators in the second plurality of category indicators and wherein at least one of the one or more memories has further instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to:
-
apply the one or more global statistical measures to the data in one or more categories of the first column corresponding to the one or more category indicators and one or more categories of the second column corresponding to the one or more category indicators to generate new global association data; apply the one or more categorical statistical measures to the data in the one or more categories of the first column and the one or more categories of the second column to generate new categorical association data; and update the visualization based at least in part on one or more of the new global association data or the new categorical association data; wherein the one or more categories of the first column correspond to the selected one or more category indicators in the first plurality of category indicators and wherein the one or more categories of the second column correspond to the selected one or more category indicators in the second plurality of category indicators.
-
-
47. The apparatus of claim 38, wherein the relationship classification comprises categorical-numerical and wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to apply one or more categorical statistical measures to generate categorical association data further cause at least one of the one or more processors to either:
-
calculate results of a plurality of one-sample T-tests for categories in the first column and ranges of data values in the second column to generate the categorical association data quantifying each relationship between each category in the first column and each range of data values in the second column; or determine an observed frequency of co-occurrence of data values within ranges of data values in the second column with categories in the first column; determine an expected frequency of co-occurrence of data values within the ranges of data values in the second column with the categories in the first column; and generate the categorical association data quantifying each relationship between each category in the first column and each range of data values in the second column based at least in part on the observed frequency of co-occurrence and the expected frequency of co-occurrence.
-
-
48. The apparatus of claim 47, wherein the visualization comprises:
-
a global relationship indicator corresponding to the global association data; a first axis comprising a plurality of category indicators representing a plurality of categories of the first column, wherein the plurality of category indicators are sorted according to a sorting criterion and wherein each category indicator visually represents a category attribute of the corresponding category; a second axis comprising a distribution of data values in the second column; and a plurality of categorical association indicators corresponding to the categorical association data, wherein each categorical association indicator visually represents a relationship between a corresponding category in the plurality of categories in the first column and one or more ranges of data values in the second column; and an interface configured to receive a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the plurality of category indicators, or a range of data values in the distribution of data values in the second column.
-
-
49. The apparatus of claim 48, wherein the visualization further comprises:
-
a plurality of categorical distribution indicators corresponding to a distribution visualization type, wherein each categorical distribution indicator visually represents a distribution of data values in the second column corresponding to a category in the plurality of categories of the first column; wherein the interface is further configured to receive a user input relating to the distribution visualization type.
-
-
50. The apparatus of claim 49, wherein at least one of the one or more memories has further instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to:
-
receive, via the interface, a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the plurality of category indicators, the range of data values in the distribution of data values in the second column, or the distribution visualization type; andupdate one or more of;
the global relationship indicator, the plurality of category indicators, the categorical association indicators, or the plurality of categorical distribution indicators based at least in part on the user input.
-
-
51. The apparatus of claim 48, wherein each categorical association indicator in the plurality of categorical association indicators comprises one or more of:
- a color, a number, or a shape.
-
52. The apparatus of claim 48, wherein the category attribute comprises one or more of:
- a sum of a corresponding category, a mean of a corresponding category, a frequency of a corresponding category, or a strength of association between a corresponding category in a column and all data values in another column.
-
53. The apparatus of claim 48, wherein at least one of the one or more memories has further instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to:
-
receive via the interface, a selection of a range of data values in the distribution of data values in the second column; apply the one or more global statistical measures to the data in the first column and data corresponding to the selected range of data values in the second column to generate new global association data; apply the one or more categorical statistical measures to the data in the first column and data corresponding to the selected range of data values in the second column to generate new categorical association data quantifying each relationship between each category in the first column and the selected range of data values in the second column; and update the visualization with one or more of the new global association data or the new categorical association data.
-
-
54. The apparatus of claim 53, wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to update the visualization with one or more of the new global association data or the new categorical association data further cause at least one of the one or more processors to:
transmit one or more new categorical association indicators corresponding to the new categorical association data, wherein each new categorical association indicator in the one or more new categorical association indicators visually represents a relationship between a corresponding category in the plurality of categories in the first column and the selected range of data values in the second column.
-
55. The apparatus of claim 54, wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to update the visualization with one or more of the new global association data or the new categorical association data further cause at least one of the one or more processors to:
transmit one or more remaining distribution indicators, wherein each remaining distribution indicator in the one or more remaining distribution indicators corresponds to a category in the plurality of categories in the first column and visually represents an attribute of the distribution of data values in the second column for that category relative to the selected range of data values in the second column for that category.
-
56. The apparatus of claim 55, wherein each remaining distribution indicator visually represents a distance between a bound of the selected range of data values and a bound of a range of data values which includes a minimum percentage of all data values for that category.
-
57. The apparatus of claim 56, wherein each remaining distribution indicator visually represents a quantity of data values for that category required to reach the minimum percentage.
-
58. The apparatus of claim 57, wherein each remaining distribution indicator comprises a triangle, wherein the triangle is positioned relative to the selected range of data values based on the distance, and wherein the height of the triangle visually represents the quantity of data values for that category required to reach the minimum percentage.
-
59. The apparatus of claim 53, wherein at least one of the one or more memories has further instructions stored thereon that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to:
-
apply the one or more categorical statistical measures to the data in the first column and data corresponding to a plurality of subsets of the selected range of data values to generate subset categorical association data quantifying each relationship between each category in the first column and each subset in the plurality of subsets of the selected range of data values in the second column; and update the visualization with the subset categorical association data.
-
-
60. The apparatus of claim 53, wherein the instructions that, when executed by at least one of the one or more processors, cause at least one of the one or more processors to receive, via the interface, a selection of a range of data values in the distribution of data values in the second column further cause at least one of the one or more processors to:
-
detect, via the interface, a user input beginning at a starting point in the distribution of data values; detect, via the interface, a continuation of the user input to a current position beyond the starting point in the distribution of data values; and set the range of data values to be the range between the starting point and the current position.
-
-
32. The apparatus of claim 31, wherein the categorical data comprises one or more of nominal data and ordinal data.
-
-
61. At least one non-transitory computer-readable medium storing computer-readable instructions that, when executed by one or more computing devices, cause at least one of the one or more computing devices to:
-
identify a relationship classification corresponding to two columns in a plurality of columns based on a data type of each column in the two columns, wherein the data type of each column in the plurality of columns comprises either categorical data or numerical data and wherein the relationship classification comprises one of categorical-categorical, categorical-numerical, and numerical-numerical; apply one or more statistical measures to data in the two columns to generate association data quantifying a plurality of relationships between data values in a first column of the two columns and data values in a second column of the two columns, wherein the one or more statistical measures are determined based at least in part on the relationship classification; and transform the association data into a visualization, wherein the visualization comprises one or more indicators corresponding to one or more relationships in the plurality of relationships and wherein a layout of the visualization is determined based on the relationship classification. - View Dependent Claims (62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90)
-
62. The at least one non-transitory computer-readable medium of claim 61, wherein the categorical data comprises one or more of nominal data and ordinal data.
-
63. The at least one non-transitory computer-readable medium of claim 61, further storing computer-readable instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to:
receive a selection of two column identifiers corresponding to the two columns in the plurality of columns prior to identifying the relationship classification for the two columns of data.
-
64. The at least one non-transitory computer-readable medium of claim 61, further storing computer-readable instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to:
-
determine a plurality of relationship classifications corresponding to a plurality of pairs of columns in the plurality of columns based on the data type of each column in each pair of columns, wherein each column in the plurality of columns has a corresponding relationship count; for each pair of columns in the plurality of pairs of columns; apply one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns, wherein the one or more global statistical measures are determined based at least in part on the relationship classification; and increment the relationship count corresponding to each column in the pair of columns based at least in part on a determination that a significant relationship exists between data values in the first column of the pair of columns and data values in the second column of the pair of columns; and transmit a plurality of relationship indicators corresponding to the plurality of columns, wherein each relationship indicator corresponds to a column in the plurality of columns and indicates the relationship count of that column.
-
-
65. The at least one non-transitory computer-readable medium of claim 64, wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to apply one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns further cause at least one of the one or more computing devices to perform one of:
-
applying a Pearson correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in both columns of the pair of columns are continuous data values; applying a Spearman correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in at least one column of the pair of columns are ordinal data values; applying a Chi-squared test and Cramer'"'"'s V measure based at least in part on a determination that the relationship classification comprises categorical-categorical;
orapplying one or more of a one-way Analysis of Variance (ANOVA) test or a plurality of one-sample T-tests based at least in part on a determination that the relationship classification comprises categorical-numerical.
-
-
66. The at least one non-transitory computer-readable medium of claim 64, wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to apply one or more global statistical measures to data in the pair of columns to determine whether a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns further cause at least one of the one or more computing devices to:
-
apply the one or more global statistical measures to determine a strength of relationship between data values in a first column of the pair of columns and data values in a second column of the pair of columns; determine whether the strength of relationship is above a predetermined threshold; and determine that a significant relationship exists between data values in a first column of the pair of columns and data values in a second column of the pair of columns based at least in part on a determination that the strength of relationship is above the predetermined threshold.
-
-
67. The at least one non-transitory computer-readable medium of claim 64, wherein each relationship indicator in the plurality of relationship indicators comprises a circle having a size proportional to the relationship count of the corresponding column.
-
68. The at least one non-transitory computer-readable medium of claim 61, wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to apply one or more statistical measures to data in the two columns to generate association data quantifying a plurality of relationships between data values in a first column of the two columns and data values in a second column of the two columns further cause at least one of the one or more computing devices to:
-
apply one or more global statistical measures to the data in the two columns to generate global association data, wherein the one or more global statistical measures are based at least in part on the relationship classification; and apply one or more categorical statistical measures to the data in the two columns generate categorical association data based at least in part on a determination that the relationship classification comprises either categorical-categorical or categorical-numerical, wherein the one or more categorical statistical measures are based at least in part on the relationship classification.
-
-
69. The at least one non-transitory computer-readable medium of claim 68, wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to applying one or more global statistical measures based at least in part on the relationship classification further cause at least one of the one or more computing devices to perform one of:
-
applying a Pearson correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in both columns of the pair of columns are continuous data values; applying a Spearman correlation based at least in part on a determination that the relationship classification comprises numerical-numerical and a determination that data values in at least one column of the pair of columns are ordinal data values; applying a Chi-squared test and Cramer'"'"'s V measure based at least in part on a determination that the relationship classification comprises categorical-categorical;
orapplying one or more of a one-way ANOVA test or a plurality of one-sample T-tests based at least in part on a determination that the relationship classification comprises categorical-numerical.
-
-
70. The at least one non-transitory computer-readable medium of claim 68, wherein the relationship classification comprises categorical-categorical and wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to apply one or more categorical statistical measures to generate categorical association data further cause at least one of the one or more computing devices to:
-
determine an observed frequency of co-occurrence of categories in the second column with categories in the first column; determine an expected frequency of co-occurrence of the categories in the second column with the categories in the first column; and generate the categorical association data quantifying each relationship between each category in the first column and each category in the second column based at least in part on the observed frequency of co-occurrence and the expected frequency of co-occurrence.
-
-
71. The at least one non-transitory computer-readable medium of claim 70, wherein the visualization comprises:
-
a global relationship indicator corresponding to the global association data; a first axis comprising a first plurality of category indicators representing a plurality of categories of the first column, wherein the first plurality of category indicators are sorted according to a sorting criterion and wherein each category indicator visually represents a category attribute of the corresponding category; a second axis comprising a second plurality of category indicators representing a plurality of categories in the second column, wherein the second plurality of category indicators are sorted according to the sorting criterion and wherein each category indicator visually represents the category attribute of the corresponding category; a plurality of categorical association indicators corresponding to the categorical association data, wherein each categorical association indicator visually represents a relationship between a category in the plurality of categories in the first column and a category in the plurality of categories in the second column; and an interface configured to receive a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the first plurality of category indicators, or one or more category indicators in the second plurality of category indicators.
-
-
72. The at least one non-transitory computer-readable medium of claim 71, wherein the plurality of categorical association indicators are arranged in rows corresponding to the first plurality of category indicators and columns corresponding to the second plurality of category indicators.
-
73. The at least one non-transitory computer-readable medium of claim 71, wherein each categorical association indicator in the plurality of categorical association indicators comprises one or more of:
- a color, a number, or a shape.
-
74. The at least one non-transitory computer-readable medium of claim 71, wherein the category attribute comprises one or more of:
- a name of a corresponding category, an intrinsic rank of a corresponding category, a frequency of a corresponding category, or a strength of association between a corresponding category in a column and all categories in another column.
-
75. The at least one non-transitory computer-readable medium of claim 71, further storing computer-readable instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to:
-
receive, via the interface, a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the first plurality of category indicators, or one or more category indicators in the second plurality of category indicators; andupdate one or more of;
the global relationship indicator, the first plurality of category indicators, the second plurality of category indicators, or the categorical association indicators based at least in part on the user input.
-
-
76. The at least one non-transitory computer-readable medium of claim 75, wherein the user input comprise a selection of one or more category indicators in the first plurality of category indicators and a selection of one or more category indicators in the second plurality of category indicators and further storing computer-readable instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to:
-
apply the one or more global statistical measures to the data in one or more categories of the first column corresponding to the one or more category indicators and one or more categories of the second column corresponding to the one or more category indicators to generate new global association data; apply the one or more categorical statistical measures to the data in the one or more categories of the first column and the one or more categories of the second column to generate new categorical association data; and update the visualization based at least in part on one or more of the new global association data or the new categorical association data; wherein the one or more categories of the first column correspond to the selected one or more category indicators in the first plurality of category indicators and wherein the one or more categories of the second column correspond to the selected one or more category indicators in the second plurality of category indicators.
-
-
77. The at least one non-transitory computer-readable medium of claim 68, wherein the relationship classification comprises categorical-numerical and wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to apply one or more categorical statistical measures to generate categorical association data further cause at least one of the one or more computing devices to either:
-
calculate results of a plurality of one-sample T-tests for categories in the first column and ranges of data values in the second column to generate the categorical association data quantifying each relationship between each category in the first column and each range of data values in the second column; or determine an observed frequency of co-occurrence of data values within ranges of data values in the second column with categories in the first column; determine an expected frequency of co-occurrence of data values within the ranges of data values in the second column with the categories in the first column; and generate the categorical association data quantifying each relationship between each category in the first column and each range of data values in the second column based at least in part on the observed frequency of co-occurrence and the expected frequency of co-occurrence.
-
-
78. The at least one non-transitory computer-readable medium of claim 77, wherein the visualization comprises:
-
a global relationship indicator corresponding to the global association data; a first axis comprising a plurality of category indicators representing a plurality of categories of the first column, wherein the plurality of category indicators are sorted according to a sorting criterion and wherein each category indicator visually represents a category attribute of the corresponding category; a second axis comprising a distribution of data values in the second column; and a plurality of categorical association indicators corresponding to the categorical association data, wherein each categorical association indicator visually represents a relationship between a corresponding category in the plurality of categories in the first column and one or more ranges of data values in the second column; and an interface configured to receive a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the plurality of category indicators, or a range of data values in the distribution of data values in the second column.
-
-
79. The at least one non-transitory computer-readable medium of claim 78, wherein the visualization further comprises:
-
a plurality of categorical distribution indicators corresponding to a distribution visualization type, wherein each categorical distribution indicator visually represents a distribution of data values in the second column corresponding to a category in the plurality of categories of the first column; wherein the interface is further configured to receive a user input relating to the distribution visualization type.
-
-
80. The at least one non-transitory computer-readable medium of claim 79, further storing computer-readable instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to:
-
receive, via the interface, a user input relating to one or more of;
the sorting criterion, the category attribute visually represented by each category indicator, one or more category indicators in the plurality of category indicators, the range of data values in the distribution of data values in the second column, or the distribution visualization type; andupdate one or more of;
the global relationship indicator, the plurality of category indicators, the categorical association indicators, or the plurality of categorical distribution indicators based at least in part on the user input.
-
-
81. The at least one non-transitory computer-readable medium of claim 78, wherein each categorical association indicator in the plurality of categorical association indicators comprises one or more of:
- a color, a number, or a shape.
-
82. The at least one non-transitory computer-readable medium of claim 78, wherein the category attribute comprises one or more of:
- a sum of a corresponding category, a mean of a corresponding category, a frequency of a corresponding category, or a strength of association between a corresponding category in a column and all data values in another column.
-
83. The at least one non-transitory computer-readable medium of claim 78, further storing computer-readable instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to:
-
receive via the interface, a selection of a range of data values in the distribution of data values in the second column; apply the one or more global statistical measures to the data in the first column and data corresponding to the selected range of data values in the second column to generate new global association data; apply the one or more categorical statistical measures to the data in the first column and data corresponding to the selected range of data values in the second column to generate new categorical association data quantifying each relationship between each category in the first column and the selected range of data values in the second column; and update the visualization with one or more of the new global association data or the new categorical association data.
-
-
84. The at least one non-transitory computer-readable medium of claim 83, wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to update the visualization with one or more of the new global association data or the new categorical association data further cause at least one of the one or more computing devices to:
transmit one or more new categorical association indicators corresponding to the new categorical association data, wherein each new categorical association indicator in the one or more new categorical association indicators visually represents a relationship between a corresponding category in the plurality of categories in the first column and the selected range of data values in the second column.
-
85. The at least one non-transitory computer-readable medium of claim 84, wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to update the visualization with one or more of the new global association data or the new categorical association data further cause at least one of the one or more computing devices to:
transmit one or more remaining distribution indicators, wherein each remaining distribution indicator in the one or more remaining distribution indicators corresponds to a category in the plurality of categories in the first column and visually represents an attribute of the distribution of data values in the second column for that category relative to the selected range of data values in the second column for that category.
-
86. The at least one non-transitory computer-readable medium of claim 85, wherein each remaining distribution indicator visually represents a distance between a bound of the selected range of data values and a bound of a range of data values which includes a minimum percentage of all data values for that category.
-
87. The at least one non-transitory computer-readable medium of claim 86, wherein each remaining distribution indicator visually represents a quantity of data values for that category required to reach the minimum percentage.
-
88. The at least one non-transitory computer-readable medium of claim 87, wherein each remaining distribution indicator comprises a triangle, wherein the triangle is positioned relative to the selected range of data values based on the distance, and wherein the height of the triangle visually represents the quantity of data values for that category required to reach the minimum percentage.
-
89. The at least one non-transitory computer-readable medium of claim 83, further storing computer-readable instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to:
-
apply the one or more categorical statistical measures to the data in the first column and data corresponding to a plurality of subsets of the selected range of data values to generate subset categorical association data quantifying each relationship between each category in the first column and each subset in the plurality of subsets of the selected range of data values in the second column; and update the visualization with the subset categorical association data.
-
-
90. The at least one non-transitory computer-readable medium of claim 83, wherein the instructions that, when executed by at least one of the one or more computing devices, cause at least one of the one or more computing devices to receive, via the interface, a selection of a range of data values in the distribution of data values in the second column further cause at least one of the one or more computing devices to:
-
detect, via the interface, a user input beginning at a starting point in the distribution of data values; detect, via the interface, a continuation of the user input to a current position beyond the starting point in the distribution of data values; and set the range of data values to be the range between the starting point and the current position.
-
-
62. The at least one non-transitory computer-readable medium of claim 61, wherein the categorical data comprises one or more of nominal data and ordinal data.
-
Specification
- Resources
-
Current AssigneeInformatica LLC (Informatica, Inc. (California))
-
Original AssigneeInformatica LLC (Informatica, Inc. (California))
-
InventorsConvertino, Gregorio, Sun, Maoyuan
-
Granted Patent
-
Time in Patent OfficeDays
-
Field of Search
-
US Class Current
-
CPC Class CodesG06F 17/18 for evaluating statistical ...G06F 40/18 of spreadsheets form-fillin...G06T 11/206 Drawing of charts or graphs