Multivariate insight discovery approach

US 10,255,345 B2
Filed: 10/09/2014
Issued: 04/09/2019
Est. Priority Date: 10/09/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

identifying, by the at least one processor, data types of a raw dataset and models of data of the raw dataset to determine attribute hierarchies;

generating, by the at least one processor, a reduced dataset from the raw dataset based on the determined attribute hierarchies by;

mapping, by the at least one processor, the attribute hierarchies to identify sets of equivalent attributes and measures;

for each set of equivalent attributes, selecting, by the at least one processor, one of the equivalent attributes and discarding the remaining equivalent attributes; and

for each set of equivalent attributes, selecting, by the at least one processor, one of the equivalent measures and discarding the remaining equivalent measures;

aggregating over at least one attribute of the reduced dataset, by the at least one processor, to generate a preprocessed dataset with the same relevant statistical properties of the raw dataset, such that at least one type of statistical analysis produces the same results when applied to the preprocessed dataset as when applied to the raw dataset;

identifying, by the at least one processor, subsets of the preprocessed dataset that include data that exhibits non-random patterns by performing the at least one type of statistical analysis;

generating a score for each of the identified subsets of the preprocessed dataset, by the at least one processor, based on the data that exhibits non-random patterns included in each of the identified subsets;

ranking each of the identified subsets for presence of non-random data structures, by the at least one processor, based on the score generated for each of the identified subsets;

selecting, by the at least one processor, an identified subset based on the ranking of the identified subset; and

generating, by the at least one processor, a visualization that highlights a non-random structure of the selected identified subset.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A raw dataset including measures and dimensions is processed, by a preprocessing module, using an algorithm that produces a preprocessed dataset such that at least one type of statistical analysis of the preprocessed dataset yields equal results to the same type of statistical analysis of the raw dataset. The preprocessed dataset is then analyzed by a statistical analysis module to identify subsets of the preprocessed dataset that include a non-random structure or pattern. The analysis of the preprocessed dataset includes the at least one type of statistical analysis that produces the same results for both the preprocessed and raw datasets. The identified subsets are then ranked by a statistical ranker based on the analysis of the preprocessed dataset and a subset is selected for visualization based on the rankings. A visualization module then generates a visualization of the selected identified subset that highlights a non-random structure of the selected subset.

44 Citations

20 Claims

1. A method comprising:
- identifying, by the at least one processor, data types of a raw dataset and models of data of the raw dataset to determine attribute hierarchies;
  
  generating, by the at least one processor, a reduced dataset from the raw dataset based on the determined attribute hierarchies by;
  
  mapping, by the at least one processor, the attribute hierarchies to identify sets of equivalent attributes and measures;
  
  for each set of equivalent attributes, selecting, by the at least one processor, one of the equivalent attributes and discarding the remaining equivalent attributes; and
  
  for each set of equivalent attributes, selecting, by the at least one processor, one of the equivalent measures and discarding the remaining equivalent measures;
  
  aggregating over at least one attribute of the reduced dataset, by the at least one processor, to generate a preprocessed dataset with the same relevant statistical properties of the raw dataset, such that at least one type of statistical analysis produces the same results when applied to the preprocessed dataset as when applied to the raw dataset;
  
  identifying, by the at least one processor, subsets of the preprocessed dataset that include data that exhibits non-random patterns by performing the at least one type of statistical analysis;
  
  generating a score for each of the identified subsets of the preprocessed dataset, by the at least one processor, based on the data that exhibits non-random patterns included in each of the identified subsets;
  
  ranking each of the identified subsets for presence of non-random data structures, by the at least one processor, based on the score generated for each of the identified subsets;
  
  selecting, by the at least one processor, an identified subset based on the ranking of the identified subset; and
  
  generating, by the at least one processor, a visualization that highlights a non-random structure of the selected identified subset.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the preprocessed dataset includes at least one online analytical processing (OLAP) cube and generating the preprocessed dataset further comprises:
    - discarding a measure of the raw dataset based on more than half of the values of the measure being zero;
      
      determining hierarchical relationships between attributes of the raw dataset; and
      
      storing “
      
      Sum” and
      
      “
      
      Count”
      
      values for corresponding measures of an attribute based on an aggregation type of the attribute being “
      
      Average”
      
      .
  - 3. The method of claim 2, wherein generating the preprocessed dataset includes at least one of:
    - aggregating over attributes of the raw dataset containing 99% of the same value;
      
      aggregating over attributes of the raw dataset with a cardinality greater than a threshold value;
      
      oraggregating over all of the attributes of the raw dataset in order of decreasing cardinality until the dataset has a threshold size.
  - 4. The method of claim 1, wherein the at least one type of statistical analysis includes an analysis of variance (ANOVA) test, and the method further comprises:
    - analyzing only subsets of the preprocessed dataset that consist of an attribute and a measure and subsets that consist of two attributes and a measure; and
      
      generating a score for each identified subset is based on an effect size of ANOVA for the identified subset.
  - 5. The method of claim 1, wherein the visualization includes at least one of:
    - a mark representing different values of an attribute, a mark type for each type of data point representation, a mark color property associated with a measure or an attribute, or a mark size property associated with a value of a measure and the method further comprises;
      
      selecting an attribute of the selected identified subset for the color property based on attribute hierarchies; and
      
      selecting a mark type based on a cardinality of an attribute of the selected identified subset.
  - 6. The method of claim 5, wherein the selecting an attribute for the color property based on the attribute hierarchies includes one of:
    - determining that an attribute is one of included as a mark and included on an x-axis with a cardinality less than 10;
      
      ordetermining that an attribute is at a higher hierarchy level than an attribute used as a mark with a cardinality less than 10.
  - 7. The method of claim 1, wherein mapping the attribute hierarchies to identify sets of equivalent attributes comprises generating an attribute hierarchy map comprising identified hierarchical relationships between attributes.
  - 8. The method of claim 1, wherein identifying sets of equivalent measures comprises calculating a correlation coefficient for each of a plurality of pairs of measures and determining whether the calculated coefficient for each pair of measures is greater than a predefined threshold to be considered as equivalent measures.
  - 9. The method of claim 1, further comprising:
    - selecting a predetermined number of highest scoring identified subsets; and
      
      generalizing a visualization for each of the selected highest scoring identified subsets, each visualization highlighting the non-random structure of the selected highest scoring identified subset.

10. A system comprising:
- one or more processors; and
  
  a machine-readable storage medium storing a set of instructions that, when executed by the one or more processors, cause the system to perform operations comprising;
  
  identifying data types of a raw dataset and models of data of the raw dataset to determine attribute hierarchies;
  
  generating a reduced dataset from the raw dataset based on the determined attribute hierarchies by;
  
  mapping the attribute hierarchies to identify sets of equivalent attributes and measures;
  
  for each set of equivalent attributes, selecting one of the equivalent attributes and discarding the remaining equivalent attributes; and
  
  for each set of equivalent attributes, selecting one of the equivalent measures and discarding the remaining equivalent measures;
  
  aggregating over at least one attribute of the reduced dataset to generate a preprocessed dataset with the same relevant statistical properties of the raw dataset, such that at least one type of statistical analysis produces the same results when applied to each of the preprocessed dataset and the raw dataset;
  
  identifying subsets of the preprocessed dataset that include data that exhibits non-random patterns by performing the at least one type of statistical analysis;
  
  generating a score for each of the identified subsets of the preprocessed dataset based on the data that exhibits the non-random patterns included in each of the identified subsets;
  
  ranking each of the identified subsets for presence of non-random data structures, based on the score generated for each of the identified subsets;
  
  selecting an identified subset based on the ranking of the identified subset; and
  
  generating a visualization of the selected identified subset that highlights a non-random structure of the selected identified subset.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system of claim 10, wherein the preprocessed dataset includes at least one online analytical processing (OLAP) cube and the operations further comprise:
    - discarding a measure of the raw dataset based on more than half of the values of the measure being zero;
      
      determining hierarchical relationships between attributes of the raw dataset; and
      
      storing “
      
      Sum” and
      
      “
      
      Count”
      
      values for corresponding measures of an attribute based on an aggregation type of the attribute being “
      
      Average”
      
      .
  - 12. The system of claim 11, wherein generating the preprocessed dataset includes at least one of:
    - aggregating over attributes of the raw dataset containing 99% of the same value;
      
      aggregating over attributes of the raw dataset with a cardinality greater than a threshold value;
      
      oraggregating over all of the attributes of the raw dataset in order of decreasing cardinality until the dataset has a threshold size.
  - 13. The system of claim 10, wherein the at least one type of statistical analysis includes an analysis of variance (ANOVA) test and the operations further comprise:
    - analyzing only subsets of the preprocessed dataset that consist of an attribute and a measure and subsets that consist of two attributes and a measure; and
      
      generating a score for each identified subset based on an effect size of ANOVA for the identified subset.
  - 14. The system of claim 10, wherein the visualization includes at least one of a mark representing different values of an attribute, a mark type for each type of data point representation, a mark color property associated with a measure or an attribute, or a mark size property associated with a value of a measure, and the operations further comprise:
    - selecting an attribute of the selected identified subset for the color property based on attribute hierarchies; and
      
      selecting a mark type based on a cardinality of an attribute of the selected identified subset.
  - 15. The system of claim 14, wherein the selecting the attribute for the color property based on the attribute hierarchies includes one of:
    - determining that an attribute is one of included as a mark or included on an x-axis with a cardinality less than 10;
      
      ordetermining that an attribute is at a higher hierarchy level than an attribute used as a mark with a cardinality less than 10.

16. A non-transitory machine-readable storage medium including instructions that, when executed on at least one processor of a machine, cause the machine to perform operations comprising:
- identifying data types of a raw dataset and models of data of the raw dataset to determine attribute hierarchies;
  
  generating a reduced dataset from the raw dataset based on the determined attribute hierarchies by;
  
  mapping the attribute hierarchies to identify sets of equivalent attributes and measures;
  
  for each set of equivalent attributes, selecting one of the equivalent attributes and discarding the remaining equivalent attributes; and
  
  for each set of equivalent attributes, selecting one of the equivalent measures and discarding the remaining equivalent measures;
  
  aggregating over at least one attribute of the reduced dataset to generate a preprocessed dataset with the same relevant statistical properties of the raw dataset, such that at least one type of statistical analysis produces the same results when applied to each of the preprocessed dataset and the raw dataset;
  
  identifying subsets of the preprocessed dataset that include data that exhibits non-random patterns by performing the at least one type of statistical analysis;
  
  generating a score for each of the identified subsets of the preprocessed dataset based on the data that exhibits the non-random patterns included in each of the identified subsets;
  
  ranking each of the identified subsets for presence of non-random data structures, based on the score generated for each of the identified subsets;
  
  selecting an identified subset based on the ranking of the identified subset; and
  
  generating a visualization of the selected identified subset that highlights a non-random structure of the selected identified subset.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The machine-readable storage medium of claim 16, wherein the preprocessed dataset includes at least one online analytical processing (OLAP) cube and generating the preprocessed dataset includes:
    - discarding a measure of the raw dataset based on more than half of the values of the measure being zero;
      
      determining hierarchical relationships between attributes of the raw dataset;
      
      storing “
      
      Sum” and
      
      “
      
      Count”
      
      values for corresponding measures of an attribute based on an aggregation type of the attribute being “
      
      Average”
      
      .
  - 18. The machine-readable storage medium of claim 16, wherein the at least one type of statistical analysis includes an analysis of variance (ANOVA) test, and the operations further comprise:
    - analyzing only subsets of the preprocessed dataset that consist of an attribute and a measure and subsets that consist of two attributes and a measure;
      
      generating a score for each identified subset based on an effect size of ANOVA for the identified subset.
  - 19. The machine-readable storage medium of claim 18, wherein the visualization includes at least one of:
    - a mark representing different values of an attribute, a mark type for each type of data point representation, a mark color property associated with a measure or an attribute, or a mark size property associated with a value of a measure and the operations further comprise;
      
      selecting an attribute of the selected identified subset for the color property based on attribute hierarchies; and
      
      selecting a mark type based on a cardinality of an attribute of the selected identified subset.
  - 20. The machine-readable storage medium of claim 19, wherein the selecting the attribute for the color property based on attribute hierarchies includes one of:
    - determining that an attribute is one of included as a mark and included on an x-axis with a cardinality less than 10;
      
      ordetermining that an attribute is at a higher hierarchy level than an attribute used as a mark with a cardinality less than 10.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Business Objects Incorporated (SAP SE)
Original Assignee
Business Objects Incorporated (SAP SE)
Inventors
Moser, Flavia, MacAulay, Alexander Kennedy, Gosper, Julian
Primary Examiner(s)
Hershley, Mark E

Application Number

US14/511,047
Publication Number

US 20160103902A1
Time in Patent Office

1,643 Days
Field of Search

707602
US Class Current
CPC Class Codes

G06F 16/248   Presentation of query results

G06F 16/254   Extract, transform and load...

G06F 16/26   Visual data mining; Browsin...

G06F 16/283   Multi-dimensional databases...

G06T 11/206   Drawing of charts or graphs

Multivariate insight discovery approach

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

44 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Multivariate insight discovery approach

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links