Automatic consistent sampling for data analysis

US 8,892,525 B2
Filed: 09/06/2013
Issued: 11/18/2014
Est. Priority Date: 07/19/2011
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method of analyzing data within one or more databases, comprising:

selecting one or more databases for analysis, each database comprising one or more database objects comprising one or more data values, wherein the data values in each database object are arranged in columns;

applying a function to each data value in each database object within the one or more databases, wherein the function produces function values limited to a predetermined range;

identifying for analysis the data values producing a certain function value within the predetermined range to form a sampled data set;

identifying for analysis the data values that produce function values other than the certain function value and reside in one or more columns lacking high cardinality to form an unsampled data set, wherein a column has a high cardinality when data values in the column satisfy one or more from a group of a predetermined cardinality threshold and a predetermined selectivity threshold; and

analyzing the sampled data set with the unsampled data set by matching data values within these data sets to determine relationships between the database objects within and across the one or more databases.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, computer program product, and system for analyzing data within one or more databases, comprising selecting one or more databases for analysis, each database comprising one or more database objects comprising one or more data values, applying a function to each data value in each database object within the one or more databases, where the function produces function values limited to a predetermined range, identifying for analysis the data values producing a certain function value within the predetermined range to form a sampled data set, and analyzing the sampled data set to determine relationships between the database objects within and across the one or more databases.

28 Citations

View as Search Results

6 Claims

1. A computer-implemented method of analyzing data within one or more databases, comprising:
- selecting one or more databases for analysis, each database comprising one or more database objects comprising one or more data values, wherein the data values in each database object are arranged in columns;
  
  applying a function to each data value in each database object within the one or more databases, wherein the function produces function values limited to a predetermined range;
  
  identifying for analysis the data values producing a certain function value within the predetermined range to form a sampled data set;
  
  identifying for analysis the data values that produce function values other than the certain function value and reside in one or more columns lacking high cardinality to form an unsampled data set, wherein a column has a high cardinality when data values in the column satisfy one or more from a group of a predetermined cardinality threshold and a predetermined selectivity threshold; and
  
  analyzing the sampled data set with the unsampled data set by matching data values within these data sets to determine relationships between the database objects within and across the one or more databases.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein said analysis further comprises determining one or more primary key-foreign key relationships between the database objects within and across the one or more databases.
  - 3. The method of claim 1, wherein a column is a high cardinality column if a number of data values in the column that produce the certain function value exceeds the predetermined cardinality threshold.
  - 4. The method of claim 1, wherein a column is a high cardinality column if a number of unique data values in the column divided by the number of data values in the column exceeds the predetermined selectivity threshold.
  - 5. The method of claim 1, wherein the function is a hash function.
  - 6. The method of claim 1, wherein the database objects are tables.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Gorelik, Alexander
Primary Examiner(s)
Fan, Shiow-Jy

Application Number

US14/019,823
Publication Number

US 20140012819A1
Time in Patent Office

438 Days
Field of Search

707/690, 707/687, 707/E17.005, 707/999.1
US Class Current

707/690
CPC Class Codes

G06F 16/21   Design, administration or m...

G06F 16/2228   Indexing structures

G06F 16/2282   Tablespace storage structur...

G06F 16/2365   Ensuring data consistency a...

Automatic consistent sampling for data analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

28 Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic consistent sampling for data analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

28 Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links