Partition boundary determination using random sampling on very large databases

US 7,024,401 B2
Filed: 07/02/2001
Issued: 04/04/2006
Est. Priority Date: 07/02/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A method for database partition boundary determination in a database management system (DBMS), the method comprising:

providing a pre-configured number S defining a default sample size in a database analysis program;

selectively receiving by the database analysis program a particular number defining a desired sample size and setting said number S equal to said particular number;

providing a seed value to the database analysis program for initializing a random number algorithm;

randomly sampling S records of the database by the database analysis program using the random sampling algorithm, wherein said S records are different each time said method is utilized with different seed values, and wherein said S records are different for successive utilizations of said method if at least one record has been added to or deleted from said database between successive utilizations of said method;

storing statistics for each of said S records as stored statistics including a record key for each record; and

, producing an approximation partition analysis based on said stored statistics, wherein said approximation partition analysis is not mathematically exact.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method utilizing random sampling for partition analysis on very large databases. The method utilizes a random sampling algorithm that provides results accurate to within a few percentage points for large homogeneous databases. The accuracy is not affected by the size of the database and is determined primarily by the size of the sample. The system and method for approximate partition analysis reduces the time required for an analysis to a fraction of the time required for an exact analysis. The reduction in time thereby permits more frequent and timely analyses of database partition sizes.

21 Citations

View as Search Results

23 Claims

1. A method for database partition boundary determination in a database management system (DBMS), the method comprising:
- providing a pre-configured number S defining a default sample size in a database analysis program;
  
  selectively receiving by the database analysis program a particular number defining a desired sample size and setting said number S equal to said particular number;
  
  providing a seed value to the database analysis program for initializing a random number algorithm;
  
  randomly sampling S records of the database by the database analysis program using the random sampling algorithm, wherein said S records are different each time said method is utilized with different seed values, and wherein said S records are different for successive utilizations of said method if at least one record has been added to or deleted from said database between successive utilizations of said method;
  
  storing statistics for each of said S records as stored statistics including a record key for each record; and
  
  , producing an approximation partition analysis based on said stored statistics, wherein said approximation partition analysis is not mathematically exact.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method as set forth in claim 1, further comprising sorting said stored statistics by key prior to producing said partition analysis.
  - 3. The method as set forth in claim 1, wherein said storing said statistics includes storing said statistics in a memory.
  - 4. The method as set forth in claim 3, wherein said storing said statistics in said memory includes compressing the statistics prior to storing in said memory.
  - 5. The method as set forth in claim 3, further including sorting said stored statistics by key prior to producing said partition analysis.
  - 6. The method as set forth in claim 5, wherein said producing said approximation partition analysis includes defining multiple partition boundaries.
  - 7. The method as set forth in claim 6, further including:
    - accessing all database records in an arbitrary sequence;
      
      iteratively filling all of said partitions except the last with said accessed records to a maximum byte count; and
      
      , storing remaining accessed records in the last of said partitions.
  - 8. The method as set forth in claim 1, wherein said randomly sampling said S records includes randomly sampling the S records utilizing dataspaces including:
    - at least one index dataspace;
      
      at least one key dataspace; and
      
      , at least one statistics dataspace.

9. A method for database partition boundary determination comprising:
- providing a pre-configured number S defining a default sample size;
  
  selectively receiving by the database analysis program a particular number defining a desired sample size and setting said number S equal to said particular number;
  
  providing a seed value for initializing a random number algorithm;
  
  randomly sampling S records of the database using the random sampling algorithm, wherein said S records are different each time said method is utilized with different seed values, and wherein said S records are different for successive utilizations of said method if at least one record has been added to or deleted from said database between successive utilizations of said method, wherein said randomly sampling S records further includes;
  
  generating a table of S number pairs (Y_j,I_j), j=1,2, . . . ,S, wherein all Y and all I are initially set to zero;
  
  initializing a reservoir of records to an empty state;
  
  setting an index M to said reservoir equal to zero;
  
  generating a sequence of N non-repeating random numbers U₁,U₂, . . . ,U_N, 0≦
  
  U≦
  
  1, wherein N is the number of records in the database;
  
  performing additional steps for each random number U_kgenerated, k=1,2, . . . ,N, including;
  
  skipping the next record in the database if U_kis less than the smallest value of Y in said table of number pairs; and
  
  , updating the table if a Y less than U_kexists by performing further steps including;
  
  setting M equal to its current value plus one;
  
  replacing the smallest Y in the table with U_k;
  
  setting the I value paired with the smallest Y equal to M; and
  
  , storing all or part of the next record of the database in said reservoir of stored records, wherein the current value of M is a reservoir index to said stored record;
  
  storing statistics for each of said S records as stored statistics including a record key for each record; and
  
  , producing an approximation partition analysis based on said stored statistics, wherein said approximation partition analysis is not mathematically exact.
- View Dependent Claims (10)
- - 10. The method as set forth in claim 9, wherein said updating the table further includes arranging the table in a heap with respect to Y.

11. A method for database partition boundary determination comprising:
- providing a pre-configured number S defining a default sample size;
  
  selectively receiving by the database analysis program a particular number defining a desired sample size and setting said number S equal to said particular number;
  
  providing a seed value for initializing a random number algorithm;
  
  randomly sampling S records of the database using the random sampling algorithm, wherein said S records are different each time said method is utilized with different seed values, and wherein said S records are different for successive utilizations of said method if at least one record has been added to or deleted from said database between successive utilizations of said method, wherein said randomly sampling S records further comprises;
  
  generating a table of S number pairs (Y_j,I_j), j=1,2, . . . ,S, wherein all Y and all I are initially set to zero;
  
  generating a sequence of N non-repeating random numbers U₁, U₂, . . . ,U_N, 0≦
  
  U≦
  
  1, wherein N is the number of records in the database; and
  
  , performing additional steps for each random number U_igenerated, i=1,2, . . . ,N, including;
  
  ignoring u_iif U_iis less than the smallest value of Y in said table of number pairs; and
  
  , updating the table if a Y less than U_iexists by performing further steps including;
  
  replacing the smallest Y in the table with U_i;
  
  setting the I value paired with the smallest Y equal to i; and
  
  , reading S records from the database corresponding to I_j, j=1,2, . . . ,S, wherein I_jis an index to a record in the database storing statistics for each of said S records as stored statistics including a record key for each record; and
  
  , producing an approximation partition analysis based on said stored statistics, wherein said approximation partition analysis is not mathematically exact.
- View Dependent Claims (12)
- - 12. The method as set forth in claim 11, wherein said updating the table further includes arranging the table in a heap with respect to Y.

13. A database partition boundary determination system comprising:
- a first computer program routine having a random number generating algorithm;
  
  a second computer program routine having a random sampling facility utilizing said first program routine to randomly read records from a database and store statistics for each read record including a record key, wherein said read records are different each time said second routine is utilized with different seed values, and wherein said read records are different for successive utilizations of said second routine if at least one record has been added to or deleted from said database between successive utilizations of said second routine; and
  
  , a third computer program routine for generating a partition boundary analysis based on said stored statistics, wherein said partition boundary analysis is an approximation and is not mathematically exact.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The system of claim 13, further comprising a fourth computer program routine for sorting said stored statistics by key prior to producing said partition analysis.
  - 15. The system of claim 13, further including a memory for storing said statistics.
  - 16. The system of claim 15, further comprising a fourth computer program routine for sorting said stored statistics by key prior to producing said partition analysis.
  - 17. The system of claim 16, wherein said partition analysis includes means for performing an analysis of multiple partition boundaries.
  - 18. The system of claim 13, further comprising:
    - means for accessing all database records in an arbitrary sequence;
      
      means for iteratively filling all of said partitions except the last with said accessed records to a maximum byte count; and
      
      , means for storing remaining accessed records in the last of said partitions.
  - 19. The system of claim 13, further comprising:
    - means for utilizing at least one index dataspace;
      
      means for utilizing at least one key dataspace; and
      
      , means for utilizing at least one statistics dataspace.

20. A database partition boundary determination system comprising:
- a first computer program routine having a random number generating algorithm;
  
  a second computer program routine having a random sampling facility utilizing said first program routine to randomly read records from a database and store statistics for each read record including a record key, wherein said read records are different each time said second routine is utilized with different seed values, and wherein said read records are different for successive utilizations of said second routine if at least one record has been added to or deleted from said database between successive utilizations of said second routine, wherein said random sampling facility further comprises;
  
  means for generating a table of S number pairs (Y.sub.j,I.sub.j), j=1,2, . . . ,S, wherein all Y and all I are initially zero;
  
  means for initializing a reservoir of records to an empty state;
  
  means for setting an index M to said reservoir equal to zero;
  
  means for generating a sequence of N non-repeating random numbers U.sub.1,U.sub.2, . . . ,U.sub.N, 0.ltoreq.U.ltoreq.1, wherein N is the number of records in the database; and
  
  , means, for each random number U.sub.k generated, k=1,2, . . . ,N, comprising;
  
  means to skip the next record in said database if U.sub.k is less than the smallest value of Y in said table of number pairs; and
  
  , means to update the table if a Y less than U.sub.k exists, comprising;
  
  a means to set M equal to its current value plus one;
  
  means to replace the smallest Y in the table with U.sub.k;
  
  means to set the I value paired with the smallest Y equal to M; and
  
  , means to store all or part of the next record of said database in said reservoir of stored records, wherein the current value of M is a reservoir index to said stored record; and
  
  , a third computer program routine for generating a partition boundary analysis based on said stored statistics, wherein said partition boundary analysis is an approximation and is not mathematically exact.
- View Dependent Claims (21)
- - 21. The system of claim 20 wherein the means to update the table further comprises means to arrange the table in a heap with respect to Y.

22. A database partition boundary determination system comprising:
- a first computer program routine having a random number generating algorithm;
  
  a second computer program routine having a random sampling facility utilizing said first program routine to randomly read records from a database and store statistics for each read record including a record key, wherein said read records are different each time said second routine is utilized with different seed values, and wherein said read records are different for successive utilizations of said second routine if at least one record has been added to or deleted from said database between successive utilizations of said second routine, wherein said random sampling facility further comprises;
  
  means for generating a table of S number pairs (Y.sub.j,I.sub.j), j=1,2, . . . ,S, wherein all V and all I are initially zero;
  
  means for generating a sequence of N non-repeating random numbers U.sub.1,U.sub.2, . . . ,U.sub.N, 0.ltoreq.U.ltoreq.1, wherein N is the number of records in the database;
  
  means, for each random number U.sub.i generated, i=1,2, . . . ,N, comprising;
  
  means to ignore u.sub.i if U.sub.i is less than the smallest value of Y in said table of number pairs; and
  
  , means to update the table if a Y less than U.sub.i exists, comprising;
  
  means to replace the smallest Y in the table with U.sub.i;
  
  means to set the I value paired with the smallest Y equal to i; and
  
  , means for reading S records from the database corresponding to I.sub.j, j=1,2, . . . ,S, wherein I.sub.j is an index to a record in the database; and
  
  , a third computer program routine for generating a partition boundary analysis based on said stored statistics, wherein said partition boundary analysis is an approximation and is not mathematically exact.
- View Dependent Claims (23)
- - 23. The system of claim 22 wherein the means to update the table further comprises a means to arrange the table in a heap with respect to Y.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Harper, John William, Slishman, Gordon Robert
Primary Examiner(s)
Alam, Shahid
Assistant Examiner(s)
Truong, Cam Y

Application Number

US09/897,853
Publication Number

US 20030004944A1
Time in Patent Office

1,737 Days
Field of Search

707/2, 707/4, 707/7, 707/100, 707/1, 707/3, 707/200, 711/173
US Class Current

1/1
CPC Class Codes

G06F 16/284   Relational databases

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99937   Sorting

Partition boundary determination using random sampling on very large databases

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

21 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Partition boundary determination using random sampling on very large databases

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links