Random sampling as a built-in function for database administration and replication

US 7,028,054 B2
Filed: 07/02/2001
Issued: 04/11/2006
Est. Priority Date: 07/02/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented method for administration and replication of a database, comprising:

providing a database management system with a built-in random sampling facility configured as an integral part of said database management system, whereby the random sampling facility has access to low level functions and buffers of the database management system, executing said random sampling facility from within the database management to perform a replication operation on said database;

defining a database record sample size S wherein S is an integer value greater than one;

randomly sampling S records of the database using said random sampling facility;

storing statistics for each of said S records, wherein said statistics include a record key for each record; and

, producing an extrapolated replication partition analysis based on said statistics, wherein said replication operation is based on said extrapolated replication partition analysis.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A database management system and method for administration and replication having a built-in random sampling facility for approximation partition analysis on very large databases. The method utilizes a random sampling algorithm that provides results accurate to within a few percentage points for large homogeneous databases. The accuracy is not affected by the size of the database and is determined primarily by the size of the sample. The system and method for approximate partition analysis reduces the time required for an analysis to a fraction of the time required for an exact analysis. The database management system is configured with the random sampling facility built-in thereby enabling even greater efficiency by reducing communication overhead between an analysis program and the database management system to a fraction of the overhead required when sampling is performed by a separate analysis program. The reduction in time thereby permits frequent and timely analyses for replication and administration of database partitions.

Citations

22 Claims

1. A computer implemented method for administration and replication of a database, comprising:
- providing a database management system with a built-in random sampling facility configured as an integral part of said database management system, whereby the random sampling facility has access to low level functions and buffers of the database management system, executing said random sampling facility from within the database management to perform a replication operation on said database;
  
  defining a database record sample size S wherein S is an integer value greater than one;
  
  randomly sampling S records of the database using said random sampling facility;
  
  storing statistics for each of said S records, wherein said statistics include a record key for each record; and
  
  , producing an extrapolated replication partition analysis based on said statistics, wherein said replication operation is based on said extrapolated replication partition analysis.
- View Dependent Claims (2, 3, 4)
- - 2. The method as set forth in claim 1, wherein said defining said sample size S includes:
    - defining a default sample size;
      
      selectively receivIng a desired sample size; and
      
      , setting said sample size S as said default sample size when the desired sample size is not selectively received, and setting said sample size S as said desired sample size when the desired sample size is selectively received.
  - 3. The method as set forth in claim 1, further comprising:
    - producing a partial replication partition analysis based on said statistics.
  - 4. The method as set forth in claim 3, wherein said defining said sample size S includes:
    - defining a default sample size;
      
      selectively receMng a desired sample size; and
      
      , setting said sample size S as said default sample size when the desired sample size is not selectively received, and setting said sample size S as said desired sample size when the desired sample size is selectively received.

5. A computer implemented method for database administration and replication, comprising:
- providing a database management system with an integrated a random sampling facility as an integral part of the database management system, the random sampling facility having access to low level functions and buffers;
  
  selecting a default sample size value S, wherein S is an integer value greater than one;
  
  selectively receiving a desired sample size value D, wherein D is an integer value greater than zero, and setting said default sample size value S to said desired sample size value D when said desired sample size value D is received;
  
  randomly sampling S records of the database using said random sampling facility;
  
  storing statistics for each of said S records, wherein said statistics include a record key for each record;
  
  producing at least one of an extrapolated replication partition analysis based on said statistics and a partial replication partition analysis based on said statistics; and
  
  , performing a replication operation on said database.
- View Dependent Claims (6, 7, 8, 9, 10, 11)
- - 6. The method as set forth in claim 5, further comprising:
    - sorting said stared statistics by key prior to producing said at least one of said extrapolated replication partition analysis and said partial replication partition analysis partition analysis.
  - 7. The method as set forth in claim 6, further comprising:
    - accessing all database records in an arbitrary sequence;
      
      iteratively filling all partitions except a last partition with said accessed records to a maximum byte count; and
      
      , storing remaining accessed records in the last partition.
  - 8. The method as set forth in claim 5, wherein said storing statistics Includes storing said statistics in a memory.
  - 9. The method as set forth in claim 8, wherein said storing statistics includes storing said statistics in said memory in a compressed format.
  - 10. The method as set forth in claim 5, wherein said producing at least one of said partition analyses includes the step of defining multiple partition boundaries.
  - 11. The method as set forth in claim 5, wherein said sampling said S records includes randomly sampling the S records utilizing dataspaces including:
    - at least one index dataspace;
      
      at least one key dataspece; and
      
      , at least one statistics dataspace.

12. A computer implemented method for database administration and replication, comprising:
- providing a database management system with an integrated a random sampling facility as an integral part of the database management system, the random sampling facility having access to low level functions and buffers;
  
  selecting a default sample size value S, wherein S is an integer value greater than one and represents a value of said selected default sample size;
  
  selectively receiving a desired sample size value D, wherein D is an integer value greater than zero, and setting said defauk sample size value S to said desired sample size value D when said desired sample size value D is received;
  
  randomly sampling S records of the database using said random sampling facility;
  
  storing statistics for each of said S records, wherein said statistics include a record key for each record;
  
  producing at least one of an extrapolated replication partition analysis based on said statistics and a partial replication partition analysis based on said statistics; and
  
  , performing a replication operation on said database, wherein the selecting said default sample size value D further includes;
  
  generating a table of S number pairs (Y_j,I_j), j=1, 2, . . . , S, wherein all Y and all I are initially set to zero;
  
  initializing a reservoir of records to an empty state;
  
  setting an index M to said reservoir and initializing the index M to zero;
  
  generating a sequence of N non-repeating random numbers U₁, U₂, . . . , U_N, 0<
  
  U<
  
  1, wherein N is the number of records in the database; and
  
  , performing additional steps for each random number U_kgenerated, k=1,2, . . . , N, the additional steps including;
  
  skipping the next record in the database if U_kis less than the smallest value of Y in said table of S number pairs; and
  
  , updating the table of S number pairs if a Y less than U_kexists by performing further steps including;
  
  setting said index M equal to its current value plus one;
  
  replacing the smallest Y in the table of S number pairs with U_k;
  
  setting the I value paired with the smallest Y equal to M; and
  
  , storing all or part of the next record of the database in said reservoir of stored records, wherein the current value of M is a reservoir index to said stored record.
- View Dependent Claims (13)
- - 13. The method as set forth in claim 12, wherein the step of updating the table of S number pairs further includes:
    - arranging the table in a heap with respect to Y.

14. A database management system (DBMS) for managing an associated database, the DBMS comprising:
- random sampling facility configured as part of the database management system and having access to low level functions and buffers of the database management system;
  
  first database analysis tools using said integrated random sampling facility for generating extrapolated reports on database content;
  
  second database analysis tools using said integrated random sampling facility for generating extrapolated reports on database size;
  
  database replication tools adapted to execute at least one of a complete replication having output partition sizes determined by extrapolating a random sample of said database, and a partial replication in which the data stored in the partial replication comprises a random sample of said database;
  
  a pre-configured number S defining a default sample size, wherein S is an integer value greater than one;
  
  a means for selectively receiving a particular number defining a desired sample size and setting said number S equal to said particular number;
  
  a means for randomly sampling S records of the database using said random sampling facility;
  
  a means for storing statistics for each of said S records, wherein said statistics include a record key for each record; and
  
  , a means for producing at least one of;
  
  an extrapolated database content analysis based on said statistics;
  
  an extrapolated partition analysis based on said statistics; and
  
  , a partial partition analysis based on said statistics.
- View Dependent Claims (15, 16)
- - 15. A database management system of claim 14, further comprising:
    - a means for sorting said stored statistics by key prior to producing at least one of said analyses.
  - 16. The database management system of claim 14, further comprising:
    - a means for utilizing at least one index dataspace;
      
      a means for utilizing at least one key dataspace; and
      
      , a means for utilizing at least one statistics dataspace.

17. A database management system (DBMS) for managing an associated database, the DBMS comprising:
- random sampling facility configured as part of the database management system and having access to low level functions and buffers of the database management system;
  
  first database analysis tools using said integrated random sampling facility for generating extrapolated reports on database content;
  
  second database analysis tools using said integrated random sampling facility for generating extrapolated reports on database size;
  
  database replication tools adapted to execute at least one of a complete replication having output partition sizes determined by extrapolating a random sample of said database, and a partial replication in which the data stored in the partial replication comprises a random sample of said database;
  
  a pre-configured number S defining a default sample size, wherein S is an integer value greater than one;
  
  a means for selectively receiving a particular number defining a desired sample size and setting said number S equal to said particular number;
  
  a means for randomly sampling S records of the database using said random sampling facility;
  
  a means for storing statistics for each of said S records, wherein said statistics include a record key for each record; and
  
  , a means for producing at least one of an extrapolated database content analysis based on said statistics;
  
  an extrapolated partition analysis based on said statistics; and
  
  , a partial partition analysis based on said statistics, wherein said means for randomly sampling S records further comprises;
  
  a means for generating a table of S number pairs (Y_j,I_j), j=1,2, . . . , S, wherein all Y and all I are initially zero;
  
  a means for initializing a reservoir of records to an empty state;
  
  a means for setting an index M to said reservoir and initializing the index M to zero;
  
  a means for generating a sequence of N non-repeating random numbers U₁, U₂, . . . , U_N, O<
  
  U<
  
  1, wherein N is the number of records in the database; and
  
  , a means, for each random number U_kgenerated, k=1,2, . . . , N;
  
  comprising;
  
  a means to skip the next record in said database if U_kis less than the smallest value of Y in said table of number pairs; and
  
  , a means to update the table if a Y less than U_kexists, comprising;
  
  a means to set said index M equal to its current value plus one;
  
  a means to replace the smallest Y in the table with U_k;
  
  a means to set the I value paired with the smallest Y equal to M; and
  
  , a means to store all or part of the next record of said database in said reservoir of stored records, wherein the current value of M is a reservoir index to said stored record.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The database management system of claim 17 wherein the means to update the table further comprises:
    - a means to arrange the table in a heap with respect to Y.
  - 19. The database management system of claim 17, wherein said means for storing statistics comprises a means for storing said statistics in memory.
  - 20. The database management system of claim 19, further comprising;
    - a means for sorting said stored statistics by key prior to producing at least one of said analyses.
  - 21. The database management system of claim 20, wherein said partition analyses include analyses of multiple partition boundaries.
  - 22. The database management system of claim 21, further comprising:
    - a means for accessing all database records in an arbitrary sequence;
      
      a means for iteratively filling all of said partitions except the last with said accessed records to a maximum byte count; and
      
      , a means for storing remaining accessed records in the last of said partitions.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Harper, John William, Slishman, Gordon Robert
Primary Examiner(s)
Corrielus, Jean M.
Assistant Examiner(s)
Chen, Chongshan

Application Number

US09/897,803
Publication Number

US 20030004973A1
Time in Patent Office

1,744 Days
Field of Search

707/200, 707/205, 707/6, 707/7, 707/1, 707/2, 707/3, 707/102, 711/100, 711/173
US Class Current

1/1
CPC Class Codes

G06F 16/284 Relational databases

Y10S 707/99953 Recoverability

Random sampling as a built-in function for database administration and replication

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Random sampling as a built-in function for database administration and replication

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links