Random sampling as a built-in function for database administration and replication
First Claim
1. A computer implemented method for administration and replication of a database, comprising:
- providing a database management system with a built-in random sampling facility configured as an integral part of said database management system, whereby the random sampling facility has access to low level functions and buffers of the database management system, executing said random sampling facility from within the database management to perform a replication operation on said database;
defining a database record sample size S wherein S is an integer value greater than one;
randomly sampling S records of the database using said random sampling facility;
storing statistics for each of said S records, wherein said statistics include a record key for each record; and
, producing an extrapolated replication partition analysis based on said statistics, wherein said replication operation is based on said extrapolated replication partition analysis.
1 Assignment
0 Petitions
Accused Products
Abstract
A database management system and method for administration and replication having a built-in random sampling facility for approximation partition analysis on very large databases. The method utilizes a random sampling algorithm that provides results accurate to within a few percentage points for large homogeneous databases. The accuracy is not affected by the size of the database and is determined primarily by the size of the sample. The system and method for approximate partition analysis reduces the time required for an analysis to a fraction of the time required for an exact analysis. The database management system is configured with the random sampling facility built-in thereby enabling even greater efficiency by reducing communication overhead between an analysis program and the database management system to a fraction of the overhead required when sampling is performed by a separate analysis program. The reduction in time thereby permits frequent and timely analyses for replication and administration of database partitions.
-
Citations
22 Claims
-
1. A computer implemented method for administration and replication of a database, comprising:
-
providing a database management system with a built-in random sampling facility configured as an integral part of said database management system, whereby the random sampling facility has access to low level functions and buffers of the database management system, executing said random sampling facility from within the database management to perform a replication operation on said database;
defining a database record sample size S wherein S is an integer value greater than one;
randomly sampling S records of the database using said random sampling facility;
storing statistics for each of said S records, wherein said statistics include a record key for each record; and
,producing an extrapolated replication partition analysis based on said statistics, wherein said replication operation is based on said extrapolated replication partition analysis. - View Dependent Claims (2, 3, 4)
-
-
5. A computer implemented method for database administration and replication, comprising:
-
providing a database management system with an integrated a random sampling facility as an integral part of the database management system, the random sampling facility having access to low level functions and buffers;
selecting a default sample size value S, wherein S is an integer value greater than one;
selectively receiving a desired sample size value D, wherein D is an integer value greater than zero, and setting said default sample size value S to said desired sample size value D when said desired sample size value D is received;
randomly sampling S records of the database using said random sampling facility;
storing statistics for each of said S records, wherein said statistics include a record key for each record;
producing at least one of an extrapolated replication partition analysis based on said statistics and a partial replication partition analysis based on said statistics; and
,performing a replication operation on said database. - View Dependent Claims (6, 7, 8, 9, 10, 11)
-
-
12. A computer implemented method for database administration and replication, comprising:
-
providing a database management system with an integrated a random sampling facility as an integral part of the database management system, the random sampling facility having access to low level functions and buffers;
selecting a default sample size value S, wherein S is an integer value greater than one and represents a value of said selected default sample size;
selectively receiving a desired sample size value D, wherein D is an integer value greater than zero, and setting said defauk sample size value S to said desired sample size value D when said desired sample size value D is received;
randomly sampling S records of the database using said random sampling facility;
storing statistics for each of said S records, wherein said statistics include a record key for each record;
producing at least one of an extrapolated replication partition analysis based on said statistics and a partial replication partition analysis based on said statistics; and
,performing a replication operation on said database, wherein the selecting said default sample size value D further includes;
generating a table of S number pairs (Yj,Ij), j=1, 2, . . . , S, wherein all Y and all I are initially set to zero;
initializing a reservoir of records to an empty state;
setting an index M to said reservoir and initializing the index M to zero;
generating a sequence of N non-repeating random numbers U1, U2, . . . , UN, 0<
U<
1, wherein N is the number of records in the database; and
,performing additional steps for each random number Uk generated, k=1,2, . . . , N, the additional steps including;
skipping the next record in the database if Uk is less than the smallest value of Y in said table of S number pairs; and
,updating the table of S number pairs if a Y less than Uk exists by performing further steps including;
setting said index M equal to its current value plus one;
replacing the smallest Y in the table of S number pairs with Uk;
setting the I value paired with the smallest Y equal to M; and
,storing all or part of the next record of the database in said reservoir of stored records, wherein the current value of M is a reservoir index to said stored record. - View Dependent Claims (13)
-
-
14. A database management system (DBMS) for managing an associated database, the DBMS comprising:
-
random sampling facility configured as part of the database management system and having access to low level functions and buffers of the database management system;
first database analysis tools using said integrated random sampling facility for generating extrapolated reports on database content;
second database analysis tools using said integrated random sampling facility for generating extrapolated reports on database size;
database replication tools adapted to execute at least one of a complete replication having output partition sizes determined by extrapolating a random sample of said database, and a partial replication in which the data stored in the partial replication comprises a random sample of said database;
a pre-configured number S defining a default sample size, wherein S is an integer value greater than one;
a means for selectively receiving a particular number defining a desired sample size and setting said number S equal to said particular number;
a means for randomly sampling S records of the database using said random sampling facility;
a means for storing statistics for each of said S records, wherein said statistics include a record key for each record; and
,a means for producing at least one of;
an extrapolated database content analysis based on said statistics;
an extrapolated partition analysis based on said statistics; and
,a partial partition analysis based on said statistics. - View Dependent Claims (15, 16)
-
-
17. A database management system (DBMS) for managing an associated database, the DBMS comprising:
-
random sampling facility configured as part of the database management system and having access to low level functions and buffers of the database management system;
first database analysis tools using said integrated random sampling facility for generating extrapolated reports on database content;
second database analysis tools using said integrated random sampling facility for generating extrapolated reports on database size;
database replication tools adapted to execute at least one of a complete replication having output partition sizes determined by extrapolating a random sample of said database, and a partial replication in which the data stored in the partial replication comprises a random sample of said database;
a pre-configured number S defining a default sample size, wherein S is an integer value greater than one;
a means for selectively receiving a particular number defining a desired sample size and setting said number S equal to said particular number;
a means for randomly sampling S records of the database using said random sampling facility;
a means for storing statistics for each of said S records, wherein said statistics include a record key for each record; and
,a means for producing at least one of an extrapolated database content analysis based on said statistics;
an extrapolated partition analysis based on said statistics; and
,a partial partition analysis based on said statistics, wherein said means for randomly sampling S records further comprises;
a means for generating a table of S number pairs (Yj,Ij), j=1,2, . . . , S, wherein all Y and all I are initially zero;
a means for initializing a reservoir of records to an empty state;
a means for setting an index M to said reservoir and initializing the index M to zero;
a means for generating a sequence of N non-repeating random numbers U1, U2, . . . , UN, O<
U<
1, wherein N is the number of records in the database; and
,a means, for each random number Uk generated, k=1,2, . . . , N;
comprising;
a means to skip the next record in said database if Uk is less than the smallest value of Y in said table of number pairs; and
,a means to update the table if a Y less than Uk exists, comprising;
a means to set said index M equal to its current value plus one;
a means to replace the smallest Y in the table with Uk;
a means to set the I value paired with the smallest Y equal to M; and
,a means to store all or part of the next record of said database in said reservoir of stored records, wherein the current value of M is a reservoir index to said stored record. - View Dependent Claims (18, 19, 20, 21, 22)
-
Specification