Determination of sampling characteristics based on available memory
First Claim
1. A computer-implemented method comprising:
- importing a portion of a full input data set into memory accessible by one or more computing devices for processing by an executing application, wherein the full input data set comprises a plurality of data records representing a dimensionally-modeled fact collection and wherein each record comprises a plurality of characteristics of an entity, the importing of the portion of the full input data set comprising;
determining an amount of the data of the full input data set to import based on an amount of available memory accessible to the one or more computing devices;
based on the determined amount of the data to import and on characteristics of the full input data set other than the total size of the full input data set, determining sampling characteristics for sampling the full input data set to sample less than every record of the full input set,wherein the determined sampling characteristics are determined in part based on an indication of how many of the records have one or more particular values for one or more particular entity characteristics such that if a determined statistical significance achievable using the determined sampling characteristics does not meet a user-provided desired statistical significance for records with the particular values for the particular entity characteristics, the determined sampling characteristics are adjusted to meet the user-provided desired statistical significance for such records; and
causing the portion of the full input data set to be imported into the memory of the computer system, including sampling the full input data set, to determine the portion of the records to import, in accordance with the determined sampling characteristics,wherein the adjusted sampling characteristics are such that analysis as a result of processing by the executing application of the sampled portion of the full input data set is representative of the analysis that could otherwise be carried out on the full input data set, with a calculable statistical relevance,wherein sampling the full input data set using the adjusted sampling characteristics is performed in a deterministic manner such that records of the full input data set having the particular values for the particular entity characteristics are included in the sampled portion of the full input data set in an amount sufficient to meet the user-provided desired statistical significance for such records, andwherein sampling the full input data set to determine the portion of the records to import includes applying a hash algorithm for sampling based on a key that is the value for each record at a particular dimension, the dimension in question being one that identifies a particular entity or user, wherein remaining information in each data record includes information regarding behavior or traits of the particular entity or user identified by the key for that record.
5 Assignments
0 Petitions
Accused Products
Abstract
A portion of data records of a full input data set are imported into memory of a computer system for processing by an executing application. The full input data set includes data records of a dimensionally-modeled fact collection. An amount of the data of the full input set to import is determined based on an amount of available memory of the computer system. The sampling characteristics for sampling the full input data set are determined based on the amount of the data that can be imported and on characteristics of the full input data set and application involved. The full input data set is then sampled and a portion of the records are imported into the memory of the computer system for processing. The sampling characteristics are determined such that analysis as a result of processing by the executing application of the sampled portion of the records imported is representative of the analysis that could otherwise be carried out on the full input data set, with a calculable statistical relevance.
8 Citations
17 Claims
-
1. A computer-implemented method comprising:
-
importing a portion of a full input data set into memory accessible by one or more computing devices for processing by an executing application, wherein the full input data set comprises a plurality of data records representing a dimensionally-modeled fact collection and wherein each record comprises a plurality of characteristics of an entity, the importing of the portion of the full input data set comprising; determining an amount of the data of the full input data set to import based on an amount of available memory accessible to the one or more computing devices; based on the determined amount of the data to import and on characteristics of the full input data set other than the total size of the full input data set, determining sampling characteristics for sampling the full input data set to sample less than every record of the full input set, wherein the determined sampling characteristics are determined in part based on an indication of how many of the records have one or more particular values for one or more particular entity characteristics such that if a determined statistical significance achievable using the determined sampling characteristics does not meet a user-provided desired statistical significance for records with the particular values for the particular entity characteristics, the determined sampling characteristics are adjusted to meet the user-provided desired statistical significance for such records; and causing the portion of the full input data set to be imported into the memory of the computer system, including sampling the full input data set, to determine the portion of the records to import, in accordance with the determined sampling characteristics, wherein the adjusted sampling characteristics are such that analysis as a result of processing by the executing application of the sampled portion of the full input data set is representative of the analysis that could otherwise be carried out on the full input data set, with a calculable statistical relevance, wherein sampling the full input data set using the adjusted sampling characteristics is performed in a deterministic manner such that records of the full input data set having the particular values for the particular entity characteristics are included in the sampled portion of the full input data set in an amount sufficient to meet the user-provided desired statistical significance for such records, and wherein sampling the full input data set to determine the portion of the records to import includes applying a hash algorithm for sampling based on a key that is the value for each record at a particular dimension, the dimension in question being one that identifies a particular entity or user, wherein remaining information in each data record includes information regarding behavior or traits of the particular entity or user identified by the key for that record. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-implemented method comprising:
-
importing a portion of a full input data set into memory accessible by one or more computing devices for processing by an executing application, wherein the full input data set comprises a plurality of data records representing a dimensionally-modeled fact collection and wherein each record comprises a plurality of characteristics of an entity, the importing of the portion of the full input data set comprising; determining a nominal amount of the data of the full input set to import based on an amount of available memory accessible to the one or more computing devices; based on the determined amount of the data to import and on characteristics of the full input data set other than the total size of the full input data set, determining sampling characteristics for sampling the full input data set; determining a characteristic associated with the determined amount of data, the characteristic relating to how many of the records have one or more particular values for one or more particular entity characteristics; causing the computing device to adjust the determined sampling characteristics when a user-provided desired statistical significance for records having the particular values for the particular entity characteristics is not met by the determined characteristic associated with the determined amount of data; and causing the portion of the full input data set to be imported into the memory of the computer system, including sampling the full input data set, to determine the portion of the records to import, in accordance with the adjusted sampling characteristics, wherein sampling the full input data set using the adjusted sampling characteristics is performed in a deterministic manner such that records of the full input data set having the particular values for the particular entity characteristics are included in the sampled portion of the full input data set in an amount sufficient to meet the user-provided desired statistical significance for such records, and wherein sampling the full input data set to determine the portion of the records to import includes applying a hash algorithm for sampling based on a key that is the value for each record at a particular dimension, the dimension in question being one that identifies a particular entity or user, wherein remaining information in each data record includes information regarding behavior or traits of the particular entity or user identified by the key for that record. - View Dependent Claims (7, 8, 9)
-
-
10. A computing device comprising processing circuitry and memory circuitry, wherein the computing device is configured to import an amount of data of a full input data set that is a portion of the full input data set, wherein the full input data set comprises a plurality of data records representing a dimensionally-modeled fact collection and each record comprises a plurality of characteristics of an entity, wherein the portion of the full input data set to import has been determined by:
-
determining an amount of the data of the full input set to import based on an amount of available memory accessible to the one or more computing devices; based on the determined amount of the data to import and on characteristics of the full input data set other than the total size of the full input data set, determining sampling characteristics for sampling the full input data set, wherein the determined sampling characteristics are determined in part based on an indication of how many of the records have one or more particular values for one or more particular entity characteristics such that if a determined statistical significance achievable using the determined sampling characteristics does not meet a user-provided desired statistical significance for records with the particular values for the particular entity characteristics, the determined sampling characteristics are adjusted to meet the user-provided desired statistical significance for such records; and sampling the full input data set, to determine the portion of the records to import, in accordance with the determined sampling characteristics, wherein the adjusted sampling characteristics are such that analysis as a result of processing by the executing application of the sampled portion of the full input data set is representative of the analysis that could otherwise be carried out on the full input data set, with a calculable statistical relevance, wherein sampling the full input data set using the adjusted sampling characteristics is performed in a deterministic manner such that records of the full input data set having the particular values for the particular entity characteristics are included in the sampled portion of the full input data set in an amount sufficient to meet the user-provided desired statistical significance for such records, and wherein sampling the full input data set to determine the portion of the records to import includes applying a hash algorithm for sampling based on a key that is the value for each record at a particular dimension, the dimension in question being one that identifies a particular entity or user, wherein remaining information in each data record includes information regarding behavior or traits of the particular entity or user identified by the key for that record. - View Dependent Claims (11, 12)
-
-
13. A computer program product comprising at least one computer-readable storage medium having computer program instructions stored therein which are operable to cause at least one computing device to:
-
import a portion of a full input data set into memory accessible by one or more computing devices for processing by an executing application, wherein the full input data set comprises a plurality of data records representing a dimensionally-modeled fact collection and wherein each record comprises a plurality of characteristics of an entity, the importing of the portion of the full input data set comprising; determine an amount of the data of the full input set to import based on an amount of available memory accessible to the one or more computing devices; based on the determined amount of the data to import and on characteristics of the full input data set other than the total size of the full input data set, determine sampling characteristics for sampling the full input data set, wherein the determined sampling characteristics are determined in part based on an indication of how many of the records have one or more particular values for one or more particular entity characteristics such that if a determined statistical significance achievable using the determined sampling characteristics does not meet a user-provided desired statistical significance for records with the particular values for the particular entity characteristics, the determined sampling characteristics are adjusted to meet the user-provided desired statistical significance for such records; and cause the portion of the full input data set to be imported into the memory of the computer system, including sampling the full input data set, to determine the portion of the full input data set to import, in accordance with the determined sampling characteristics, wherein the adjusted sampling characteristics are such that analysis as a result of processing by the executing application of the sampled portion of the full input data set is representative of the analysis that could otherwise be carried out on the full input data set, with a calculable statistical relevance, wherein the computer program instructions operable to cause the at least one computing device to sample the full input data set to determine the portion of the full input data set to import includes computer program instructions operable to cause the at least one computing device to sample the full input data set in a deterministic manner such that records of the full input data set having the particular values for the particular entity characteristics are included in the sampled portion of the full input data set in an amount sufficient to meet the user-provided desired statistical significance for such records, and wherein sampling the full input data set to determine the portion of the records to import includes applying a hash algorithm for sampling based on a key that is the value for each record at a particular dimension, the dimension in question being one that identifies a particular entity or user, wherein remaining information in each data record includes information regarding behavior or traits of the particular entity or user identified by the key for that record. - View Dependent Claims (14, 15, 16, 17)
-
Specification