×

Determination of sampling characteristics based on available memory

  • US 7,818,534 B2
  • Filed: 05/09/2007
  • Issued: 10/19/2010
  • Est. Priority Date: 05/09/2007
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method comprising:

  • importing a portion of a full input data set into memory accessible by one or more computing devices for processing by an executing application, wherein the full input data set comprises a plurality of data records representing a dimensionally-modeled fact collection and wherein each record comprises a plurality of characteristics of an entity, the importing of the portion of the full input data set comprising;

    determining an amount of the data of the full input data set to import based on an amount of available memory accessible to the one or more computing devices;

    based on the determined amount of the data to import and on characteristics of the full input data set other than the total size of the full input data set, determining sampling characteristics for sampling the full input data set to sample less than every record of the full input set,wherein the determined sampling characteristics are determined in part based on an indication of how many of the records have one or more particular values for one or more particular entity characteristics such that if a determined statistical significance achievable using the determined sampling characteristics does not meet a user-provided desired statistical significance for records with the particular values for the particular entity characteristics, the determined sampling characteristics are adjusted to meet the user-provided desired statistical significance for such records; and

    causing the portion of the full input data set to be imported into the memory of the computer system, including sampling the full input data set, to determine the portion of the records to import, in accordance with the determined sampling characteristics,wherein the adjusted sampling characteristics are such that analysis as a result of processing by the executing application of the sampled portion of the full input data set is representative of the analysis that could otherwise be carried out on the full input data set, with a calculable statistical relevance,wherein sampling the full input data set using the adjusted sampling characteristics is performed in a deterministic manner such that records of the full input data set having the particular values for the particular entity characteristics are included in the sampled portion of the full input data set in an amount sufficient to meet the user-provided desired statistical significance for such records, andwherein sampling the full input data set to determine the portion of the records to import includes applying a hash algorithm for sampling based on a key that is the value for each record at a particular dimension, the dimension in question being one that identifies a particular entity or user, wherein remaining information in each data record includes information regarding behavior or traits of the particular entity or user identified by the key for that record.

View all claims
  • 5 Assignments
Timeline View
Assignment View
    ×
    ×