Measuring confidence of file clustering and clustering based file classification

US 8,214,365 B1
Filed: 02/28/2011
Issued: 07/03/2012
Est. Priority Date: 02/28/2011
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for quantifying a confidence level in a quality of a cluster of samples, wherein the samples are clustered according to runtime behavior, the method comprising the steps of:

determining, by at least one computer, a uniformity of the cluster, the uniformity of the cluster being determined as a function of at least a ratio of a most frequently occurring unique sample label present in the cluster to a total number of unique sample labels present in the cluster;

assigning, by the at least one computer, a raw confidence value to the cluster, the raw confidence value being a function of the determined uniformity of the cluster;

calculating, by the at least one computer, a confidence interval weight for the cluster, the confidence interval weight being calculated by using a confidence interval to determine reliability of the determined uniformity of the cluster;

calculating, by the at least one computer, a trace length weight for the cluster, the trace length weight being calculated as a function of lengths of traces generated by the samples in the cluster;

calculating, by the at least one computer, an n-gram weight for the cluster, the n-gram weight being calculated as a function of numbers of unique n-grams generated by the samples in the cluster;

calculating, by the at least one computer, a compactness weight for the cluster, the compactness weight being calculated as a function of similarity of samples in the cluster to a point of reference;

calculating, by the at least one computer, a cluster weight for the cluster, the cluster weight being calculated as a function of the confidence interval weight, the trace length weight, the n-gram weight and the compactness weight; and

assigning, by the at least one computer, a cluster confidence measurement to the cluster, the cluster confidence measurement being a function of the cluster weight and the cluster raw confidence value.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A uniformity of a cluster of samples is determined, and a corresponding raw confidence value is calculated. A confidence interval weight is calculated using a confidence interval to determine reliability of the uniformity. A trace length weight is calculated, as a function of traces of the samples. An n-gram weight is calculated, as a function of numbers of n-grams generated by the samples. A compactness weight is calculated, as a function of the similarity of the samples. A cluster weight is calculated as a function of the four above-described weights. A cluster confidence measurement is calculated as a function of the cluster weight and the raw confidence value. When a new sample is assigned to the cluster, an assignment confidence measurement is calculated, as a function of the cluster'"'"'s confidence measurement and the sample'"'"'s trace length, n-grams and similarity.

19 Citations

View as Search Results

20 Claims

1. A computer implemented method for quantifying a confidence level in a quality of a cluster of samples, wherein the samples are clustered according to runtime behavior, the method comprising the steps of:
- determining, by at least one computer, a uniformity of the cluster, the uniformity of the cluster being determined as a function of at least a ratio of a most frequently occurring unique sample label present in the cluster to a total number of unique sample labels present in the cluster;
  
  assigning, by the at least one computer, a raw confidence value to the cluster, the raw confidence value being a function of the determined uniformity of the cluster;
  
  calculating, by the at least one computer, a confidence interval weight for the cluster, the confidence interval weight being calculated by using a confidence interval to determine reliability of the determined uniformity of the cluster;
  
  calculating, by the at least one computer, a trace length weight for the cluster, the trace length weight being calculated as a function of lengths of traces generated by the samples in the cluster;
  
  calculating, by the at least one computer, an n-gram weight for the cluster, the n-gram weight being calculated as a function of numbers of unique n-grams generated by the samples in the cluster;
  
  calculating, by the at least one computer, a compactness weight for the cluster, the compactness weight being calculated as a function of similarity of samples in the cluster to a point of reference;
  
  calculating, by the at least one computer, a cluster weight for the cluster, the cluster weight being calculated as a function of the confidence interval weight, the trace length weight, the n-gram weight and the compactness weight; and
  
  assigning, by the at least one computer, a cluster confidence measurement to the cluster, the cluster confidence measurement being a function of the cluster weight and the cluster raw confidence value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1 wherein determining, by the at least one computer, the uniformity of the cluster further comprises:
    - reading, by the at least one computer, a label of each sample in the cluster;
      
      determining, by the at least one computer, a total number of unique sample labels that are present in the cluster;
      
      determining, by the at least one computer, the most frequently occurring unique sample label present in the cluster;
      
      determining, by the at least one computer, a number of samples in the cluster with the most frequently occurring unique sample label present in the cluster; and
      
      determining, by the at least one computer, a percentage of the total number of unique sample labels that are present in the cluster comprised by the number of samples in the cluster with the most frequently occurring unique sample label.
  - 3. The method of claim 1 wherein determining, by the at least one computer, the uniformity of the cluster further comprises:
    - reading, by the at least one computer, a label of each sample in the cluster;
      
      determining, by the at least one computer, a total number of unique sample labels that are present in the cluster; and
      
      for each unique sample label present in the cluster, determining, by the at least one computer, its percentage of the total number of unique sample labels that are present in the cluster; and
      
      determining, by the at least one computer, the uniformity of the cluster as a function of percentages of the total number of unique sample labels that are present comprised by multiple unique sample labels present in the cluster.
  - 4. The method of claim 1 wherein determining, by the at least one computer, the uniformity of the cluster further comprises:
    - reading, by the at least one computer, a label of each sample in the cluster;
      
      determining, by the at least one computer, a total number of unique sample labels that are present in the cluster;
      
      for each unique sample label present in the cluster, determining, by the at least one computer, a percentage of a total number of samples that are present in the cluster having that unique sample label; and
      
      determining, by the at least one computer, the uniformity of the cluster as a function of percentages of the total number of samples that are present comprised by samples with each multiple unique sample label present in the cluster.
  - 5. The method of claim 1 wherein assigning, by the at least one computer, a raw confidence value to the cluster, the raw confidence value being a function of the determined uniformity of the cluster, further comprises:
    - calculating, by the at least one computer, the raw confidence value by using a sigmoid function to map the determined uniformity of the cluster to a nonlinear scale.
  - 6. The method of claim 1 wherein calculating, by the at least one computer, a confidence interval weight for the cluster further comprises:
    - using, by the at least one computer, a standard adjusted Wald confidence interval to determine the reliability of the determined uniformity of the cluster.
  - 7. The method of claim 1 wherein calculating, by the at least one computer, a trace length weight for the cluster further comprises:
    - calculating, by the at least one computer, the trace length weight for the cluster as a function of an average of the lengths of traces generated by the samples in the cluster.
  - 8. The method of claim 1 wherein calculating, by the at least one computer, a trace length weight for the cluster further comprises:
    - calculating, by the at least one computer, the trace length weight for the cluster as a function of lengths of traces into security sensitive calls made by the samples in the cluster.
  - 9. The method of claim 1 wherein calculating, by the at least one computer, an n-gram weight for the cluster, further comprises:
    - calculating, by the at least one computer, the n-gram weight for the cluster as a function of numbers of unique n-grams generated by the samples in the cluster making security sensitive calls.
  - 10. The method of claim 1 wherein calculating, by the at least one computer, a compactness weight for the cluster further comprises:
    - calculating, by the at least one computer, the compactness weight for the cluster as a function of a similarity of each sample in the cluster at a feature vector level to a prototype sample.
  - 11. The method of claim 1 wherein calculating, by the at least one computer, a cluster weight for the cluster further comprises:
    - weighting, by the at least one computer, at least one of the interval weight, the trace length weight, the n-gram weight and the compactness weight according to at least one weighting factor; and
      
      adding, by the at least one computer, the confidence interval weight, the trace length weight, the n-gram weight and the compactness weight.
  - 12. The method of claim 1 wherein assigning, by the at least one computer, a cluster confidence measurement to the cluster further comprises:
    - calculating, by the at least one computer, the cluster confidence measurement by multiplying the cluster weight and the cluster raw confidence value.
  - 13. The method of claim 1 further comprising:
    - assigning, by the at least one computer, a new sample to the cluster, based on the runtime behavior of the new sample; and
      
      calculating, by the at least one computer, an assignment confidence measurement concerning the assignment of the new sample to the cluster as a function of the confidence measurement assigned to the cluster, a length of a trace generated by the new sample, a number of n-grams present in the trace generated by the new sample, and a similarity of the new sample at a feature vector level to a reference point concerning the cluster.

14. At least one non-transitory computer readable medium storing a computer program product for quantifying a confidence level in a quality of a cluster of samples, wherein the samples are clustered according to runtime behavior, the computer program product comprising:
- program code for determining a uniformity of the cluster, the uniformity of the cluster being determined as a function of at least a ratio of a most frequently occurring unique sample label present in the cluster to a total number of unique sample labels present in the cluster;
  
  program code for assigning a raw confidence value to the cluster, the raw confidence value being a function of the determined uniformity of the cluster;
  
  program code for calculating a confidence interval weight for the cluster, the confidence interval weight being calculated by using a confidence interval to determine reliability of the determined uniformity of the cluster;
  
  program code for calculating a trace length weight for the cluster, the trace length weight being calculated as a function of lengths of traces generated by the samples in the cluster;
  
  program code for calculating an n-gram weight for the cluster, the n-gram weight being calculated as a function of numbers of unique n-grams generated by the samples in the cluster;
  
  program code for calculating a compactness weight for the cluster, the compactness weight being calculated as a function of similarity of samples in the cluster to a point of reference;
  
  program code for calculating a cluster weight for the cluster, the cluster weight being calculated as a function of the confidence interval weight, the trace length weight, the n-gram weight and the compactness weight; and
  
  program code for assigning a cluster confidence measurement to the cluster, the cluster confidence measurement being a function of the cluster weight and the cluster raw confidence value.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The computer program product of claim 14 wherein the program code for determining the uniformity of the cluster further comprises:
    - program code for reading a label of each sample in the cluster;
      
      program code for determining a total number of unique sample labels that are present in the cluster;
      
      program code for determining the most frequently occurring unique sample label present in the cluster;
      
      program code for determining a number of samples in the cluster with the most frequently occurring unique sample label present in the cluster; and
      
      program code for determining a percentage of the total number of unique sample labels that are present in the cluster comprised by the number of samples in the cluster with the most frequently occurring unique sample label.
  - 16. The computer program product of claim 14 wherein the program code for calculating a trace length weight for the cluster further comprises:
    - program code for calculating the trace length weight for the cluster as a function of lengths of traces into security sensitive calls made by the samples in the cluster.
  - 17. The computer program product of claim 14 wherein the program code for calculating an n-gram weight for the cluster, further comprises:
    - program code for calculating the n-gram weight for the cluster as a function of numbers of unique n-grams generated by the samples in the cluster making security sensitive calls.
  - 18. The computer program product of claim 14 wherein the program code for calculating a compactness weight for the cluster further comprises:
    - program code for calculating the compactness weight for the cluster as a function of a similarity of each sample in the cluster at a feature vector level to a prototype sample.
  - 19. The computer program product of claim 14 further comprising:
    - program code for assigning a new sample to the cluster, based on the runtime behavior of the new sample; and
      
      program code for calculating an assignment confidence measurement concerning the assignment of the new sample to the cluster as a function of the confidence measurement assigned to the cluster, a length of a trace generated by the new sample, a number of n-grams present in the trace generated by the new sample, and a similarity of the new sample at a feature vector level to a reference point concerning the cluster.

20. A computer system for quantifying a confidence level in a quality of a cluster of samples, wherein the samples are clustered according to runtime behavior, the computer system comprising:
- a processor;
  
  computer memory;
  
  a cluster uniformity determining module residing in the system memory, configured for determining a uniformity of the cluster as a function of at least the uniformity of the cluster being determined as a function of at least a ratio of a most frequently occurring unique sample label present in the cluster to a total number of unique sample labels present in the cluster;
  
  a raw confidence calculating module residing in the system memory, configured for assigning a raw confidence value to the cluster, the raw confidence value being a function of the determined uniformity of the cluster;
  
  a confidence interval weight calculating module residing in the system memory, configured for calculating a confidence interval weight for the cluster, the confidence interval weight being calculated by using a confidence interval to determine reliability of the determined uniformity of the cluster;
  
  a trace length weight calculating module residing in the system memory, configured for calculating a trace length weight for the cluster, the trace length weight being calculated as a function of lengths of traces generated by the samples in the cluster;
  
  an n-gram weight calculating module residing in the system memory, configured for calculating an n-gram weight for the cluster, the n-gram weight being calculated as a function of numbers of unique n-grams generated by the samples in the cluster;
  
  a compactness weight calculating module residing in the system memory, configured for calculating a compactness weight for the cluster, the compactness weight being calculated as a function of similarity of samples in the cluster to a point of reference;
  
  a cluster weight calculating module residing in the system memory, configured for calculating a cluster weight for the cluster, the cluster weight being calculated as a function of the confidence interval weight, the trace length weight, the n-gram weight and the compactness weight; and
  
  a cluster confidence measurement calculating module residing in the system memory, configured for assigning a cluster confidence measurement to the cluster, the cluster confidence measurement being a function of the cluster weight and the cluster raw confidence value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Manadhata, Pratyusa Kumar, Bhatkar, Sandeep B., Griffin, Kent E.
Primary Examiner(s)
Vy, Hung T

Application Number

US13/036,864
Time in Patent Office

491 Days
Field of Search

707/758, 707/737, 707/738, 707/739, 707/749, 707/750, 706/54
US Class Current

707/737
CPC Class Codes

G06F 18/24137   Distances to cluster centroïds

G06F 21/566   Dynamic detection, i.e. det...

G06F 21/577   Assessing vulnerabilities a...

G06V 30/268   Lexical context

Measuring confidence of file clustering and clustering based file classification

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

19 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Measuring confidence of file clustering and clustering based file classification

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links