Systems and methods for clustering data samples

US 9,152,703 B1
Filed: 02/28/2013
Issued: 10/06/2015
Est. Priority Date: 02/28/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for clustering data samples, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:

identifying a plurality of samples to cluster;

identifying a plurality of candidate features for clustering the plurality of samples;

identifying a plurality of candidate distance functions for clustering the plurality of samples;

selecting a distance function from the plurality of candidate distance functions for clustering the plurality of samples at least in part by;

selecting a set of features from the plurality of candidate features for clustering the plurality of samples based at least in part on determining that a result of clustering a training set of samples using the set of features and the distance function fits an expected clustering of the training set of samples more closely than an additional result of clustering the training set of samples using an alternative set of features from the plurality of candidate features and the distance function, according to a predefined clustering accuracy metric;

determining that the result of clustering the training set of samples using the set of features and the distance function fits the expected clustering of the training set of samples more closely than a best result of clustering the training set of samples for each candidate distance function, aside from the distance function, within the plurality of candidate distance functions, according to the predefined clustering accuracy metric;

clustering the plurality of samples using the set of features and the distance function.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method for clustering data samples may include (1) identifying a plurality of samples, (2) identifying a plurality of candidate features, (3) identifying a plurality of candidate distance functions, (4) selecting a distance function by (i) selecting a set of features based on determining that a result of clustering a training set of samples using the set of features and the distance function fits an expected clustering of the training set of samples more closely than results from using an alternative set of features and (ii) determining that the result of clustering the training set using the set of features and the distance function fits the expected clustering of the training set of samples more closely than a best result of any other distance function, and (5) clustering the plurality of samples using the set of features and the distance function. Various other methods and systems are also disclosed.

16 Citations

View as Search Results

20 Claims

1. A computer-implemented method for clustering data samples, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:
- identifying a plurality of samples to cluster;
  
  identifying a plurality of candidate features for clustering the plurality of samples;
  
  identifying a plurality of candidate distance functions for clustering the plurality of samples;
  
  selecting a distance function from the plurality of candidate distance functions for clustering the plurality of samples at least in part by;
  
  selecting a set of features from the plurality of candidate features for clustering the plurality of samples based at least in part on determining that a result of clustering a training set of samples using the set of features and the distance function fits an expected clustering of the training set of samples more closely than an additional result of clustering the training set of samples using an alternative set of features from the plurality of candidate features and the distance function, according to a predefined clustering accuracy metric;
  
  determining that the result of clustering the training set of samples using the set of features and the distance function fits the expected clustering of the training set of samples more closely than a best result of clustering the training set of samples for each candidate distance function, aside from the distance function, within the plurality of candidate distance functions, according to the predefined clustering accuracy metric;
  
  clustering the plurality of samples using the set of features and the distance function.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, wherein:
    - the set of features comprises a subset of the plurality of features;
      
      the alternative set of features comprises the subset and an additional feature from within the plurality of features;
      
      selecting the set of features comprises adding the additional feature with the set of features to create the alternative set of features to determine whether the additional feature improves upon the result of clustering according to the predefined clustering accuracy metric.
  - 3. The computer-implemented method of claim 1, wherein:
    - the alternative set of features comprises a subset of the plurality of features;
      
      the set of features comprises the subset and an additional feature from within the plurality of features;
      
      selecting the set of features comprises adding the additional feature with the alternative set of features to create the set of features to determine whether the additional feature improves upon the result of clustering according to the predefined clustering accuracy metric.
  - 4. The computer-implemented method of claim 1, wherein:
    - the plurality of samples comprises a plurality of malware samples;
      
      the training set of samples comprises a set of malware variants from a plurality of malware families;
      
      the predefined clustering accuracy metric comprises a level of correspondence between at least one cluster of the plurality of malware samples and at least one malware family of the plurality of malware families.
  - 5. The computer-implemented method of claim 1, wherein selecting the set of features comprises:
    - ordering the plurality of candidate features by single-feature clustering efficacy to create an ordered list of candidate features;
      
      iterating through the ordered list of candidate features and adding to the set of features each candidate feature from the ordered list of candidate features that improves clustering of the training set of samples when added to the set of features.
  - 6. The computer-implemented method of claim 1, wherein the predefined clustering accuracy metric comprises at least one of:
    - a measure of inter-cluster distance;
      
      a measure of intra-cluster closeness.
  - 7. The computer-implemented method of claim 1, further comprising classifying at least one sample within the plurality of samples according to a cluster in which the sample falls after clustering the plurality of samples using the set of features and the distance function.

8. A system for clustering data samples, the system comprising:
- an identification module programmed to;
  
  identify a plurality of samples to cluster;
  
  identify a plurality of candidate features for clustering the plurality of samples;
  
  identify a plurality of candidate distance functions for clustering the plurality of samples;
  
  a selection module programmed to select a distance function from the plurality of candidate distance functions for clustering the plurality of samples at least in part by;
  
  selecting a set of features from the plurality of candidate features for clustering the plurality of samples based at least in part on determining that a result of clustering a training set of samples using the set of features and the distance function fits an expected clustering of the training set of samples more closely than an additional result of clustering the training set of samples using an alternative set of features from the plurality of candidate features and the distance function, according to a predefined clustering accuracy metric;
  
  determining that the result of clustering the training set of samples using the set of features and the distance function fits the expected clustering of the training set of samples more closely than a best result of clustering the training set of samples for each candidate distance function, aside from the distance function, within the plurality of candidate distance functions, according to the predefined clustering accuracy metric;
  
  a clustering module programmed to cluster the plurality of samples using the set of features and the distance function;
  
  at least one processor configured to execute the identification module, the selection module, and the clustering module.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein:
    - the set of features comprises a subset of the plurality of features;
      
      the alternative set of features comprises the subset and an additional feature from within the plurality of features;
      
      selecting the set of features comprises adding the additional feature with the set of features to create the alternative set of features to determine whether the additional feature improves upon the result of clustering according to the predefined clustering accuracy metric.
  - 10. The system of claim 8, wherein:
    - the alternative set of features comprises a subset of the plurality of features;
      
      the set of features comprises the subset and an additional feature from within the plurality of features;
      
      the selection module is programmed to select the set of features by adding the additional feature with the alternative set of features to create the set of features to determine whether the additional feature improves upon the result of clustering according to the predefined clustering accuracy metric.
  - 11. The system of claim 8, wherein:
    - the plurality of samples comprises a plurality of malware samples;
      
      the training set of samples comprises a set of malware variants from a plurality of malware families;
      
      the predefined clustering accuracy metric comprises a level of correspondence between at least one cluster of the plurality of malware samples and at least one malware family of the plurality of malware families.
  - 12. The system of claim 8, wherein the selection module is programmed to select the set of features by:
    - ordering the plurality of candidate features by single-feature clustering efficacy to create an ordered list of candidate features;
      
      iterating through the ordered list of candidate features and adding to the set of features each candidate feature from the ordered list of candidate features that improves clustering of the training set of samples when added to the set of features.
  - 13. The system of claim 8, wherein the predefined clustering accuracy metric comprises at least one of:
    - a measure of inter-cluster distance;
      
      a measure of intra-cluster closeness.
  - 14. The system of claim 8, wherein the clustering module is further programmed to classify at least one sample within the plurality of samples according to a cluster in which the sample falls after clustering the plurality of samples using the set of features and the distance function.

15. A non-transitory computer-readable medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to:
- identify a plurality of samples to cluster;
  
  identify a plurality of candidate features for clustering the plurality of samples;
  
  identify a plurality of candidate distance functions for clustering the plurality of samples;
  
  select a distance function from the plurality of candidate distance functions for clustering the plurality of samples at least in part by;
  
  selecting a set of features from the plurality of candidate features for clustering the plurality of samples based at least in part on determining that a result of clustering a training set of samples using the set of features and the distance function fits an expected clustering of the training set of samples more closely than an additional result of clustering the training set of samples using an alternative set of features from the plurality of candidate features and the distance function, according to a predefined clustering accuracy metric;
  
  determining that the result of clustering the training set of samples using the set of features and the distance function fits the expected clustering of the training set of samples more closely than a best result of clustering the training set of samples for each candidate distance function, aside from the distance function, within the plurality of candidate distance functions, according to the predefined clustering accuracy metric;
  
  cluster the plurality of samples using the set of features and the distance function.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer-readable medium of claim 15, wherein:
    - the set of features comprises a subset of the plurality of features;
      
      the alternative set of features comprises the subset and an additional feature from within the plurality of features;
      
      selecting the set of features comprises adding the additional feature with the set of features to create the alternative set of features to determine whether the additional feature improves upon the result of clustering according to the predefined clustering accuracy metric.
  - 17. The non-transitory computer-readable medium of claim 15, wherein:
    - the alternative set of features comprises a subset of the plurality of features;
      
      the set of features comprises the subset and an additional feature from within the plurality of features;
      
      selecting the set of features comprises adding the additional feature with the alternative set of features to create the set of features to determine whether the additional feature improves upon the result of clustering according to the predefined clustering accuracy metric.
  - 18. The non-transitory computer-readable medium of claim 15, wherein:
    - the plurality of samples comprises a plurality of malware samples;
      
      the training set of samples comprises a set of malware variants from a plurality of malware families;
      
      the predefined clustering accuracy metric comprises a level of correspondence between at least one cluster of the plurality of malware samples and at least one malware family of the plurality of malware families.
  - 19. The non-transitory computer-readable medium of claim 15, wherein selecting the set of features comprises:
    - ordering the plurality of candidate features by single-feature clustering efficacy to create an ordered list of candidate features;
      
      iterating through the ordered list of candidate features and adding to the set of features each candidate feature from the ordered list of candidate features that improves clustering of the training set of samples when added to the set of features.
  - 20. The non-transitory computer-readable medium of claim 15, wherein the predefined clustering accuracy metric comprises at least one of:
    - a measure of inter-cluster distance;
      
      a measure of intra-cluster closeness.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Veritas Technologies, LLC (Whitehouse Group Ltd.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Satish, Sourabh
Primary Examiner(s)
Mofiz, Apu
Assistant Examiner(s)
Agharahimi, Farhad

Application Number

US13/780,765
Time in Patent Office

950 Days
Field of Search

726/22
US Class Current

1/1
CPC Class Codes

G06F 16/35 Clustering; Classification

G06Q 30/0185 Product, service or busines...

Systems and methods for clustering data samples

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

16 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for clustering data samples

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links