System and method for clustering files and assigning a maliciousness property based on clustering
First Claim
1. A system, comprising:
- an interface configured to receive a file;
a processor configured to;
transform file contents using a space-filling curve;
down-sample the transformed file contents to generate a sample locus;
perform a hashing operation on the sample locus and assign a cluster identifier to the file based at least in part on a result of the hashing operation;
in response to a determination that the cluster identifier is not present in a data store, determine a set of candidate nearest neighbors for the cluster identifier;
for each candidate nearest neighbor included in the set of candidate nearest neighbors, determine a set of existing cluster identifiers present in the data store;
for each existing cluster identifier included in the set of existing cluster identifiers, determine a set of member loci;
determine an edit distance between the sample locus and each of the respective loci in the set of member loci; and
in response to a determination that at least a first locus included the set of member loci is within a threshold edit distance of the sample locus, assign one or more properties to the file based at least in part on properties associated with first locus, wherein at least one property assigned to the file is an indicator of maliciousness; and
a memory coupled to the processor and configured to provide the processor with instructions.
1 Assignment
0 Petitions
Accused Products
Abstract
A file is received. File contents are transformed using a space-filling curve. The results are down-sampled to generate a sample locus. A cluster identifier is assigned to the file. In response to a determination that the cluster identifier is not present in a data store, a set of candidate nearest neighbors is determined for the cluster identifier. For each candidate nearest neighbor, a set of existing cluster identifiers present in the data store is determined. For each existing cluster identifier, a set of member loci is determined. An edit distance between the sample locus and each of the member loci is determined. Finally, in response to a determination that a first locus in the set of member loci is within a threshold edit distance of the sample locus, one or more properties associated with the first locus is assigned to the file.
30 Citations
28 Claims
-
1. A system, comprising:
-
an interface configured to receive a file; a processor configured to; transform file contents using a space-filling curve; down-sample the transformed file contents to generate a sample locus; perform a hashing operation on the sample locus and assign a cluster identifier to the file based at least in part on a result of the hashing operation; in response to a determination that the cluster identifier is not present in a data store, determine a set of candidate nearest neighbors for the cluster identifier; for each candidate nearest neighbor included in the set of candidate nearest neighbors, determine a set of existing cluster identifiers present in the data store; for each existing cluster identifier included in the set of existing cluster identifiers, determine a set of member loci; determine an edit distance between the sample locus and each of the respective loci in the set of member loci; and in response to a determination that at least a first locus included the set of member loci is within a threshold edit distance of the sample locus, assign one or more properties to the file based at least in part on properties associated with first locus, wherein at least one property assigned to the file is an indicator of maliciousness; and a memory coupled to the processor and configured to provide the processor with instructions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method, comprising:
-
receiving a file; transforming file contents using a space-filling curve; down-sampling the transformed file contents to generate a sample locus; performing a hashing operation on the sample locus and assigning a cluster identifier to the file based at least in part on a result of the hashing operation; in response to a determination that the cluster identifier is not present in a data store, determining a set of candidate nearest neighbors for the cluster identifier; for each candidate nearest neighbor included in the set of candidate nearest neighbors, determining a set of existing cluster identifiers present in the data store; for each existing cluster identifier included in the set of existing cluster identifiers, determining a set of member loci; determining an edit distance between the sample locus and each locus included in the set of member loci; and in response to determining that at least a first locus included in the set of member loci is within a threshold edit distance of the sample locus, assigning one or more properties to the file based at least in part on properties associated with the first locus, wherein at least one property assigned to the file is an indicator of maliciousness. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A system, comprising:
-
an interface configured to receive a file; a processor configured to; transform file contents using a space-filling curve; down-sample the transformed file contents to generate a sample locus; assign a cluster identifier to the file based at least in part on the sample locus; in response to a determination that the cluster identifier is not present in a data store, determine a set of candidate nearest neighbors for the cluster identifier; for each candidate nearest neighbor included in the set of candidate nearest neighbors, determine a set of existing cluster identifiers present in the data store; for each existing cluster identifier included in the set of existing cluster identifiers, determine a set of member loci; determine an edit distance between the sample locus and each locus included in the set of member loci; and in response to determining that at least a first locus included in the set of member loci is within a threshold edit distance of the sample locus, assign one or more properties to the file based at least in part on properties associated with the first locus, wherein at least one property assigned to the file is an indicator of maliciousness; and a memory coupled to the processor and configured to provide the processor with instructions.
-
-
28. A method, comprising:
-
receiving a file; transforming file contents using a space-filling curve; down-sampling the transformed file contents to generate a sample locus; assigning a cluster identifier to the file based at least in part on the sample locus; in response to determining that the cluster identifier is not present in a data store, determining a set of candidate nearest neighbors for the cluster identifier; for each candidate nearest neighbor included in the set of candidate nearest neighbors, determining a set of existing cluster identifiers present in the data store; for each existing cluster identifier included in the set of existing cluster identifiers, determining a set of member loci; determining an edit distance between the sample locus and each locus included in the set of member loci; and in response to determining that at least a first locus included in the set of member loci is within a threshold edit distance of the sample locus, assigning one or more properties to the file based at least in part on properties associated with the first locus, wherein at least one property assigned to the file is an indicator of maliciousness.
-
Specification