Automatic generation of generic file signatures
First Claim
Patent Images
1. A method for automatically generating signatures for detecting malware, comprising:
- collecting a set of static attributes from a malware dataset and a goodware dataset;
generating a plurality of decision trees from the set of static attributes, wherein each decision tree in the plurality of decision trees comprises a plurality of terminal nodes;
identifying, for each sample in a known-file dataset, a pattern of terminal nodes to which the sample is mapped by the plurality of decision trees, wherein the pattern of terminal nodes of the sample comprises a representation of a terminal node from each decision tree within the plurality of decision trees to which the sample has been mapped;
generating a cluster of samples comprising samples in the known file dataset that have identical patterns of terminal nodes;
validating the cluster of samples against a reputation value range to determine a purity of the cluster of samples; and
generating, based at least in part on the purity of the cluster of samples, a signature for identifying additional files that are similar to the samples in the cluster of samples.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods to automatically generate signatures used to detect malware are provided. The systems and methods use machine learning techniques, to build an over-trained heuristic model to analyze software, cluster identified patterns, validate the clusters against known reputational metrics, automatically create signatures and, in some examples, deploy such signatures to remote computing devices.
28 Citations
20 Claims
-
1. A method for automatically generating signatures for detecting malware, comprising:
-
collecting a set of static attributes from a malware dataset and a goodware dataset; generating a plurality of decision trees from the set of static attributes, wherein each decision tree in the plurality of decision trees comprises a plurality of terminal nodes; identifying, for each sample in a known-file dataset, a pattern of terminal nodes to which the sample is mapped by the plurality of decision trees, wherein the pattern of terminal nodes of the sample comprises a representation of a terminal node from each decision tree within the plurality of decision trees to which the sample has been mapped; generating a cluster of samples comprising samples in the known file dataset that have identical patterns of terminal nodes; validating the cluster of samples against a reputation value range to determine a purity of the cluster of samples; and generating, based at least in part on the purity of the cluster of samples, a signature for identifying additional files that are similar to the samples in the cluster of samples. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system to automatically generate signatures used to detect malware, comprising:
-
an attribute collection module, stored in memory, that collects a set of static attributes from a malware dataset and a goodware dataset; a heuristic module, stored in memory, that generates a plurality of decision trees from the set of static attributes, wherein each decision tree in the plurality of decision trees comprises a plurality of terminal nodes; a clustering module, stored in memory, that; identifies, for each sample in a known-file dataset, a pattern of terminal nodes to which the sample is mapped by the plurality of decision trees, wherein the pattern of terminal nodes of the sample comprises a representation of a terminal node from each decision tree within the plurality of decision trees to which the sample has been mapped; and generates a cluster of samples comprising samples in the known file dataset that have identical patterns of terminal nodes; a cluster validation module, stored in memory, that validates the cluster of samples against a reputation value range to determine a purity of the cluster of samples; a signature creation module, stored in memory, that creates, based at least in part on the purity of the cluster of samples, a signature for identifying additional files that are similar to the samples in the cluster of samples; and at least one physical processor that executes the attribute collection module, the heuristic module, the clustering module, the cluster validation module, and the signature creation module. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium comprising computer executable instructions that when executed by at least one processor of a computing device, cause the computing device to:
-
collect a set of static attributes from a malware dataset and a goodware dataset; generate a plurality of decision trees from the set of static attributes, wherein each decision tree in the plurality of decision trees comprises a plurality of terminal nodes; identify, for each sample in a known-file dataset, a pattern of terminal nodes to which the sample is mapped by the plurality of decision trees, wherein the patter of terminal nodes of the sample comprises a representation of a terminal node from each decision tree within the plurality of decision trees to which the sample has been mapped; generate a cluster of samples comprising samples in the known file dataset that have identical patterns of terminal nodes; validate the cluster of samples against a reputation value range to determine a purity of the cluster of samples; and generate, based at least in part on the purity of the cluster of samples, a signature for identifying additional files that are similar to the samples in the cluster of samples. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification