Matrix factorization for automated malware detection
First Claim
Patent Images
1. A malware detection system comprising:
- at least one processor;
a feature identifier configured to generate a matrix of files and associated machines having a plurality of features associated with the files and machines, the feature identifier further configured to apply matrix factorization to the matrix of files and associated machines to generate a machine matrix and a file matrix, and is configured to perform dimensional reduction to identify a group of features from the plurality of features that are most informative features, wherein the group of features is a fixed number of features and comprises a subset of the plurality of features;
a malware database comprising files of known malware and a plurality of features associated with the known malware;
a comparison engine configured to identify for a file a number of other files that are similar to the file from the matrix of files and the malware database and to score the file based on a closeness of the other files to the file; and
malware classification component configured to identify potential malware based on the score of the file and is further configured to create an alert if the score for the file exceeds a first threshold score.
3 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein is a system and method for automatically identifying potential malware files or benign files in files that are not known to be malware. Vector distances for select features of the files are compared to vectors both known malware files and benign files. Based on the distance measures a malware score is obtained for the unknown file. If the malware score exceeds a threshold a researcher may be notified of the potential malware, or the file may be automatically classified as malware if the score is significantly high.
16 Citations
17 Claims
-
1. A malware detection system comprising:
-
at least one processor; a feature identifier configured to generate a matrix of files and associated machines having a plurality of features associated with the files and machines, the feature identifier further configured to apply matrix factorization to the matrix of files and associated machines to generate a machine matrix and a file matrix, and is configured to perform dimensional reduction to identify a group of features from the plurality of features that are most informative features, wherein the group of features is a fixed number of features and comprises a subset of the plurality of features; a malware database comprising files of known malware and a plurality of features associated with the known malware; a comparison engine configured to identify for a file a number of other files that are similar to the file from the matrix of files and the malware database and to score the file based on a closeness of the other files to the file; and malware classification component configured to identify potential malware based on the score of the file and is further configured to create an alert if the score for the file exceeds a first threshold score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for identifying unknown malware files from a plurality of files comprising:
-
receiving a plurality of files from a plurality of machines each file and each machine having a plurality of features each of the plurality of files associated with at least one of the plurality of machines upon which the file resides; building a multidimensional matrix associating the plurality of files with the plurality of machines; identifying from the multidimensional matrix a group of features that are most informative in describing a file in the matrix wherein the group of features is a fixed number of features and comprises a subset of the plurality of features for the plurality of files and the plurality of machines; determining a malware score for at least one file in the plurality of files; determining if the at least one file is potential malware by comparing the malware score for the at least one file against a threshold malware score; and generating an alert when the at least one the is determined to be potential malware; wherein the preceding steps are performed by at least one processor. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A computer readable storage device having computer readable instructions that when executed cause at least one computing device to:
-
receive a plurality of files from a plurality of machines each file and each machine having a plurality of features; build a multidimensional matrix associating the plurality of files with the plurality of machines; identify from the multidimensional matrix a group of features that are most informative in describing a machine in the matrix wherein the group of features is a fixed number of features and comprises a subset of the plurality of features; determine a vector distance for files on the at least one machine with corresponding vectors for a plurality of known malware files in a malware database wherein the vector and corresponding vectors are based on the group of features; determine a malware score for at least one machine in the plurality of machines by adding the determined distance; and determine if the at least one machine is compromised by malware by comparing the malware score for the at least one malware against a threshold malware score.
-
Specification