Generation and use of trained file classifiers for malware detection
First Claim
Patent Images
1. A computing device comprising:
- a memory configured to store instructions to generate a trained file classifier; and
a processor configured to execute the instructions from the memory to perform operations comprising;
accessing information identifying multiple files and identifying classification data for the multiple files, wherein the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware;
generating a feature vector representing the particular file of the multiple files, the feature vector including;
zero-skip n-gram data indicating occurrences of adjacent characters in printable characters representing the particular file;
skip n-gram data indicating occurrences of non-adjacent characters in the printable characters representing the particular file; and
n-gram data indicating occurrences of groups of entropy indicators in a set of entropy indicators derived from file entropy data for the particular file, each entropy indicator of the set of entropy indicators having a value representing entropy of a corresponding chunk of the particular file;
generating the trained file classifier using the feature vector and the classification data as supervised training data; and
transmitting the trained file classifier to a remote computing device via a network, wherein the trained file classifier is executable by the remote computing device to restrict access to a file or to restrict execution of the file based on a classification result generated by execution of the trained file classifier.
2 Assignments
0 Petitions
Accused Products
Abstract
A method includes accessing information identifying multiple files and identifying classification data for the multiple files, where the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware. The method also includes generating n-gram vectors for the multiple files by, for each file, generating an n-gram vector indicating occurrences of character pairs in printable characters representing the file. The method further includes generating and storing a file classifier using the n-gram vectors and the classification data as supervised training data.
86 Citations
20 Claims
-
1. A computing device comprising:
-
a memory configured to store instructions to generate a trained file classifier; and a processor configured to execute the instructions from the memory to perform operations comprising; accessing information identifying multiple files and identifying classification data for the multiple files, wherein the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware; generating a feature vector representing the particular file of the multiple files, the feature vector including; zero-skip n-gram data indicating occurrences of adjacent characters in printable characters representing the particular file; skip n-gram data indicating occurrences of non-adjacent characters in the printable characters representing the particular file; and n-gram data indicating occurrences of groups of entropy indicators in a set of entropy indicators derived from file entropy data for the particular file, each entropy indicator of the set of entropy indicators having a value representing entropy of a corresponding chunk of the particular file; generating the trained file classifier using the feature vector and the classification data as supervised training data; and transmitting the trained file classifier to a remote computing device via a network, wherein the trained file classifier is executable by the remote computing device to restrict access to a file or to restrict execution of the file based on a classification result generated by execution of the trained file classifier. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method comprising:
-
accessing information identifying multiple files and identifying classification data for the multiple files, wherein the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware; generating a feature vector representing the particular file of the multiple files, the feature vector including; zero-skip n-gram data indicating occurrences of adjacent characters in printable characters representing the particular file; skip n-gram data indicating occurrences of non-adjacent characters in the printable characters representing the particular file; and n-gram data indicating occurrences of groups of entropy indicators in a set of entropy indicators derived from file entropy data for the particular file, each entropy indicator of the set of entropy indicators having a value representing entropy of a corresponding chunk of the particular file; generating a trained file classifier using the feature vector and the classification data as supervised training data; and transmitting the trained file classifier to a remote computing device via a network, wherein the trained file classifier is executable by the remote computing device to restrict access to a file or to restrict execution of the file based on a classification result generated by execution of the trained file classifier. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer-readable storage device storing instructions that, when executed, cause a computer to perform operations comprising:
-
accessing information identifying multiple files and identifying classification data for the multiple files, wherein the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware; generating a feature vector representing the particular file of the multiple files, the feature vector including; zero-skip n-gram data indicating occurrences of adjacent characters in printable characters representing the particular file; skip n-gram data indicating occurrences of non-adjacent characters in the printable characters representing the particular file; and n-gram data indicating occurrences of groups of entropy indicators in a set of entropy indicators derived from file entropy data for the particular file, each entropy indicator of the set of entropy indicators having a value representing entropy of a corresponding chunk of the particular file; generating and storing a trained file classifier using the feature vector and the classification data as supervised training data; and causing the trained file classifier to be transmitted to a remote computing device via a network, wherein the trained file classifier is executable by the remote computing device to restrict access to a file or to restrict execution of the file based on a classification result generated by execution of the trained file classifier. - View Dependent Claims (17, 18, 19, 20)
-
Specification