Methods and apparatus for evaluating and extracting signatures of computer viruses and other undesirable software entities
First Claim
1. A method for operating a digital data processor to obtain one or more valid signatures of an undesirable software entity, the digital data processor including a memory that is bidirectionally coupled to the digital data processor, the method comprising the steps of:
- storing in the memory a corpus of computer programs that are representative of computer programs that are likely to be infected by an undesirable software entity;
inputting to the digital data processor at least one portion of the undesirable software entity, the at least one portion including a sequence of bytes of the undesirable software entity that are likely to remain substantially invariant from a first instance of the undesirable software entity to a second instance of the undesirable software entity;
storing the at least one inputted portion in the memory;
selecting at least one candidate signature of the undesirable software entity from the stored at least one portion of the undesirable software entity;
constructing with the digital data processor a list of unique n-grams from the sequence of bytes, each of the unique n-grams being comprised of from one to a chosen maximal number of sequential bytes (B) of the sequence of bytes, the constructed list of unique n-grams being stored in the memory;
for each of the unique n-grams of the stored list, estimating with the digital data processor a probability of an occurrence of the unique n-gram within sequences of bytes obtained from the stored corpus of computer programs;
for each candidate signature that is comprised of one or more of the unique n-grams, estimating with the digital data processor a false-positive probability of an occurrence of the candidate signature within the sequences of bytes obtained from the corpus of computer programs;
comparing the estimated false-positive probabilities of the candidate signatures with one another and with a set threshold probabilities, the threshold probabilities having values selected to reduce a likelihood of an occurrence of a false positive indication during the use of a signature; and
outputting at least one signature for subsequent use in identifying an occurrence of the undesirable software entity or a modified version of the undesirable software entity, the outputted at least one signature being determined to exhibit a false alarm probability that is comparable to or less than a lowest false alarm probability of others of the candidate signatures.
1 Assignment
0 Petitions
Accused Products
Abstract
A method, and apparatus for accomplishing the method, to extract and/or evaluate a signature of a computer virus or other undesirable software entity. The method includes a first step of inputting to a digital data processor at least one portion of a undesirable software entity, the at least one portion including a sequence of bytes of the undesirable software entity that is likely to remain substantially invariant from one instance of that entity to another instance, and it is from this portion or portions that candidate computer virus signatures are drawn. A second step constructs a list of unique n-grams from the sequence of bytes, each of the unique n-grams being comprised of from one to a specified maximum number of sequential bytes of the sequence of bytes. A third step estimates, for each of the unique n-grams, a probability of an occurrence of a unique n-gram within sequences of bytes obtained from a corpus of computer programs that are typically executed upon the digital data processor. For each candidate signature that is comprised of one or more of the unique n-grams, a fourth step estimates a probability of an occurrence of the candidate virus signature within the sequences of bytes obtained from the corpus. A fifth step accepts the candidate signature as a valid signature if the estimated probability of the occurrence of the candidate virus signature is less than a threshold probability. The threshold probabilities have values selected to reduce the possibility of an occurrence of a false positive indication during the subsequent use of the valid virus signature by a virus scanner.
428 Citations
34 Claims
-
1. A method for operating a digital data processor to obtain one or more valid signatures of an undesirable software entity, the digital data processor including a memory that is bidirectionally coupled to the digital data processor, the method comprising the steps of:
-
storing in the memory a corpus of computer programs that are representative of computer programs that are likely to be infected by an undesirable software entity; inputting to the digital data processor at least one portion of the undesirable software entity, the at least one portion including a sequence of bytes of the undesirable software entity that are likely to remain substantially invariant from a first instance of the undesirable software entity to a second instance of the undesirable software entity; storing the at least one inputted portion in the memory; selecting at least one candidate signature of the undesirable software entity from the stored at least one portion of the undesirable software entity; constructing with the digital data processor a list of unique n-grams from the sequence of bytes, each of the unique n-grams being comprised of from one to a chosen maximal number of sequential bytes (B) of the sequence of bytes, the constructed list of unique n-grams being stored in the memory; for each of the unique n-grams of the stored list, estimating with the digital data processor a probability of an occurrence of the unique n-gram within sequences of bytes obtained from the stored corpus of computer programs; for each candidate signature that is comprised of one or more of the unique n-grams, estimating with the digital data processor a false-positive probability of an occurrence of the candidate signature within the sequences of bytes obtained from the corpus of computer programs; comparing the estimated false-positive probabilities of the candidate signatures with one another and with a set threshold probabilities, the threshold probabilities having values selected to reduce a likelihood of an occurrence of a false positive indication during the use of a signature; and outputting at least one signature for subsequent use in identifying an occurrence of the undesirable software entity or a modified version of the undesirable software entity, the outputted at least one signature being determined to exhibit a false alarm probability that is comparable to or less than a lowest false alarm probability of others of the candidate signatures. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. Apparatus for obtaining a valid signature of an undesirable software entity, comprising:
-
a memory for storing a corpus of computer programs that are representative of computer programs that are likely to be infected by an undesirable software entity; means for inputting and for storing in the memory at least one portion of an undesirable software entity, the at least one portion including a sequence of bytes of the undesirable software entity that is likely to remain substantially invariant from a first instance of the undesirable software entity to a second instance of the undesirable software entity; a first circuit for selecting at least one candidate signature of the undesirable software entity from the at least one portion of the undesirable software entity that is likely to remain substantially invariant, said first circuit including means for accessing the stored at least one portion of the undesirable software entity to identify the sequence of bytes of the undesirable software entity that are likely to remain substantially invariant; a second circuit for constructing a list of unique n-grams, each of the unique n-grams being comprised of from one to n sequential bytes of the sequence of bytes; a third circuit for estimating, for each of the unique n-grams, a probability of an occurrence of a unique n-gram within sequences of bytes obtained from said stored corpus of computer programs; a fourth circuit for estimating, for each candidate signature that is comprised of one or more of the unique n-grams, a probability of an occurrence of the candidate signature within the sequences of bytes obtained from the corpus of computer programs; a fifth circuit for comparing candidate signatures with one another and with a set threshold probabilities, the threshold probabilities having a value selected to reduce a likelihood of an occurrence of a false positive indication during the use of a signature; and circuit means for outputting at least one signature for subsequent use in identifying an occurrence of the undesirable software entity or a modified version of the undesirable software entity, the outputted at least one signature being determined to exhibit a false alarm probability that is comparable to or less than a lowest false alarm probability of others of the candidate signatures. - View Dependent Claims (21, 22, 23, 24, 25)
-
-
26. A method for operating a digital data processor to obtain one or more valid signatures of an undesirable software entity, the digital data processor including a memory that is bidirectionally coupled to the digital data processor, the method comprising the steps of:
-
storing in the memory a corpus of computer programs that are representative of computer programs that are likely to be infected by an undesirable software entity; inputting to the digital data processor at least one portion of the undesirable software entity, the at least one portion including a sequence of bytes of the undesirable software entity that are likely to remain substantially invariant from a first instance of the undesirable software entity to a second instance of the undesirable software entity; storing the at least one inputted portion in the memory; selecting at least one candidate signature of the undesirable software entity from the stored at least one portion of the undesirable software entity; constructing with the digital data processor a list of unique n-grams from the sequence of bytes, the constructed list being stored in the memory; for each of the unique n-grams of the stored list, estimating with the digital data processor a probability of an occurrence of the unique n-gram within sequences of bytes obtained from the stored corpus of computer programs; for each candidate signature that is comprised of one or more of the unique n-grams, estimating with the digital data processor a false-positive probability of an occurrence of the candidate signature within the sequences of bytes obtained from the corpus of computer programs; and comparing the estimated false-positive probabilities of the candidate signatures with one another and with a set of threshold probabilities, the threshold probabilities having values selected to reduce a likelihood of an occurrence of a false positive indication during the use of any signature with a false-positive probability less than the threshold, the step of comparing including the steps of, discarding any candidate signatures for which an occurrence of a predetermined number of selected bytes is more common than a predetermined threshold; evaluating an exact-match probability for all remaining candidate signatures; discarding any candidate signatures having an exact-match false positive probability that is above an exact-match threshold; retaining n candidate signatures having the lowest estimated probabilities; for each remaining candidate signature i, evaluating a fragment false positive probability fragi ; for each remaining candidate signature i, also evaluating an m-mismatch false positive probability, starting with m=1, and incrementing m until the false positive probability exceeds an m-mismatch threshold;
space="preserve" listing-type="equation">setting M.sub.i =m-1;for all candidate signatures that correspond to a particular undesirable software entity, choosing as a best signature the signature having a largest value of M; and recording a chosen best signature for each undesirable software entity in a signature database for subsequent use in identifying an occurrence of the undesirable software entity or a modified version of the undesirable software entity. - View Dependent Claims (27, 28, 29, 30)
-
-
31. A method for operating a digital data processor to obtain one or more valid signatures of an entity of interest, the digital data processor including a memory that is bidirectionally coupled to the digital data processor, the method comprising the steps of:
-
storing in the memory a corpus of digital information that is representative of digital information with which the entity is likely to be found to be associated with; inputting to the digital data processor at least one portion of a digitally encoded representation of an entity of interest, the at least one portion being likely to remain substantially invariant from a first instance of the entity of interest to a second instance of the entity of interest; storing the at least one inputted portion in the memory; selecting at least one candidate signature of the entity of interest from the at least one portion of the representation of the entity of interest, each candidate signature including a subset of the representation of the entity of interest; constructing with the digital data processor a list of signature components which can be combined to form a candidate signature, each signature component of the list being distinguishable from other signature components of the list and being comprised of at least one byte of data; storing the list of signature components in the memory; for each of the stored signature components, estimating with the digital data processor a probability of an occurrence of the signature component within the stored corpus of digital information; for each candidate signature that includes one or more of the signature components, estimating with the digital data processor a false-positive probability of an exact occurrence or an approximate occurrence of the candidate signature within a set of digital information that is statistically similar to the corpus of digital information; comparing the estimated false-positive probabilities of the candidate signatures with one another and with a set of threshold probabilities, the threshold probabilities having values selected to reduce a likelihood of an occurrence of a false positive indication during the use of any signature having a false-positive probability that is less than the threshold; and outputting at least one signature for subsequent use in identifying an occurrence of the entity of interest or an occurrence of a modified version of the entity of interest, the outputted at least one signature being determined to exhibit a false alarm probability that is comparable to or less than a lowest false alarm probability of others of the candidate signatures. - View Dependent Claims (32)
-
-
33. A method for operating a digital data processor to validate a signature of an undesirable software entity, the digital data processor including a memory that is bidirectionally coupled to the digital data processor, the method comprising the steps of:
-
storing in the memory a corpus of computer programs that are representative of computer programs that are likely to be infected by an undesirable software entity; inputting to the digital data processor at least one candidate signature of the undesirable software entity, the at least one candidate signature including a sequence of bytes of the undesirable software entity that are likely to remain substantially invariant from a first instance of the undesirable software entity to a second instance of the undesirable software entity; storing the at least one inputted portion in the memory; constructing with the digital data processor a list of unique n-grams from the sequence of bytes, each of the unique n-grams being comprised of from one to a chosen maximal number of sequential bytes (B) of the sequence of bytes, the constructed list of unique n-grams being stored in the memory; for each of the unique n-grams of the stored list, estimating with the digital data processor a probability of an occurrence of a unique n-gram within sequences of bytes obtained from the stored corpus of computer programs; for each candidate signature that is comprised of one or more of the unique n-grams, estimating with the digital data processor a probability of an occurrence of the candidate signature within the sequences of bytes obtained from the corpus of computer programs; comparing the candidate signature with at least one threshold probability having a value selected to reduce a likelihood of an occurrence of a false positive indication during the use of the signature; accepting or rejecting the candidate signature as a result of the execution of the step of comparing; and outputting at least one accepted signature for subsequent use in identifying an occurrence of the undesirable software entity or a modified version of the undesirable software entity, the outputted at least one accepted signature being determined to exhibit a false alarm probability that is comparable to or less than a lowest false alarm probability of others of the candidate signatures.
-
-
34. A method for operating a digital data processor to determine threshold probabilities against which candidate signatures of undesirable software entities are compared, the digital data processor including a memory that is bidirectionally coupled to the digital data processor, the method comprising the steps of:
-
providing a corpus of computer programs that are stored in the memory; segregating the stored corpus of computer programs into a probe set, a training set and a test set; selecting byte-strings from the probe set as candidate signatures; evaluating the selected byte-strings against byte-strings found in the training set to determine exact-match and fuzzy-match probabilities; determining a frequency of occurrence of the candidate signatures within the test set, including fuzzy-matches of the signatures within the test set; for each candidate signature, producing lists for indicating an estimated probability of an occurrence of the candidate signature within the training set and a determined frequency of occurrence of exact-matches and fuzzy-matches within the test set; determining a set of threshold probabilities in accordance with a criteria that provides an acceptable false-positive probability while not excluding a significant number of the candidate signatures; and outputting at least one signature for subsequent use in identifying an occurrence of the undesirable software entity or a modified version of the undesirable software entity, the outputted at least one signature being determined to exhibit a false alarm probability that is comparable to or less than a lowest false alarm probability of others of the candidate signatures.
-
Specification