Systems, methods, and media for outputting a dataset based upon anomaly detection

US 8,381,299 B2
Filed: 02/28/2007
Issued: 02/19/2013
Est. Priority Date: 02/28/2006
Status: Active Grant

First Claim

Patent Images

1. A method for outputting a dataset based upon anomaly detection, the method comprising:

receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;

computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;

receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;

defining a first window in the input dataset;

identifying first matching n-grams by determining whether the first input n-grams in the first window correspond to one of the first plurality of distinct training n-grams;

computing a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and

outputting the input dataset based on the first anomaly detection score.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and media for outputting a dataset based upon anomaly detection are provided. In some embodiments, methods for outputting a dataset based upon anomaly detection: receive a training dataset having a plurality of n-grams, which plurality includes a first plurality of distinct training n-grams each being a first size; compute a first plurality of appearance frequencies, each for a corresponding one of the first plurality of distinct training n-grams; receive an input dataset including first input n-grams each being the first size; define a first window in the input dataset; identify as being first matching n-grams, the first input n-grams in the first window that correspond to the first plurality of distinct training n-grams; compute a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies; and output the input dataset based on the first anomaly detection score.

283 Citations

93 Claims

1. A method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  defining a first window in the input dataset;
  
  identifying first matching n-grams by determining whether the first input n-grams in the first window correspond to one of the first plurality of distinct training n-grams;
  
  computing a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and
  
  outputting the input dataset based on the first anomaly detection score.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - defining a second window in the input dataset;
      
      identifying second matching n-grams by determining whether the first input n-grams in the second window correspond to one of the first plurality of distinct training n-grams; and
      
      computing a second anomaly detection score for the input dataset using the second matching n-grams and first plurality of appearance frequencies.
  - 3. The method of claim 2, further comprising determining which of the first anomaly detection score and the second anomaly detection score is higher.
  - 4. The method of claim 2, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size, and the input dataset also includes second input n-grams that are each the second size, and wherein the method further comprises:
    - computing a second plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the second plurality of distinct training n-grams;
      
      identifying third matching n-grams by determining whether the second input n-grams in the first window correspond to one of the second plurality of distinct training n-grams;
      
      computing a third anomaly detection score for the input dataset using the third matching n-grams and the second plurality of appearance frequencies; and
      
      determining based upon the third anomaly detection score whether the input dataset contains an anomaly.
  - 5. The method of claim 4, wherein the first size and the second size are randomly or pseudo-randomly chosen.
  - 6. The method of claim 1, wherein the first plurality of distinct training n-grams comprises grouped n-grams and the first matching n-grams comprise grouped n-grams.
  - 7. The method of claim 1, further comprising excluding from the plurality of n-grams in the training dataset a portion of the training dataset that includes malicious code.

8. A method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  selecting the first plurality of distinct training n-grams on a random basis, pseudo-random basis, or secret basis;
  
  receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  determining a first matching count of the first input n-grams that correspond to one of the first plurality of distinct training n-grams;
  
  determining a first total count of the first input n-grams;
  
  determining a first anomaly detection score using the first matching count and the first total count, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and
  
  outputting the input dataset based on the first anomaly detection score.
- View Dependent Claims (9, 10, 11)
- - 9. The method of claim 8, further comprising storing the first plurality of distinct training n-grams in a Bloom filter.
  - 10. The method of claim 9, wherein the Bloom filter uses at least two hash functions.
  - 11. The method of claim 8, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size, and the input dataset also includes second input n-grams that are each the second size, and wherein the method further comprises:
    - determining a second matching count of the second input n-grams that correspond to one of the second plurality of distinct training n-grams;
      
      determining a second total count of the second input n-grams;
      
      determining a second anomaly detection score using the second matching count and the second total count; and
      
      determining based upon the second anomaly detection score whether the input dataset contains an anomaly.

12. A method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a first training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  receiving a second training dataset having a plurality of n-grams that includes a second plurality of distinct training n-grams, wherein each of the second plurality of distinct training n-grams is the first size;
  
  computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  computing a first plurality of uniformities of distribution, wherein each of the first plurality of uniformities of distribution corresponds to one of the first plurality of distinct training n-grams;
  
  computing a second plurality of uniformities of distribution, wherein each of the second plurality of uniformities of distribution corresponds to one of the second plurality of distinct training n-grams;
  
  determining a first plurality of most-heavily weighted n-grams from the first plurality of distinct training n-grams using at least one of;
  
  the first plurality of appearance frequencies;
  
  the first plurality of uniformities of distribution; and
  
  the second plurality of uniformities of distribution;
  
  selecting a subset of the first plurality of most-heavily weighted n-grams, wherein the subset includes m n-grams and at least one of the n-grams in the subset is outside of the top m of the first plurality of most-heavily weighted n-grams;
  
  receiving an input dataset including first input n-grams, wherein each of the plurality of first input n-grams is the first size;
  
  obtaining a subset of a second plurality of most-heavily weighted n-grams from the first input n-grams that correspond to the subset of the first plurality of distinct training n-grams;
  
  classifying the input dataset as containing an anomaly using the subset of the first plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams; and
  
  outputting a dataset based upon the classifying of the input dataset.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The method of claim 12, wherein the plurality of n-grams in the first training dataset also includes a third plurality of distinct training n-grams that are each a second size, the plurality of n-grams in the second training dataset also includes a fourth plurality of distinct training n-grams that are each the second size, and the input dataset also includes second input n-grams that are each the second size, and wherein the method further comprises:
    - computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the third plurality of distinct training n-grams;
      
      computing a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the third plurality of distinct training n-grams;
      
      computing a fourth plurality of uniformities of distribution, wherein each of the fourth plurality of uniformities of distribution corresponds to one of the fourth plurality of distinct training n-grams;
      
      determining a third plurality of most-heavily weighted n-grams from the third plurality of distinct training n-grams using at least one of;
      
      the second plurality of appearance frequencies;
      
      the third plurality of uniformities of distribution; and
      
      the fourth plurality of uniformities of distribution;
      
      determining a fourth plurality of most-heavily weighted n-grams using the first plurality of most-heavily weighted n-grams and the third plurality of most-heavily weighted n-grams; and
      
      classifying the input dataset as containing an anomaly using the fourth plurality of most-heavily weighted n-grams.
  - 14. The method of claim 12, further comprising:
    - receiving a third training dataset having a plurality of n-grams that includes a third plurality of distinct training n-grams, wherein each of the third plurality of distinct training n-grams is the first size and contains malicious code;
      
      computing a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the third plurality of distinct training n-grams; and
      
      determining a third plurality of most-heavily weighted n-grams from the third plurality of distinct training n-grams using at least one of;
      
      the first plurality of appearance frequencies;
      
      the first plurality of uniformities of distribution; and
      
      the third plurality of uniformities of distribution; and
      
      classifying the input dataset as containing an anomaly using a subset of the third plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams.
  - 15. The method of claim 12, wherein obtaining the subset of the second plurality of most-heavily weighted n-grams comprises:
    - identifying a plurality of matching n-grams from the first input n-grams that correspond to the first plurality of distinct training n-grams;
      
      computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the plurality of matching n-grams;
      
      computing a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the plurality of matching n-grams; and
      
      determining the second plurality of most-heavily weighted n-grams from the plurality of matching n-grams using at least one of the second plurality of appearance frequencies; and
      
      the third plurality of uniformities of distribution; and
      
      selecting the subset of the second plurality of most-heavily weighted n-grams corresponding to the subset of the first plurality of most-heavily weighted n-grams.
  - 16. The method of claim 12, wherein the first training dataset comprises a dataset free of known instances of malicious code.
  - 17. The method of claim 12, wherein the second training dataset comprises a dataset containing at least one known instance of malicious code.
  - 18. The method of claim 12, wherein determining the first plurality of most-heavily weighted n-grams comprises:
    - computing a weight for each of the first plurality of distinct training n-grams; and
      
      selecting as the first plurality of most-heavily weighted n-grams certain of the plurality of distinct training n-grams having the highest weights.
  - 19. The method of claim 12, wherein the first size is randomly or pseudo-randomly determined.
  - 20. The method of claim 19, wherein the first size is kept secret.
  - 21. The method of claim 12, wherein the number of n-grams in the first plurality of most-heavily weighted n-grams is predetermined.
  - 22. The method of claim 21, wherein the number of n-grams in the first plurality of most-heavily weighted n-grams can be adjusted.

23. A method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  obtaining a first pseudo count associated with the first plurality of appearance frequencies;
  
  computing a first total count of the number of n-grams of the plurality of n-grams in the training dataset that are the first size;
  
  computing a first maximum possible count of distinct n-grams of the first size in the plurality of n-grams;
  
  computing a second total count of the first plurality of distinct training n-grams;
  
  computing a first smoothing factor;
  
  computing a first probability that the first plurality of distinct training n-grams are found in the training dataset using at least one of;
  
  the first plurality of appearance frequencies,the first pseudo count, the first total count, the second total count, andthe first smoothing factor;
  
  computing a first consistency score of the plurality of n-grams in the training dataset that are the first size using the first maximum possible count and the first probability;
  
  receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  obtaining a second consistency score of the first input n-grams;
  
  classifying the input dataset using the first consistency score and the second consistency score; and
  
  outputting a dataset based upon the classifying of the input dataset.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31)
- - 24. The method of claim 23, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size, and the method further comprises:
    - computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the second plurality of distinct training n-grams;
      
      obtaining a second pseudo count associated with the second plurality of appearance frequencies;
      
      computing a third total count of the number of n-grams of the plurality of n-grams in the training dataset that are the second size;
      
      computing a second maximum possible count of distinct n-grams of the second size in the plurality of n-grams;
      
      computing, a fourth total count of the second plurality of distinct training n-grams;
      
      computing a second smoothing factor;
      
      computing a second probability that the second plurality of distinct training n-grams are found in the training dataset using at least one of;
      
      the second plurality of appearance frequencies, the second pseudo count, the third total count, the fourth total count, and the second smoothing factor;
      
      computing a second consistency score of the plurality of n-grams in the training dataset that are the second size using the second maximum possible count and the second probability; and
      
      classifying the input dataset using the second consistency score.
  - 25. The method of claim 24, wherein the first size is greater than the second size.
  - 26. The method of claim 24, further comprising computing a third probability that the second plurality of distinct training n-grams are found in the training dataset given a presence of the first plurality of distinct training n-grams.
  - 27. The method of claim 23, wherein obtaining the second consistency score of the first input n-grams comprises:
    - identifying a plurality of matching n-grams from the first input n-grams that correspond to the first plurality of distinct training n-grams;
      
      computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the plurality of matching n-grams;
      
      obtaining a second pseudo count associated with the second plurality of appearance frequencies;
      
      computing a third total count of the first input n-grams;
      
      computing a third maximum possible count of distinct n-grams of the first input n-grams;
      
      computing a fourth total count of distinct n-grams of the first input n-grams;
      
      computing a second smoothing factor;
      
      computing a second probability that the distinct n-grams of the first input n-grams are found in the input dataset using at least one of;
      
      the second plurality of appearance frequencies, the second pseudo count, the third total count, the fourth total count, and the second smoothing factor; and
      
      computing a third consistency score of the first input n-grams using the third maximum possible count and the second probability.
  - 28. The method of claim 27, further comprising classifying the input dataset as containing an anomaly if the second consistency score is below a threshold value.
  - 29. The method of claim 23, wherein the training dataset comprises a first count of the top-most-frequently occurring n-grams extracted from network data traffic.
  - 30. The method of claim 29, further comprising adjusting the first maximum possible count.
  - 31. The method of claim 30, wherein adjusting the first maximum possible count comprises:
    - determining a second count of discarded n-grams not chosen as part of the top-most-frequently occurring n-grams; and
      
      adding the first count of the top-most-frequently occurring n-grams and the second count of the top-most-frequently occurring n-grams to provide the first maximum possible count.

32. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  defining a first window in the input dataset;
  
  identifying first matching n-grams by determining whether the first input n-grams in the first window correspond to one of the first plurality of distinct training n-grams;
  
  computing a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and
  
  outputting the input dataset based on the first anomaly detection score.
- View Dependent Claims (33, 34, 35, 36, 37, 38)
- - 33. The medium of claim 32, the method further comprising:
    - defining a second window in the input dataset;
      
      identifying second matching n-grams by determining whether the first input n-grams in the second window correspond to one of the first plurality of distinct training n-grams; and
      
      computing a second anomaly detection score for the input dataset using the second matching n-grams and first plurality of appearance frequencies.
  - 34. The medium of claim 33, the method further comprising determining which of the first anomaly detection score and the second anomaly detection score is higher.
  - 35. The medium of claim 33, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size, and the input dataset also includes second input n-grams that are each the second size, and wherein the method further comprises:
    - computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the second plurality of distinct training n-grams;
      
      identifying third matching n-grams by determining whether the second input n-grams in the first window correspond to one of the second plurality of distinct training n-grams;
      
      computing a third anomaly detection score for the input dataset using the third matching n-grams and the second plurality of appearance frequencies; and
      
      determining based upon the third anomaly detection score whether the input dataset contains an anomaly.
  - 36. The medium of claim 35, wherein the first size and the second size are randomly or pseudo-randomly chosen.
  - 37. The medium of claim 32, wherein the first plurality of distinct training n-grams comprises grouped n-grams and the first matching n-grams comprise grouped n-grams.
  - 38. The medium of claim 32, the method further comprising excluding from the plurality of n-grams in the training dataset a portion of the training dataset that includes malicious code.

39. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  selecting the first plurality of distinct training n-grams on a random basis, pseudo-random basis, or secret basis;
  
  receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  determining a first matching count of the first input n-grams that correspond to one of the first plurality of distinct training n-grams;
  
  determining a first total count of the first input n-grams;
  
  determining a first anomaly detection score using the first matching count and the first total count, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and
  
  outputting the input dataset based on the first anomaly detection score.
- View Dependent Claims (40, 41, 42)
- - 40. The medium of claim 39, the method further comprising storing the first plurality of distinct training n-grams in a Bloom filter.
  - 41. The medium of claim 40, wherein the Bloom filter uses at least two hash functions.
  - 42. The medium of claim 39, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size, and the input dataset also includes second input n-grams that are each the second size, and wherein the method further comprises:
    - determining a second matching count of the second input n-grams that correspond to one of the second plurality of distinct training n-grams;
      
      determining a second total count of the second input n-grams;
      
      determining a second anomaly detection score using-the second matching count and the second total count; and
      
      determining based upon the second anomaly detection score whether the input dataset contains an anomaly.

43. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a first training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  receiving a second training dataset having a plurality of n-grams that includes a second plurality of distinct training n-grams, wherein each of the second plurality of distinct training n-grams is the first size;
  
  computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  computing a first plurality of uniformities of distribution, wherein each of the first plurality of uniformities of distribution corresponds to one of the first plurality of distinct training n-grams;
  
  computing a second plurality of uniformities of distribution, wherein each of the second plurality of uniformities of distribution corresponds to one of the second plurality of distinct training n-grams;
  
  determining a first plurality of most-heavily weighted n-grams from the first plurality of distinct training n-grams using at least one of;
  
  the first plurality of appearance frequencies;
  
  the first plurality of uniformities of distribution; and
  
  the second plurality of uniformities of distribution;
  
  selecting a subset of the first plurality of most-heavily weighted n-grams, wherein the subset includes m n-grams and at least one of the n-grams in the subset is outside of the top m of the first plurality of most-heavily weighted n-grams;
  
  receiving an input dataset including first input n-grams wherein each of the plurality of first input n-grams is the first size;
  
  obtaining a subset of a second plurality of most-heavily weighted n-grams from the first input n-grams that correspond to the subset of the first plurality of distinct training n-grams;
  
  classifying the input dataset as containing an anomaly using the subset of the first plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams; and
  
  outputting a dataset based upon the classifying of the input dataset.
- View Dependent Claims (44, 45, 46, 47, 48, 49, 50, 51, 52, 53)
- - 44. The medium of claim 43, wherein the plurality of n-grams in the first training dataset also includes a third plurality of distinct training n-grams that are each a second size, the plurality of n-grams in the second training dataset also includes a fourth plurality of distinct training n-grams that are each the second size, and the input dataset also includes second input n-grams that are each the second size, and wherein the method further comprises:
    - computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the third plurality of distinct training n-grams;
      
      computing a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the third plurality of distinct training, n-grams;
      
      computing a fourth plurality of uniformities of distribution, wherein each of the fourth p1urality of uniformities of distribution corresponds to one of the fourth plurality of distinct training n-grams;
      
      determining a third plurality of most-heavily weighted n-grams from the third plurality of distinct training n-grams using at least one of;
      
      the second plurality of appearance frequencies;
      
      the third plurality of uniformities of distribution; and
      
      the fourth plurality of uniformities of distribution;
      
      determining a fourth plurality of most-heavily weighted n-grams using the first plurality of most-heavily weighted n-grams and the third plurality of most-heavily weighted n-grams; and
      
      classifying the input dataset as containing an anomaly using the fourth plurality of most-heavily weighted n-grams.
  - 45. The medium of claim 43, the method further comprising:
    - receiving a third training dataset having a plurality of n-grams that includes a third plurality of distinct training n-grams, wherein each of the third plurality of distinct training n-grams is the first size and contains malicious code;
      
      computing a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the third plurality of distinct training n-grams; and
      
      determining a third plurality of most-heavily weighted n-grams from the third plurality of distinct training n-grams using at least one of;
      
      the first plurality of appearance frequencies;
      
      the first plurality of uniformities of distribution; and
      
      the third plurality of uniformities of distribution; and
      
      classifying the input dataset as containing an anomaly using a subset of the third plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams.
  - 46. The medium of claim 43, wherein obtaining the subset of the second plurality of most-heavily weighted n-grams comprises:
    - identifying a plurality of matching n-grams from the first input n-grams that correspond to the first plurality of distinct training n-grams;
      
      computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the plurality of matching n-grams;
      
      computing a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the plurality of matching n-grams; and
      
      determining the second plurality of most-heavily weighted n-grams from the plurality of matching n-grams using at least one of the second plurality of appearance frequencies; and
      
      the third plurality of uniformities of distribution; and
      
      selecting the subset of the second plurality of most-heavily weighted n-grams corresponding to the subset of the first plurality of most-heavily weighted n-grams.
  - 47. The medium of claim 43, wherein the first training dataset comprises a dataset free of known instances of malicious code.
  - 48. The medium of claim 43, wherein the second training dataset comprises a dataset containing at least one known instance of malicious code.
  - 49. The medium of claim 43, wherein determining the first plurality of most-heavily weighted n-grams comprises:
    - computing a weight for each of the first plurality of distinct training n-grams; and
      
      selecting as the first plurality of most-heavily weighted n-grams certain of the plurality of distinct training n-grams having the highest weights.
  - 50. The medium of claim 43, wherein the first size is randomly or pseudo-randomly determined.
  - 51. The medium of claim 50, wherein the first size is kept secret.
  - 52. The medium of claim 43, wherein the number of n-grams in the first plurality of most-heavily weighted n-grams is predetermined.
  - 53. The medium of claim 52, wherein the number of n-grams in the first plurality of most-heavily weighted n-grams can be adjusted.

54. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  obtaining a first pseudo count associated with the first plurality of appearance frequencies;
  
  computing a first total count of the number of n-grams of the plurality of n-grams in the training dataset that are the first size;
  
  computing a first maximum possible count of distinct n-grams of the first size in the plurality of n-grams;
  
  computing a second total count of the first plurality of distinct training n-grams;
  
  computing a first smoothing factor;
  
  computing a first probability that the first plurality of distinct training n-grams are found in the training dataset using at least one of;
  
  the first plurality of appearance frequencies,the first pseudo count, the first total count, the second total count, andthe first smoothing factor;
  
  computing a first consistency score of the plurality of n-grams in the training dataset that are the first size using the first maximum possible count and the first probability;
  
  receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  obtaining a second consistency score of the first input n-grams;
  
  classifying the input dataset using the first consistency score and the second consistency score; and
  
  outputting a dataset based upon the classifying of the input dataset.
- View Dependent Claims (55, 56, 57, 58, 59, 60, 61, 62)
- - 55. The medium of claim 54, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size, and the method further comprises:
    - computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the second plurality of distinct training n-grams;
      
      obtaining a second pseudo count associated with the second plurality of appearance frequencies;
      
      computing a third total count of the number of n-grams of the plurality of n-grams in the training dataset that are the second size;
      
      computing a second maximum possible count of distinct n-grams of the second size in the plurality of n-grams;
      
      computing a fourth total count of the second plurality of distinct training n-grams;
      
      computing a second smoothing factor;
      
      computing a second probability that the second plurality of distinct training n-grams are found in the training dataset using at least one of;
      
      the second plurality of appearance frequencies, the second pseudo count, the third total count, the fourth total count, and the second smoothing factor;
      
      computing a second consistency score of the plurality of n-grams in the training dataset that are the second size using the second maximum possible count and the second probability; and
      
      classifying the input dataset using the second consistency score.
  - 56. The medium of claim 55, wherein the first size is greater than the second size.
  - 57. The medium of claim 55, the method further comprising computing a third probability that the second plurality of distinct training n-grams are found in the training dataset given a presence of the first plurality of distinct training n-grams.
  - 58. The medium of claim 54, wherein obtaining the second consistency score of the first input n-grams comprises:
    - identifying, a plurality of matching n-grams from the first input n-grams that correspond to the first plurality of distinct training n-grams;
      
      computing a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the plurality of matching n-grams;
      
      obtaining a second pseudo count associated with the second plurality of appearance frequencies;
      
      computing a third total count of the first input n-grams;
      
      computing a third maximum possible count of distinct n-grams of the first input n-grams;
      
      computing a fourth total count of distinct n-grams of the first input n-grams;
      
      computing a second smoothing factor;
      
      computing a second probability that the distinct n-grams of the first input n-grams are found in the input dataset using at least one of;
      
      the second plurality of appearance frequencies, the second pseudo count, the third total count, the fourth total count, and the second smoothing factor; and
      
      computing a third consistency score of the first input n-grams using the third maximum possible count and the second probability.
  - 59. The medium of claim 58, the method further comprising classifying the input dataset as containing an anomaly if the second consistency score is below a threshold value.
  - 60. The medium of claim 54, wherein the training dataset comprises a first count of the top-most-frequently occurring n-grams extracted from network data traffic.
  - 61. The medium of claim 60, the method further comprising adjusting the first maximum possible count.
  - 62. The medium of claim 61, wherein adjusting the first maximum possible count comprises:
    - determining a second count of discarded n-grams not chosen as part of the top-most-frequently occurring n-grams; and
      
      adding the first count of the top-most-frequently occurring n-grams and the second count of the top-most-frequently occurring n-grams to provide the first maximum possible count.

63. A system for outputting a dataset based upon anomaly detection, the system comprising:
- a digital processing device that;
  
  receives a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  computes a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  receives an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  defines a first window in the input dataset;
  
  identifies first matching n-grams by determining whether the first input n-grams in the first window correspond to one of the first plurality of distinct training n-grams;
  
  computes a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and
  
  outputs the input dataset based on the first anomaly detection score.
- View Dependent Claims (64, 65, 66, 67, 68, 69)
- - 64. The system of claim 63, wherein the digital processing device also:
    - defines a second window in the input dataset;
      
      identifies second matching n-grams by determining whether the first input n-grams in the second window correspond to one of the first plurality of distinct training n-grams; and
      
      computes a second anomaly detection score for the input dataset using the second matching n-grams and first plurality of appearance frequencies.
  - 65. The system of claim 64, wherein the digital processing device also determines which of the first anomaly detection score and the second anomaly detection score is higher.
  - 66. The system of claim 64, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size, and the input dataset also includes second input n-grams that are each the second size, and wherein the digital processing device also:
    - computes a second plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the second plurality of distinct training n-grams;
      
      identifies third matching n-grams by determining whether the second input n-grams in the first window correspond to one of the second plurality of distinct training n-grams;
      
      computes a third anomaly detection score for the input dataset using the third matching n-grams and the second plurality of appearance frequencies; and
      
      determines based upon the third anomaly detection score whether the input dataset contains an anomaly.
  - 67. The system of claim 66, wherein the first size and the second size are randomly or pseudo-randomly chosen.
  - 68. The system of claim 63, wherein the first plurality of distinct training n-grams comprises grouped n-grams and the first matching n-grams comprise grouped n-grams.
  - 69. The system of claim 63, wherein the digital processing device also excludes from the plurality of n-grams in the training dataset a portion of the training dataset that includes malicious code.

70. A system for outputting a dataset based upon anomaly detection, the system comprising:
- a digital processing device that;
  
  receives a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  selects the first plurality of distinct training n-grams on a random basis, pseudo-random basis, or secret basis;
  
  receives an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  determines a first matching count of the first input n-grams that correspond to one of the first plurality of distinct training n-grams;
  
  determines a first total count of the first input n-grams;
  
  determines a first anomaly detection score using the first matching count and the first total count, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and
  
  outputs the input dataset based on the first anomaly detection score.
- View Dependent Claims (71, 72, 73)
- - 71. The system of claim 70, wherein the digital processing device also stores the first plurality of distinct training n-grams in a Bloom filter.
  - 72. The system of claim 71, wherein the Bloom filter uses at least two hash functions.
  - 73. The system of claim 70, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size;
    - and the input dataset also includes second input n-grams that are each the second size, and wherein the digital processing device also;
      
      determines a second matching count of the second input n-grams that correspond to one of the second plurality of distinct training n-grams;
      
      determines a second total count of the second input n-grams;
      
      determines a second anomaly detection score using the second matching count and the second total count; and
      
      determines based upon the second anomaly detection score whether the input dataset contains an anomaly.

74. A system for outputting a dataset based upon anomaly detection, the system comprising:
- a digital processing, device that;
  
  receives a first training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  receives a second training dataset having a plurality of n-grams that includes a second plurality of distinct training n-grams, wherein each of the second plurality of distinct training n-grams is the first size;
  
  computes a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  computes a first plurality of uniformities of distribution, wherein each of the first plurality of uniformities of distribution corresponds to one of the first plurality of distinct training n-grams;
  
  computes a second plurality of uniformities of distribution, wherein each of the second plurality of uniformities of distribution corresponds to one of the second plurality of distinct training n-grams;
  
  determines a first plurality of most-heavily weighted n-grams from the first plurality of distinct training n-grams using at least one of;
  
  the first-plurality of appearance frequencies;
  
  the first plurality of uniformities of distribution; and
  
  the second plurality of uniformities of distribution;
  
  selects a subset of the first plurality of most-heavily weighted n-grams, wherein the subset includes in n-grams and at least one of the n-grams in the subset is outside of the top m of the first plurality of most-heavily weighted n-grams;
  
  receives an input dataset including first input n-grams, wherein each of the plurality of first input n-grams is the first size;
  
  obtains a subset of a second plurality of most-heavily weighted n-grams from the first input n-grams that correspond to the subset of the first plurality of distinct training n-grams;
  
  classifies the input dataset as containing an anomaly using the subset of the first plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams; and
  
  outputs a dataset based upon the classifying of the input dataset.
- View Dependent Claims (75, 76, 77, 78, 79, 80, 81, 82, 83, 84)
- - 75. The system of claim 74, wherein the plurality of n-grams in the first training dataset also includes a third plurality of distinct training n-grams that are each a second size, the plurality of n-grams in the second training dataset also includes a fourth plurality of distinct training n-grams that are each the second size, and the input dataset also includes second input n-grams that are each the second size, wherein the digital processing device also:
    - computes a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the third plurality of distinct training n-grams;
      
      computes a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the third plurality of distinct training n-grams;
      
      computes a fourth plurality of uniformities of distribution, wherein each of the fourth plurality of uniformities of distribution corresponds to one of the fourth plurality of distinct training n-grams;
      
      determines a third plurality of most-heavily weighted n-grams from the third plurality of distinct training n-grams using at least one of;
      
      the second plurality of appearance frequencies;
      
      the third plurality of uniformities of distribution; and
      
      the fourth plurality of uniformities of distribution;
      
      determines a fourth plurality of most-heavily weighted n-grams using the first plurality of most-heavily weighted n-grams and the third plurality of most-heavily weighted n-grams; and
      
      classifies the input dataset as containing an anomaly using the fourth plurality of most-heavily weighted n-grams.
  - 76. The system of claim 74, wherein-the digital processing device also:
    - receives a third training dataset having a plurality of n-grams that includes a third plurality of distinct training n-grams, wherein each of the third plurality of distinct training n-grams is the first size and contains malicious code;
      
      computes a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the third plurality of distinct training n-grams; and
      
      determines a third plurality of most-heavily weighted n-grams from the third plurality of distinct training n-grams using at least one of;
      
      the first plurality of appearance frequencies;
      
      the first plurality of uniformities of distribution; and
      
      the third plurality of uniformities of distribution; and
      
      classifies the input dataset as containing an anomaly using a subset of the third plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams.
  - 77. The system of claim 74, wherein the digital processing device in obtaining the subset of the second plurality of most-heavily weighted n-grams also:
    - identifies a plurality of matching n-grams from the first input n-grams that correspond to the first plurality of distinct training n-grams;
      
      computes a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the plurality of matching n-grams;
      
      computes a third plurality of uniformities of distribution, wherein each of the third plurality of uniformities of distribution corresponds to one of the plurality of matching n-grams; and
      
      determines the second plurality of most-heavily weighted n-grams from the plurality of matching n-grams using at least one of;
      
      the second plurality of appearance frequencies; and
      
      the third plurality of uniformities of distribution; and
      
      selects the subset of the second plurality of most-heavily weighted n-grams corresponding to the subset of the first plurality of most-heavily weighted n-grams.
  - 78. The system of claim 74, wherein the first training dataset comprises a dataset free of known instances of malicious code.
  - 79. The system of claim 74, wherein the second training dataset comprises a dataset containing at least one known instance of malicious code.
  - 80. The system of claim 74, wherein the digital processing device in determining the first plurality of most-heavily weighted n-grams also:
    - computes a weight for each of the first plurality of distinct training n-grams; and
      
      selects as the first plurality of most-heavily weighted n-grams certain of the plurality of distinct training n-grams having the highest weights.
  - 81. The system of claim 74, wherein the first size is randomly or pseudo-randomly determined.
  - 82. The system of claim 81, wherein the first size is kept secret.
  - 83. The system of claim 74, wherein the number of n-grams in the first plurality of most-heavily weighted n-grams is predetermined.
  - 84. The system of claim 83, wherein the number of n-grams in the first plurality of most-heavily weighted n-grams can be adjusted.

85. A system for outputting a dataset based upon anomaly detection, the system comprising:
- a digital processing device that;
  
  receives a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
  
  computes a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
  
  obtains a first pseudo count associated with the first plurality of appearance frequencies;
  
  computes a first total count of the number of n-grams of the plurality of n-grams in the training dataset that are the first size;
  
  computes a first maximum possible count of distinct n-grams of the first size in the plurality of n-grams;
  
  computes a second total count of the first plurality of distinct training n-grams;
  
  computes a first smoothing factor;
  
  computes a first probability that the first plurality of distinct training n-grams are found in the training dataset using at least one of;
  
  the first plurality of appearance frequencies, the first pseudo count, the first total count, the second total count, and the first smoothing factor;
  
  computes a first consistency score of the plurality of n-grams in the training dataset that are the first size using the first maximum possible count and the first probability;
  
  receives an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
  
  obtains a second consistency score of the first input n-grams;
  
  classifies the input dataset using the first consistency score and the second consistency score; and
  
  outputs a dataset based upon the classifying of the input dataset.
- View Dependent Claims (86, 87, 88, 89, 90, 91, 92, 93)
- - 86. The system of claim 85, wherein the plurality of n-grams in the training dataset also includes a second plurality of distinct training n-grams that are each a second size, and wherein the digital processing device also:
    - computes a second plurality of appearance frequencies wherein each of the second plurality of appearance frequencies corresponds to one of the second plurality of distinct training n-grams;
      
      obtains a second pseudo count associated with the second plurality of appearance frequencies;
      
      computes a third total count of the number of n-grams of the plurality of n-grams in the training dataset that are the second size;
      
      computes a second maximum possible count of distinct n-grams of the second size in the plurality of n-grams;
      
      computes a fourth total count of the second plurality of distinct training n-grams;
      
      computes a second smoothing factor;
      
      computes a second probability that the second plurality of distinct training n-grams are found in the training dataset using at least one of;
      
      the second plurality of appearance frequencies, the second pseudo count, the third total count, the fourth total count, and the second smoothing factor;
      
      computes a second consistency score of the plurality of n-grams in the training dataset that are the second size using the second maximum possible count and the second probability; and
      
      classifies the input dataset using the second consistency score.
  - 87. The system of claim 86, wherein the first size is greater than the second size.
  - 88. The system of claim 86, wherein the digital processing device also computes a third probability that the second plurality of distinct training n-grams are found in the training dataset given a presence of the first plurality of distinct training n-grams.
  - 89. The system of claim 85, wherein the digital processing device in obtaining, the second consistency score of the first input n-grams also:
    - identifies a plurality of matching n-grams from the first input n-grams that correspond to the first plurality of distinct training n-grams;
      
      computes a second plurality of appearance frequencies, wherein each of the second plurality of appearance frequencies corresponds to one of the plurality of matching n-grams;
      
      obtains a second pseudo count associated with the second plurality of appearance frequencies;
      
      computes a third total count of the first input n-grams;
      
      computes a third maximum possible count of distinct n-grams of the first input n-grams;
      
      computes a fourth total count of distinct n-grams of the first input n-grams;
      
      computes a second smoothing factor;
      
      computes a second probability that the distinct n-grams of the first input n-grams are found in the input dataset using at least one of;
      
      the second plurality of appearance frequencies, the second pseudo count, the third total count, the fourth total count, and the second smoothing factor; and
      
      computes a third consistency score of the first input n-grams using the third maximum possible count and the second probability.
  - 90. The system of claim 89, wherein the digital processing device also classifies the input dataset as containing an anomaly if the second consistency score is below a threshold value.
  - 91. The system of claim 85, wherein the training dataset comprises a first count of the top-most-frequently occurring n-grams extracted from network data traffic.
  - 92. The system of claim 91, wherein the digital processing device also adjusts the first maximum possible count.
  - 93. The system of claim 92, wherein the digital processing device in adjusting the first maximum possible count also:
    - determines a second count of discarded n-grams not chosen as part of the top-most-frequently occurring n-grams; and
      
      adds the first count of the top-most-frequently occurring n-grams and the second count of the top-most-frequently occurring n-grams to provide the first maximum possible count.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Trustees Of Columbia University In The City Of New York (Columbia University)
Original Assignee
Trustees Of Columbia University In The City Of New York (Columbia University)
Inventors
Stolfo, Salvatore J, Wang, Ke, Parekh, Janak
Primary Examiner(s)
McNally, Michael S

Application Number

US12/280,969
Publication Number

US 20100064368A1
Time in Patent Office

2,183 Days
Field of Search

726/24
US Class Current

726/24
CPC Class Codes

G06F 21/56   Computer malware detection ...

G06F 21/564   by virus signature recognition

G06F 2221/034   Test or assess a computer o...

H04L 63/1416   Event detection, e.g. attac...

H04L 63/1425   Traffic logging, e.g. anoma...

Systems, methods, and media for outputting a dataset based upon anomaly detection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

283 Citations

93 Claims

Specification

Solutions

Use Cases

Quick Links

Systems, methods, and media for outputting a dataset based upon anomaly detection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

283 Citations

93 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links