Systems, methods, and media for outputting a dataset based upon anomaly detection
First Claim
1. A method for outputting a dataset based upon anomaly detection, the method comprising:
- receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size;
computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams;
receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size;
defining a first window in the input dataset;
identifying first matching n-grams by determining whether the first input n-grams in the first window correspond to one of the first plurality of distinct training n-grams;
computing a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and
outputting the input dataset based on the first anomaly detection score.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, methods, and media for outputting a dataset based upon anomaly detection are provided. In some embodiments, methods for outputting a dataset based upon anomaly detection: receive a training dataset having a plurality of n-grams, which plurality includes a first plurality of distinct training n-grams each being a first size; compute a first plurality of appearance frequencies, each for a corresponding one of the first plurality of distinct training n-grams; receive an input dataset including first input n-grams each being the first size; define a first window in the input dataset; identify as being first matching n-grams, the first input n-grams in the first window that correspond to the first plurality of distinct training n-grams; compute a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies; and output the input dataset based on the first anomaly detection score.
283 Citations
93 Claims
-
1. A method for outputting a dataset based upon anomaly detection, the method comprising:
-
receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; defining a first window in the input dataset; identifying first matching n-grams by determining whether the first input n-grams in the first window correspond to one of the first plurality of distinct training n-grams; computing a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and outputting the input dataset based on the first anomaly detection score. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for outputting a dataset based upon anomaly detection, the method comprising:
-
receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; selecting the first plurality of distinct training n-grams on a random basis, pseudo-random basis, or secret basis; receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; determining a first matching count of the first input n-grams that correspond to one of the first plurality of distinct training n-grams; determining a first total count of the first input n-grams; determining a first anomaly detection score using the first matching count and the first total count, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and outputting the input dataset based on the first anomaly detection score. - View Dependent Claims (9, 10, 11)
-
-
12. A method for outputting a dataset based upon anomaly detection, the method comprising:
-
receiving a first training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; receiving a second training dataset having a plurality of n-grams that includes a second plurality of distinct training n-grams, wherein each of the second plurality of distinct training n-grams is the first size; computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; computing a first plurality of uniformities of distribution, wherein each of the first plurality of uniformities of distribution corresponds to one of the first plurality of distinct training n-grams; computing a second plurality of uniformities of distribution, wherein each of the second plurality of uniformities of distribution corresponds to one of the second plurality of distinct training n-grams; determining a first plurality of most-heavily weighted n-grams from the first plurality of distinct training n-grams using at least one of;
the first plurality of appearance frequencies;
the first plurality of uniformities of distribution; and
the second plurality of uniformities of distribution;selecting a subset of the first plurality of most-heavily weighted n-grams, wherein the subset includes m n-grams and at least one of the n-grams in the subset is outside of the top m of the first plurality of most-heavily weighted n-grams; receiving an input dataset including first input n-grams, wherein each of the plurality of first input n-grams is the first size; obtaining a subset of a second plurality of most-heavily weighted n-grams from the first input n-grams that correspond to the subset of the first plurality of distinct training n-grams; classifying the input dataset as containing an anomaly using the subset of the first plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams; and outputting a dataset based upon the classifying of the input dataset. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A method for outputting a dataset based upon anomaly detection, the method comprising:
-
receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; obtaining a first pseudo count associated with the first plurality of appearance frequencies; computing a first total count of the number of n-grams of the plurality of n-grams in the training dataset that are the first size; computing a first maximum possible count of distinct n-grams of the first size in the plurality of n-grams; computing a second total count of the first plurality of distinct training n-grams; computing a first smoothing factor; computing a first probability that the first plurality of distinct training n-grams are found in the training dataset using at least one of;
the first plurality of appearance frequencies,the first pseudo count, the first total count, the second total count, and the first smoothing factor; computing a first consistency score of the plurality of n-grams in the training dataset that are the first size using the first maximum possible count and the first probability; receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; obtaining a second consistency score of the first input n-grams; classifying the input dataset using the first consistency score and the second consistency score; and outputting a dataset based upon the classifying of the input dataset. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31)
-
-
32. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for outputting a dataset based upon anomaly detection, the method comprising:
-
receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; defining a first window in the input dataset; identifying first matching n-grams by determining whether the first input n-grams in the first window correspond to one of the first plurality of distinct training n-grams; computing a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and outputting the input dataset based on the first anomaly detection score. - View Dependent Claims (33, 34, 35, 36, 37, 38)
-
-
39. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for outputting a dataset based upon anomaly detection, the method comprising:
-
receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; selecting the first plurality of distinct training n-grams on a random basis, pseudo-random basis, or secret basis; receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; determining a first matching count of the first input n-grams that correspond to one of the first plurality of distinct training n-grams; determining a first total count of the first input n-grams; determining a first anomaly detection score using the first matching count and the first total count, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and outputting the input dataset based on the first anomaly detection score. - View Dependent Claims (40, 41, 42)
-
-
43. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for outputting a dataset based upon anomaly detection, the method comprising:
-
receiving a first training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; receiving a second training dataset having a plurality of n-grams that includes a second plurality of distinct training n-grams, wherein each of the second plurality of distinct training n-grams is the first size; computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; computing a first plurality of uniformities of distribution, wherein each of the first plurality of uniformities of distribution corresponds to one of the first plurality of distinct training n-grams; computing a second plurality of uniformities of distribution, wherein each of the second plurality of uniformities of distribution corresponds to one of the second plurality of distinct training n-grams; determining a first plurality of most-heavily weighted n-grams from the first plurality of distinct training n-grams using at least one of;
the first plurality of appearance frequencies;
the first plurality of uniformities of distribution; and
the second plurality of uniformities of distribution;selecting a subset of the first plurality of most-heavily weighted n-grams, wherein the subset includes m n-grams and at least one of the n-grams in the subset is outside of the top m of the first plurality of most-heavily weighted n-grams; receiving an input dataset including first input n-grams wherein each of the plurality of first input n-grams is the first size; obtaining a subset of a second plurality of most-heavily weighted n-grams from the first input n-grams that correspond to the subset of the first plurality of distinct training n-grams; classifying the input dataset as containing an anomaly using the subset of the first plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams; and outputting a dataset based upon the classifying of the input dataset. - View Dependent Claims (44, 45, 46, 47, 48, 49, 50, 51, 52, 53)
-
-
54. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for outputting a dataset based upon anomaly detection, the method comprising:
-
receiving a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; computing a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; obtaining a first pseudo count associated with the first plurality of appearance frequencies; computing a first total count of the number of n-grams of the plurality of n-grams in the training dataset that are the first size; computing a first maximum possible count of distinct n-grams of the first size in the plurality of n-grams; computing a second total count of the first plurality of distinct training n-grams; computing a first smoothing factor; computing a first probability that the first plurality of distinct training n-grams are found in the training dataset using at least one of;
the first plurality of appearance frequencies,the first pseudo count, the first total count, the second total count, and the first smoothing factor; computing a first consistency score of the plurality of n-grams in the training dataset that are the first size using the first maximum possible count and the first probability; receiving an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; obtaining a second consistency score of the first input n-grams; classifying the input dataset using the first consistency score and the second consistency score; and outputting a dataset based upon the classifying of the input dataset. - View Dependent Claims (55, 56, 57, 58, 59, 60, 61, 62)
-
-
63. A system for outputting a dataset based upon anomaly detection, the system comprising:
-
a digital processing device that; receives a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; computes a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; receives an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; defines a first window in the input dataset; identifies first matching n-grams by determining whether the first input n-grams in the first window correspond to one of the first plurality of distinct training n-grams; computes a first anomaly detection score for the input dataset using the first matching n-grams and the first plurality of appearance frequencies, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and outputs the input dataset based on the first anomaly detection score. - View Dependent Claims (64, 65, 66, 67, 68, 69)
-
-
70. A system for outputting a dataset based upon anomaly detection, the system comprising:
-
a digital processing device that; receives a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; selects the first plurality of distinct training n-grams on a random basis, pseudo-random basis, or secret basis; receives an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; determines a first matching count of the first input n-grams that correspond to one of the first plurality of distinct training n-grams; determines a first total count of the first input n-grams; determines a first anomaly detection score using the first matching count and the first total count, wherein the first anomaly detection score is indicative of the presence of anomalous n-grams in the input dataset; and outputs the input dataset based on the first anomaly detection score. - View Dependent Claims (71, 72, 73)
-
-
74. A system for outputting a dataset based upon anomaly detection, the system comprising:
-
a digital processing, device that; receives a first training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; receives a second training dataset having a plurality of n-grams that includes a second plurality of distinct training n-grams, wherein each of the second plurality of distinct training n-grams is the first size; computes a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; computes a first plurality of uniformities of distribution, wherein each of the first plurality of uniformities of distribution corresponds to one of the first plurality of distinct training n-grams; computes a second plurality of uniformities of distribution, wherein each of the second plurality of uniformities of distribution corresponds to one of the second plurality of distinct training n-grams; determines a first plurality of most-heavily weighted n-grams from the first plurality of distinct training n-grams using at least one of;
the first-plurality of appearance frequencies;
the first plurality of uniformities of distribution; and
the second plurality of uniformities of distribution;selects a subset of the first plurality of most-heavily weighted n-grams, wherein the subset includes in n-grams and at least one of the n-grams in the subset is outside of the top m of the first plurality of most-heavily weighted n-grams; receives an input dataset including first input n-grams, wherein each of the plurality of first input n-grams is the first size; obtains a subset of a second plurality of most-heavily weighted n-grams from the first input n-grams that correspond to the subset of the first plurality of distinct training n-grams; classifies the input dataset as containing an anomaly using the subset of the first plurality of most-heavily weighted n-grams and the subset of the second plurality of most-heavily weighted n-grams; and outputs a dataset based upon the classifying of the input dataset. - View Dependent Claims (75, 76, 77, 78, 79, 80, 81, 82, 83, 84)
-
-
85. A system for outputting a dataset based upon anomaly detection, the system comprising:
-
a digital processing device that; receives a training dataset having a plurality of n-grams that includes a first plurality of distinct training n-grams, wherein each of the first plurality of distinct training n-grams is a first size; computes a first plurality of appearance frequencies, wherein each of the first plurality of appearance frequencies corresponds to one of the first plurality of distinct training n-grams; obtains a first pseudo count associated with the first plurality of appearance frequencies; computes a first total count of the number of n-grams of the plurality of n-grams in the training dataset that are the first size; computes a first maximum possible count of distinct n-grams of the first size in the plurality of n-grams; computes a second total count of the first plurality of distinct training n-grams; computes a first smoothing factor; computes a first probability that the first plurality of distinct training n-grams are found in the training dataset using at least one of;
the first plurality of appearance frequencies, the first pseudo count, the first total count, the second total count, and the first smoothing factor;computes a first consistency score of the plurality of n-grams in the training dataset that are the first size using the first maximum possible count and the first probability; receives an input dataset including first input n-grams, wherein each of the first input n-grams is the first size; obtains a second consistency score of the first input n-grams; classifies the input dataset using the first consistency score and the second consistency score; and outputs a dataset based upon the classifying of the input dataset. - View Dependent Claims (86, 87, 88, 89, 90, 91, 92, 93)
-
Specification