Identification of anomalous data records
First Claim
Patent Images
1. A method for execution by one or more digital processors, the method for detecting whether a current record in a dataset of records is an anomalous record, comprising:
- defining a feature of the records in the dataset;
calculating, by the one or more digital processors, a plurality of pairwise distances between a value of the feature in the current record and values of the feature in at least some of the records in the dataset, where each pairwise distance is either;
(A) small for mismatches between the values when both of the values rarely occur in the dataset records or where the distance is large for mismatches between the values when both of the values commonly occur in the dataset records;
or(B) large for mismatches between the values when both of the values rarely occur in the dataset records or where the distance is small when both of the values commonly occur in the dataset records; and
in response to the plurality of distances d, producing a score for the current record;
indicating that the current record is anomalous if the score meets a predetermined criterion;
wherein the distances are responsive to the frequency Freq(vi) of the value vi of the feature in the current record and to the frequency Freq(vj) of the value vj of the feature in another of the records in the dataset; and
calculating each of the pairwise distances d from a value vi of the feature in the current record and a value vj of the feature in the other of the dataset records according to the relation
1 Assignment
0 Petitions
Accused Products
Abstract
Identifying anomalies or outliers in a set of data records employs a distance or similarity measure between features of record pairs that depends upon the frequencies of the feature values in the set. Feature distances may be combined for a total distance between record pairs. An outlier is indicated for a certain score that may be based upon the pairwise distances. Outliers may be employed to detect intrusions in computer networks.
-
Citations
21 Claims
-
1. A method for execution by one or more digital processors, the method for detecting whether a current record in a dataset of records is an anomalous record, comprising:
-
defining a feature of the records in the dataset; calculating, by the one or more digital processors, a plurality of pairwise distances between a value of the feature in the current record and values of the feature in at least some of the records in the dataset, where each pairwise distance is either; (A) small for mismatches between the values when both of the values rarely occur in the dataset records or where the distance is large for mismatches between the values when both of the values commonly occur in the dataset records;
or(B) large for mismatches between the values when both of the values rarely occur in the dataset records or where the distance is small when both of the values commonly occur in the dataset records; and in response to the plurality of distances d, producing a score for the current record; indicating that the current record is anomalous if the score meets a predetermined criterion; wherein the distances are responsive to the frequency Freq(vi) of the value vi of the feature in the current record and to the frequency Freq(vj) of the value vj of the feature in another of the records in the dataset; and calculating each of the pairwise distances d from a value vi of the feature in the current record and a value vj of the feature in the other of the dataset records according to the relation - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A storage medium containing instructions executable in a digital processor for carrying out a method comprising:
-
defining a feature of the records in the dataset; calculating a plurality of pairwise distances between a value vi of the feature in the current record and values vj of the feature in at least some of the records in the dataset, where each pairwise distance d is either; (A) small for mismatches between vi and vj when both vi and vj rarely occur in the dataset records or where the distance is large for mismatches between vi and vj when both vi and vj commonly occur in the dataset records;
or(B) large for mismatches between vi and vj when both vi and vj rarely occur in the dataset records or where the distance is small when both vi and vj commonly occur in the dataset records; and in response to the plurality of distances d, producing a score for the current record; indicating that the current record is anomalous if the score meets a predetermined criterion; wherein the distances are responsive to the frequency Freq(vi) of the value vi of the feature in the current record and to the frequency Freq(vj) of the value vj of the feature in another of the records in the dataset; and calculating each of the pairwise distances d from a value vi of the feature in the current record and a value vj of the feature in the other of the dataset records according to the relation - View Dependent Claims (13)
-
-
14. Apparatus for detecting whether a current record in a dataset is an anomalous record, comprising:
-
a feature digital preprocessor for defining a feature of the records in the dataset; a distance calculator for calculating a plurality of pairwise distances between a value of the feature in the current record and a value of the feature in at least some of the records in the dataset, where each pairwise distance is either; (A) small for mismatches between the values when both values rarely occur in the dataset records or where the distance is large for mismatches between the values when both of the values commonly occur in the dataset records;
or(B) large for mismatches between the values when both values rarely occur in the dataset records or where the distance is small when both of the values commonly occur in the dataset records; an outlier detector for producing a score for the current record in response to the plurality of distances, and for indicating that the current record is anomalous if the score meets a predetermined criterion; wherein the distances are responsive to the frequency Freq(vi) of the value vi of the feature in the current record and to the frequency Freq(vj) of the value vj of the feature in another of the records in the dataset; and each of the pairwise distances d from a value vi of the feature in the current record and a value vj of the feature in the other of the dataset records is calculated according to the relation - View Dependent Claims (15)
-
-
16. A system for detecting intrusions in a computer attached to a network, comprising:
-
a data-capture module for receiving message records from the network; a feature digital preprocessor for defining a feature of the records in the dataset; a distance calculator for calculating a plurality of pairwise distances between a value of the feature in the current record and a value of the feature in at least some of the records in the dataset, where each pairwise distance is either; (A) small for mismatches between the values when both values rarely occur in the dataset records or where the distance is large for mismatches between the values when both of the values commonly occur in the dataset records;
or(B) large for mismatches between the values when both values rarely occur in the dataset records or where the distance is small when both of the values commonly occur in the dataset records; an outlier detector for producing a score for the current record in response to the plurality of distances, and for indicating that the current record is anomalous if the score meets a predetermined criterion; wherein the distances are responsive to the frequency Freq(vi) of the value vi of the feature in the current record and to the frequency Freq(vj) of the value vj of the feature in another of the records in the dataset; and each of the pairwise distances d from a value vi of the feature in the current record and a value vj of the feature in the other of the dataset records is calculated according to the relation - View Dependent Claims (17, 18, 19, 20, 21)
-
Specification