Machine learning based botnet detection with dynamic adaptation

US 8,402,543 B1
Filed: 03/25/2011
Issued: 03/19/2013
Est. Priority Date: 03/25/2011
Status: Active Grant

First Claim

Patent Images

1. A method for botnet detection in a network, comprising:

extracting, by a processor of a computer system and from first network traffic data exchanged between a malicious client and a plurality of servers in the network, a malicious data instance comprising a first plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the malicious client and a first corresponding server in the first network traffic data;

extracting, by the processor and from second network traffic data exchanged between a non-malicious client and the plurality of servers, a non-malicious data instance comprising a second plurality of features, corresponding to the plurality of servers, each representing the measure of communication activity between the non-malicious client and a second corresponding server in the second network traffic data;

including the malicious data instance and the non-malicious data instance in a training data set comprising a plurality of malicious data instances and non-malicious data instances, wherein each data instance of the plurality of malicious data instances and non-malicious data instances is associated with one of a plurality of clients comprising the malicious client and the non-malicious client;

generating, by the processor and using a pre-determined machine learning algorithm, a classification model based on the training data set, wherein the classification model is adapted to, when applied to one or more malicious data instance, generate a malicious label, wherein the classification model is further adapted to, when applied to one or more non-malicious data instance, generate a non-malicious label;

extracting, by the processor and from third network traffic data exchanged between a unclassified client and the plurality of servers, a unclassified data instance comprising a third plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the unclassified client and a third corresponding server in the third network traffic data;

generating, by the processor, a classification label of the unclassified data instance by applying the classification model to the unclassified data instance, wherein the classification label comprises the malicious label; and

identifying, in response to the classification label comprising the malicious label, the unclassified client as associated with a botnet.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the invention address the problem of detecting bots in network traffic based on a classification model learned during a training phase using machine learning algorithms based on features extracted from network data associated with either known malicious or known non-malicious client and applying the learned classification model to features extracted in real-time from current network data. The features represent communication activities between the known malicious or known non-malicious client and a number of servers in the network.

Citations

27 Claims

1. A method for botnet detection in a network, comprising:
- extracting, by a processor of a computer system and from first network traffic data exchanged between a malicious client and a plurality of servers in the network, a malicious data instance comprising a first plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the malicious client and a first corresponding server in the first network traffic data;
  
  extracting, by the processor and from second network traffic data exchanged between a non-malicious client and the plurality of servers, a non-malicious data instance comprising a second plurality of features, corresponding to the plurality of servers, each representing the measure of communication activity between the non-malicious client and a second corresponding server in the second network traffic data;
  
  including the malicious data instance and the non-malicious data instance in a training data set comprising a plurality of malicious data instances and non-malicious data instances, wherein each data instance of the plurality of malicious data instances and non-malicious data instances is associated with one of a plurality of clients comprising the malicious client and the non-malicious client;
  
  generating, by the processor and using a pre-determined machine learning algorithm, a classification model based on the training data set, wherein the classification model is adapted to, when applied to one or more malicious data instance, generate a malicious label, wherein the classification model is further adapted to, when applied to one or more non-malicious data instance, generate a non-malicious label;
  
  extracting, by the processor and from third network traffic data exchanged between a unclassified client and the plurality of servers, a unclassified data instance comprising a third plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the unclassified client and a third corresponding server in the third network traffic data;
  
  generating, by the processor, a classification label of the unclassified data instance by applying the classification model to the unclassified data instance, wherein the classification label comprises the malicious label; and
  
  identifying, in response to the classification label comprising the malicious label, the unclassified client as associated with a botnet.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 17, 18, 19, 20, 24)
- - 2. The method of claim 1, further comprising:
    - identifying each of the plurality of clients as one of malicious and non-malicious based on a pre-determined list,wherein a first data instance of the plurality of malicious data instances and non-malicious data instances is identified as malicious in response to identifying a first client associated with the first data instance as malicious based on the pre-determined list, andwherein a second data instance of the plurality of malicious data instances and non-malicious data instances is identified as non-malicious in response to identifying a second client associated with the second data instance as non-malicious based on the pre-determined list.
  - 3. The method of claim 1, wherein the measure of communication activity comprises at least one selected from a group consisting of a number of flows exchanged during a pre-determined length of time, a number of packets exchanged during the pre-determined length of time, and a number of bytes exchanged during the pre-determined length of time.
  - 4. The method of claim 1,wherein the pre-determined machine learning algorithm comprises a support vector machine (SVM) algorithm, andwherein the classification model comprises a decision surface of the SVM.
  - 5. The method of claim 4,wherein the decision surface comprises a maximum margin hyerplane in a multi-dimensional space having a plurality of axes corresponding to the plurality of servers, each of the plurality of axes having coordinates representing the measure of communication activity between a fourth corresponding server and any of the plurality of clients,wherein generating the classification model comprises:
    - representing each data instance of the plurality of malicious data instances and non-malicious data instances by a node in the multi-dimensional space;
      
      formulating a constrained optimization criterion of the SVM algorithm based on the training data set, wherein variables in the constrained optimization criterion comprise a normal vector w of the maximum margin hyerplane and an intercept b of the maximum margin hyerplane;
      
      converting the constrained optimization criterion to a unconstrained formulation using a pre-determined mathematical procedure;
      
      determining the maximum margin hyerplane based on a first mathematical solution of the unconstrained formulation for solving w and b,wherein the maximum margin hyerplane segregates a first plurality of nodes corresponding to a plurality of malicious data instances and a second plurality of nodes corresponding to a plurality of non-malicious data instances, andwherein applying the classification model comprises;
      
      computing a function f(x_unclassified)=sign[w^Tx_unclassified+b] based on the first mathematical solution for w and b, where x_unclassifiedrepresents, in a vector format, the third plurality of features of the unclassified data instance and T represents a transpose operator, anddetermining the classification label based on a value of the function f(x).
  - 6. The method of claim 5, wherein the unconstrained formulation comprises a first term represented by w^Tw and a second term represented by α
    - _iy_i[w^Tx_i+b], where α
      
      _irepresents an i^thmultiplier included in a summation operation in the unconstrained formulation and associated with an i^thdata instance of the plurality of malicious data instances and non-malicious data instances in the training data set, y_irepresents an i^thclassification label included in the summation operation and associated with the i^thdata instance, and x_irepresents, in the vector format, an i^thplurality of features of the i_thdata instance in the summation operation, wherein the summation operation is performed over 1<
      
      i<
      
      N where N represents a number of data instances in the training data set.
  - 7. The method of claim 6, further comprising:
    - expanding the training data set based on a plurality of additional servers in the network in addition to the plurality of servers by adding a plurality of additional features, corresponding to the plurality of additional servers, to each of the plurality of malicious data instances and non-malicious data instances, each of the plurality of additional features representing the measure of communication activity between an additional server and one of the plurality of clients, wherein adding the plurality of additional features comprises;
      
      adding a first additional feature to expand the malicious data instance, wherein the first additional feature is extracted from first additional network traffic data exchanged between the malicious client and the additional server, the first additional feature representing the measure of communication activity between the malicious client and the additional server; and
      
      adding a second additional feature to expand the non-malicious data instance, wherein the second additional feature is extracted from second additional network traffic data exchanged between the non-malicious client and the additional server, the second additional feature representing the measure of communication activity between the non-malicious client and the additional server;
      
      generating an expanded classification model by at least;
      
      revising the unconstrained formulation of the SVM algorithm to generate a revised unconstrained formulation by at least substituting the first term with λ
      
      ²w^Tw+v^Tv and substituting the second term with α
      
      _iy_i[λ
      
      (w^Tx_i+b)+v^T{circumflex over (x)}_i+{circumflex over (b)}, where λ
      
      represents a level of consistency between the classification model and the expanded training data set, v and {circumflex over (b)} represent a normal vector and an intercept, respectively, of another maximum margin hyerplane in another multi-dimensional space having another plurality of axes corresponding to the plurality of additional servers, each of the another plurality of axes having coordinates representing the measure of communication activity associated with a corresponding additional server, and {circumflex over (x)}_irepresents, in the vector format, an ith plurality of additional features added to the i^thdata instance in the summation operation, wherein the summation operation;
      
      determining the another maximum margin hyerplane based on a second mathematical solution of the revised unconstrained formulation for solving v and {circumflex over (b)}, wherein v and {circumflex over (b)} are solved by at least substituting the first mathematical solution for w and b into the revised unconstrained formulation;
      
      expanding the unclassified data instance by at least adding a third additional feature extracted from third additional network traffic data exchanged between the unclassified client and the additional server, the third additional feature representing the measure of communication activity between the unclassified client and the additional server;
      
      generating another classification label of the expanded unclassified data instance by applying the expanded classification model to the expanded unclassified data instance, wherein the another classification label comprises the malicious label; and
      
      further identifying, in response to the another classification label comprising the malicious label, the unclassified client as associated with the botnet.
  - 8. The method of claim 1, further comprising:
    - revising the training data set at least by;
      
      identify a candidate data instance from the plurality of malicious data instances and non-malicious data instances in the training data set based on a pre-determined criterion; and
      
      removing the candidate data instance from the training data set;
      
      revising the classification model based on the revised training data set,wherein the classification label of the unclassified data instance is generated subsequent to revising the classification model.
  - 9. The method of claim 8, wherein the candidate data instance is identified from the training data set based on a timestamp of when the candidate data instance was added to the training data set.
  - 10. The method of claim 8, wherein the candidate data instance is identified from the training data set based on a pre-determined measure representing contribution from the candidate data instance to the classification model.
  - 11. The method of claim 6, further comprising:
    - revising the training data set at least by;
      
      identify a candidate data instance based on a pre-determined measure representing contribution from the candidate data instance to the classification model, wherein the candidate data instance is represented as the i^thdata instance of the plurality of malicious data instances and non-malicious data instances in the training data set, wherein the contribution is determined based on the i^thmultiplier α
      
      _iincluded in the summation operation and associated with the candidate data instance; and
      
      removing the candidate data instance from the training data set;
      
      revising the classification model based on the revised training data set,wherein the classification label of the unclassified data instance is generated subsequent to revising the classification model.
  - 12. The method of claim 2, further comprising:
    - extracting, from fourth network traffic data exchanged between another malicious client and the plurality of servers, another malicious data instance comprising a fourth plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the another malicious client and a fourth corresponding server in the fourth network traffic data, wherein the classification model, when applied to the another malicious data instance, generate the non-malicious label;
      
      revising the training data set at least by including the another malicious data instance in the training data set; and
      
      revising the classification model based on the revised training data set,wherein the classification label of the unclassified data instance is generated subsequent to revising the classification model.
  - 13. The method of claim 12, further comprising:
    - identifying the another malicious client based on an updated version of the pre-determined list.
  - 17. The system of claim 1,wherein the pre-determined machine learning algorithm comprises a support vector machine (SVM) algorithm, andwherein the classification model comprises a decision surface of the SVM.
  - 18. The system of claim 17,wherein the decision surface comprises a maximum margin hyerplane in a multi-dimensional space having a plurality of axes corresponding to the plurality of servers, each of the plurality of axes having coordinates representing the measure of communication activity between a fourth corresponding server and any of the plurality of clients,wherein generating the classification model comprises:
    - representing each data instance of the plurality of malicious data instances and non-malicious data instances by a node in the multi-dimensional space;
      
      formulating a constrained optimization criterion of the SVM algorithm based on the training data set, wherein variables in the constrained optimization criterion comprise a normal vector w of the maximum margin hyerplane and an intercept b of the maximum margin hyerplane;
      
      converting the constrained optimization criterion to a unconstrained formulation using a pre-determined mathematical procedure;
      
      determining the maximum margin hyerplane based on a first mathematical solution of the unconstrained formulation for solving w and b, wherein the maximum margin hyerplane segregates a first plurality of nodes corresponding to a plurality of malicious data instances and a second plurality of nodes corresponding to a plurality of non-malicious data instances, andwherein applying the classification model comprises;
      
      computing a function f(x_unclassified)=sign[w^Tx_unclassified+b] based on the first mathematical solution for w and b, where x_unclassifiedrepresents, in a vector format, the third plurality of features of the unclassified data instance and T represents a transpose operator, anddetermining the classification label based on a value of the function f(x).
  - 19. The system of claim 18, wherein the unconstrained formulation comprises a first term represented by w^Tw and a second term represented by α
    - _iy_i[w^Tx_i+b], where α
      
      _irepresents an i^thmultiplier included in a summation operation in the unconstrained formulation and associated with an i^thdata instance of the plurality of malicious data instances and non-malicious data instances in the training data set, y_irepresents an i^thclassification label included in the summation operation and associated with the i^thdata instance, and x_irepresents, in the vector format, an i^thplurality of features of the i_thdata instance in the summation operation, wherein the summation operation is performed over 1<
      
      i<
      
      N where N represents a number of data instances in the training data set.
  - 20. The system of claim 19,wherein the feature extractor is further configured to:
    - expand the training data set based on a plurality of additional servers in the network in addition to the plurality of servers by adding a plurality of additional features, corresponding to the plurality of additional servers, to each of the plurality of malicious data instances and non-malicious data instances, each of the plurality of additional features representing the measure of communication activity between an additional server and one of the plurality of clients, wherein adding the plurality of additional features comprises;
      
      adding a first additional feature to expand the malicious data instance, wherein the first additional feature is extracted from first additional network traffic data exchanged between the malicious client and the additional server, the first additional feature representing the measure of communication activity between the malicious client and the additional server; and
      
      adding a second additional feature to expand the non-malicious data instance, wherein the second additional feature is extracted from second additional network traffic data exchanged between the non-malicious client and the additional server, the second additional feature representing the measure of communication activity between the non-malicious client and the additional server; and
      
      expand the unclassified data instance by at least adding a third additional feature extracted from third additional network traffic data exchanged between the unclassified client and the additional server, the third additional feature representing the measure of communication activity between the unclassified client and the additional server,wherein the model generator is further configured to generate an expanded classification model by at least;
      
      revising the unconstrained formulation of the SVM algorithm to generate a revised unconstrained formulation by at least substituting the first term with λ
      
      ²w^Tw+v^Tv and substituting the second term with α
      
      _iy_i[λ
      
      (w^Tx_i+b)+v^T{circumflex over (x)}_i+b, where λ
      
      represents a level of consistency between the classification model and the expanded training data set, v and {circumflex over (b)} represent a normal vector and an intercept, respectively, of another maximum margin hyerplane in another multi-dimensional space having another plurality of axes corresponding to the plurality of additional servers, each of the another plurality of axes having coordinates representing the measure of communication activity associated with a corresponding additional server, and {circumflex over (x)}_irepresents, in the vector format, an ith plurality of additional features added to the i^thdata instance in the summation operation, wherein the summation operation; and
      
      determining the another maximum margin hyerplane based on a second mathematical solution of the revised unconstrained formulation for solving v and {circumflex over (b)}, wherein v and {circumflex over (x)} are solved by at least substituting the first mathematical solution for w and b into the revised unconstrained formulation;
      
      wherein the online classifier is further configured to;
      
      generate another classification label of the expanded unclassified data instance by applying the expanded classification model to the expanded unclassified data instance, wherein the another classification label comprises the malicious label; and
      
      further identify, in response to the another classification label comprising the malicious label, the unclassified client as associated with the botnet.
  - 24. The system of claim 19, wherein the feature extractor is further configured to:
    - revise the training data set at least by;
      
      identify a candidate data instance based on a pre-determined measure representing contribution from the candidate data instance to the classification model, wherein the candidate data instance is represented as the i^thdata instance of the plurality of malicious data instances and non-malicious data instances in the training data set, wherein the contribution is determined based on the i^thmultiplier α
      
      _iincluded in the summation operation and associated with the candidate data instance; and
      
      removing the candidate data instance from the training data set;
      
      revise the classification model based on the revised training data set,wherein the classification label of the unclassified data instance is generated subsequent to revising the classification model.

14. A system for botnet detection in a network, comprising:
- a hardware processor;
  
  a feature extractor executing on the hardware processor and configured to;
  
  extract, from first network traffic data exchanged between a malicious client and a plurality of servers in the network, a malicious data instance comprising a first plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the malicious client and a first corresponding server in the first network traffic data;
  
  extract, from second network traffic data exchanged between a non-malicious client and the plurality of servers, a non-malicious data instance comprising a second plurality of features, corresponding to the plurality of servers, each representing the measure of communication activity between the non-malicious client and a second corresponding server in the second network traffic data;
  
  include the malicious data instance and the non-malicious data instance in a training data set comprising a plurality of malicious data instances and non-malicious data instances, wherein each data instance of the plurality of malicious data instances and non-malicious data instances is associated with one of a plurality of clients comprising the malicious client and the non-malicious client; and
  
  extract, from third network traffic data exchanged between a unclassified client and the plurality of servers, a unclassified data instance comprising a third plurality of features, corresponding to the plurality of servers, each representing the measure of communication activity between the unclassified client and a third corresponding server in the third network traffic data;
  
  a model generator operatively coupled to the feature extractor, executing on the hardware processor, and configured to;
  
  generate, using a pre-determined machine learning algorithm, a classification model based on the training data set, wherein the classification model is adapted to, when applied to one or more malicious data instance, generate a malicious label, wherein the classification model is further adapted to, when applied to one or more non-malicious data instance, generate a non-malicious label;
  
  an online classifier operatively coupled to the model generator, executing on the hardware processor, and configured to;
  
  generate a classification label of the unclassified data instance by applying the classification model to the unclassified data instance, wherein the classification label comprises the malicious label; and
  
  identify, in response to the classification label comprising the malicious label, the unclassified client as associated with a botnet; and
  
  a repository coupled to the online classifier and configured to store the plurality of malicious data instances and non-malicious data instances, the unclassified data instance, and the classification model.
- View Dependent Claims (15, 16, 21, 22, 23, 25, 26)
- - 15. The system of claim 14, further comprising an acquisition module configured to:
    - identify each of the plurality of clients as one of malicious and non-malicious based on a pre-determined list; and
      
      obtain the first, second, and third network traffic data from the network,wherein a first data instance of the plurality of malicious data instances and non-malicious data instances is identified as malicious in response to identifying a first client associated with the first data instance as malicious based on the pre-determined list, andwherein a second data instance of the plurality of malicious data instances and non-malicious data instances is identified as non-malicious in response to identifying a second client associated with the second data instance as non-malicious based on the pre-determined list.
  - 16. The system of claim 14, wherein the measure of communication activity comprises at least one selected from a group consisting of a number of flows exchanged during a pre-determined length of time, a number of packets exchanged during the pre-determined length of time, and a number of bytes exchanged during the pre-determined length of time.
  - 21. The system of claim 14, wherein the feature extractor is further configured to:
    - revise the training data set at least by;
      
      identify a candidate data instance from the plurality of malicious data instances and non-malicious data instances in the training data set based on a pre-determined criterion; and
      
      removing the candidate data instance from the training data set;
      
      revise the classification model based on the revised training data set,wherein the classification label of the unclassified data instance is generated subsequent to revising the classification model.
  - 22. The system of claim 21, wherein the candidate data instance is identified from the training data set based on a timestamp of when the candidate data instance was added to the training data set.
  - 23. The system of claim 21, wherein the candidate data instance is identified from the training data set based on a pre-determined measure representing contribution from the candidate data instance to the classification model.
  - 25. The system of claim 15, wherein the feature extractor is further configured to:
    - extract, from fourth network traffic data exchanged between another malicious client and the plurality of servers, another malicious data instance comprising a fourth plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the another malicious client and a fourth corresponding server in the fourth network traffic data, wherein the classification model, when applied to the another malicious data instance, generate the non-malicious label;
      
      revise the training data set at least by including the another malicious data instance in the training data set; and
      
      revise the classification model based on the revised training data set,wherein the classification label of the unclassified data instance is generated subsequent to revising the classification model.
  - 26. The system of claim 25, wherein the acquisition module is further configured to:
    - identify the another malicious client based on an updated version of the pre-determined list.

27. A non-transitory computer readable medium storing instructions for identifying a botnet in a network, the instructions, when executed bya processor of a computer, comprising functionality for:
- extracting, from first network traffic data exchanged between a malicious client and a plurality of servers in the network, a malicious data instance comprising a first plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the malicious client and a first corresponding server in the first network traffic data;
  
  extracting, from second network traffic data exchanged between a non-malicious client and the plurality of servers, a non-malicious data instance comprising a second plurality of features, corresponding to the plurality of servers, each representing the measure of communication activity between the non-malicious client and a second corresponding server in the second network traffic data;
  
  including the malicious data instance and the non-malicious data instance in a training data set comprising a plurality of malicious data instances and non-malicious data instances, wherein each data instance of the plurality of malicious data instances and non-malicious data instances is associated with one of a plurality of clients comprising the malicious client and the non-malicious client;
  
  generating, using a pre-determined machine learning algorithm, a classification model based on the training data set, wherein the classification model is adapted to, when applied to one or more malicious data instance, generate a malicious label, wherein the classification model is further adapted to, when applied to one or more non-malicious data instance, generate a non-malicious label;
  
  extracting, from third network traffic data exchanged between a unclassified client and the plurality of servers, a unclassified data instance comprising a third plurality of features, corresponding to the plurality of servers, each representing a measure of communication activity between the unclassified client and a third corresponding server in the third network traffic data;
  
  generating a classification label of the unclassified data instance by applying the classification model to the unclassified data instance, wherein the classification label comprises the malicious label; and
  
  identifying, in response to the classification label comprising the malicious label, the unclassified client as associated with a botnet.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Boeing Co.
Original Assignee
Narus, Inc. (Gen Digital Inc.)
Inventors
Ranjan, Supranamaya, Chen, Feilong
Primary Examiner(s)
Vu, Kim
Assistant Examiner(s)
King, John B

Application Number

US13/072,290
Time in Patent Office

725 Days
Field of Search

726 1- 3, 726 11- 14, 726 22- 25, 713/188, 709223-225
US Class Current

726/23
CPC Class Codes

H04L 2463/144 Detection or countermeasure...

H04L 63/1416 Event detection, e.g. attac...

Machine learning based botnet detection with dynamic adaptation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Machine learning based botnet detection with dynamic adaptation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links