Detect cyber incidents with machine learning: our model in 5 key steps!

As the role of Artificial Intelligence grows in companies, from predictive maintenance to price optimization, new so-called ‘intelligent’ tools are being developed for cybersecurity. How do these tools exploit recent developments in Machine Learning? What steps should be taken to develop an intelligent and above all relevant detection solution in this context?

From static detection methods to behavioral analysis

As attacks evolve more and more rapidly and in an increasingly sophisticated way, the SOC (Security Operations Center) is forced to review its approach and existing tools as static detection mechanisms become obsolete:

  • The historical approach uses the recognition of known behaviors and footprints (e.g. malware signatures). This method, called misuse-based, provides explicit alerts that are easy to analyse for operational staff, but only attacks that have already occurred and been detected can be recognized.
  • The new approach aims to analyse actions that deviate from the behavior normally observed, without having to explicitly and exhaustively define a malicious act (e.g. the behavior of an individual who deviates from that of his colleagues). This anomaly-based approach makes it possible to detect attacks that are not directly run through the tools but require high volumes of data.

The anomaly-based approach exploits the correlation capabilities of unsupervised learning algorithms that highlight links between unlabeled data (i.e. not categorized as normal or abnormal).

Recipe: detection of anomalies on a machine learning bed

To know if Machine Learning is appropriate for its context, the best solution is to create a PoC (Proof of Concept). How do you implement it? What are the key points to look out for? Here are the key steps in our development.

Starter, main or dessert: define the use case

Doing Machine Learning is good, knowing why is better. Defining a use case is like answering the question ‘What do you want to observe?’ and determining the means available to respond.

In our context, a use case is a threat scenario involving one or more groups of accounts (malicious administrators, exfiltration of sensitive data, etc…). To evaluate them, several criteria must be taken into consideration:

  • Utility: what would be the impact if the scenario were to happen?
  • Data availability: what are the available sources of useful data?
  • Data complexity: is the available data structured (numbers, tables) or unstructured (images, text)?

We have chosen to work on the compromising of service accounts: some may have important rights, and their automated actions generate relatively structured data. In the context of a PoC, a limited scope, and homogeneous and easily accessible data sources are essential to obtain concrete and exploitable results, before considering more ambitious use cases.

Ingredient weighing: determine the data model

In order to make the best use of the data, it is necessary to define a behavior to be modeled based on available information. This is where business expertise comes in: can an isolated action be a sign of compromise or should a series of actions be considered for detecting malicious behavior?

First, we defined a model based on the analysis of unit and family logs (e.g. connections, access to resources, etc.) to evaluate the overall functioning. However, a model that is too simple will ignore weak signals hidden in action correlations, while a representation that is too complex will add processing time and be more sensitive to modelling biases.

Selection of tools: choose the algorithm

Several types of algorithms can be used to detect anomalies:

  • Some try to isolate each point: if a point is easy to isolate, it is far from the others and therefore more abnormal.
  • Clustering algorithms creates groups of points that look alike and from this it calculates the center of gravity of each one to create the average behavior: if a point is too far from the center, it is considered abnormal.
  • Less common, auto-encoders are artificial neural networks that learn to recreate normal behavior with fewer parameters: behavior reproduction errors can be considered as an anomaly score.

Other approaches still exist, including the most exotic artificial immune systems that mimic biological mechanisms to create an evolving detection tool. However, it should not be forgotten that a simple and well optimized tool is often more effective than an overly complex tool.

The k-means clustering algorithm was selected in our case: used in the detection of bank fraud, it simplifies re-training which allows the tool to remain adaptable despite changes in behavior.

All these algorithms can also be enhanced, depending on the chosen behavior model, to consider a series of actions. Thus, convolutional or recurrent neural networks can be added upstream to take into account time series.

Preparation of ingredients: transforming data

Once the algorithm has been selected, the raw data must be processed to make it usable. This process is carried out in several steps:

  • Cleaning: correction of parsing errors, removal of unnecessary information and addition of missing information.
  • Enrichment: adding data from other sources and reprocessing fields to highlight information (e. g. indicate if a date is a public holiday…).
  • Transformation: creation of binary columns for qualitative data (e.g. account name, event type, etc.) that cannot be directly transformed into numbers (one column for each unique value, indicating whether the value is present or not).
  • Normalization: reprocessing the values so that they are all between 0 and 1 (to prevent one field from taking over from another).

Due to the variety of possible events and the complexity of the logs, we have chosen to automate this process: for each field, the algorithm detects the type of data and selects the appropriate transformation from a predefined library. The operator can then interact with the tool to modify the choice before continuing the process.

Seasoning: test and optimize the tool

Once the model has been defined, the algorithm chosen and the data transformed, the tool developed should be able to raise alerts on anomalies. Do these alerts make sense or are they false positives?

In order to evaluate the performance of the tool, we performed two types of tests:

  • Intrusion simulation by performing malicious actions to check if they are detected as abnormal (this approach can also be handled by directly adding “false” logs to data sets).
  • Analysis of anomalies by checking whether the alerts raised actually correspond to malicious behavior.

Many parameters can be adjusted in the algorithms to refine detection. Performance optimization is achieved through an iterative process; changing parameters and observing the effect on a set of validation data. Manually time-consuming, it can be improved by the AutoML approach which seeks to automate certain steps by using optimization algorithms.

However, parameter optimization is not enough: the results of our PoC have shown that the quality of detection based on behavioral analysis depends largely on the relevance of the behaviors defined before the algorithm is developed.

ML or not ML: that may not be the question

Despite its undeniable advantages, Machine Learning is a tool to be used in a rational way: frameworks are becoming increasingly accessible and easy to use, but the definition of the use-case and the behavior model are still crucial steps that exist. These choices, where business expertise is essential, will irreversibly influence the choice of data, the selection of the detection algorithm and the tests to be performed.

The question is no longer ‘Where can I put Machine Learning in my SOC? ‘, but rather ‘Of all the approaches available, which is the most effective to address my problem?’.

To find out, there’s only one solution: light the fires!

To go further…

… here are the tools used during our PoC:

  • IDE
    • Pycharm: clear and practical development environment with efficient library management
  • Language
    • Python: a language widely used in the field of Data Science with many powerful libraries
  • Libraries
    • Scikit-learn: complete Machine Learning library (supervised, unsupervised…)
    • Pandas: complex processing of data tables
    • Numpy: handling of matrices and vectors
    • Matplotlib, Seaborn: display of graphics for visualization
Back to top