Machine learning - RiskInsight

Leaking Minds: How Your Data Could Slip Through AI Chatbots

Jeanne PIGASSOU — Wed, 21 May 2025 14:21:32 +0000

OpenAI’s flagship ChatGPT was over the news 18 months ago for accidentally leaking a CEO’s personal information after being asked to repeat a word forever. This is among the many exploits that have been discovered in recent months. 

Figure 1 : Example of the Leaking exploit found in ChatGPT in December

Scandals like these highlight a deeper truth: the core architecture of Large Language Models (LLMs) such as GPT and Google’s Gemini is inherently prone to data leakage. This leakage can involve Personally Identifiable Information (PII) or confidential company data. The techniques used by attackers will continue to evolve in response to improved defenses from tech giants, the underlying vectors remain unchanged.

Today, three main vectors exist through which PIIs (Personally Identifiable Information) or sensitive data might be exposed to such attacks:

The use of publicly available web content in training datasets
The continuous re-training of models using user prompts and conversations
The introduction of persistent memory features in chatbots

LLM Pre-Training Data Leakage 

Most models available right now are transformer models, specifically GPTs or Generative Pre-Trained Transformers. The Pre-Trained in GPT refers to the initial training phase, where the model is exposed to a massive, diverse corpus of data unrelated to its final application. This helps the model learn foundational knowledge such as grammar, vocabulary, and factual information. When GPTs were first released, companies were transparent on where this training data came from, but currently the largest models on the web have datasets that are too large and too diverse and are often kept confidential. 

A major source of the data used in GPT pre-training are online forums such as Reddit (for Google’s models), Stack Overflow, and other social media platforms. This poses a significant risk since these social media forums often contain PIIs . Although companies claim to filter out PII during training, there have been many instances where LLMs have leaked personal data from their pre-training data corpus to users after some prompt engineering and jail breaking. This danger will become ever more present as companies race to gather more data through web scraping to train larger and more sophisticated models. 

Known leaks of this type are mostly uncovered by researchers who develop more and more creative methods to bypass the defenses of chatbots. The example mentioned earlier is one such case. By prompting the chatbot to repeat forever a word, it “forgets” its task and begins to exhibit a behavior known as memorization. In this state, the chatbot regurgitates data from its training set. While this attack has been patched, new prompt techniques continue to be found to change the behavior of the chatbot.

User Input Re-Usage and Re-Training 

User Inputs re-training is the process of continuously improving the LLM by training it on user inputs. This can be done in several ways, the most popular of which is RLHF or Reinforcement Learning from Human Feedback.  

Figure 3 : The feedback buttons used for RLHF in ChatGPT

This method is built on top of collecting user feedback on the LLM’s output. Many users of LLMs might have seen the “Thumbs Up” or “Thumbs Down” buttons in ChatGPT or other LLM platforms. 

These buttons collect feedback from the user and use the feedback to re-train the model. If the user signifies the response as positive, the platform takes the user input / model output pair and encourages the model to replicate the behavior. Similarly, if the user indicates that the model performed poorly, the user input / model output pair will be used to discourage the model from replicating the behavior. 

However, continuous re-training can also occur without any user interaction. Models may occasionally use user input / model output to re-train in seemingly random ways. The lack of transparency from model providers and developers makes it difficult to pinpoint exactly how this happens. However, many users across the internet have reported models gaining new knowledge through re-training from other users’ chats all the way back to 2022. For example, OpenAI’s GPT 3.5 should not be able to know any information after Sept 2021, its cut-off date. Yet, asking it about recent information such as Elon Musk’s new position as CEO of Twitter (now X) will provide you with a different reality as it confidently answers your question with accuracy.  

Essentially, what this means for end-users is that their chats are not kept confidential at all and any information given to the LLM through internal documents, meeting minutes or development codebases may show up in the chats of other users thus leaking it. This poses significant privacy risks not only for individuals but also for companies, many of which have already taken action, like Samsung. In April 2023, Samsung banned the use of ChatGPT and similar chatbots after a group of employees used the tool for coding assistance and summarizing meeting notes. Although Samsung has no concrete evidence that the data was used by OpenAI, the potential risk was deemed too high to allow employees to continue using the tool. This is a classic example of Shadow AI, where unauthorized use of AI tools leads to the possible leakage of confidential or proprietary information.

Many companies globally are waiting for stricter AI and data regulations before using LLMs for commercial use. We are seeing certain industries such as consulting open up but at an incredibly slow pace. Other companies, however, are tightening their control over internal LLM use to avoid leaking confidential data and client information. 

Memory Persistence

While the two precedent risks have been recognized to exist for a few years, a new threat has emerged with the introduction of a feature by ChatGPT in September 2024. This feature enables the model to retain long-term memory of user conversations. The idea is to reduce redundancy by allowing the chatbot to remember user preferences, context, and previous interactions, thereby improving the relevance and personalization of responses.

However, this convenience comes at a significant security cost. Unlike earlier cases, where leaked information was more or less random, persistent memory introduces account-level targeting. Now, attackers could potentially exploit this memory to extract specific details from a particular user’s history, significantly raising the stakes.

Security researcher Johannes Rehberger demonstrated how this vulnerability could be exploited through a technique known as context poisoning. In his proof-of-concept, he crafted a site with a malicious image containing instructions. Once the targeted chatbot views the URL, its persistent memory is poisoned. This covert instruction allows the chatbot to be manipulated into extracting sensitive information from the victim’s conversation history and transmitting it to an external URL.

This attack is particularly dangerous because it combines persistence and stealth. Once it infiltrates the chatbot, it remains active indefinitely, continuously exfiltrating user data until the memory is cleaned. At the same time, it is subtle enough to go unnoticed, requiring careful human analysis of the memory to be detected.

LLM Data Privacy and Mitigation

LLM developers often intentionally make it hard to disable re-training since it benefits their LLM development. If your personal information is already out in public, it has probably been scraped and used for pre-training an LLM. Additionally, if you gave ChatGPT or another LLM a confidential document in your prompt (without manually turning re-training OFF), it has most probably been used for re-training. 

Currently, there is no reliable technique that allows an individual to request the deletion of their data once it has been used for model training. Addressing this challenge is the goal of an emerging research area known as Machine Unlearning. This field focuses on developing methods to selectively remove the influence of specific data points from a trained model, thus deleting those data from the memory of the model. The field is evolving rapidly, particularly in response to GDPR regulations that enforce the right to erasure. For this reason, it is important to mitigate and minimize these risks in the future by controlling what data individuals and organizations put out on the internet and what information employees add to their prompts. 

It is vital for many business operations to stay confidential. However, the productivity boost that LLMs add to employee workflows cannot be overlooked. For this reason, we constructed a 3-step framework to ensure that organizations can harness the power of LLMs without losing control over their data. 

Choose the most optimal model, environment and configuration 

Ensure that the environment and model you are using are well-secured. Check over the model’s data retention period and the provider’s policy on re-training on user conversations. Ensure that you have “Auto-delete” as ON when available and “Chat History” to OFF.  

At Wavestone we made a tool that compares the top 3 closed-source and open-source models in terms of pricing, data retention period, guard rails, and confidentiality to empower organizations in their AI journey. 

Raise employee awareness on best practices when using LLMs 

Ensure that your employees know the danger of providing confidential and client information to LLMs and what they can do to minimize including corporate or personal information in an LLM’s pre-training and re-training data corpus. 

Implement a robust AI policy  

Forward-looking companies should implement a robust internal AI policy that specifies: 

What information can and can’t be shared with LLMs internally 
Monitoring of AI behavior 
Limiting their online presence 
Anonymization of prompt data 
Limiting use to secure AI tools only

Following these steps, organizations can minimize the digital risk they face by using the latest GenAI tools while also benefiting from their productivity increases. 

Moving Forward 

Although the data privacy vulnerabilities mentioned in this article impact individuals like you and me, their cause is the LLM developers’ greed for data. This greed produces higher-quality end products but at the cost of data privacy and autonomy.

New regulations and technologies have come out to combat this issue such as the EU AI Act and OWASP top 10 LLM checklist. However, relying solely on responsible governance is not enough. Individuals and organizations must actively recognize the critical role PIIs play in today’s digital landscape and take proactive steps to protect them. This is especially important as we move toward more agentic AI systems, which autonomously interact with multiple third-party services. Not only will these systems process an increasing amount of personal and sensitive data, but this data will also be transmitted and handled by numerous different services, complicating oversight and control.

References and Further Reading 

[1] D. Goodin, “OpenAI says mysterious chat histories resulted from account takeover,” Ars Technica, https://arstechnica.com/security/2024/01/ars-reader-reports-chatgpt-is-sending-him-conversations-from-unrelated-ai-users/ (accessed Jul. 13, 2024).

[2] M. Nasr et al., “Extracting Training Data from ChatGPT,” not-just-memorization , Nov. 28, 2023. Available: https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html

[3] “What Is Confidential Computing? Defined and Explained,” Fortinet. Available: https://www.fortinet.com/resources/cyberglossary/confidential-computing#:~:text=Confidential%20computing%20refers%20to%20cloud

[4] S. Wilson, “OWASP Top 10 for Large Language Model Applications | OWASP Foundation,” owasp.org, Oct. 18, 2023. Available: https://owasp.org/www-project-top-10-for-large-language-model-applications/

[5] “Explaining the Einstein Trust Layer,” Salesforce. Available: https://www.salesforce.com/news/stories/video/explaining-the-einstein-gpt-trust-layer/

[6] “Hacker plants false memories in ChatGPT to steal user data in perpetuity” Ars Technica , 24 sept. 2024 Available: https://arstechnica.com/security/2024/09/false-memories-planted-in-chatgpt-give-hacker-persistent-exfiltration-channel/

[7] “Why we’re teaching LLMs to forget things” IBM, 07 Oct 2024 Available: https://research.ibm.com/blog/llm-unlearning

Cet article Leaking Minds: How Your Data Could Slip Through AI Chatbots est apparu en premier sur RiskInsight.

Adopting MLSecOps: the key to reliable and secure AI models

Pierre Aubret — Fri, 25 Oct 2024 14:57:34 +0000

Artificial intelligence (AI) now occupies a central place in the products and services offered by businesses and public services, largely thanks to the rise of generative AI. To support this growth and encourage the adoption of AI, it has been necessary to industrialize the design of AI systems by adapting model development methods and procedures.

This gave rise to MLOps, a contraction of “Machine Learning” (the heart of AI systems) and “Operations”. Like DevOps, MLOps facilitates the success of Machine Learning projects while ensuring the production of high-performance models.

However, it is crucial to guarantee the security of the algorithms so that they remain efficient and reliable over time. To achieve this, it is necessary to evolve from MLOps to MLSecOps, by integrating security into processes in the same way as DevSecOps. Few organisations have adopted and deployed a complete MLSecOps process. In this article, we explore in detail the form that MLSecOps could take.

MLOps, the fundamentals of AI model development

Closer links with DevOps

DevOps is an approach that combines software development (Dev) and IT operations (Ops). Its aim is to shorten the development lifecycle while ensuring continuous high-quality delivery. Key principles include process automation (development, testing and release), continuous delivery (CI/CD) and fast feedback loops.

MLOps is an extension of DevOps principles applied specifically to Machine Learning (ML) projects. Workflows are simplified and automated as far as possible, from the preparation of training data to the management of models in production. MLOps differs from DevOps in several ways:

Importance of data and models: In Machine Learning, data, and models are crucial. MLOps goes a step further by automating all the stages of Machine Learning, from data preparation to the training phases. What’s more, a larger volume of data is often used in Machine Learning projects.

Experimental nature of development: Development in Machine Learning is experimental and involves continually testing and adjusting models to find the best algorithms, parameters and relevant data for learning. This poses challenges for adapting DevOps to Machine Learning, as DevOps focuses on process automation and stability.

Complexity of testing and acceptance: The evolving nature of the models and the complexity of the data make the testing and acceptance phases more delicate in Machine Learning. What’s more, performance monitoring is essential to ensure that the models work properly in production. In Machine Learning, therefore, it is necessary to adapt the Operational Maintenance procedures to maintain the stability and reliability of the systems.

In short, an MLOps chain shares common elements with a DevOps chain although introduces additional steps and places particular importance on the management and use of data. The following graph highlights in yellow all the additional steps that MLOps introduces:

Data access and use: This stage includes all the data engineering phases (collection, transformation and versioning of the data used for training). The challenge is to ensure the integrity of the data and the reproducibility of the tests.

Model acceptance: ML acceptance and integration tests are more complex and take place at three different layers: the data pipeline, the ML model pipeline and the application pipeline.

Production monitoring: This involves guaranteeing the model’s performance over time and avoiding “model drifting” (decline in performance over time). To achieve this, all deviations (instantaneous change, gradual change, recurring change) must be detected, analyzed, and corrected if necessary.

Figure 1 – Adapting the DevOps stages to Machine Learning

Implementing MLOps requires creating a dialogue between data engineers and DevOps operators

Moving to MLOps means creating new organizational steps specifically adapted to data management. This includes the collection and transformation of training data, as well as the processes for tracking the different versions of the data.

In this sense, collaboration between MLOps experts, data scientists and data engineers is essential for success in this constantly evolving field. The main challenge in setting up an MLOps chain therefore lies in integrating the data engineers into the DevOps processes. They are responsible for preparing the data that MLOps engineers need to train and execute models.

And what about safety?

The massive adoption of generative AI in 2024 has provided us with a variety of examples of security term compromises. Indeed, the attack surface is large: a malicious actor can both attack the model itself (model theft, model reconstruction, diversion from initial use) but also attack its data (extracting training data, modifying behaviour by adding false data, etc.). To illustrate the latter, we have simulated two realistic attacks in previous articles: Attacking an AI? A concrete example! or When words become weapons: prompt injection.

At the same time, MLOps introduces automation, which speeds up production. While this may reduce time to market, it also increases the risks (supply chain attacks, massaction). It is therefore crucial to ensure that the risks associated with cybersecurity and AI are properly managed.

As DevSecOps does for DevOps, the MLOps production chain must be secure. Here is an overview of the main risks in the MLOps chain:

Adopt MLSECOPS

Integrating safety into MLOPS teams and strengthening the safety culture

The principles of MLSecOps need to be understood by data scientists and data engineers. To achieve this, it is crucial that the security teams are involved from the outset of the project. This can be done in two ways:

When a new project is created, a member of the security team is assigned as the security manager. He or she supervises progress and answers questions from the project teams.

A more agile approach, similar to DevSecOps, involves designating a member of the team as the “Security Champion“. This cybersecurity referent within the project team becomes the main point of contact for the cyber teams. This method enables security to be integrated more realistically into the project but requires appropriate training for the Security Champion.

For this change to be effective, it is also necessary to change the way project teams perceive cybersecurity:

By providing basic training to teams to help them better understand the challenges of cyber security.

By integrating cyber security into collaboration and knowledge platforms.

By organising regular awareness campaigns.

Securing MLOPS chain tools

To guarantee product security, it is essential to secure the production chain. In the context of MLOps, this means ensuring that all the tools are used correctly, with practices that incorporate cybersecurity, whether they be data processing and management tools (such as MongoDB, SQL, etc.), monitoring tools (such as Prometheus), or more or less specific development tools (such as MLFlow or GitHub).

For example, it is crucial that teams remain vigilant on issues such as identification and identity management, business continuity, monitoring and data management. The possibilities offered by the various tools used throughout the lifecycle, and their specific features, must be examined in relation to these issues. Ideally, cybersecurity features should be used as selection criteria when choosing the most suitable tool.

Defining AI security practices

In addition to the security of the tools used to build AI systems, security measures must be incorporated to prevent vulnerabilities specific to AI systems. These measures must be incorporated right from the design stage and throughout the application’s lifecycle, following an MLSecOps approach. From data collection to system monitoring, there are numerous security measures to incorporate:

Figure 2 – Securing the MLOps lifecycle

Three security measures to implement in your MLSecOps processes

Depending on the security strategy adopted, various security measures can be integrated throughout the MLOps lifecycle. We have detailed the main defence mechanisms for securing AI in the following article: Securing AI: The New Cybersecurity Challenges.

In this section, we will focus on 3 specific measures that can be implemented to enhance the security of MLOps:

Figure 3 – Selected security measures

Checking the relevance of data and the risks of poisoning

In the context of Machine Learning, data security is essential to prevent the risk of poisoning and to guarantee the integrity of the data processed.

Before processing the data collected, it is essential to continually check the origin of the data in order to guarantee its quality and relevance. This is all the more complex when using external data streams, the provenance and veracity of which can sometimes be uncertain. The major risk lies in the integration of user data during continuous learning. This can lead to unpredictable results, as illustrated by the example of Microsoft’s TAY ChatBot in 2016. This was designed to learn through user interaction. However, without proper moderation, it quickly adopted inappropriate behaviour, reflecting the negative feedback it received. This incident highlights the importance of constant monitoring and moderation of input data, particularly when it comes from real-time human interactions.

Various analysis techniques can be used to clean up a dataset. The aim is to check the integrity of the data and remove any data that could have a negative impact on the model’s performance. Two main methods are possible:

On the one hand, we can individually check the integrity of each data item by checking for outliers, validating the format or characteristic metrics, etc.
On the other hand, with a global analysis, approaches such as cross-validation and statistical clustering are effective in identifying and eliminating inappropriate elements from the dataset.

Introduce contradictory examples

Contradictory examples are corrupted inputs, modified to mislead the predictions of a Machine Learning algorithm. These modifications are designed to be undetectable to the human eye but sufficient to fool the algorithm. This type of attack exploits vulnerabilities or flaws in the model training to cause prediction errors. To reduce these errors, the model can be taught to identify and ignore this type of input.

To do this, we can deliberately add contradictory examples to the training data. The aim is to present the model with slightly altered data, in order to prepare it to correctly identify and manage potential errors. Creating this type of degraded data is complex. The generation of these contradictory examples must be adapted to the problem and the threats identified. It is crucial to carefully monitor the training phase to ensure that the model effectively recognises these incorrect inputs and knows how to react correctly.

Modify user entries

Input security is essential to minimise the risks associated with malicious manipulation. A major weakness of LLMs (Large Language Models) is their lack of in-depth contextual understanding and their sensitivity to the precise formulation of prompts. One of the best-known techniques for exploiting this vulnerability is the prompt injection attack. It is therefore necessary to introduce an intermediate step of transforming user data before it is processed by the model.

It is possible to modify the input slightly in order to counter this type of attack, while preserving the accuracy of the model. This transformation can be carried out using various techniques (e.g. coding, adding noise, reformulation, feature compression, etc.). The aim is to retain only what is essential for the response. In this way, any superfluous, potentially malicious information is discarded. In addition, this method deprives the attacker of the possibility of accessing the real input to the system. This prevents any in-depth analysis of the relationships between inputs and outputs, and thus complicates the design of future attacks. However, it remains essential to test the various measures implemented, to ensure that they do not degrade the performance of the model, thus guaranteeing enhanced security without compromising efficiency.

Due to industrial production of applications based on Machine Learning and AI, large-scale security is becoming a crucial organisational issue for the market. It is imperative to make the transition to MLSecOps. This transformation is based on three main pillars:

Strengthening the security culture of Data Scientists: It is essential that Data Scientists understand and integrate security principles into their day-to-day work. This creates a shared security culture and strengthens collaboration between the various players.
Securing the tools that produce Machine Learning algorithms: It is essential to select secure MLOPS tools and apply best practices within the tools (rights management, etc.) to secure the Machine Learning algorithm “factory” and thus reduce the surface area for compromise.
Integrating AI-specific security measures: Adapting security measures to the specific features of AI systems is crucial to preventing potential attacks and ensuring the reliability of models over time. These security measures should therefore be integrated into the MLOps chain using MLSecOps.

Make the transition to MLSecOps today. Train your teams, secure your tools, and integrate AI-specific security measures. Making this shift, you will be able to benefit from AI systems that are industrially produced and secure by design.

Thanks to Louis FAY and Hortense SOULIER who contributed to the writing of this article as well.

Cet article Adopting MLSecOps: the key to reliable and secure AI models est apparu en premier sur RiskInsight.

Machine learning for its cybersecurity: how to find your way in the jungle of products

Carole Meyziat — Fri, 25 Sep 2020 13:00:07 +0000

Machine Learning is an emerging topic in recent years, particularly in the context of cyber security monitoring. However, as mentioned in the article “Boost your Cybersecurity thanks to Machine Learning” (Part 1 & Part 2), the development of such solutions requires strong human and financial investments.

Indeed, not all companies have the necessary means (or the will) to develop this type of technology internally, and thus turn themselves to market solutions facing a major problem: how to succeed in quickly choosing and integrating an effective solution in my context?

**Why use Machine Learning in Cybersecurity?**

The static nature of current detection solutions (antiviruses using signature bases, alert thresholds in a SIEM…) no longer allows to face more and more numerous and varied attacks. In addition, security teams are overloaded by the volume of data to be analyzed.

As explained in the article « Which tools do you need for your SOC? » (Part 2 & Part 3), Machine Learning provides an answer to these problems encountered by the SOC by using behavioral analysis methods to detect advanced attacks and prioritize the alerts to be analyzed.

Principle of anomalies detection in a SOC

While these types of solutions provide real added value, they do not completely eliminate the need for current detection methods and are rather used to complement existing tools.

Moreover, their level of complexity (deployment, alerts processing) requires a sufficient level of maturity in terms of detection and reaction (organization, tools, resources, data centralization) before it is relevant to launch a project based on Machine Learning. This will facilitate the scoping phase and speed up deployment.

In advance of phase: defining the specifications

Which use case do I wish to address?

During our various interventions with our clients, we have supported the integration of numerous solutions and we can highlight four main types of use cases on which companies invest:

Fight against fraud: tools for detecting deviation(s) in user’s behavior(s)
Email monitoring: prevention tools against phishing or information leakage (DLP)
Network threat detection: «Next-Gen » probes
Endpoint threat identification: « Next-Gen » anti-viruses

The choice of a solution (and therefore of a use case) should not be defined unilaterally by the ISS branch, but should be discussed with various stakeholders (ISS, CIO, businesses, etc.). This exchange will enable the target to be specified and the technical and organizational prerequisites to be validated (accessibility of logs, resources to be mobilized, size of teams, etc.) in order to best prepare for its integration and use.

What kind of solution to choose?

Depending on the tools already in place and according to the need, several solutions are possible:

Choosing to implement a turnkey solution allowing to treat very precise use cases that are not specific to business issues (EDR, behavioral biometrics…). This choice generally suits an immediate need rather than a long-term strategy.
Activate a Machine Learning module on a tool already in place (SIEM, log sink…) in order to extend its detection perimeter. This choice allows to quickly test use cases and to free oneself from the phases of integration of a new equipment within the IS.

Finally, it is essential to remember that there is no miracle solution and that each type of solution responds to specific needs.

In front of the editor : challenging the essential points

Testing the solution and think about scalability

Once all these prerequisites are defined, it is usual to realize with the editor a Proof of Concept (PoC). However, in the specific case of a Machine Learning solution, the PoC will answer several specific questions:

Do my currently collected data allow me to have quickly satisfactory results? Machine Learning solutions require the analysis of a very large amount of data potentially enriched by repositories that can be cross-referenced from several sources. It is therefore necessary to make sure in advance with the editor that the data currently collected already allows to obtain first results.
How long will the learning phase last in my context? Some Machine Learning solutions produce results only after several months or even years because the learning phases can be extremely long due to the specific context of each company. The possibility to use a log history for tests would allow you to free yourself from a significant learning period.

Specific questions will also have to be addressed in order to anticipate the longer term:

Will it be possible to enrich the analyses with other types of data? Machine Learning solutions allow you to perform analyses on many types of data that may have heterogeneous formats, so it is necessary to be able to ensure that the analyses can be enriched with new types of data collected.
Will it be possible to implement new detection algorithms? The possibility of being able to customize these solutions by adding new types of algorithms (and potentially independently) is not negligible.
How can I be sure that my publisher is always at the cutting edge of technology? Given the exponential evolution of techniques on this subject, it is important to ensure that the publisher continues to be at the forefront of technology in order to offer new means of defense against attacks that are becoming increasingly complex.

Preparing to protect the data life cycle

Detection methods based on behavioral analysis require the collection and processing of sensitive/personal data. Thus, especially in the case where the solution is hosted by the editor, issues related to the use of the data will have to be addressed as soon as possible. On the one hand, contractual security requirements will of course need to be reinforced, and on the other hand it may be useful to use upstream solutions that enable more secure processing of the data lifecycle.

For example, startups like SARUS are working on masking personal data, allowing data scientists to perform Machine Learning without accessing source data. Startups like HAZY are working on generating synthetic data that keeps the statistical value of the useful data, but loses its sensitive nature. This type of solution also allows to artificially enlarge the sample provided, and to obtain an almost unlimited amount of data, which can be very useful in the context of a PoC where currently available data is limited.

Once the relevance of the solution is validated, the adventure is just beginning!

Through our various experiences, we have been able to forge a conviction: the market is mature enough to provide interesting results, especially on the four use cases mentioned above. The implementation of such tools will be effective if the solutions are connected to a rich ecosystem and meet a specific need. Indeed, the implementation of one solution can be a success or a failure in two different contexts. The result will depend on the clarity of the need, the scope targeted, the expertise available (Cybersecurity and Data Science), and the availability of the data (quality and quantity).

While choosing a Machine Learning solution is not easy, the best way to get an idea quickly is to realize a PoC that can be quick and involving little engagement: we have seen with some of our customers that solutions were already showing interesting results after only two weeks of PoC.

Keeping in mind that the PoC is only the beginning of the adventure. It will result in the launch of an exciting project lasting several months (analysis of new types of alerts, discovery of new techniques …), bringing a real added value in security (detection of new events …), boosting a new breath within the operational security teams (prioritization of efforts, possibility of optimizing redundant tasks …).

Cet article Machine learning for its cybersecurity: how to find your way in the jungle of products est apparu en premier sur RiskInsight.

MACHINE LEARNING POUR SA CYBERSECURITE : COMMENT SE RETROUVER DANS LA JUNGLE DES PRODUITS

Carole Meyziat — Mon, 21 Sep 2020 08:00:53 +0000

Le Machine Learning est un sujet émergeant de ces dernières années et notamment dans le cadre de la surveillance cybersécurité. Cependant, comme évoqué dans l’article « Booster sa cybersécurité grâce à du Machine Learning » (Partie 1 & Partie 2), le développement de telles solutions nécessite de forts investissements humains et financiers.

En effet, toutes les entreprises n’ont pas les moyens nécessaires (ou la volonté) de développer en interne ce type de technologie et se tournent alors vers des solutions du marché en se confrontant à une problématique majeure : comment réussir à choisir et intégrer rapidement une solution efficace dans mon contexte ?

Pourquoi utiliser du Machine Learning en cybersécurité ?

Le caractère statique des solutions de détection actuelles (antivirus utilisant des bases de signatures, alertes seuils d’alerte dans un SIEM…) ne permet plus de faire face à des attaques de plus en plus nombreuses et variées. En outre, les équipes de sécurité sont surchargées par le volume de données à analyser.

Comme expliqué dans l’article « La saga de l’été sur les nouveaux outils du SOC » (Partie 2 & Partie 3), le Machine Learning permet de répondre à ces problématiques que rencontre le SOC en utilisant des méthodes d’analyse comportementale pour détecter des attaques avancées et prioriser les alertes à analyser.

Principe de détection d’anomalies dans un SOC

Si ces types de solutions apportent une réelle plus-value, elles ne permettent pas de totalement s’affranchir des moyens de détection actuels et sont plutôt utilisées pour compléter les outils en place.

Par ailleurs, leur niveau de complexité (déploiement, traitement des alertes) requiert en prérequis d’avoir déjà atteint un niveau de maturité suffisant en termes de détection et réaction (organisation, outillage, ressources, centralisation de la donnée) avant qu’il soit pertinent de se lancer dans un projet basé sur du Machine Learning. La phase de cadrage n’en sera que facilitée et le déploiement accéléré.

En avance de phase : définir le cahier des charges

Quel est le cas d’usage que je souhaite adresser ?

Lors de nos différentes interventions chez nos clients, nous avons accompagné l’intégration de nombreuses solutions et nous pouvons faire ressortir quatre grands types de cas d’usages sur lesquels les entreprises investissent :

La lutte contre la fraude: outils de détection de déviation(s) dans le(s) comportement(s) d’un utilisateur
La surveillance des emails: outils de prévention contre le phishing ou la fuite d’informations (DLP)
La détection de menaces sur le réseau: sondes « Next-Gen »
L’identification des menaces sur les endpoints: anti-virus « Next-Gen »

Le choix d’une solution (et donc d’un cas d’usage) ne devra pas être défini de manière unilatérale par la filière SSI mais devra être réfléchi avec les différents acteurs concernés (SSI, DSI, métiers…). Cet échange permettra de préciser la cible ainsi que de valider les prérequis techniques et organisationnels (accessibilité des logs, ressources à mobiliser, taille des équipes…) pour préparer au mieux son intégration et son exploitation.

Quel type de solution choisir ?

Selon les outils déjà en place et en fonction du besoin, plusieurs solutions sont envisageables :

Choisir d’implémenter une solution clé en main permettant de traiter des cas d’usages très précis et non spécifiques à des problématiques métiers (EDR, biométrie comportementale…). Ce choix convient généralement à un besoin immédiat plutôt qu’à une stratégie à long terme.
Activer un module de Machine Learning sur un outil déjà en place (SIEM, puits de logs…) dans le but de pouvoir étendre son périmètre de détection. Ce choix permet notamment de pouvoir tester rapidement des cas d’usages et de s’affranchir des phases d’intégration d’un nouvel équipement au sein du son SI.

Enfin, il est essentiel de se rappeler qu’il n’existe pas de solution miracle et que chaque type de solution répond à des besoins précis.

Devant l’éditeur : challenger les points essentiels

Tester la solution et réfléchir à son évolutivité

Une fois que tous ces prérequis sont définis, il est d’usage de réaliser avec l’éditeur un Proof of Concept (PoC). Cependant, dans le cas spécifique d’une solution de Machine Learning, le PoC permettra de répondre à plusieurs interrogations spécifiques :

Mes données actuellement collectées permettent-elles d’avoir des résultats rapidement satisfaisants ? Les solutions de Machine Learning requièrent l’analyse d’un très grand nombre de données potentiellement enrichies par des référentiels permettant de croiser plusieurs sources. Il est donc nécessaire de s’assurer en avance de phase avec l’éditeur que les données actuellement collectées permettent déjà d’obtenir des premiers résultats.
Combien de temps la phase d’apprentissage durera-t-elle dans mon contexte ? Certaines solutions de Machine Learning produisent des résultats qu’à partir de plusieurs mois voire années car les phases d’apprentissages peuvent-être extrêmement longues du fait du contexte particulier à chaque entreprise. La possibilité d’utiliser un historique de logs pour les tests permettrait de s’affranchir d’une période d’apprentissage conséquente.

Des questions spécifiques seront également à traiter afin d’anticiper le plus long terme :

Sera-t-il possible d’enrichir les analyses avec d’autres types de données ? Les solutions de Machine Learning permettent de pouvoir effectuer des analyses sur de nombreux types de données pouvant avoir des formats hétérogènes, il est donc nécessaire de pouvoir s’assurer que les analyses pourront être enrichies avec de nouveaux types de données collectées.
Sera-t-il possible de mettre en place de nouveaux algorithmes de détection ? La possibilité de pouvoir personnaliser ces solutions en y ajoutant de nouveaux types d’algorithmes (et potentiellement de manière indépendante) est non négligeable.
Comment suis-je assuré que mon éditeur soit toujours à la pointe de la technologie ? Au vu de l’évolution exponentielle des techniques sur ce sujet, il est important de s’assurer que l’éditeur poursuive sa course à l’avancée technologique afin de proposer de nouveaux moyens de défense contre des attaques qui ne cessent de se complexifier.

Se préparer à protéger le cycle de vie de la donnée

Les méthodes de détection basées sur de l’analyse comportementale nécessitent la collecte et le traitement de données sensibles/personnelles. Ainsi, particulièrement dans le cas où la solution est hébergée chez l’éditeur, les problématiques liées à l’usage des données devront être adressées au plus tôt. D’une part les exigences contractuelles de sécurité devront bien sûr être renforcées, et d’autre part il pourra être utile de faire appel en amont à des solutions permettant un traitement plus sécurisé du cycle de vie de la donnée.

Par exemple, des startups comme SARUS travaillent sur le masquage des données personnelles, permettant aux data scientists d’effectuer du Machine Learning sans accéder aux données sources. Des startups comme HAZY travaillent elles sur la génération de données synthétiques gardant la valeur statistique des données utiles, mais perdant leur caractère sensible. Ce type de solution permet également d’agrandir artificiellement l’échantillon fourni, et d’obtenir une quantité quasiment illimitée de données, ce qui peut être très utile dans le cadre d’un PoC où les données actuellement disponibles sont en quantité limitées.

Une fois que la pertinence de la solution est validée, la partie ne fait que commencer !

Au travers de nos différentes expériences, nous avons pu nous forger une conviction : le marché est assez mature pour fournir des résultats intéressants, notamment sur les quatre cas d’usages mentionnés ci-dessus. La mise en place de tels outils saura être efficace si les solutions sont connectées à un écosystème riche et qu’elles répondent à un besoin spécifique. En effet, la mise en place d’une même solution peut être une franche réussite ou un échec dans deux contextes différents. Le résultat dépendra notamment de la clarté du besoin, du périmètre visé, de l’expertise présente (Cybersécurité et Data Science), et encore de la disponibilité de la donnée (qualité et quantité).

Si le choix d’une solution de Machine Learning n’est pas simple, le meilleur moyen de se faire rapidement une idée est de réaliser un PoC pouvant être rapide et peu engageant : nous avons pu constater chez certains de nos clients que des solutions remontaient déjà des résultats intéressants après uniquement deux semaines de PoC.

Tout en gardant en tête que le PoC n’est que le début de l’aventure. Il résultera sur le lancement d’un projet de plusieurs mois passionnant (analyse de nouveaux types d’alertes, découvertes de nouvelles techniques…), apportant une réelle plus-value sécurité (détection de nouveaux évènements…), impulsant un nouveau souffle au sein des équipes opérationnelles de sécurité (priorisation des efforts, possibilité d’optimisation des tâches rébarbatives…).

Cet article MACHINE LEARNING POUR SA CYBERSECURITE : COMMENT SE RETROUVER DANS LA JUNGLE DES PRODUITS est apparu en premier sur RiskInsight.

Boost your cybersecurity thanks to machine learning? Part 2 – “Yes, but choose the right approach!”

Carole Meyziat — Wed, 08 Jul 2020 07:34:20 +0000

In the previous article, we presented a step by step approach for Machine Learning applied to cybersecurity in order to use its value and understand how it works (lien vers partie 1 de l’article). In this second part, we will answer a few common questions that may arise before starting such an initiative.

Is the amount of data the only success factor?

Absolutely not. #GarbageInGarbageOut

Focusing only on the data is the best way to be disappointed by machine learning. Results do not appear out of thin air if the input data is not carefully chosen!

Not only should you define precisely the use case before starting, but you need to make sure that relevant data will be fed to the model.

What use case should I choose to do machine learning?

You’re looking at the problem upside down!

The right questions would rather be:

Are some use cases currently causing problems? g. time-consuming process because all the alerts raised require analysis, and ultimately include many false positives.
Does a machine learning based approach fit with some of those problems? g. alerts raised on a behaviour deemed as « abnormal », rather than a fixed detection threshold that would be hard to configure and to keep up to date.
Have I checked that there are no standard solutions to tackle the problem? #IAmNotReinventingTheWheel

In cybersecurity, in front of a complex problem that has to be described explicitly (e.g what is a suspect communication in my information system?) and that additionally is very likely to evolve along time (e.g the detection thresholds need frequent adjustment), finding the right compromise between detection of suspect use cases and false positives with static rules can be difficult. In these kinds of situation, it is interesting to explore the machine learning option.

Who leads the project: the cybersecurity team or the data team?

Both, with a lot of communication! #OneTeam

Each of these teams have their own expertise, technical for data scientists, business for the cybersecurity team. One without the other does not allow to properly conduct a machine learning for cybersecurity project.

Without data scientists, the cybersecurity team might for instance:

Start without enough data. g. the volume of data does not allow the algorithm to define a standard behaviour and it cannot separate normal situations from abnormal.
Forget to cross some data. g. each user’s first connection to a new application is detected as an abnormal event, because it is not combined with a variable to allow the comparison of this specific behaviour with the behaviour of the mass of users (that already use the application).
Not being able to interpret the alerts given by the algorithm, and not being able to optimize it. g. the algorithm shows anomalies that turn out not to be, the cybersecurity team does not understand on what is based the algorithm’s analysis and does not know how to improve it.

And without the cybersecurity team, the data scientists might:

Not know how to assess the relevance of the anomalies detected. g. the algorithm rises a log as an anomaly, but the data scientists cannot evaluate if it is a real cybersecurity issue or not.
Not being able to select the data the algorithm should be fed with. g. cybersecurity gave its proxy logs to the data scientists, but they did not sort the most adequate fields for the use case: the results of the algorithm are confused.
Miss out on crucial elements that should be integrated in the model to answer the need of the business. g. by trying to optimise an algorithm, a field that is necessary to the categorisation of an anomaly in cybersecurity is deleted from the data set; the results of the algorithm are no longer valuable for cybersecurity purposes.

Combining the expertise of both teams is key to guarantee that the resources of the Machine Learning will be used efficiently to bring a high value-added answer for cybersecurity.

What are the prerequisites?

The data!

Although it is not the only aspect to focus on, no model can be create without data.

As a reminder, machine learning encompasses all the techniques that allow machines to learn without having been explicitly programmed for their purpose. For them to learn, the algorithms are fed with the data that we can provide them.

They will need a high quantity of data so that they can define a « norm » as sharp as possible, since it will be defined and confronted to important volumes of real-life cases. Note that «high quantity » does not necessarily mean « diversity »: it is important to only select the data relevant for the use case.
The data will need to be qualitative not to deceive the learning of the algorithm, e. without the introduction of biases for instance.

It will be useful to identify the relevant type of data for the analysis (e.g. security logs), the sources where they will be collected (e .g. web proxies) and the resources that will enrich them (e.g. CMDB to link IPs with machine names) if needed.

I don’t have much data available for my use case, does this mean that machine learning is not for me?

Not necessarily!

If the available data is relevant to the use case and well distributed (e.g. representative of a usual situation on a defined time period so that a non-supervised algorithm could learn the « normal » situation), it is possible to have interesting results.

For instance, with a well-defined use case (e.g. targeted on a specific user population) and the adequate collected logs, suspect behaviors can be detected in proxy logs with only two weeks of traffic (depending on the wordiness of the logs, this only represents a few GB).

Which algorithm should I use?

Pick one and see!

The most important element that will allow to answer this question in a more adapted way is the type of learning process: supervised or non-supervised.

The choice of one non-supervised algorithm rather than another will affect performance, but not as much as the input data. Many algorithms can work for a given use case, and their performance will depend on the context (e.g. need to interpret the results, volume of the training data…).

The data scientists choose the algorithm based on their watch in order to suggest the most recognized and performing algorithm for a determined use case and context.

Should I do it myself or outsource?

It depends, and it can evolve in time!

Our first article detailed an implementation example: development with your own tools, starting from scratch. In reality, there are three implementation options; the choice depends on the use case, the available resources and the ambitions.

Each of these scenarios present their strengths and weaknesses and it is possible to use them conjunctly. Also, it is essential to keep an eye on the market in order to observe if new, innovating and more-performing solutions have since appeared.

#TakeAStepBack

Is it easy to test?

If the framing is well done, yes! #Test&Learn

Once that the use case is selected, the data availability checked and the implementation method chosen, it is rather easy to test the benefit of machine learning before further investments.

This type of project is well adapted to iterative or sprint methods. Try out rapidly the selected solutions, demonstrate their relevance thanks to the added value, or on the contrary bring to light the fact that for this use case, the results are not encouraging enough to continue.

Whatever the case may be, a POC approach following an opportunity study can help you get a quick idea. This step, before starting on a larger scale, also enables you to take a step back to evaluate the potential benefits (e.g gains in time due to less false positives, better overall reactivity because the alerts are more relevant) compared to the investment to be made (e.g dedicated computing infrastructures, skills to recruit) before starting.

Once that my POC is done, how do I scale up?

Once again, step by step!

Once that the first conclusive results are obtained on a use case, it is possible to envisage a production launch. Be careful not to go too fast: the production launch raises new questions that must be answered before continuing, for instance:

What are the volumes of data to analyse? What pre-processing (data preparation phase) needs to be done beforehand? How frequently? (Real time, delayed time…)
How often will the algorithm need to go through the learning process? On how much data?
What are the necessary infrastructures?
Which skills and resources will enable to maintain to solution in time?

It will then be time to take a step back and make operational choices while keeping in mind a long-term vision.

How much does it cost?

It all depends on the ambitions.

For a POC, a framing allows to limit the investment until the added value of machine learning is demonstrated (e.g. activation of an option on a security tool on a determined time frame to test it, no infrastructure investment)

Once the added value is tangible, the question of the costs involved for production launch and maintenance in time surges. A few elements must be considered to evaluate the total investment that will be needed:

Material investments (e.g. hardware for market solutions, infrastructure and resources to acquire computing power, in-house development) and software investments (license, machine learning feature activation on SIEM, big data tools for data science…). It is essential not to put aside the computing power that is necessary to the functioning of some models. It is one reason why – besides the quality of the results- the most relevant data are needed to answer a use case.
Talent acquisition : the new profiles to include (e.g. data scientists, data engineers) as well as the business profiles and accurate experts, that will be solicited during the project phase but also in the long term (alerts handling, re learning process, non-diversion tests for the solution, etc.)

To sum up, what are the main pitfalls to avoid?

#Reminder

Cet article Boost your cybersecurity thanks to machine learning? Part 2 – “Yes, but choose the right approach!” est apparu en premier sur RiskInsight.

Boost your cybersecurity thanks to Machine Learning? Part 1 – « Absolutely, here’s how! »

Carole Meyziat — Fri, 03 Jul 2020 12:00:14 +0000

Nowadays, we hear about artificial intelligence (AI) everywhere, it affects all sectors… and cybersecurity is not to be left out! According to a global benchmark published by CapGemini in the summer of 2019, 69% of organizations consider that they will no longer be able to respond to a cyber-attack without AI. Gartner places AI applied to cybersecurity in the top 10 strategic technological trends for 2020.

Throughout two articles, we will explore AI’s capabilities, specifically those pertaining to Machine Learning for cybersecurity. In this first article, we will go through each stage of a Machine Learning project focused on a cybersecurity use scenario: the exfiltration of data from the IS, on a very simplified case. We have chosen a case study, but the concepts of this article are applicable to all Machine Learning projects and can be transposed to any other use case, most notably cyber.

First of all, what are we talking about?

The term Artificial Intelligence (AI) includes all the techniques that allow machines to simulate intelligence. Today, however, when we talk about AI, we very often talk about Machine Learning, one of its sub-domains. These are techniques that enable machines to learn a task, without having been explicitly programmed to do so.

For us cybersecurity professionals, this is a good thing: we often find it difficult to describe explicitly what it is we want to detect! Machine Learning then provides us with new perspectives, that have already many application cases, of which the main ones are illustrated hereunder:

The example of a use case for ML-enhanced cybersecurity: the DLP

To illustrate the contribution of Machine Learning to cybersecurity, we have chosen to focus on the fraudulent extraction of data from a company’s information system. In other words, the case of DLP (Data Leakage Prevention), an issue encountered by many companies. We want to detect suspicious outbound communications in order to prevent them from happening.

«Very well but… how do we identify a suspicious communication? »

By large traded volumes? By a strange destination? By an unusual connection time?

In reality, our problem is complex to explain and what we need to assess is likely to change over time. Therefore, by using only static detection rules, our security teams find it difficult to be exhaustive. They can play on the thresholds of these rules to refine the detected elements, but unfortunately still find themselves with a large number of false positives to deal with.

We understand that the Machine Learning as we defined it previously can be useful here. What if we try it?

Step 1: Clarify the need

That is what we just did!

Step 2: Choose the data

When we hear the words Machine Learning, we usually must understand “data” to feed the algorithms. Lots of data, and of good quality!

When asking where to get useful data for our data exfiltration case to our requesting business (which for once is cybersecurity!), the web proxy stands out as the big winner: it sees almost all the traffic that comes out through the IS. So, we recovered its logs and they look like this:

« This all seems quite complicated…»

Data scientists have indeed enough reasons to get lost: on the one hand, the whole thing is not easily understandable, and on the other hand, after consultation with the cybersecurity business, not all fields are really useful for our use case. We therefore selected some of them with the cybersecurity business before continuing.

The result is easier for data scientists to use!

Step 3: prepare the data

Data scientists can now “explore the data” in order to ensure optimal learning of the algorithm. Here, they give us a surprising element in the distribution of our requests according to their upload volume. Since we want to detect data exfiltration, this variable is of particular interest to us.

The value of our variable is not distributed, we even have a very high volume at 0.

“But still, there are a lot of these requests with a null upload volume; is it really relevant to keep them in our case? “.

Indeed, after discussion with the cybersecurity business, it appears that these data do not bring much for our use case. So we decided to remove them. Our sample was then distributed as follows:

After several back and forth exchanges between data scientists challenging the data from a statistical point of view and cybersecurity teams responding with their professional eye, the data is simplified as much as possible. Data is then:

Enriched by creating new variables that are denser in useful information. We introduced a relative upload volume to each site, measuring the difference between the upload volume of a request and its average value over the last 90 days. We could also add the connection time for example.
Normalized by reducing the amplitude of each variable to decrease an over- or underweighting of certain variables.
Digitized, as most algorithms can only interpret numerical variables.

We can now split our data set in two: one set that will be used to train our model, one set that will allow us to test its performance. Several separation methods exist, enabling us to keep certain characteristics of the data (e.g. seasonality), but the objective remains the same: to guarantee an evaluation measure as close as possible to the model’s real performances, by presenting the model with data that it did not have at its disposal during training.

Step 4: Choosing the learning method and training the model

Some algorithms are more efficient than others for a given problem, it is therefore necessary to make a reasoned choice.

There are two main categories of Machine Learning algorithms:

Supervised, when we have labeled data as a reference to give as an example to our algorithm. These algorithms are for example used in cybersecurity by anti-spam solutions: they can learn via the users’ classification of emails as spam for example.
Unsupervised, when we do not know precisely what we want to detect or when we lack examples to provide the algorithm with for its learning (i.e. we lack labeled data).

As explained above, the context of our use case points us more towards the second option. It is for the same reasons that we initially thought of Machine Learning. We then choose our unsupervised learning algorithm (Isolation Forest here, but we could have chosen another one) and train our model.

Step 5: Analyze results

We use our test data set to evaluate the effectiveness of our model in detecting exfiltration cases.

The designed model detects patterns in the data (queries), then compares the new data (queries) with these patterns and highlights those that deviate from what it considers to be the norm through its learning (anomaly score).

Here are our results:

« Ok, but how should I interpret all this ? »

The graph on the left represents the anomaly scores associated with each query in the test set, sorted in chronological order. To the right are the logs with the highest anomaly scores.

After investigation with the cybersecurity business:

The peak in yellow, corresponds to a much larger upload volume than others, from a user who extracts a large volume of data. This anomaly is legitimate. However, an alert based on a static volume per request rule would also have detected this suspicious communication.
More interesting now, the peaks in red, correspond to requests for low volumes of regular uploads to unknown sites from the same user. These anomalies are harder to detect with conventional means, yet our algorithm has given them the same anomaly score as a large volume. They therefore become just as high a priority to qualify for our cybersecurity alert management teams.

Now, let’s focus on the large package in the center of the graph (in orange). On the first day, we observe a large anomaly score, a sudden sending of data by many users to the city’s transit website. After investigation we realize that this is not a real security incident, but the annual sending of receipts for the continuation of transport subscriptions (we are at the beginning of September …). We then observe that the algorithm “understands” that these flows return to several users and progressively integrates them as a habit. The risk score therefore decreases day by day.

The model therefore detects what is out of the norm, regardless of the standard, and corrects itself with experience. This is where Machine Learning presents a real added value compared to traditional detection methods.

If the performance of the model on this first simplified use case attests to the potential value of the Learning Machine, it may be time to move on to step 6 – deployment to scale!

In a second article we will come back to these steps to highlight the success factors and pitfalls to be avoided when studying the possibilities of Machine Learning in cybersecurity.

Cet article Boost your cybersecurity thanks to Machine Learning? Part 1 – « Absolutely, here’s how! » est apparu en premier sur RiskInsight.

Detect cyber incidents with machine learning: our model in 5 key steps!

Hugo.MORET@wavestone.fr — Tue, 24 Dec 2019 14:19:30 +0000

As the role of Artificial Intelligence grows in companies, from predictive maintenance to price optimization, new so-called ‘intelligent’ tools are being developed for cybersecurity. How do these tools exploit recent developments in Machine Learning? What steps should be taken to develop an intelligent and above all relevant detection solution in this context?

From static detection methods to behavioral analysis

As attacks evolve more and more rapidly and in an increasingly sophisticated way, the SOC (Security Operations Center) is forced to review its approach and existing tools as static detection mechanisms become obsolete:

The historical approach uses the recognition of known behaviors and footprints (e.g. malware signatures). This method, called misuse-based, provides explicit alerts that are easy to analyse for operational staff, but only attacks that have already occurred and been detected can be recognized.
The new approach aims to analyse actions that deviate from the behavior normally observed, without having to explicitly and exhaustively define a malicious act (e.g. the behavior of an individual who deviates from that of his colleagues). This anomaly-based approach makes it possible to detect attacks that are not directly run through the tools but require high volumes of data.

The anomaly-based approach exploits the correlation capabilities of unsupervised learning algorithms that highlight links between unlabeled data (i.e. not categorized as normal or abnormal).

Recipe: detection of anomalies on a machine learning bed

To know if Machine Learning is appropriate for its context, the best solution is to create a PoC (Proof of Concept). How do you implement it? What are the key points to look out for? Here are the key steps in our development.

Starter, main or dessert: define the use case

Doing Machine Learning is good, knowing why is better. Defining a use case is like answering the question ‘What do you want to observe?’ and determining the means available to respond.

In our context, a use case is a threat scenario involving one or more groups of accounts (malicious administrators, exfiltration of sensitive data, etc…). To evaluate them, several criteria must be taken into consideration:

Utility: what would be the impact if the scenario were to happen?
Data availability: what are the available sources of useful data?
Data complexity: is the available data structured (numbers, tables) or unstructured (images, text)?

We have chosen to work on the compromising of service accounts: some may have important rights, and their automated actions generate relatively structured data. In the context of a PoC, a limited scope, and homogeneous and easily accessible data sources are essential to obtain concrete and exploitable results, before considering more ambitious use cases.

Ingredient weighing: determine the data model

In order to make the best use of the data, it is necessary to define a behavior to be modeled based on available information. This is where business expertise comes in: can an isolated action be a sign of compromise or should a series of actions be considered for detecting malicious behavior?

First, we defined a model based on the analysis of unit and family logs (e.g. connections, access to resources, etc.) to evaluate the overall functioning. However, a model that is too simple will ignore weak signals hidden in action correlations, while a representation that is too complex will add processing time and be more sensitive to modelling biases.

Selection of tools: choose the algorithm

Several types of algorithms can be used to detect anomalies:

Some try to isolate each point: if a point is easy to isolate, it is far from the others and therefore more abnormal.
Clustering algorithms creates groups of points that look alike and from this it calculates the center of gravity of each one to create the average behavior: if a point is too far from the center, it is considered abnormal.
Less common, auto-encoders are artificial neural networks that learn to recreate normal behavior with fewer parameters: behavior reproduction errors can be considered as an anomaly score.

Other approaches still exist, including the most exotic artificial immune systems that mimic biological mechanisms to create an evolving detection tool. However, it should not be forgotten that a simple and well optimized tool is often more effective than an overly complex tool.

The k-means clustering algorithm was selected in our case: used in the detection of bank fraud, it simplifies re-training which allows the tool to remain adaptable despite changes in behavior.

All these algorithms can also be enhanced, depending on the chosen behavior model, to consider a series of actions. Thus, convolutional or recurrent neural networks can be added upstream to take into account time series.

Preparation of ingredients: transforming data

Once the algorithm has been selected, the raw data must be processed to make it usable. This process is carried out in several steps:

Cleaning: correction of parsing errors, removal of unnecessary information and addition of missing information.
Enrichment: adding data from other sources and reprocessing fields to highlight information (e. g. indicate if a date is a public holiday…).
Transformation: creation of binary columns for qualitative data (e.g. account name, event type, etc.) that cannot be directly transformed into numbers (one column for each unique value, indicating whether the value is present or not).
Normalization: reprocessing the values so that they are all between 0 and 1 (to prevent one field from taking over from another).

Due to the variety of possible events and the complexity of the logs, we have chosen to automate this process: for each field, the algorithm detects the type of data and selects the appropriate transformation from a predefined library. The operator can then interact with the tool to modify the choice before continuing the process.

Seasoning: test and optimize the tool

Once the model has been defined, the algorithm chosen and the data transformed, the tool developed should be able to raise alerts on anomalies. Do these alerts make sense or are they false positives?

In order to evaluate the performance of the tool, we performed two types of tests:

Intrusion simulation by performing malicious actions to check if they are detected as abnormal (this approach can also be handled by directly adding “false” logs to data sets).
Analysis of anomalies by checking whether the alerts raised actually correspond to malicious behavior.

Many parameters can be adjusted in the algorithms to refine detection. Performance optimization is achieved through an iterative process; changing parameters and observing the effect on a set of validation data. Manually time-consuming, it can be improved by the AutoML approach which seeks to automate certain steps by using optimization algorithms.

However, parameter optimization is not enough: the results of our PoC have shown that the quality of detection based on behavioral analysis depends largely on the relevance of the behaviors defined before the algorithm is developed.

ML or not ML: that may not be the question

Despite its undeniable advantages, Machine Learning is a tool to be used in a rational way: frameworks are becoming increasingly accessible and easy to use, but the definition of the use-case and the behavior model are still crucial steps that exist. These choices, where business expertise is essential, will irreversibly influence the choice of data, the selection of the detection algorithm and the tests to be performed.

The question is no longer ‘Where can I put Machine Learning in my SOC? ‘, but rather ‘Of all the approaches available, which is the most effective to address my problem?’.

To find out, there’s only one solution: light the fires!

To go further…

… here are the tools used during our PoC:

IDE
- Pycharm: clear and practical development environment with efficient library management
Language
- Python: a language widely used in the field of Data Science with many powerful libraries
Libraries
- Scikit-learn: complete Machine Learning library (supervised, unsupervised…)
- Pandas: complex processing of data tables
- Numpy: handling of matrices and vectors
- Matplotlib, Seaborn: display of graphics for visualization

Cet article Detect cyber incidents with machine learning: our model in 5 key steps! est apparu en premier sur RiskInsight.

L’utilisation du Machine Learning par les startups françaises dans le domaine de la cybersécurité

Paul Bonnaure — Tue, 22 Oct 2019 11:45:50 +0000

Cet article a pour but de présenter nos convictions sur l’utilisation du Machine Learning par les startups françaises en cybersécurité du Radar Wavestone 2019.

L’intelligence artificielle est un sujet à la mode et la cybersécurité fait partie des cas d’usages phare de développement. Est-ce aussi le cas pour les startups françaises en cybersécurité ? Qu’en est-il de son utilisation ? Quelles sont les tendances du marché concernant cette technologie ?

« Intelligence Artificielle », « Machine Learning », « Deep Learning » : trois termes trop souvent confondus

Avant de rentrer dans le vif du sujet, commençons par clarifier le vocabulaire qui sera employé dans la suite de l’article :

Intelligence Artificielle: ensemble des techniques mises en œuvre pour que des machines simulent l’intelligence ;
Machine Learning: technique reposant sur des modèles statistiques qui permettent à l’ordinateur « d’apprendre » à partir d’un grand nombre de données ;
Deep Learning: méthode de Machine Learning basée sur un réseau de neurones profonds. D’autres méthodes existent : Support Vector Machines, Random Forests, K-Nearest Neighbors, …

La confusion entre ces trois termes est fréquente. Bien souvent l’utilisation du terme « Intelligence Artificielle » en cybersécurité désigne l’utilisation du Machine Learning, sous toutes ses formes.

La cybersécurité, un terreau fertile pour les technologies de Machine Learning

Sur les 134 startups recensées dans notre radar 2019 des startups françaises en cybersécurité, 19% proposent des solutions basées sur du Machine Learning. Interrogées, 70% de ces startups déclarent que développer ce type de technologie dans leurs solutions fait partie de leur stratégie.

De plus, l’utilisation du Machine Learning dans certains domaines de la cybersécurité devient quasi incontournable et la majorité des startups de ces domaines envisagent de baser les futures évolutions de leur solution sur cette technologie.

Le Machine Learning en cybersécurité est en pleine progression et son utilisation, déjà implanté dans l’écosystème des startups françaises, démontre une forte volonté d’innovation du marché. On pressent que ce rythme d’adoption va continuer à s’accélérer dans les années à venir, la même « photo » de l’état des lieux dans un an devrait le prouver.

Le Machine Learning utilisé dans un but d’amélioration de performances

Les startups qui ont choisi d’utiliser le Machine Learning le font principalement afin de :

Obtenir des temps de réponses courts: réduire le temps de réponse de la prise de décision en utilisation nominale. En effet, dans certains cas où le nombre de données est particulièrement important, il faudrait des mois à un algorithme n’utilisant pas de Machine Learning pour fournir un résultat ;
Améliorer la fiabilité des détections: réduire le nombre d’erreurs, c’est-à-dire diminuer le taux de faux positif et faux négatifs. Les solutions anti-phishing sont une bonne illustration car celles reposant sur du Machine Learning filtrent avec moins d’erreur qu’une solution dite « classique ».

Une adoption hétérogène selon le thème du radar…

L’adoption du Machine Learning varie de manière importante d’une thématique du radar à l’autre. Les thématiques où l’utilisation du Machine Learning est la plus répandue sont : « Application Security », « Endpoint », « Industrial Security » et « Web Security ». On note également quelques cas d’usages particuliers dans d’autres thèmes du radar tel DPO Consulting utilisant du Machine Learning pour aider à la prise de décisions dans le cas d’une évaluation des risques.

…expliqué par la nature même du Machine Learning.

Le Machine Learning nécessite un certain nombre de prérequis et de conditions pour fonctionner efficacement. Toute la performance des modèles de Machine Learning repose sur la phase d’entrainement où le modèle « apprend » grâce aux données qu’on lui fournit. Ces données, que nous allons illustrer par le cas d’une solution anti-phishing pour boite mail, doivent être :

Pertinentes: c’est-à-dire porteuses d’informations utiles. Dans notre exemple de solution anti-phishing, une information utile est par exemple la présence de certains mots souvent utilisés dans des mails de phishing ; une image ou la taille du fichier mail sont moins utiles ;
En nombre suffisant: ce nombre varie selon le cas d’usage et le niveau de précision souhaité. Dans notre exemple de solution anti-phishing, il faudrait probablement entrainer l’algorithme avec quelques dizaines de milliers de mails ;
Variées: si possible de sources différentes pour plus de résilience de l’algorithme. Dans notre exemple de solution anti-phishing, il serait bon que la base de données d’entrainement contienne des mails issus de différentes campagnes de phishing, reçus par différents entreprises/particuliers, des mails ciblés ou non…, et qu’elle puisse traiter à la fois le contenu du mail mais également les headers, etc. ;
Représentatives : c’est-à-dire ne pas introduire de biais et être à jour. Dans notre exemple de solution anti-phishing, il convient par exemple de ré-entrainer régulièrement le modèle afin de prendre en compte les dernières tendances en matière de phishing.

Schéma de fonctionnement d’une solution basée sur du Machine Learning

Dans les domaines où le Machine Learning est le plus utilisé, il s’avère que ces conditions sont plus facilement réunies. En effet, les données nécessaires à l’entrainement sont souvent déjà disponibles dans les équipements en place (log applicatifs, log système, log réseau, alerte anti-virus, …), voire déjà consolidées dans des équipements de sécurité centraux (SIEM, Data Lake…).

« Intelligence Artificielle » : Gare à l’effet de mode !

Si le Machine Learning offre de nouvelles possibilités permettant de grandement améliorer les capacités cybersécurité des entreprises, cette technologie n’est pas en soi une solution miracle. Il est important de bien comprendre ces algorithmes et de garder certains points d’attention en tête avant de se lancer dans l’acquisition d’une telle solution.

Tout d’abord, comme la phase d’entrainement est clé pour la performance du Machine Learning, il faut s’interroger sur sa capacité à fournir à la solution les données nécessaires et suffisantes pour l’apprentissage. Le principal frein remonté par les startups proposant des solutions entrainées avec des données clients est d’ailleurs la difficulté d’obtenir des données en qualité et quantité suffisante pour faire tourner leur solution.

Il est également important de réussir à lire au-delà du discours commercial pour comprendre l’apport réel du Machine Learning à la solution, au risque de payer plus cher un outil qui ne serait pas forcément justifié. Et surtout, il faut être conscient que le Machine Learning ne signifie pas ne plus avoir de risques. Ces solutions, comme toute solution de sécurité, répondent à un cas d’usage précis et viennent compléter un ensemble de mesures de sécurité.

Cette mise en garde nous semble nécessaire même si nous avons constaté une utilisation pertinente et justifiée de ces technologies par les startups françaises en cybersécurité dans le cadre de notre analyse.

Cet article L’utilisation du Machine Learning par les startups françaises dans le domaine de la cybersécurité est apparu en premier sur RiskInsight.

SOAR, UEBA, CASB, EDR and others: which tools do you need for you SOC? (3/3)

Amaury Coulomban — Thu, 18 Apr 2019 10:41:38 +0000

After the first article which covered “Extending the scope of detection to new perimeters” (see here), and the second, dedicated to “Enhancing detection through new approaches” (available here)… this is the conclusion to this (epic!) saga. This last installment will cover the last two strategic areas.

Improving knowledge of threats and attackers

Cyber-threat intelligence (CTI) platforms

Cyber-Threat Intelligence (CTI or Threat Intel) is a discipline that brings together the collection, consolidation, and exploitation of all information on cyber-threats. “Know your enemy” says Sun Tzu in the Art of War. Although this quote refers to “physical” wars, the principle remains true, and is probably even more true when it comes to “cyber” battles.

Today, a large number of security approaches rely on knowledge of attacks: the signature-based approach of antivirus and IDS solutions, targeted detection scenarios, etc. Even though this trend is reversing (in particular with the detection of anomalies) the vast majority of security products still rely—and will continue to rely—on the principles of Threat Intelligence.

With companies’ needs becoming more specific, and attackers ever more specialized, Threat Intel solutions are becoming increasingly popular, with services being offered directly to companies. In addition to commercial offerings, more and more exchange platforms and partnerships are enabling direct collaboration with other companies (in the same sector or geographical area, etc.).

Threat Intel offers a range of services. On the one hand, ‘strategic’ Threat Intel helps an SOC better understand the context and specific threats to the company. To do this, the risks from various ecosystems are studied: geographical, political, ideological, sectoral, etc. This information enables security teams to better understand the threats they face and guides their decisions to define “long-term” strategy (solutions to be deployed, etc.).

On the other hand, ‘tactical’ Threat Intel provides more precise information on attackers’ methods, allowing the SOC to facilitate detection and tailor existing measures: new threat scenarios to monitor, ports to block, etc.

In addition to these approaches, ‘technical’ Threat Intel contributes greatly to the analysis of security events by providing, on request (from SOAR in particular—see below), elements that enable the veracity of an alert to be judged: an IP belonging to a botnet, a file hash corresponding to a known virus, etc.

Threat Intelligence approaches are therefore among an SOC’s most versatile tools, enabling it to make the most of existing devices, by remaining up to date and prioritizing the threats to be detected, as well as identifying future tools and measures to be deployed.

Examples of Threat Intelligence publishers:

The standardization and automation of the response process

Security Orchestration, Automation and Response

Security Orchestration, Automation and Response (SOAR) is derived from the combination of three SOC tools: Security Incident Response Platforms (SIRPs—more details here), Security Orchestration Automation (SOA— orchestration and automation solutions) and some of the functionality of Threat Intelligence platforms. In summary, these are platforms that provide help and automate responses to security incidents. The solutions are similar to traditional ticketing tools (ITSMs) but include functionalities specific to cybersecurity issues. SOARs offer three main capabilities, each linked to one of the three types of tools from which they are derived.

First, like SIRPs, they allow the definition of response processes that are tailored to each security event. These are based on pre-defined playbooks provided by the publisher, published by the community using the solution, or created manually to better tailor things to the needs of the business. In particular, this task requires response teams to establish a clearly defined process that encourages them to ask themselves the right questions when they create response procedures, as well as to capitalize on and retain the knowledge gained.

The benefits of a SOAR, however, come more from the automation of the various stages that follow detection. During the analysis phase, the tool will automatically enrich knowledge about a security event by retrieving contextual information about the IS (identity in the AD, criticality of a resource, etc.), and querying external Threat Intelligence services (via APIs) or those that are offered as part of the solution. In addition to automating the enrichment and analysis steps, SOARs also facilitate the work of analysts: the investigation of terminals, the interrogation of VirusTotal etc. in one click—when their involvement is required.

But automation doesn’t stop there! Although controversial, the automation of the response (via the connection to security equipment, a legacy of SOA) can represent an important gain for security teams: the blocking of a URL, the generation of the signature of a file and its propagation to antivirus tools, the blacklisting of an IP, etc.

The goal of SOARs is clear: to make it easier for the teams in charge of analysis and response, by helping them to define processes and automate tasks to the greatest extent possible. Although SOARs are very adaptable and can therefore help in response to any type of attack, they really shine when it comes to automating the treatment of common attacks (such as ransomware, phishing, etc.), which are very repetitive and tie up the resources of response teams.

Once these tasks have been automated, the security teams responsible for responding can focus on more complex alerts, where their knowledge adds real value.

Provided they are prepared to put in the initial effort (the formalization of processes, etc.), the likely reactivity and load gains are significant. SOARs will change the way SOC teams work, especially with respect to top-level analysts. Even though these solutions are still rarely deployed in France, they are set to become an essential tool for SOCs in the coming years.

Examples of SOAR publishers:

Even though tools are only part of equipping an SOC, each of these solutions has distinct advantages that can help detection teams keep up to date in terms of the evolution of ISs and threats.

All the tools are promising, and some are coming to maturity. However, it’s important to keep in mind that current toolkits already raise a raft of alerts, which presents a challenge when it comes to processing. It’s therefore advisable to complete the deployment and automation of what exists (using SOARs, for example), before turning toward new solutions.

And, as for any innovative product, a cool head is needed: the deployment of a new solution must be the result of well-defined needs.

Cet article SOAR, UEBA, CASB, EDR and others: which tools do you need for you SOC? (3/3) est apparu en premier sur RiskInsight.

SOAR, UEBA, CASB, EDR and others: which tools do you need for you SOC? (2/3)

Amaury Coulomban — Thu, 18 Apr 2019 09:56:28 +0000

After the first article, which covered “Extending the scope of detection to new perimeters” (available here), this second installment is the next in our summer series about the SOC…

Enhancing detection with new approaches

Think identity to detect suspect behaviors: UEBA

User and Entity Behavioral Analysis (UEBA—previously known as UBA) technologies are among the latest tools being used to enhance SOC’s detection arsenals. As their name suggests, they take a specific approach—leaving aside the technical considerations of current solutions (SIEM, etc.), and, instead, analyzing the behavior of users and entities (including terminals, applications, networks, servers, connected objects, etc.).

The principle is simple, but its implementation much less so. To be effective, UEBA approaches require a diversity of sources, and a variety of data formats. Traditional sources, such as SIEM and log manager(s), are employed and, in addition, certain resources (such as ADs, proxies, BDDs, etc.) are often used directly.

But, to perfect their detection capabilities, UEBA solutions also draw on new sources: information on users (HR applications, badge management, etc.), exchanges between employees (chats, video exchanges, emails, etc.), or any other relevant sources (business applications that need to be monitored, etc.).

Taking all this information together, UEBA solutions analyze the behavior of users (and entities) to identify potential threats. They can use static rules, in the form of signatures to be detected (which are often already implemented in SIEM solutions): simultaneous connections from two different locations, or unusual times of use, etc.

But the real strength of UEBA lies in the use of Machine Learning algorithms to detect changes in the behavior of users or services: suspicious business-function operations, access to critical, previously unused applications during holidays, unusual data transfers, etc.

Although UEBA was initially conceived to counter fraud, its role has gradually broadened to cover some areas that typically pose problems for SIEM: data theft, compromise or loan of application accounts, terminal or server infection, privilege abuse, etc.

Thus, today, UEBA is positioning itself as complementary to SIEM, adding to the latter’s “technical” approach by providing “user” visibility, and bringing an additional layer of intelligence to the analysis.

The market’s view is that, in the coming years, UEBA solutions will probably cease to exist in their present form. Instead, they’ll be integrated into existing solutions (SIEM, EDR, etc.), changing their form from products to functionalities.

Examples of UEBA publishers:

Trapping attackers: deceptive security

Deceptive Security can be considered as a move to a higher form of the Honey Pot approach. Here, decoys, in the form of data, agents, or dedicated environments, are distributed widely throughout all, or part of, the IS.

Depending on the needs and solutions, Deceptive Security tools can serve two purposes. By diverting the attention of attackers away from real resources and leading them down false trails, they can act as a means of protection.

But above all, monitoring these decoys can detect threats that are spreading within the IS. In fact, the decoys have no other use than to lure potential attackers or to provide false information; any communication with them is then, by definition, suspect.

This type of solution isn’t a replacement for existing measures but addresses very specific use cases where conventional detection approaches are ineffective: APTs, which are specially designed to circumvent them, and, more broadly, horizontal movements within the IS.

For more detail on Deceptive Security solutions, read our dedicated article here.

Examples of Deceptive Security publishers:

Detecting weak signals on the network: machine learning sensors

Traditional detection sensors (IDPSs), based on traffic analysis and comparisons with known attack signatures, are not particularly effective when it comes to detecting subtle attacks (APTs, etc.) or unknown threats (0-day, etc.). To overcome this problem, new-generation IDPSs integrate Machine Learning capabilities (sometimes presented as Artificial Intelligence) into their detection arsenals.

Depending on the solution, two types of use for Machine Learning can be distinguished. On the one hand, the use of these algorithms in supervised mode to learn to recognize the behavior of certain attacks, or phases of attack (during the active phases): command and control, scans, lateral movements, data leakage, etc.

On the other, once the sensor has been deployed, adjustment of the detection thresholds to the client context is also based on Machine Learning algorithms (something already used by many traditional IDPS solutions).

This mode of operation enables rapid deployment (solutions that can be used out-of-the-box with shorter learning phases), and a better ability to detect previously characterized attacks. Conversely, the detection of attacks that have not been subject to learning, or are completely unknown, remains difficult.

In contrast to this approach, some solutions rely on unsupervised learning to detect attacks. Here, during deployment, sensors are positioned on the network to observe the traffic and learn how to recognize what constitutes legitimate traffic.

Once the learning phase is over, the sensors can detect anomalies and raise alerts when suspicious behavior occurs. This approach enables the detection of unknown attacks, but generally requires a longer learning phase if it is to be effective and achieve an acceptable false alert rate.

In both cases, the “Machine Learning“ sensors make it possible to enhance an SOC’s arsenal (which, today, is mostly aimed at detecting known attacks) through detection capabilities that can discern complex, unknown attacks, or those designed to circumvent conventional security approaches.

Initial feedback from the field shows that these technologies can indeed detect threats that bypass conventional security measures. False positives, however, are very common (the learning curve varies widely, depending on solutions and contexts), and it remains difficult to judge how comprehensively threats are being detected.

“Machine Learning” sensors therefore have a definite future among SOC tools, even if they need to further mature to reach their full potential.

Examples of Machine Learning sensor publishers:

You can find our third, and final, article in this series here.

Cet article SOAR, UEBA, CASB, EDR and others: which tools do you need for you SOC? (2/3) est apparu en premier sur RiskInsight.

Revolution technologique : quelle perspective pour la lutte contre la fraude ? (2/2)

m@THIEU — Fri, 02 Nov 2018 18:17:47 +0000

Après un premier article présentant les nouvelles technologies que l’on trouve dans la lutte contre la fraude, ce deuxième article présente comment intégrer ces technologies au mieux.

Le dilemme de l’évolution des dispositifs antifraude : quels leviers pour intégrer ces technologies ?

Faisant écho à ces problématiques, l’écosystème des éditeurs s’est organisé pour proposer des solutions antifraude s’appuyant sur ces technologies. Ainsi éditeurs et start-ups se sont très largement développés, partout dans le monde (plus de 150 fournisseurs ont été recensés au sein du radar « Antifraude » Wavestone). Le besoin de lutte antifraude a en effet par nature une dimension internationale, notamment dans la protection des flux monétaires qui sont rarement limités à un seul pays.

Figure 2 :Exemple du radar des éditeurs antifraude Wavestone (extrait non exhaustif)

Même si la lutte contre la fraude apparait comme un use case de choix pour démontrer le ROI du Machine Learning (réduction du nombre de fraudes, automatisation de la détection…) et au-delà du choix de la stratégie d’outillage de lutte contre la fraude au regard de la maturité du marché, les questions à se poser doivent rester celles d’une solution SI « standard » (exploitation, maintenance, évolutivité…).

Si les coûts d’infrastructures nécessaires à la mise en place d’outils basés sur le Machine Learning et le big data ne sont pas négligeables, ils permettent de créer un environnement favorable à l’exploitation de la richesse des données pour divers usages (maintenance prédictive des serveurs, connaissance client, etc.) en gardant à l’esprit les garde-fous mis en place par le RGPD.

Figure 3 : Où peut-on agir avec le Machine Learning : exemple d’une banque

Une nouvelle cible à atteindre : une approche “sans couture” technologique et métier

Face aux nouveaux enjeux et l’apport des technologies émergentes, une nouvelle stratégie antifraude doit être désormais définie.

La mise en place d’un dispositif de détection globale de confiance qui devra respecter 5 grands principes.

L’efficience et l’automatisation : il bénéficiera d’une détection à plusieurs critères (moteur de règles et Machine Learning) et d’une efficacité opérationnelle optimisée par l’automatisation de mesures allant de l’augmentation du niveau d’authentification demandé au gel d’un virement.
L’évolutivité et l’omnicanal : il intègrera plusieurs périmètres dans la détection avec une logique « sans couture » entre le monde cyber et le monde « hors cyber » et sera conçu pour permettre l’intégration de nouvelles données disponibles (ex : données de biométrie comportementale).
La visibilité et l’exploitabilité : il fournira la visibilité (reporting) et l’explication des résultats de détection, aux équipes antifraude, aux clients et également aux régulateurs.
La conformité et la sécurisation : il respectera les obligations en matière de détection ainsi que les réglementations (RGPD), et traitera les risques inhérents au Machine Learning (tentatives de poisoning, compréhension par l’attaquant du modèle…).
La gouvernance transverse cybersécurité et métier : une collaboration étroite des équipes de détection de menaces cyber et métier antifraude, dépassant les silos encore trop présents, permettra une réponse globale avec une vision 360 des menaces et fera le meilleur usage des données disponibles.

Pour bénéficier de tous les atouts apportés par cette nouvelle stratégie de détection, il conviendra également de ne pas négliger les systèmes d’investigation et de réaction.

Une décentralisation partielle de la lutte contre la fraude, impliquant les conseiller bancaires, permettra une plus grande capacité d’investigation. Ayant la connaissance la plus fine de leurs clients, ces derniers représentent un atout dans le processus d’investigation.

De plus, la biométrie comportementale et le machine learning permettent de fournir une meilleure visibilité sur le niveau de confiance qu’on peut accorder à l’utilisateur. Une fois le niveau de confiance défini, il est donc possible d’adapter les niveaux d’authentification demandés en conséquence. Une contribution adaptée et graduée de l’utilisateur permettra ainsi de réduire le nombre d’alertes émises.

La mise en place d’une nouvelle cible antifraude n’est pas seulement pour assurer une réponse adaptée à un changement de contexte mais aussi pour anticiper une vague de fond qui s’amorce aujourd’hui. La détection de fraudes deviendra à l’avenir de plus en plus complexe compte tenu d’une digitalisation qui va continuer à s’accélérer, en particulier sur les moyens de paiement. L’émergence de nouveaux acteurs, comme les Fintechs, et la désintermédiation grandissante des banques vont notamment entraîner un appauvrissement de la donnée disponible. Les dispositifs antifraude sont donc voués à évoluer en profondeur afin de garder et développer leur efficacité.

Cet article Revolution technologique : quelle perspective pour la lutte contre la fraude ? (2/2) est apparu en premier sur RiskInsight.

Revolution technologique : quelle perspective pour la lutte contre la fraude ? (1/2)

m@THIEU — Wed, 31 Oct 2018 08:53:45 +0000

La protection des actifs, notamment contre leur vol ou leur détournement, est depuis longtemps un enjeu majeur des entreprises. Les dispositifs de lutte contre la fraude s’organisent autour de trois grands piliers : la prévention, la détection et la réaction. Ces dispositifs historiques font aujourd’hui face à de multiples évolutions qui offrent également des opportunités sans précédent que les entreprises se doivent de saisir.

Les expériences et expérimentations du secteur bancaire, en avance sur ces problématiques, permettent d’envisager les perspectives à venir et fournit donc un prisme d’analyse utile aussi pour les autres secteurs.

Menaces, usages, réglementations : trois évolutions majeures qui impliquent des adaptations des dispositifs antifraude

Les transformations business et technologiques dans l’ensemble des secteurs d’activité font apparaitre des évolutions impactant directement les dispositifs antifraude historiques.

Les menaces évoluent, les pratiques de fraude se sont professionnalisées avec de nouveaux outils et de nouvelles pratiques. Prenons l’exemple du phishing : même sans connaissances informatiques, une cellule de fraudeurs entrainée peut désormais acheter un kit de phishing prêt à l’emploi et met en moyenne seulement trois minutes entre une connexion frauduleuse et une sortie d’argent. Les tentatives de fraude se sont donc démultipliées ces dernières années.

En parallèle, les usages évoluent vers une plus forte digitalisation, parfois dictés directement par des évolutions réglementaires, à la fois à destination des clients ou à destination des collaborateurs. Par exemple la mise en place de l’Instant Payment en France ou de la directive européenne sur les services de paiement 2ème version (DSP2) prévoient des virements instantanés. Ces nouveaux usages accélèrent les transactions financières entre les acteurs entrainant par la même occasion des besoins d’évaluation instantanée des risques de fraude. De plus, cette multiplication des canaux de paiement entraîne une augmentation de la surface d’attaque avec notamment une diversification des malwares bancaires aux applications mobiles ainsi que l’apparition de pratiques d’ingénierie sociale complexes multicanales et appuyées sur une compréhension des processus métier.

La diversification des fraudes, la volumétrie associée et l’augmentation des besoins de traitement instantané rend le traitement manuel presque impossible. La création de règles d’alertes plus restrictives pour minimiser les volumes ferait cependant courir le risque de manquer un grand nombre de fraudes.

Dans ce nouveau paysage, où la fraude devient de plus en plus technologique et peut avoir de multiples origines (clients, donneurs d’ordres, sous-traitants, fournisseurs, administrateurs…), les stratégies de détection doivent évoluer et passer d’une détection réactive des fraudes connues à une détection proactive des menaces encore inconnues.

Les nouvelles technologies, l’avenir de l’antifraude pour faire face à ce nouveau paradigme

L’approche historique de la détection de fraude est fondée principalement sur la définition de règles unitaires générant une alerte en cas de non-respect d’un des critères et sur la corrélation d’événements, consistant à mettre en œuvre des règles métiers plus avancées prenant en compte plusieurs types de données, afin de générer une alerte lorsque apparaissent des indices du déroulement d’un scénario de fraude connu.

Cette approche tout en demeurant efficace pour la détection de fraudes connues, par exemple dans la lutte contre le phishing, n’est plus suffisante pour faire face aux évolutions en cours. Une approche hybride doit être enrichie sur la base des nouvelles technologies présentes sur le marché (intelligence artificielle / Machine Learning, biométrie comportementale…) qui offrent deux grandes perspectives d’enrichissement des dispositifs actuels.

Passer d’une détection de masse à une détection individualisée beaucoup plus fine qui va se concentrer sur les changements de comportement.

Le Machine Learning a la possibilité de créer des profils individuels à chaque client. Ces profils, composés de variables construites à l’aide des données collectées, vont permettre de modéliser le comportement. Ainsi, les algorithmes utilisés vont comparer le profil du client (et donc son habitude) avec un événement donné et, de fait, remonter une anomalie lorsqu’une divergence apparait. A noter que le nombre de variables manipulées peut facilement dépasser plusieurs dizaines, là où des règles statiques n’intègreront que quelques paramètres, permettant ainsi de démultiplier le potentiel de détection ou de réduire le nombre de faux positif.

Diversifier les périmètres à couvrir en bénéficiant des économies d’échelle apportées par ces technologies (mutualisation des infrastructures big data, massification des données, automatisation permettant un gain de temps pour les analystes…)

Ces technologies ont la capacité d’intégrer et corréler, grâce à des Data Lake sur lesquels elles s’appuient, des volumétries importantes de données brutes, techniques ou métiers (logs applicatifs, connaissances clients, opérations financières…) et d’apporter un potentiel d’enrichissement par des données extérieures (liste de surveillance, transformation d’adresses IP en localisations physiques…). Pour tirer le maximum de bénéfices des systèmes antifraudes, le Data Lake doit disposer d’un historique de données pertinentes et conformes, à savoir 13 mois pour des personnes physiques et 6 mois pour des personnes morales.

Pour autant ces technologies ne sont pas « magiques », elles nécessitent d’avoir à disposition des données en qualité et en quantité afin de réaliser un important travail préparatoire sur la construction des variables qui portent les capacités de détection des algorithmes. Cette phase de construction nécessite un apport d’expertise à la fois métier mais aussi technologique (datascience, développeurs, etc.).

Figure 1 – les principales méthodes de détection

Le choix des algorithmes n’est également pas à négliger, notamment d’un point de vue de la transparence. En effet, certains outils sont basés sur des algorithmes où les résultats sont difficilement justifiables. Le manque de visibilité sur les critères d’établissement des résultats entraine une remontée d’alertes en « boîte noire » et ne permet pas toujours de justifier les blocages aux clients. Une trop grande opacité peut également avoir des conséquences juridiques, voir être illégale, lorsque ces alertes ont des conséquences directes sur des clients.

Si ce premier article présente quelles sont les technologies d’avenir dans la lutte contre la fraude, un deuxième article viendra détailler comment les intégrer au mieux.

Cet article Revolution technologique : quelle perspective pour la lutte contre la fraude ? (1/2) est apparu en premier sur RiskInsight.

Le Machine Learning, quelles opportunités et quels enjeux dans une Banque en Ligne moderne ?

ArtHuRC0ugeT — Wed, 16 Nov 2016 08:22:50 +0000

La Banque en Ligne connaît de profondes mutations, tant sur le plan des enjeux métiers – avec des périmètres de plus en plus larges et de moins en moins ensilotés – que sur celui des enjeux règlementaires (Instant Payment, DSP2…). Les cas de fraude se multiplient et les schémas d’attaque menés par des fraudeurs de plus en plus aguerris se diversifient. Pour accompagner ces nombreux changements, les méthodes et les processus métiers se doivent d’être plus efficaces, mieux adaptés, et plus flexibles. Les méthodes de Machine Learning, malgré leur récente démocratisation, permettent d’épouser la révolution digitale de la Banque en Ligne.

Machine Learning, démystification et opportunités

Le Machine Learning est une forme d’intelligence artificielle qui consiste à apprendre et modéliser un phénomène pour mieux le comprendre et le maîtriser. Pour cela, un ou plusieurs algorithmes permettent d’établir des corrélations entre les évènements qui composent ce phénomène. On distingue deux grands types de méthodes :

Les méthodes supervisées, qui créent des modèles à partir d’une base de données d’exemples (généralement des cas déjà traités et validés).
Les méthodes non-supervisées, qui n’ont pas besoin d’une base de données d’exemples

Pour illustrer la différence entre les deux méthodes, on peut considérer le cas de la détection de fraude. Pour s’entraîner et créer des modèles précis, les méthodes supervisées utiliseraient en entrée des données déjà traitées et marquées comme étant liées ou non à des cas de fraude (schémas de fraude connus), alors que les méthodes non-supervisées utiliseraient des données brutes issues des applications du SI afin de modéliser les comportements normaux. Conceptuellement, cela revient à modéliser respectivement ce qui est anormal (la fraude – en ayant assez de données pour que cette représentation soit fidèle) ou ce qui est normal (en détectant de facto les fraudes lorsque l’on s’éloigne de cette normalité).

Tous les algorithmes ne se valent pas. Chacun possède des qualités et des défauts qu’il faut savoir peser et qui dépendent en grande partie des données d’entrée, propres à chaque cas métiers. Il est important de choisir des données à la fois pertinentes et disponibles en quantité suffisante pour obtenir des résultats probants. Dans le contexte de la Banque en Ligne, de nombreuses données peuvent faire l’objet de Machine Learning :

Habitudes de transaction : montants des virements, pays destinataires…
Habitudes de connexion : heure de connexion, user-agent, device utilisé…
Habitudes de navigation : parcours client, vélocité de navigation…
Données comportementales : vitesse de frappe, déplacement de la souris…
Données marketing : produits consommés, libellés des virements…

Correctement exploitée par des algorithmes de Machine Learning, la conjugaison de ces différentes données, précédée par un traitement tirant le maximum de leur valeur, peut permettre des résultats bien plus significatifs que ne le permettent les méthodes classiques. La connaissance client (KYC), en exploitant par exemple le parcours client type, la détection de fraude, en utilisant les habitudes de virement pour identifier des cas suspects (pays de connexion, distribution des montants…), ou encore le marketing à travers la connaissance des habitudes de consommations (analyse des libellés, regroupements des achats par catégories…) peuvent notamment largement tirer parti de ces données.

Concrètement, quels sont les gains du Machine Learning ?

Tout d’abord, connaître le client et mieux adresser ses besoins

Le Machine Learning permet de tirer le maximum de valeur des données en singularisant les modèles là où les méthodes « classiques » reposent sur un modèle commun à l’ensemble des données d’entrée. Par exemple dans le cas de la détection de fraude, les modèles de règles « classiques » reviennent à élaborer un modèle qui sera commun à tous les clients, sans tenir compte de leur unicité, là où le Machine Learning permettra une détection plus efficace en associant un profil à chaque client et en effectuant une surveillance et une détection propres à ce profil. Ce raisonnement vaut pour tous les autres domaines d’applications, et permet, in fine, une meilleure représentation et une meilleure connaissance non plus « du client » au sens large, mais de chacun des clients.

Le Machine Learning permet également d’offrir de nouveaux services

Au-delà de l’amélioration notable des résultats basés sur les KPI classiques (taux de faux positifs, taux de détection…), le Machine Learning permet une création de valeur en termes de gains financiers en personnalisant les outils dont profite le client. Cela peut parfaitement servir de socle à une offre commerciale qui reposerait par exemple sur la personnalisation de ses seuils par le client ou sur la possibilité d’être alerté en temps réel lorsqu’une information marketing, commerciale ou concernant sa sécurité a particulièrement du sens. Certaines banques ont d’ailleurs déjà franchi le pas, en offrant la possibilité à leurs clients Entreprises d’être alertés en cas de virements qui dépassent des seuils personnalisés préalablement établis.

Finalement, le Machine Learning est aussi une occasion de moderniser les outils et rester à l’état de l’art

Lancer un projet de Machine Learning permet de communiquer sur le sujet et de profiter du buzzword pour générer de la satisfaction chez un certain nombre de client de plus en plus sensible à des problématiques de sécurité ou de confidentialité, tout en s’assurant d’être à l’état de l’art du marché. Cela peut également permettre de moderniser des outils existants en vue des changements qui vont continuer d’opérer dans la Banque en Ligne au gré des nouvelles règlementations et des exigences techniques (temps réel notamment avec Instant Payment) et métiers qui en découlent. Dans ce cadre, on voit par exemple éclore des méthodes de Machine Learning pour la surveillance des marchés et lutter contre les délits d’initiés.

En conclusion, la pleine maîtrise technique du Machine Learning coïncide avec de nouveaux besoins et de nouvelles exigences exprimés dans la Banque en Ligne moderne. Embrasser cette évolution présente de nombreux avantages, de l’amélioration des performances et des résultats à la satisfaction des clients, en passant par une meilleure flexibilité technique. La maîtrise des différentes méthodes doit permettre un renouvellement des traitements et des processus métiers, en les rapprochant du client (aujourd’hui ces méthodes sont plutôt transparentes pour lui). Dans le cas de la lutte contre la fraude, on peut par exemple imaginer de nombreux cas autour de l’alerting et des contre-mesures, comme une vérification par authentification forte en cas de suspicion, ou des informations reçues en temps réel pour mieux impliquer les clients.

Cet article Le Machine Learning, quelles opportunités et quels enjeux dans une Banque en Ligne moderne ? est apparu en premier sur RiskInsight.