IA - RiskInsight

GenAI Guardrails – Why do you need them & Which one should you use?

Nicolas Lermusiaux — Wed, 11 Feb 2026 09:10:19 +0000

The rise of generative AI and Large Language Models (LLMs) like ChatGPT has disrupted digital practices. More companies choose to deploy applications integrating these language models, but this integration comes with new vulnerabilities, identified by OWASP in its Top 10 LLM 2025 and Top 10 for Agentic Applications 2026. Faced with these new risks and new regulations like the AI Act, specialized solutions, named guardrails, have emerged to secure interactions (by analysing semantically all the prompts and responses) with LLMs and are becoming essential to ensure compliance and security for these applications.

The challenge of choosing a guardrails solution

As guardrails solutions multiply, organizations face a practical challenge: selecting protection mechanisms that effectively reduce risk without compromising performance, user experience, or operational feasibility.

Choosing guardrails is not limited to blocking malicious prompts. It requires balancing detection accuracy, false positives, latency, and the ability to adapt filtering to the specific context, data sources, and threat exposure of each application. In practice, no single solution addresses all use cases equally well, making guardrail selection a contextual and risk-driven decision.

An important diversity of solutions

Overview of guardrails solutions (not exhaustive)

In 2025, the AI security and LLM guardrails landscape experienced significant consolidation. Major cybersecurity vendors increasingly sought to extend their portfolios with protections dedicated to generative AI, model usage, and agent interactions. Rather than building these capabilities from scratch, many chose to acquire specialized startups to rapidly integrate AI-native security features into their existing platforms, such as SentinelOne with Prompt Security or Check Point with Lakera.

This trend illustrates a broader shift in the cybersecurity market: protections for LLM-based applications are becoming a standard component of enterprise security offerings, alongside more traditional controls. Guardrails and runtime AI protections are no longer niche solutions, but are progressively embedded into mainstream security stacks to support enterprise-scale AI adoption

The main criteria to choose your guardrails

With so many guardrails’ solutions, choosing the right option becomes a challenge. The most important criteria to focus on are:

Filtering effectiveness, to reduce exposure to malicious prompts while limiting false positives
Latency, to ensure a user-friendly experience
Personalisation capabilities, to adapt filtering to business-specific contexts and risks
Operational cost, to support scalability over time

Key Results & Solutions Profiles

To get an idea of the performances the guardrails in the market, we tested several solutions across these criteria and a few profiles stood out:

Some solutions offer rapid deployment and effective baseline protection with minimal configuration, making them suitable for organizations seeking immediate risk reduction. These solutions typically perform well out of the box but provide limited customization.
Other solutions emphasize flexibility and fine-grained control. While these frameworks enable advanced filtering strategies, they often exhibit poor default performance and require significant configuration effort to reach good protection levels.

As a result, selecting a guardrails solution depends less on raw detection scores and more on the expected level of customization, operational maturity, and acceptable setup effort.

Focus on Cloud Providers’ guardrails

As most LLM-based applications are deployed in cloud environments, native guardrails offered by cloud providers represent a pragmatic first layer of protection. These solutions are easy to activate, cost-effective, and integrate seamlessly into existing cloud workflows.

Using automated red-teaming techniques, we observed that cloud-native guardrails consistently blocked most of the common prompt injection and jailbreak attempts. The overall performance of the guardrails available on Azure, AWS and GCP were similar, confirming their relevance as baseline protection mechanisms for production workloads.

Sensitivity Configuration

The configuration of several of the Cloud provider’s solutions allows us to set a sensitivity level to the guardrails configured in order to adapt the detection to the required level for the considered use-case.

AWS Bedrock Guardrails configuration

Customization

Beyond sensitivity tuning, fine-grained customization is essential for effective guardrails protections. Each application has specific filtering requirements, driven by business context, regulatory constraints, and threat exposure.

Personalization is required at multiple levels:

Business context: blocking application-specific forbidden topics, such as competitors, confidential projects, or regulated information
Threat mitigation: adapting filters to address high-impact attacks, including indirect prompt injection
Data flow awareness: within a single application, different data sources require different filtering strategies. User inputs, retrieved documents, and tool outputs should not be filtered identically.

Applying uniform filtering across all inputs significantly limits effectiveness and may create blind spots. Guardrails must therefore be designed as part of the application architecture, not as a single monolithic filter.

Guardrails position in your application’s infrastructure

Key Insights

This study highlights several key insights:

No single guardrails solution fits all use cases, trade-offs exist between ease of deployment, performance, and customization
Cloud-native guardrails provide an effective and low-effort baseline for most cloud-hosted applications
Advanced use cases require configurable solutions capable of adapting filtering logic to application context and data flows

Guardrails should be selected based on risk exposure, operational maturity, and long-term maintainability rather than raw detection scores alone.

Guardrails have become a necessary component of LLM-based applications, and a wide range of solutions is now available. Selecting the right guardrails requires identifying the solution that best aligns with an organization’s specific risks, constraints, and application architecture.

Depending on your profile we have several suggestions for you:

If your application is already deployed in a cloud environment, using the guardrails provided by the cloud provider is a good solution.
If you want better control over the filtering solution, deploying one of the open-source guardrails solutions may be the most suitable option.
You want the best and have the capacity, you can issue an RFI or RFP to compare different solutions and select the most tailored to your needs.

Finally, guardrails alone are not sufficient to protect your applications. Secure LLM applications also rely on properly configured tools, strict IAM policies, and robust security architecture to prevent more severe exploitation scenarios.

Cet article GenAI Guardrails – Why do you need them & Which one should you use? est apparu en premier sur RiskInsight.

Cybersecurity Startups Radar: 2025, AI at the service of cybersecurity

Ahmed Amine Ghariani — Fri, 20 Jun 2025 14:17:53 +0000

AI at the service of cybersecurity: a concrete step forward

Every year since 2020, Wavestone has identified Swiss cybersecurity startups in its eponymous radar. While AI has established itself as a cross-disciplinary subject in all fields, the 2025 Radar focuses on the use of artificial intelligence as a tool, not just as a subject to be secured, but as a technology at the very heart of the cyber response.

Several startups are using AI to automate, enhance or personalize their solutions:

Egonym uses generative AI to anonymize faces in images and videos while preserving useful traits like age and emotion — striking a rare balance between privacy and utility.

Hafnova applies real-time AI to detect, block, and report threats across critical infrastructures with high responsiveness and minimal delay.

Aurigin combats deepfake-based fraud in real time using multimodal AI that simultaneously analyzes voice, image, and text to validate identities.

RedCarbon delivers autonomous AI agents capable of handling complex cybersecurity tasks such as threat detection, hunting, and compliance monitoring — significantly reducing analyst workload.

Baited leverages AI and OSINT to generate hyper-realistic phishing simulations, enabling organizations to test and train employees under real-world conditions.

It’s good to see AI becoming an essential defensive weapon contributing to the defense of our information systems.

Strong momentum around threat detection, response and monitoring

The second strong trend this year is the emergence or reinforcement of startups specializing in intrusion detection, suspicious behavior detection, incident response and continuous supervision.

This segment, already well established historically, is undoubtedly gaining strength with several new entries:

RedCarbon: AI agents for threat detection & automated hunting.

Swiss Security Hub: continuous monitoring of SAP systems with XDR integration.

Cyberservices : XDR platform based on the Google ecosystem.

Hafnova: real-time cyber supervision in critical sectors.

Tirreno: on-prem platform for online fraud detection with user trust scoring.

At a time when cyber-attacks continue to increase in number and complexity, preventive, contextualized and autonomous detection is and will remain key to strengthening operational resilience.

New ground explored: digital sovereignty and secure hardware

Among the notable additions, The Cosmic Dolphins stands out with its sovereign hardware approach:

The Cosmic Dolphins: Swiss smartphones with dual-zone OS (Shark Zone / Dolphin Zone), kill switch, and hardware-first approach to privacy.

Swiss innovation isn’t limited to software: mastery of the physical infrastructure is becoming an issue of trust, sovereignty and differentiation.

Key Figures

Geographical focus: undisputed predominance of Lausanne and Zurich, but other regions are gaining ground

Unsurprisingly, most startups are located around two main technological clusters: Zürich and Lausanne. This confirms an already existing trend since these two cities are hosting Swiss Federal institutes of technology (ETHZ in Zürich, EPFL in Lausanne).

These universities are providing a fertile ground for startups as they offer support in terms of infrastructure but also in terms of collaboration with students and labs. In return, intellectual property is shared between startups and universities. This model is a success for Switzerland as it allows to continuously improve the economy of these regions with a good balance between investment and research.

Nevertheless, other regions such as Geneva and Ticino are showing increasing dynamism, with several new startups emerging in this year’s edition. This points to a gradually diversifying ecosystem, supported by regional initiatives like innovation hubs and dedicated startup incubators.

Methodology

Wavestone’s Swiss Cybersecurity Startups Radar identifies new players in the Swiss cyber innovation ecosystem. Its objective: to provide a global and critical view of an ever-renewing environment.

Startups were selected according to our eligibility criteria:
Head office in Switzerland
Less than 50 employees
Less than 8 years of activity (established as of 2017)
Business model around a specific product (software or hardware)
Startups were identified and evaluated according to the following procedure:
Open Source Intelligence (OSINT) data consolidation
Evaluation in regard to above criteria
Qualitative interviews with the startups

Cet article Cybersecurity Startups Radar: 2025, AI at the service of cybersecurity est apparu en premier sur RiskInsight.

Red Teaming IA : State of play of AI risks in 2025

Basma Benali — Tue, 15 Apr 2025 13:00:00 +0000

Generative AI systems are fallible: in March 2025, a ChatGPT vulnerability was widely exploited to trap its users; a few months earlier, Microsoft’s health chatbot exposed sensitive data; in December, a simple prompt injection allowed the takeover of a user account on the competing service DeepSeek.

Today, the impacts are limited because the latitude given to AI systems is still relatively low. Tomorrow, with the rise of agentic AI, accelerated adoption of generative AI, and the multiplication of use cases, the impacts will grow. Just as the ransomware WannaCry exploited vulnerabilities on a massive scale in 2017, major cyberattacks are likely to target AI systems and could result in injuries or financial bankruptcies.

These risks can be anticipated. One of the most pragmatic ways to do this is to take on the role of a malicious individual and attempt to manipulate an AI system to study its robustness. This approach highlights system vulnerabilities and how to fix them. Specifically for generative AI, this discipline is called AI RedTeaming. In this article, we offer insight into its contours, focusing particularly on field feedback regarding the main vulnerabilities encountered.

To stay aligned with the market practices, this article exclusively focuses on the RedTeaming of generative AI systems.

Back to basics, how does genAI work ?

GenAI relies on components that are often distributed between cloud and on-premise environments. Generally, the more functionalities a generative AI system offers (searching for information, launching actions, executing code, etc.), the more components it includes. From a cybersecurity perspective, this exposes the system to multiple risks :

Diagram of a Generative AI System and Issues Raised by Component

In general, an attacker only has access to a web interface through which they can interact (click, enter text into fields, etc.). From there, they can:

Conduct classic cybersecurity attacks (inserting malicious scripts – XSS, etc.) by exploiting vulnerabilities in the AI system’s components;
Perform a new type of attack by writing in natural language to exploit the functionalities provided by the generative AI system behind the web interface: data exfiltration, executing malicious actions using the privileges of the generative AI system, etc.

Technically, each component is protected by implementing security measures defined by Security Integration Processes within Projects. It is then useful to practically assess the effective level of security through an AI RedTeam audit.

RedTeaming IA, Art of findings AI vulnerabilities

AI RedTeam audits are similar to traditional security audits. However, to address the new challenges of GenAI, they rely on specific methodologies, frameworks, and tools. Indeed, during an AI RedTeam audit, the goal is to bypass the generative AI system by either attacking its components or crafting malicious instructions in natural language. This second type of attack is called prompt injection, the art of formulating malicious queries to an AI system to divert its functionalities.

During an AI RedTeam audit, two types of tests in natural language attacks (specific to AI) are conducted simultaneously:

Manual tests. These allow a reconnaissance phase using libraries of malicious questions consolidated beforehand.
Automated tests. These usually involve a generative AI attacking the target generative AI system by generating a series of malicious prompts and automatically analyzing the coherence of the chatbot’s responses. They help assess the system’s robustness across a wide range of scenarios.

These tests typically identify several vulnerabilities and highlight cybersecurity risks that are often underestimated.

What are the main vulnerabilities we found ?

We have covered three main deployment categories with our clients:

Simple chatbot : these solutions are primarily used for redirecting and sorting user requests;
RAG (Retrieval-Augmented Generation) chatbot : these more sophisticated systems consult internal document databases to enrich their responses;
Agentic chatbot : these advanced solutions can interact with other systems and execute actions.

The consolidation of vulnerabilities identified during our interventions, as well as their relative criticality, allows us to define the following ranking:

Diversion of the model and generation of illegitimate content

This concerns the circumvention of the technical safeguards put in place during the development of the chatbot in order to generate offensive, malicious, or inappropriate content. Thus, the credibility and reputation of the company are at risk of being impacted since it is responsible for the content produced by its chatbot.

It is worth noting that the circumvention of the model’s security mechanisms can lead to a complete unlocking. This is referred to as a jailbreak of the model, which shifts it into an unrestricted mode. In this state, it can produce content outside the framework desired by the company.

Access to the preprompt

The term preprompt refers to the set of instructions that feed the model and shape it for the desired use. All models are instructed not to disclose this preprompt in any form.

An attacker gaining access to this preprompt has their attack facilitated, as it allows them to map the capabilities of the chatbot model. This mapping is particularly useful for complex systems interfaced with APIs or other external systems. Furthermore, access to this preprompt by an attacker enables them to visualize how the filters and limitations of the chatbot have been implemented, which allows them to bypass them more easily.

Web integration and third-party integration

GenAI solutions are often presented to users through a web interface. AI RedTeaming activities regularly highlight classic issues of web applications, particularly the isolation of user sessions or attacks aimed at trapping them. In the case of agentic systems, these vulnerabilities can also affect third-party components interconnected with the GenAI system.

Sensitive data leaks

If the data feeding the internal knowledge base of a RAG chatbot is insufficiently consolidated (selection, management, anonymization, …), the models may inadvertently reveal sensitive or confidential information.

This issue is related to aspects of rights management, data classification, and hardening the data preparation and transit pipelines (MLOps).

Stored injection

In the case of stored injection, the attacker is able to feed the knowledge base of a model by including malicious instructions (via a compromised document). This knowledge base is used for the chatbot’s responses, so any user interacting with the model and requesting the said document will have their session compromised (leak of users’ conversation history data, malicious redirections, participation in a social engineering attack, etc.).

Compromised documents may be particularly difficult to identify, especially in the case of large or poorly managed knowledge bases. This attack is thus persistent and stealthy.

Mention honorable: parasitism and cost explosion

We talk about parasitism when a user is able to unlock the chatbot to fully utilize the model’s capabilities and do so for free. Coupled with a lack of volumetric restrictions, a user can make a prohibitive number of requests, unrelated to the initial use case, and still be charged for them.

In general, some of the mentioned vulnerabilities concern relatively minor risks, whose business impact on information systems (IS) is limited. Nevertheless, with advances in AI technologies, these vulnerabilities take on a different dimension, particularly in the following cases:

Agentic solutions with access to sensitive systems
RAG applications involving confidential data
Systems for which users have control over the knowledge base documents, opening the door to stored injections

The tested GenAI systems are largely unlockable, although the exercise becomes more complex over time. This persistent inability of the models to implement effective restrictions encourages the AI ecosystem to turn to external security components.

What are the new attack surfaces ?

The increasing integration of AI into sensitive sectors (healthcare, finance, defense, …) expands the attack surfaces of critical systems, which reinforces the need for filtering and anonymization of sensitive data. Where AI applications were previously very compartmentalized, agentic AI puts an end to this compartmentalization as it deploys a capacity for interconnection, opening the door to potential threat propagation within information systems.

The decrease in the technical level required to create an AI system, particularly through the use of SaaS platforms and Low/no code services, facilitates its use for both legitimate users and attackers.

Finally, the widespread adoption of “co-pilots” directly on employees’ workstations results in an increasing use of increasingly autonomous components that act in place of and with the privileges of a human, accelerating the emergence of uncontrolled AI perimeters or Shadow IT AI.

Towards increasingly difficult-to-control systems

Although appearing to imitate human intelligence, GenAI models (LLMs, or Large Language Models) have the sole function of mimicking language and often act as highly efficient text auto-completion systems. These systems are not natively trained to reason, and their use encounters a “black box” operation. It is indeed complex to reliably explain their reasoning, which regularly results in hallucinations in their outputs or logical fallacies. In practice, it is also impossible to prove the absence of “backdoors” in these models, further limiting our trust in these systems.

The emergence of agentic AI complicates the situation. By interconnecting systems with opaque functioning, it renders the entire reasoning process generally unverifiable and inexplicable. Cases of models training, auditing, or attacking other models are becoming widespread, leading to a major trust issue when they are integrated into corporate information systems.

What are the perspectives for the future ?

The RedTeaming AI audits conducted on generative AI systems reveal a contrasting reality. On one hand, innovation is rapid, driven by increasingly powerful and integrated use cases. On the other hand, the identified vulnerabilities demonstrate that these systems, often perceived as intelligent, remain largely manipulable, unstable, and poorly explainable.

This observation is part of a broader context of the democratization of AI tools coupled with their increasing autonomy. Agentic AI, in particular, reveals chains of action that are difficult to trace, acting with human privileges. In such a landscape, the risk is no longer solely technical: it also becomes organizational and strategic, involving continuous governance and oversight of its uses.

In the face of these challenges, RedTeaming AI emerges as an essential lever to anticipate possible deviations, adopting the attacker’s perspective to better prevent drifts. It involves testing the limits of a system to design robust, sustainable protection mechanisms that align with new uses. Only by doing so can generative AI continue to evolve within a framework of trust, serving both users and organizations.

Cet article Red Teaming IA : State of play of AI risks in 2025 est apparu en premier sur RiskInsight.

US Executive Order & Betchley Declaration

Amélie Grangien — Fri, 03 May 2024 08:49:27 +0000

In the evolving landscape of AI governance and regulation, recent efforts have shifted from scattered and reactive measures to cohesive policy frameworks that foster innovation while safeguarding against potential misuse.
As AI becomes more integrated into our daily life, both public and private sectors have raised ethical concerns around issues of privacy, bias, accountability, and transparency.

Source: https://ourworldindata.org/artificial-intelligence

Today, as governments actively craft AI guidance and legislation, policymakers face the challenge of delicately balancing the need to foster innovation and ensuring accountability. A regulatory framework that prioritizes innovation but relies too heavily on the private sector’s self-governance could lead to a lack of oversight and accountability. Conversely, while robust safeguards are essential to mitigate potential risks, an overly restrictive approach may stifle technological progress.
This whitepaper will explore the approaches proposed by the governments of the United States and the United Kingdom as they pertain to AI governance across both the public and private sectors.

American Approach to AI Regulation

In October of 2023, the White House published the AI Executive Order. The order specifies key near-term priorities of introducing reporting requirements for AI developers exceeding computing thresholds, launching research initiatives, developing frameworks for responsible AI use, and establishing AI governance within the federal government. Longer-term efforts focus on international cooperation, global standards, and AI safety.
On the side of ensuring accountability, the order calls for the Secretary of Commerce to enforce reporting provisions for companies developing dual-use AI foundation models, organizations acquiring large-scale computing clusters, and Infrastructure as a Service providers enabling foreign entities to conduct certain AI model training. While these criteria will likely exempt most small to medium sized AI companies from immediate regulations, large industry players like Open AI, Anthropic, and Meta could be affected if they surpass the computing threshold established by the order.
On the other side of fostering innovation, further sections of the order reaffirm the US government’s aim to promote AI innovation and competition – supporting R&D initiatives and public-private partnerships, provisioning streamlined visa processes to attract AI talent to the US, prioritizing AI-oriented recruitment within the federal government, clarifying IP issues related to AI, and preventing unlawful collusion.
Overall, the nature of the documents published by the US is mostly non-binding, indicating a strategy of encouraging the private sector to self-regulate and align to common AI best practices. In this approach, the White House has been persistent in its messaging that it is committed to nurturing innovation, research, and leadership in the domain, while also balancing with the need for a secure and responsible AI ecosystem.

The British Approach to AI Regulation

The Bletchley Declaration, agreed upon during the AI Safety Summit 2023 held at Bletchley Park, Buckinghamshire, marks a pioneering international effort towards ensuring the safe and responsible development of AI technologies. This declaration represents a commitment from 29 governments to collaborate on developing AI in a manner that is human-centric, trustworthy, and responsible, with the UK, US, China, and major European member states among the notable signatories. The focus is on “frontier AI,” which refers to highly capable, general-purpose AI models that could pose significant risks, particularly in areas such as cybersecurity and biotechnology.
The declaration emphasizes the need for governments to take proactive measures to ensure the safe development of AI, acknowledging the technology’s pervasive deployment across various facets of daily life including housing, employment, education, and healthcare. It calls for the development of risk-based policies, appropriate evaluation metrics, tools for safety testing, and building relevant public sector capability and scientific research.
In addition to the declaration, a policy paper on AI ‘Safety Testing’ was also signed by ten countries, including the UK and the US, as well as major technology companies. This policy paper outlines a broad framework for testing next-generation AI models by government agencies, promoting international cooperation, and enabling government agencies to develop their own approaches to AI safety regulation.
The key takeaways from the Bletchley Declaration include a clear signal from governments regarding the urgency to address the development of safe AI. However, how these commitments will translate into specific policy proposals and the role of the newly announced AI Safety Institute (AISI) in the UK’s regulatory landscape remain to be seen. The AISI’s mission is to minimize surprise from rapid and unexpected advances in AI, focusing on testing and evaluation of advanced AI systems, foundational AI safety research, and facilitating information exchange.

As they seek to establish themselves as AI leaders in the global community and set the direction for effective policymaking, both the US and the UK are navigating the balance between promoting AI innovation and ensuring ethical governance. While most of the current focus is on proposing guidelines and frameworks for the safe and responsible use of AI, the reference to potential future regulations across both documents should serve as a wake-up call for companies to start aligning their practices with the principles and recommendations outlined.
To stay ahead of the curve, organizations should develop robust methodologies to monitor AI risks effectively. This involves adapting their AI strategy to prioritize risk mitigation, identifying potential harms that may arise from the deployment of AI systems, and preparing for forthcoming regulatory measures by implementing a secure and comprehensive risk management program.
However, the US and UK opportunist approach to AI legislation is not followed by all. China chose a targeted and evolutive approach by writing a law on Generative AI that came into effect in 2023. Finally, in Europe, the AI Act shows that the EU doesn’t want to let AI technologies go out of hand.

Cet article US Executive Order & Betchley Declaration est apparu en premier sur RiskInsight.

Securing AI: The New Cybersecurity Challenges

Gérôme Billois — Wed, 13 Mar 2024 15:08:52 +0000

The use of artificial intelligence systems and Large Language Models (LLMs) has exploded since 2023. Businesses, cybercriminals and individuals alike are beginning to use them regularly. However, like any new technology, AI is not without risks. To illustrate these, we have simulated two realistic attacks in previous articles: Attacking an AI? A real-life example! and Language as a sword: the risk of prompt injection on AI Generative.

This article provides an overview of the threat posed by AI and the main defence mechanisms to democratize their use.

AI introduces new attack techniques, already widely exploited by cybercriminals

As with any new technology, AI introduces new vulnerabilities and risks that need to be addressed in parallel with its adoption. The attack surface is vast: a malicious actor could attack both the model itself (model theft, model reconstruction, diversion from initial use) and its data (extracting training data, modifying behaviour by adding false data, etc.).

Prompt injection is undoubtedly the most talked-about technique. It enables an attacker to perform unwanted actions on the model, such as extracting sensitive data, executing arbitrary code, or generating offensive content.

Given the growing variety of attacks on AI models, we will take a non-exhaustive look at the main categories:

Data theft (impact on confidentiality)

As soon as data is used to train Machine Learning models, it can be (partially) reused to respond to users. A poorly configured model can then be a little too verbose, unintentionally revealing sensitive information. This situation presents a risk of violation of privacy and infringement of intellectual property.

And the risk is all the greater if the models are ‘overfitted’ with specific data. Oracle attacks take place when the model is in production, and the attacker questions the model to exploit its responses. These attacks can take several forms:

Model extraction/theft: an attacker can extract a functional copy of a private model by using it as an oracle. By repeatedly querying the Machine Learning model’s API access, the adversary can collect the model’s responses. These responses will be used as labels to form a separate model that mimics the behaviour and performance of the target model.
Membership inference attacks: this attack aims to check whether a specific piece of data has been used during the training of an AI model. The consequences can be far-reaching, particularly for health data: imagine being able to check whether an individual has cancer or not! This method was used by the New York Times to prove that its articles were used to train ChatGPT[1].

Destabilisation and damage to reputation (impact on integrity)

The performance of a Machine Learning model depends on the reliability and quality of its training data. Poison attacks aim to compromise the training data to affect the model’s performance:

Model skewing: the attack aims to deliberately manipulate a model during training (either during initial training, or after it has been put into production if the model continues to learn) to introduce biases and steer the model’s predictions. As a result, the biased model may favour certain groups or characteristics, or be directed towards malicious predictions.
Backdoors: an attacker can train and distribute a corrupted model containing a backdoor. Such a model functions normally until an input containing a trigger modifies its behaviour. This trigger can be a word, a date or an image. For example, a malware classification system may let malware through if it sees a specific keyword in its name or from a specific date. Malicious code can also be executed[2]!

The attacker can also add carefully selected noise to mislead the prediction of a healthy model. This is known as an adversarial or evasion attack:

Evasion attack (adversarial attack): the aim of this attack is to make the model generate an output not intended by the designer (making a wrong prediction or causing a malfunction in the model). This can be done by slightly modifying the input to avoid being detected as malicious input. For example:
- Ask the model to describe a white image that contains a hidden injection prompt, written white on white in the image.
- Wear a special pair of glasses to avoid being recognised by a facial recognition algorithm[3].
- Add a sticker of some kind to a “Stop” sign so that the model recognises a “45km/h limit” sign[4].

Impact on availability

In addition to data theft and the impact on image, attackers can also hamper the availability of Artificial Intelligence (AI) systems. These tactics are aimed not only at making data unavailable, but also at disrupting the regular operation of systems. One example is the poisoning attack, the impact of which is to make the model unavailable while it is retrained (which also has an economic impact due to the cost of retraining the model). Here is another example of an attack:

Denial of service attack (DDOS) on the model: like all other applications, Machine Learning models are sensitive to denial-of-service attacks that can hamper system availability. The attack can combine a high number of requests, while sending requests that are very heavy to process. In the case of Machine Learning models, the financial consequences are greater because tokens/prompts are very expensive (for example, ChatGPT is not profitable despite its 616 million monthly users).

Two ways of securing your AI projects: adapt your existing cyber controls, and develop specific Machine Learning measures

Just like security projects, a prior risk analysis is necessary to implement the right controls, while finding an acceptable compromise between security and the functioning of the model. To do this, our traditional risk methods need to evolve to include the risks detailed above, which are not well covered by historical methods.

Following these risk analyses, security measures will need to be implemented. Wavestone has identified over 60 different measures. In this second part, we present a small selection of these measures to be implemented according to the criticality of your models.

1. Adapting cyber controls to Machine Learning models

The first line of defence corresponds to the basic application, infrastructure, and organisational measures for cybersecurity. The aim is to adapt requirements that we already know about, which are present in the various security policies, but do not necessarily apply in the same way to AI projects. We need to consider these specificities, which can sometimes be quite subtle.

The most obvious example is the creation of AI pentests. Conventional pentests involve finding a vulnerability to gain access to the information system. However, AI models can be attacked without entering the IS (like evasion and oracle attacks). RedTeaming procedures need to evolve to deal with these particularities while developing detection and incident response mechanisms to cover the new applications of AI.

Another essential example is the isolation of AI environments used throughout the lifecycle of Machine Learning models. This reduces the impact of a compromise by protecting the models, training data, and prediction results.

You also need to assess the regulations and laws with which the Machine Learning application must comply, and adhere to the latest legislation on artificial intelligence (the IA Act in Europe, for example).

And finally, a more than classic measure: awareness and training campaigns. We need to ensure that the stakeholders (project managers, developers, etc.) are trained in the risks of AI systems and that users are made aware of these risks.

2. Specific controls to protect sensitive Machine Learning models

In addition to the standard measures that need to be adapted, specific measures need to be identified and applied.

For your least critical projects, keep things simple and implement the basics

Poison control: to guard against poisoning attacks, you need to detect any “false” data that may have been injected by an attacker. This involves using exploratory statistical analysis to identify poisoned data (analysing the distribution of data and identifying absurd data, for example). This step can be included in the lifecycle of a Machine Learning model to automate downstream actions. However, human verification will always be necessary.

Input control (analysing user input): to counter prompt injection and evasion attacks, user input is analysed and filtered to block all malicious input. We can think of basic rules (blocking requests containing a specific word) as well as more specific statistical rules (format, consistency, semantic coherence, noise, etc.). However, this approach could have a negative impact on model performance, as false positives would be blocked.

For your moderately sensitive projects, aim for a good investment/risk coverage ratio

There is a plethora of measures, and a great deal of literature on the subject. On the other hand, some measures can cover several risks at once. We think it is worth considering them first.

Transform inputs: an input transformation step is added between the user and the model. The aim is twofold:

For example, remove or modify any malicious input by reformulating the input or truncating it. An implementation using encoders is also possible (but will be detailed in the next section).
Another instance will be to reduce the attacker’s visibility to counter oracle attacks (which require precise knowledge of the model’s input and output) by adding random noise or reformulating the prompt.

Depending on the implementation method, impacts on model performance are to be expected.

Supervise AI with AI models: any AI model that learns after it has been put into production must be specifically supervised as part of overall incident detection and response processes. This involves both collecting the appropriate logs to carry out investigations, but also monitoring the statistical deviation of the model to spot any abnormal drift. In other words, it involves assessing changes in the quality of predictions over time. Microsoft’s Tay model launched on Twitter in 2016 is a good example of a model that has drifted.

For your critical projects, go further to cover specific risks

There are measures that we believe are highly effective in covering certain risks. Of course, this involves carrying out a risk analysis beforehand. Here are two examples (among many others):

Randomized Smoothing: a training technique designed to improve the robustness of a model’s predictions. The model is trained twice: once with real training data, then a second time with the same data altered by noise. The aim is to have the same behaviour, whether noise is present in the input. This limits evasion attacks, particularly for classification algorithms.

Learning from contradictory examples: the aim is to teach the model to recognise malicious inputs to make it more robust to adversarial attacks. In practical terms, this means labelling contradictory examples (i.e. a real input that includes a small error/disturbance) as malicious data and adding them during the training phase. By confronting the model with these simulated attacks, it learns to recognise and counter malicious patterns. This is a very effective measure, but it involves a certain cost in terms of resources (longer training phase) and can have an impact on the accuracy of the model.

Versatile guardians – three sentinels of AI security

Three methods stand out for their effectiveness and their ability to mitigate several attack scenarios simultaneously: GAN (Generative Adversarial Network), filters (encoders and auto-encoders that are models of neural networks) and federated learning.

The GAN: the forger and the critic

The GAN, or Generative Adversarial Network, is an AI model training technique that works like a forger and a critic working together. The forger, called the generator, creates “copies of works of art” (such as images). The critic, called the discriminator, evaluates these works to identify the fakes from the real ones and gives advice to the forger on how to improve. The two work in tandem to produce increasingly realistic works until the critic can no longer identify the fakes from the real thing.

A GAN can help reduce the attack surface in two ways:

With the generator (the faker) to prevent sensitive data leaks. A new fictitious training database can be generated, like the original but containing no sensitive or personal data.
The discriminator (the critic) limits evasion or poisoning attacks by identifying malicious data. The discriminator compares a model’s inputs with its training data. If they are too different, then the input is classified as malicious. In practice, it can predict whether an input belongs to the training data by associating a likelihood scope with it.

Auto-encoders: an unsupervised learning algorithm for filtering inputs and outputs

An auto-encoder transforms an input into another dimension, changing its form but not its essence. To take a simplifying analogy, it’s as if the prompt were summarized and rewritten to remove undesirable elements. In practice, the input is compressed by a noise-removing encoder (via a first layer of the neural network), then reconstructed via a decoder (via a second layer). This model has two uses:

If an auto-encoder is positioned upstream of the model, it will have the ability to transform the input before it is processed by the application, removing potential malicious payloads. In this way, it becomes more difficult for an attacker to introduce elements enabling an evasion attack, for example.
We can use this same system downstream of the model to protect against oracle attacks (which aim to extract information about the data or the model by interrogating it). The output will thus be filtered, reducing the verbosity of the model, i.e. reducing the amount of information output by the model.

Federated Learning: strength in numbers

When a model is deployed on several devices, a delocalised learning method such as federated learning can be used. The principle: several models learn locally with their own data and only send their learning back to the central system. This allows several devices to collaborate without sharing their raw data. This technique makes it possible to cover a large number of cyber risks in applications based on artificial intelligence models:

Segmentation of training databases plays a crucial role in limiting the risks of Backdoor and Model Skewing poisoning. The fact that training data is specific to each device makes it extremely difficult for an attacker to inject malicious data in a coordinated way, as he does not have access to the global set of training data. This same division limits the risks of data extraction.
The federated learning process also limits the risks of model extraction. The learning process makes the link between training data and model behaviour extremely complex, as the model does not learn directly. This makes it difficult for an attacker to understand the link between input and output data.

Together, GAN, filters (encoders and auto-encoders) and federated learning form a good risk hedging proposition for Machine Learning projects despite the technicality of their implementation. These versatile guardians demonstrate that innovation and collaboration are the pillars of a robust defence in the dynamic artificial intelligence landscape.

To take this a step further, Wavestone has written a practical guide for ENISA on securing the deployment of machine learning, which lists the various security controls that need to be established.

In a nutshell

Artificial intelligence can be compromised by methods that are not usually encountered in our information systems. There is no such thing as zero risk: every model is vulnerable. To mitigate these new risks, additional defence mechanisms need to be implemented depending on the criticality of the project. A compromise will have to be found between security and model performance.

AI security is a very active field, from Reddit users to advanced research work on model deviation. That’s why it’s important to keep an organisational and technical watch on the subject.

[1] New York Times proved that their articles were in AI training data set

[2] Au moins une centaine de modèles d’IA malveillants seraient hébergés par la plateforme Hugging Face

[3] Sharif, M. et al. (2016). Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. ACM Conference on Computer and Communications Security (CCS)

[4] Eykholt, K. et al. (2018). Robust Physical-World Attacks on Deep Learning Visual Classification. CVPR. https://arxiv.org/pdf/1707.08945.pdf

Cet article Securing AI: The New Cybersecurity Challenges est apparu en premier sur RiskInsight.