Skip to content

2.3 Misuse Risks

In the following sections, we will go through some world-states that hopefully paint a little bit of a clearer picture of risks when it comes to AI. Although the sections have been divided into misuse, misalignment, and systemic, it is important to remember that this is for the sake of explanation. It is highly likely that the future will involve a mix of risks emerging from all of these categories.

Technology increases the harm impact radius . Technology is an amplifier of intentions. As it improves, so does the radius of its effects. According to how powerful a certain technology is, both its beneficial and its harmful effects can affect the world in a larger radius. Think about the harm that a person could do when utilizing other tools throughout history. During the Stone Age, with a rock maybe someone could harm ~5 people, a few hundred years ago with a bomb someone could harm ~100 people. In 1945 with a nuclear weapon, one person could harm ~250,000 people. The thing to notice here is that we are on an exponential trend, where the radius of potential impact from one person using technological tools keeps increasing. If we experience a nuclear winter today, the harm radius would be almost 5 billion people, which is ~60% of humanity. If we assume that transformative AI is a tool that overshadows the power of all others that came before it, then its blast radius could potentially harm 100% of humanity. (Munk Debate, 2023)

Another thing to keep in mind is that the more spread out that such a technology is, the higher the risks of malicious use. From the previous example, we can see that as time progresses, a single person in possession of some technology has been able to cause increasing amounts of harm throughout history. If many people have access to tools that can be both highly beneficial or catastrophically harmful, then it might only take one single person to cause significant devastation to society. So the growing potential for AIs to empower malicious actors may be one of the most severe threats humanity will face in the coming decades.

2.3.1 Bio Risk

When we look at ways AI could enable harm through misuse, one of the most concerning cases involves biology. Just as AI can help scientists develop new medicines and understand diseases, it can also make it easier for bad actors to create biological weapons.

What unique risks could arise from AI-enabled bioweapons? Unlike conventional weapons with localized effects, engineered pathogens can self-replicate and spread globally. The COVID-19 pandemic demonstrated how even relatively mild viruses can cause widespread harm despite safeguards (Pannu et al., 2024). While pandemic-class agents might be strategically useless to nation-states due to their slow spread and indiscriminate lethality, they can still be potentially acquired and deliberately released by terrorists (Esvelt, 2022). The offense-defense balance in biotechnology development compounds these risks - developing a new virus might cost around 100 thousand dollars, while creating a vaccine against it could cost over 1 billion dollars (Mouton et al., 2024).

Enter image alt description
Figure 2.9: Graphic adapted from a report by RAND. It highlights how biotechnology and AI are converging rapidly. (Zakaria, 2024)

What do we know about AI's impact on biological information access? Demonstrations have shown that students with no biology background can use AI chatbots to rapidly gather sensitive information - “within an hour, they identified potential pandemic pathogens, methods to produce them, DNA synthesis firms likely to overlook screening, and detailed protocols”(Soice et al., 2023). However, when compared to the baseline of internet access, the National Security Commission on Emerging Biotechnology concluded in 2024 that LLMs do not meaningfully increase bioweapon risks beyond existing information sources. (Peppin et al., 2024; NSCEB, 2024). But as we saw in the previous chapter, capabilities continue to increase, and with accelerating capabilities of models the situation could change equally rapidly.

How might AI affect access to biological knowledge? Demonstrations have shown that students with no biology background were able to use AI chatbots to rapidly gather sensitive information - "within an hour, they identified potential pandemic pathogens, methods to produce them, DNA synthesis firms likely to overlook screening, and detailed protocols" (Soice et al., 2023). This raised concerns about AI democratizing access to dangerous biological knowledge. However, when compared to baseline internet access it was concluded by the US national security commission on emerging biotechnology that they do not meaningfully increase bioweapon risks beyond existing information sources (Peppin et al., 2024; NSCEB, 2024). We saw in the previous chapter that AI systems are rapidly becoming more capable. Future models could potentially overcome current limitations by providing more actionable guidance and helping users work around safety measures. So increased access to potential bio weapon synthesis knowledge remains a risk deserving of serious consideration.

Enter image alt description
Figure 2.10: Biotechnology risk chain. The risk chain for developing a bioweapon starts with ideating a biological threat, followed by a design-build-test-learn (DBTL) loop. (Li et al., 2024)

How might AI transform biological design and synthesis? The potential for misuse in biological design has already been demonstrated - researchers took an AI model designed for drug discovery and redirected it to generate toxic compounds, producing 40,000 potentially toxic molecules within six hours, some of which were more deadly than known chemical weapons (Urbina et al., 2022). A major limitation is that creating biological weapons still requires extensive practical expertise and resources. Experts estimate that in 2022 about 30,000 individuals worldwide possessed the skills needed to follow even basic virus assembly protocols (Esvelt, 2022). Key barriers include specialized laboratory skills, tacit knowledge, access to controlled materials and equipment, and complex testing requirements (Carter et al., 2023)

Enter image alt description
Figure 2.11: An example of a benchtop DNA synthesis machine. (DnaScript, 2024)

However, broader technological trends could help overcome these barriers. DNA synthesis costs have been halving every 15 months (Carlson, 2009). Automated "cloud laboratories" allow researchers to remotely conduct experiments by sending instructions to robotic systems. Benchtop DNA synthesis machines (at home devices that can print custom DNA sequences) are also becoming more widely available. Combined with increasingly sophisticated AI assistance for experimental design and optimization, these developments could make creating custom biological agents more accessible to people without extensive resources or institutional backing (Carter et al., 2023).

2.3.2 Cyber Risk

What does non-AI enabled cybersecurity look like? Even without AI, global cybersecurity infrastructure shows vulnerabilities. A single software update by crowdstrike caused airlines to stop flights, hospitals to cancel surgeries, and banks to stop processing transactions causing over 5 billion dollars of damage (CrowdStrike, 2024). This wasn't even a cyber attack - it was an accident. In deliberate attacks, we have examples like the colonial pipeline ransomware attack which caused widespread gas shortages (CISA, 2021; Cunha & Estima, 2023), or the Sony Pictures hack through targeted phishing emails by North Korea. (Slattery et al., 2024) These are just a couple of examples amongst many others. It shows how vulnerable our computer systems are, and why we need to think carefully about how AI could make attacks worse.

What are cyber attack overhangs? Beyond accidents and demonstrated attacks, we also face "cyberattack overhangs" - where devastating attacks are possible but haven't occurred due to attacker restraint rather than robust defenses. As an example, Chinese state actors are claimed to have already positioned themselves inside critical U.S. infrastructure systems (CISA, 2024). This type of cyber deterrent positioning can happen between any group of nations. Due to such cyber attack overhangs several actors might have the potential capability to disrupt water controls, energy systems, and ports in different nations. The point we are trying to illustrate is that as far as cyber security is concerned, society is in a pretty precarious state, even before AI comes into the picture.

How does AI augment social engineering? AI enables automated, highly personalized phishing at scale. AI-generated phishing emails achieve higher success rates (65% vs 60% for human-written) while taking 40% less time to create (Slattery et al., 2024). Tools like FraudGPT automate this customization using targets' background, interests, and relationships. Adding to this threat, open source AI voice cloning tools just minutes of audio to create convincing replicas of someone's voice (Qin et al., 2024). A similar situation exists in deepfakes where AI is showing progress in one-shot face swapping and manipulation. If only a single image of two individuals exists on the internet, then they can be a target of face swapping deepfakes (Zhu et al., 2021; Li et al., 2022; Xu et al., 2022) Automated web crawling for open source intelligence (OSINT) to gather photos, audio, interests and information enables AI-assisted password cracking which has shown to significantly more effective than traditional methods while requiring less computational resources (Slattery et al., 2024).

Enter image alt description
Figure 2.12: Example of one shot face swapping. Left: source image that represents the identity; Middle: target image that provides the attributes; Right: the swapped face image. (Zhu et al., 2021)

How does AI enhance vulnerability discovery? AI systems can now scan code and probe systems automatically, finding potential weaknesses much faster than humans. Research shows AI agents can autonomously discover and exploit vulnerabilities without human guidance, successfully hacking 73% of test targets (Fang et al., 2024). These systems can even discover novel attack paths that weren't known beforehand.

How does AI enhance malware development? We can take tools that are designed to write correct code, and simply reward them for writing malware. Tools like WormGPT help attackers generate malicious code and build attack frameworks without requiring deep technical knowledge. Polymorphic AI malware like BlackMamba can also automatically generate variations of malware that preserve functionality while appearing completely different to security tools. Each attack can use unique code, communication patterns, and behaviors - making it much harder for traditional security tools to identify threats. (HYAS, 2023) AI fundamentally changes the cost-benefit calculations for attackers. Research shows autonomous AI agents can now hack websites for about 10 dollars per attempt - roughly 8 times cheaper than using human expertise (Fang et al., 2024). This dramatic reduction in cost enables attacks at unprecedented scale and frequency.

Enter image alt description
Figure 2.13: Stages of a cyberattack. The objective is to design benchmarks and evaluations that assess models ability to aid malicious actors with all four stages of a cyberattack. (Li et al., 2024)

How do AI enabled cyber threats influence infrastructure and systemic risks? Infrastructure attacks that once took years and millions of dollars, like Stuxnet, could become more accessible as AI automates the mapping of industrial networks and identification of critical control points. AI can analyze technical documentation and generate attack plans that previously required teams of experts (Ladish, 2024). AI removes these limits, enabling automated attacks that could target thousands of systems simultaneously and trigger cascading failures across interconnected infrastructure (Newman, 2024).

Enter image alt description
Figure 2.14: Schematic of using autonomous LLM agents to hack websites. (Fang et al., 2024)

How does AI change the offense defence balance in cyber security? AI should theoretically help defenders more than attackers since a perfectly secured system would have no vulnerabilities to exploit. However, real-world security faces practical challenges - many organizations struggle to implement even basic security practices. Attackers only need to find a single weakness, while defenders must protect everything. AI makes finding these weaknesses easier and more automated, potentially shifting the balance toward offense (Slattery et al., 2024). The speed of AI-enabled attacks adds another layer of difficulty. When we combine automated vulnerability discovery, malware generation, and increased ease of access this enables end-to-end automated attacks that previously required teams of skilled humans. AI's ability to execute attacks in minutes rather than weeks creates the potential for "flash attacks" where systems are compromised before human defenders can respond (Fang et al., 2024).

2.3.3 Autonomous Weapons Risk

In the previous sections, we saw how AI amplifies risks in biological and cyber domains by removing human bottlenecks and enabling attacks at unprecedented speed and scale. The same pattern emerges even more dramatically with military systems. Traditional weapons are constrained by their human operators - a person can only control one drone, make decisions at human speed, and may refuse unethical orders. AI removes these human constraints, setting the stage for a fundamental transformation in how wars are fought.

How widespread is military AI deployment today? AI-enabled weapons are already being used in active conflicts, with real-world impacts we can observe. According to reports made to the UN Security Council, autonomous drones were used to track and attack retreating forces in Libya in 2021, marking one of the first documented cases of lethal autonomous weapons (LAWs) making targeting decisions without direct human control (Panel of Experts on Libya, 2021). In Ukraine, both parties have used loitering munitions. Russian KUB-BLA, Lancet-3 and Ukrainian Switchblade, Phoenix Ghost are AI-enabled drones. The Lancet is using an Nvidia computing module for autonomous target tracking. (Bode & Watts, 2023) Israel has conducted AI-guided drone swarm attacks in Gaza, while Turkey's Kargu-2 can find and attack human targets on its own using machine learning, rather than needing constant human guidance. These deployments show how quickly military AI is moving from theoretical possibilities to battlefield realities (Simmons-Edler et al., 2024; Bode & Watts, 2023).

How are military systems becoming more autonomous over time? The evolution of military AI follows a clear progression from basic algorithms to increasingly autonomous systems. Early automated weapons like the U.S. Phalanx close-in weapon system from the 1970s operated under narrow constraints - they could only respond to specific threats in specific ways (Simmons-Edler et al., 2024). Today's systems use machine learning to actively perceive and respond to their environment, as is evident from all the examples we gave in the previous paragraph. This progression from algorithmic to autonomous capabilities mirrors broader trends in AI development, but with big implications for risk, warfare and human lives. (Simmons-Edler et al., 2024).

How do Military Incentives Drive Increasing Autonomy? Several forces push toward greater AI control of weapons. Speed offers decisive advantages in modern warfare - when DARPA tested an AI system against an experienced F-16 pilot in simulated dogfights, the AI won consistently by executing maneuvers too precise and rapid for humans to counter. Cost creates additional pressure - the U.S. military's Replicator program aims to deploy thousands of autonomous drones at a fraction of the cost of traditional aircraft (Simmons-Edler et al., 2024). Perhaps most importantly, military planners worry about enemies jamming communications to remotely operated weapons. This drives development of systems that can continue fighting even when cut off from human control (Bode & Watts, 2023). These incentives mean military AI development increasingly focuses on systems that can operate with minimal human oversight.

Many modern systems are specifically designed to operate in GPS-denied environments where maintaining human control becomes impossible. In Ukraine, military commanders have explicitly called for more autonomous operation to match the speed of modern combat, with one Ukrainian commander noting they 'already conduct fully robotic operations without human intervention' (Bode & Watts, 2023).

There is also a “marketing messaging shift” around autonomous capabilities in real conflicts. After a UN report suggested autonomous targeting by Turkish Kargu-2 drones in Libya, the manufacturer quickly shifted from advertising its 'autonomous attack modes' to emphasizing human control. However, technical analysis reveals many current systems retain latent autonomous targeting capabilities even while being operated with humans-in-the-loop - suggesting a small software update could enable fully autonomous operation (Bode & Watts, 2023).

Enter image alt description
Figure 2.15: Loitering munitions are expendable uncrewed aircraft which can integrate sensor based analysis to hover over, detect, and crash into targets. These systems were developed during the 1980s and early 1990s to conduct Suppression of Enemy Air Defence (SEAD) operations. They “blur the line between drone and missile”. (Bode & Watts, 2023)

How do advances in swarm intelligence amplify these risks? As AI enables better coordination between autonomous systems, military planners are increasingly focused on deploying weapons in interconnected swarms. The U.S. Replicator already has plans to build and deploy thousands of coordinated autonomous drones that can overwhelm defenses through sheer numbers and synchronized actions. (Defense Innovation Unit, 2023) When combined with increasing autonomy, these swarm capabilities mean that future conflicts may involve massive groups of AI systems making coordinated decisions faster than humans can track or control (Simmons-Edler et al., 2024).

What Happens as Humans Lose Meaningful Control? The pressure to match the speed and scale of AI-driven warfare leads to a gradual erosion of human decision-making. Military commanders increasingly rely on AI systems not just for individual weapons, but for broader tactical decisions. In 2023, Palantir demonstrated an AI system that could recommend specific missile deployments and artillery strikes. While presented as advisory tools, these systems create pressure to delegate more control to AI as human commanders struggle to keep pace (Simmons-Edler et al., 2024).

Even when systems nominally keep humans in control, combat conditions can make this control more theoretical than real. Operators often make targeting decisions under intense battlefield stress, with only seconds to verify computer-suggested targets. Studies of similar high-pressure situations show operators tend to uncritically trust machine suggestions rather than exercising genuine oversight. This means that even systems designed for human control may effectively operate autonomously in practice (Bode & Watts, 2023).

The "Lavender" targeting system demonstrates where this leads - humans just set the acceptable thresholds. Lavender uses machine learning to assign residents a numerical score relating to the suspected likelihood that a person is a member of an armed group. Based on reports, Israeli military officers are responsible for setting the threshold beyond which an individual can be marked as a target subject to attack. (Human Rights Watch, 2024; Abraham, 2024). As warfare accelerates beyond human decision speeds, maintaining meaningful human control becomes increasingly difficult.

What happens when automation removes human safeguards? Traditional warfare had built-in human constraints that limited escalation. Soldiers could refuse unethical orders, feel empathy for civilians, or become fatigued - all natural brakes on conflict. AI systems remove these constraints. Recent studies of military AI systems found they consistently recommend more aggressive actions than human strategists, including escalating to nuclear weapons in simulated conflicts (Rivera et al., 2024). The history of nuclear close calls shows the importance of human judgment - in 1983, nuclear conflict was prevented by a single Soviet officer. Stanislav Petrov chose to ignore a computerized warning of incoming U.S. missiles, correctly judging it to be a false alarm. As militaries increasingly rely on AI for early warning and response, we may lose these crucial moments of human judgment that have historically prevented catastrophic escalation (Simmons-Edler et al., 2024).

How Does This Create Dangerous Arms Race Dynamics? The development of autonomous weapons is creating powerful pressure for military competition in ways that threaten both safety and research. When one country develops new AI military capabilities, others feel they must rapidly match them to maintain strategic balance. China and Russia have set 2028-2030 as targets for major military automation, while the U.S. Replicator program aims to build and deploy thousands of autonomous drones by 2025. (Greenwalt, 2023; U.S Defense Innovation Unit, 2023) This competition creates pressure to cut corners on safety testing and oversight. (Simmons-Edler et al., 2024). This mirrors the nuclear arms race during the Cold War, where competition for superiority ultimately increased risks for all parties. Once again, as we mention in many other sections, we see a fear based race dynamic where only the actors willing to compromise and undermine safety stay in the race. (Leahy et al., 2024)

Complete automation leads to loss of human safeguards . Traditional warfare had built-in human constraints that limited escalation. Soldiers could refuse unethical orders, feel empathy for civilians, or become fatigued - all natural brakes on conflict. AI systems remove these constraints. Recent studies of military AI systems found they consistently recommend more aggressive actions than human strategists, including escalating to nuclear weapons in simulated conflicts. When researchers tested AI models in military planning scenarios, the AIs showed concerning tendencies to recommend pre-emptive strikes and rapid escalation, often without clear strategic justification (Rivera et al., 2024). The loss of human judgment becomes especially dangerous when combined with the increasing speed of AI-driven warfare. The history of nuclear close calls shows the importance of human judgment - in 1983, Soviet officer Stanislav Petrov chose to ignore a computerized warning of incoming U.S. missiles, correctly judging it to be a false alarm. As militaries increasingly rely on AI for early warning and response, we may lose these crucial moments of human judgment that have historically prevented catastrophic escalation (Simmons-Edler et al., 2024).

What happens when AI systems interact in conflict? The risks of autonomous weapons become even more concerning when multiple AI systems engage with each other in combat. AI systems can interact in unexpected ways that create feedback loops, similar to how algorithmic trading can cause flash crashes in financial markets. But unlike market crashes that only affect money, autonomous weapons could trigger rapid escalations of violence before humans can intervene. This risk becomes especially severe when AI systems are connected to nuclear arsenals or other weapons of mass destruction. The complexity of these interactions means even well-tested individual systems could produce catastrophic outcomes when deployed together (Simmons-Edler et al., 2024).

Enter image alt description
Figure 2.16: An example from the 2010 stock trading flash crash. Various stocks crashed to as little as 1 cent, and then quickly rebounded within a matter of minutes partly caused by algorithmic trading. (Future of Life Institute, 2024). We can imagine automated retaliation systems that might cause similar incidents, but this time with missiles instead of stocks.

How does this create risks of automated escalation? The combination of increasing autonomy, swarm intelligence, and pressure for speed creates a clear path to potential catastrophe. As weapons become more autonomous, they can act more independently. As they gain swarm capabilities, they can coordinate at massive scale. As warfare accelerates, humans have less ability to intervene. Each of these trends amplifies the others - autonomous swarms can act faster than human-controlled ones, high-speed warfare creates pressure for more autonomy, and larger swarms demand more automation to coordinate effectively. This self-reinforcing cycle pushes toward automated warfare even if no single actor intends that outcome (Simmons-Edler et al., 2024).

This can create unexpected feedback loops, similar to how algorithmic trading can cause flash crashes in financial markets. But unlike market crashes that only affect money, autonomous weapons could trigger rapid escalations of violence before humans can intervene. This risk becomes especially concerning as AI systems begin interfacing with nuclear command and control. The complexity of these interactions means even well-tested individual systems could produce catastrophic outcomes when deployed together (Simmons-Edler et al., 2024). When wars require human soldiers, the human cost creates political barriers to conflict. Studies suggest that countries are more willing to initiate conflicts when they can rely on autonomous systems instead of human troops. Combined with the risks of automated nuclear escalation, this creates multiple paths to catastrophic outcomes that could threaten humanity's long-term future (Simmons-Edler et al., 2024).

2.3.4 Adversarial AI Risk

Adversarial attacks reveal a fundamental vulnerability in machine learning systems - they can be reliably fooled through careful manipulation of their inputs. This manipulation can happen in several ways: during the system's operation (runtime/inference time attacks), during its training (data poisoning), or through pre-planted vulnerabilities (backdoors).

What are runtime adversarial attacks? The simplest way to understand runtime attacks is through computer vision. By adding carefully crafted noise to an image - changes so subtle humans can't notice them - attackers can make an AI confidently misclassify what it sees. A photo of a panda with imperceptible pixel changes causes the AI to classify it as a gibbon with 99.3% confidence, while to humans it still looks exactly like a panda (Goodfellow et al., 2014). These attacks have evolved beyond simple misclassification - attackers can now choose exactly what they want the AI to see.

Enter image alt description
Figure 2.17: Perturbations: Small but intentional changes to data such that the model outputs an incorrect answer with high confidence. (Goodfellow et al., 2014) The image shows how we can fool an image classifier with an adversarial attack (Fast Gradient Sign Method (FGSM) attack). (OpenAI, 2017)

Example: Runtime attacks in the real world . These attacks aren't just theoretical - they work in physical settings too. Think about AI systems controlling cars, robots, or security cameras. Just like adding careful pixel noise to digital images, attackers can modify physical objects to fool AI systems. Researchers showed that putting a few small stickers on a stop sign could trick autonomous vehicles into seeing a speed limit sign instead. The stickers were designed to look like ordinary graffiti but created adversarial patterns that fooled the AI.

Enter image alt description
Figure 2.18: Robust Physical Perturbations (RP2): Small visual stickers placed on physical objects like stop signs can cause image classifiers to misclassify them, even under different viewing conditions. (Eykholt et al., 2018)

Example: Optical Attacks - Runtime attacks using light . You don't even need to physically modify objects anymore - shining specific light patterns works too because it creates those same adversarial patterns through light and shadow. All an attacker needs is line of sight and basic equipment to project these patterns and compromise vision-based AI systems. (Eykholt et al., 2018)

Enter image alt description
Figure 2.19: Optical Perturbations: Small visual stickers placed on physical objects like stop signs can cause image classifiers to misclassify them, even under different viewing conditions. (Gnanasambandam et al, 2021)

Example: Dolphin Attacks - Runtime attack on audio systems . Just as AI systems can be fooled by carefully crafted visual patterns, they're vulnerable to precisely engineered audio patterns too. Remember how small changes in pixels could dramatically change what a vision AI sees? The same principle works in audio - tiny changes in sound waves, carefully designed, can completely change what an audio AI "hears." Researchers found they could control voice assistants like Siri or Alexa using commands encoded in ultrasonic frequencies - sounds that are completely inaudible to humans. Using nothing more than a smartphone and a 3 dollar speaker, attackers could trick these systems into executing commands like "call 911" or "unlock front door" without the victim even knowing. These attacks worked from up to 1.7 meters away - someone just walking past your device could trigger them (Zhang et al., 2017). Just like in the vision examples where self-driving cars could miss stop signs, audio attacks create serious risks - unauthorized purchases, control of security systems, or disruption of emergency communications.

What are prompt injections? Runtime attacks against language models are called prompt injections. Just like attackers can fool vision systems with carefully crafted pixels or audio systems with engineered sound waves, they can manipulate language models through carefully constructed text patterns. By adding specific phrases to their input, attackers can completely override how a language model behaves. As an example, assume a malicious actor embeds a paragraph within some website which has hidden instructions for a LLM to stop its current operation and instead perform some harmful action. If an unsuspecting user asks for a summary of the website content, then the model might inadvertently follow the malicious embedded instructions instead of providing a simple summary.

Enter image alt description
Figure 2.20: An instance of an ad-hoc jailbreak prompt, crafted solely through user creativity by employing various techniques like drawing hypothetical situations, exploring privilege escalation, and more. (Shayegani et al., 2023)

Prompt injection attacks have already compromised real systems. Take Slack's AI assistant as an example - attackers showed they could place specific text instructions in a public channel that, like the inaudible commands in audio attacks, were hidden in plain sight. When the AI processed messages, these hidden instructions tricked it into leaking confidential information from private channels the attacker couldn't normally access (Liu et al., 2024). They are particularly concerning because an attack developed against one system (e.g. GPT) frequently works against others too (Claude, Gemini, Llama, etc.).

Prompt injection attacks can be automated. Early attacks required manual trial and error, but new automated systems can systematically generate effective attacks. For example, AutoDAN can automatically generate "jailbreak" prompts that reliably make language models ignore their safety constraints (Liu et al., 2023). Researchers are also developing ways to plant undetectable backdoors in machine learning models that persist even after security audits (Goldwasser et al., 2024). These automated methods make attacks more accessible and harder to defend against. Another concern is that they can also cause failures in downstream systems. Many organizations use pre-trained models as starting points for their own applications, through fine tuning, or some other type of “AI integration” (e.g. email writing assistants). Which means that all systems that use these underlying base models will be vulnerable as soon as one attack is discovered. (Liu et al., 2024)

Enter image alt description
Figure 2.21: Illustration of LLM-integrated Application under attack. An attacker injects instruction/data into the data to make an LLM-integrated Application produce attacker-desired responses for a user. (Liu et al., 2024)

So far we've seen how attackers can fool AI systems during their operation - whether through pixel patterns, sound waves, or text prompts. But there's another way to compromise these systems: during their training. This type of attack, called data poisoning, happens long before the system is ever deployed.

What is data poisoning? Unlike runtime attacks that fool an AI system while it's running, data poisoning compromises the system during training. Runtime attacks require attackers to have access to a system's inputs, but with data poisoning, attackers only need to contribute some training data once to permanently compromise the system.

Think of it like teaching someone with a textbook full of deliberate mistakes - they'll learn the wrong things and make predictable errors. What makes poisoning particularly dangerous is that attackers only need to corrupt the training data once to permanently compromise the system. This is especially concerning as more AI systems are trained on data scraped from the internet where anyone can potentially inject harmful examples. (Schwarzschild et al., 2021) As long as models keep getting trained on more data scraped from the internet or collected from users, then with every uploaded photo or written comment that might be used to train future AI systems, there's an opportunity for poisoning.

Data poisoning becomes more powerful as AI systems grow larger and more complex. Researchers found that by poisoning just 0.1% of a language model's training data, they could create reliable backdoors that persist even after additional training. It has also been found that larger language models are actually more vulnerable to certain types of poisoning attacks, not less (Sandoval-Segura et al., 2022). This vulnerability increases with model size and dataset size - which is exactly the direction AI systems are heading as we saw from numerous examples in the capabilities chapter.

Enter image alt description
Figure 2.22: An illustrating example of backdoor attacks. The face recognition system is poisoned to have a backdoor with a physical key, i.e., a pair of commodity reading glasses. Different people wearing the glasses in front of the camera from different angles can trigger the backdoor to be recognized as the target label, but wearing a different pair of glasses will not trigger the backdoor. (Chen et al., 2017)

Example: Data poisoning using backdoors. A backdoor is one example of a specific type of poisoning attack. In a backdoor attack if we manage to introduce poisoned data during training, then the AI behaves normally most of the time but fails in a predictable way when it sees a specific trigger. This is like having a security guard who does their job perfectly except when they see someone wearing a particular color tie - then they always let that person through regardless of credentials. Researchers demonstrated this by creating a facial recognition system that would misidentify anyone as an authorized user if they wore specific glasses (Chen et al., 2017).

What are privacy attacks? Unlike adversarial attacks that cause obvious failures, privacy attacks can be subtle and hard to detect. Researchers have shown that even when language models appear to be working normally, they can be leaking sensitive information from their training data. This creates a particular challenge for AI safety because we might deploy systems that seem secure but are actually compromising privacy in ways we can't easily observe (Carlini et al., 2021)

Enter image alt description
Figure 2.23: Extracting Training Data from Large Language Models (Carlini et al., 2021)

What are membership inference attacks? One of the most basic but powerful privacy attacks is membership inference - determining whether specific data points have been used to train a model. This might sound harmless, but imagine an AI system trained on medical records - being able to determine if someone's data was in the training set could reveal private medical information. Researchers have shown that these attacks can work with just the ability to query the model, no special access required (Shokri et al., 2017). Another variation of this are model inversion attacks which aim to infer and reconstruct private training data by abusing access to a model. (Nguyen et al., 2023)

LLMs are trained on huge amounts of internet data, which often contains personal information. Researchers have shown these models can be prompted to just tell us things like email addresses, phone numbers, and even social security numbers (Carlini et al., 2021). The larger and more capable the model, the more private information it potentially retains. If we combine this with data poisoning, then we can further amplify privacy vulnerabilities by making specific data points easier to detect (Chen et al., 2022).

How do these vulnerabilities combine to create enhanced risks? The interaction between many attack methods creates compounding risks. For example, attackers can use privacy attacks to extract sensitive information, which they then use to make other attacks more effective. They might learn details about a model's training data that help them craft better adversarial examples or more effective poisoning strategies. This creates a cycle where one type of vulnerability enables others. (Shayegani et al., 2023).

How can we defend against these attacks? One of the most promising approaches to defending against adversarial attacks is adversarial training - deliberately exposing AI systems to adversarial examples during training to make them more robust. Think of it like building immunity through controlled exposure. However, this approach creates its own challenges. While adversarial training can make systems more robust against known types of attacks, it often comes at the cost of reduced performance on normal inputs. More concerning, researchers have found that making systems robust against one type of attack can sometimes make them more vulnerable to others (Zhao et al., 2024). This suggests we may face fundamental trade-offs between different types of robustness and performance. There might even be potential fundamental limitations to how much we can mitigate these issues if we continue with the current training paradigms that we talked about in the capabilities chapter (pre-training followed by instruction tuning). (Bansal et al., 2022)

How does this relate back to misuse and AI Safety? Despite efforts to make language models safer through alignment training, they remain susceptible to a wide range of attacks. (Shayegani et al., 2023) We want AI systems to learn from broad datasets to be more capable, but this increases privacy risks. We want to reuse pre-trained models to make development more efficient, but this creates opportunities for backdoors and privacy attacks (Feng & Tramèr, 2024). We want to make models more robust through techniques like adversarial training, but this can sometimes make them more vulnerable to other types of attacks (Zhao et al., 2024).

Multi-modal systems (LMMs) that combine text, images, and other types of data create even more attack opportunities. Attackers can inject malicious content through one modality (like images) to affect behavior in another modality (like text generation). These cross-modal attacks are particularly concerning because they can bypass safety measures designed for single-modal systems. For example, attackers can embed adversarial patterns in images that trigger harmful text generation, even when the text prompts themselves are completely safe. (Chen et al., 2024) All of this suggests we need new approaches to AI development that consider security and privacy as fundamental requirements, not afterthoughts (King & Meinhardt, 2024).