Skip to content

2.3 Misuse Risk

In the following sections, we will go through some world-states that hopefully paint a little bit of a clearer picture of risks when it comes to AI. Although the sections have been divided into misuse, misalignment, and systemic, it is important to remember that this is for the sake of explanation. It is highly likely that the future will involve a mix of risks emerging from all of these categories.

Technology increases the harm impact radius. Technology is an amplifier of intentions. As it improves, so does the radius of its effects. According to how powerful a certain technology is, both its beneficial and its harmful effects can affect the world in a larger radius. Think about the harm that a person could do when utilizing other tools throughout history. During the Stone Age, with a rock maybe someone could harm ~5 people, a few hundred years ago with a bomb someone could harm ~100 people. In 1945 with a nuclear weapon, one person could harm ~250,000 people. The thing to notice here is that we are on an exponential trend, where the radius of potential impact from one person using technological tools keeps increasing. If we experience a nuclear winter today, the harm radius would be almost 5 billion people, which is ~60% of humanity. If we assume that transformative AI is a tool that overshadows the power of all others that came before it, then its blast radius could potentially harm 100% of humanity. (Munk Debate, 2023)

Another thing to keep in mind is that the more spread out that such a technology is, the higher the risks of malicious use. From the previous example, we can see that as time progresses, a single person in possession of some technology has been able to cause increasing amounts of harm throughout history. If many people have access to tools that can be both highly beneficial or catastrophically harmful, then it might only take one single person to cause significant devastation to society. So the growing potential for AIs to empower malicious actors may be one of the most severe threats humanity will face in the coming decades.

2.3.1 Bio Risk

When we look at ways AI could enable harm through misuse, one of the most concerning cases involves biology. Just as AI can help scientists develop new medicines and understand diseases, it can also make it easier for bad actors to create biological weapons.

Why are biological weapons uniquely dangerous? Biological weapons pose special risks because living organisms can multiply and spread far beyond their initial release point. Unlike a conventional bomb or chemical weapon that causes localized damage, an engineered pathogen could potentially trigger a pandemic. (Pannu et al., 2024) COVID-19 pandemic showed how a highly transmissible virus can spread globally within months, affecting millions of people despite modern medical capabilities and public health measures. It demonstrated that even relatively mild pandemic viruses can kill more people than any nuclear device. While pandemic-class agents might be strategically useless to nation-states due to their slow spread and indiscriminate lethality, they could be acquired and deliberately released by terrorists. What makes this particularly concerning is that once such viruses are identified and their genome sequences shared, thousands would have the capability to assemble them, and a single person with access to multiple such viruses could trigger multiple pandemics simultaneously. These viruses can spread faster than vaccines or antivirals can be distributed. (Esvelt, 2022)

What has limited biological weapons historically? Until recently, creating biological weapons required extensive knowledge and resources. Even well-funded groups have struggled - in 1993, the Aum Shinrikyo cult tried and failed to develop anthrax as a weapon despite having significant resources and scientific expertise. Until recently, creating biological weapons required extensive knowledge and resources. Experts estimate only about 30,000 individuals worldwide currently possess the skills to follow the most straightforward virus assembly protocols - primarily those with doctorates in virology, synthetic biology, bioengineering, and biomedicine. So approximately 2,500 virology doctorates awarded in the US over the past 20 years, plus roughly three times as many scientists in related disciplines who work with viruses. (Esvelt, 2022).

How does AI make biological knowledge more accessible? One of the main barriers to creating biological weapons has been the need for deep technical expertise. Developing dangerous pathogens requires understanding complex topics like molecular biology, genetic engineering, and laboratory procedures. This knowledge has traditionally been scattered across scientific papers, textbooks, and experienced researchers.

AI systems, particularly LLMs, are changing this picture by making technical knowledge more accessible. These models can synthesize information from thousands of sources, explain complex concepts in simple terms, and even provide step-by-step guidance for laboratory procedures. Recent research at MIT showed that students with no biology background could use AI chatbots to quickly understand key concepts about pandemic pathogens and how they might be created - “Within an hour, the chatbots outlined four possible endemic pathogens, described methods to produce them from synthetic DNA via reverse genetics, listed DNA synthesis firms likely to overlook order screenings, detailed exact protocols, and troubleshooting methods, etc.” (Soice et al., 2023)

Even more concerning, AI systems can help users work around safety measures. When asked about obtaining dangerous materials, AI models can suggest companies that don't screen DNA synthesis orders or provide advice on evading existing screening methods. The ability to get this kind of specific, practical guidance represents a significant change from having to piece together information from scattered sources.

Enter image alt description
Figure 2.9: Graphic adapted from a report by RAND published at the royal society of biology website. It highlights how biotechnology and AI are converging rapidly. (Zakaria, 2024)

How is AI enhancing technical capabilities in biology? Beyond just providing knowledge, AI is actively pushing forward what's possible in biological engineering. The development of AlphaFold, which can accurately predict protein structures, marked a major advance in our ability to understand and modify biological systems. The latest version can predict not just protein structures but entire molecular complexes including DNA, RNA, and other components (DeepMind, 2024). Technical advances like this make it easier to design new biological agents or modify existing ones. In 2022, researchers demonstrated this risk by taking an AI model designed for helpful drug discovery and simply redirecting it to generate toxic compounds. Within just six hours, it had designed 40,000 potentially toxic molecules, including some predicted to be more deadly than known chemical weapons (Urbina et al., 2022).

Interactive Figure 2.2: Protein folding prediction accuracy. (Giattino et al., 2023)

How is AI lowering barriers to biological weapons development? Beyond providing knowledge and enhancing capabilities, AI is making the entire process of biological engineering cheaper, faster, and more automated. The cost of DNA synthesis has been dropping rapidly, halving approximately every 15 months (Carlson, 2009). When combined with AI assistance, this means that creating custom biological agents is becoming more accessible to people without extensive resources or institutional backing. The rise of automated "cloud laboratories" and benchtop DNA synthesis machines poses additional risks. These technologies allow users to conduct complex biological experiments with minimal hands-on lab work. If someone can access synthetic DNA and use AI to guide them through the process, they might be able to create dangerous agents without the deep practical experience that was previously required (Carter et al., 2023).

Enter image alt description
Figure 2.10: Biotechnology risk chain. The risk chain for developing a bioweapon starts with ideating a biological threat, followed by a design-build-test-learn (DBTL) loop. (Li et al., 2024)

What makes these developments especially concerning? We face a fundamental imbalance between offense and defense in biological weapons. Creating and releasing a dangerous pathogen is far easier than defending against it. For example, while developing a new virus might cost around $100,000, creating a vaccine against it could cost over $1 billion (Mouton et al., 2024). AI threatens to make this imbalance even worse by reducing the expertise needed to create biological weapons while doing comparatively little to improve our defensive capabilities. The potential for a single person or small group to cause massive harm is particularly worrying. Traditional biological weapons programs required state-level resources and expertise. But as AI makes these capabilities more accessible, the number of people who could potentially create pandemic-level threats is expanding dramatically. (Pannu et al., 2024) This "democratization" of biological weapons capabilities represents a fundamental shift in the risk landscape.

Implications of foundation model advances in biology are concerning. Future biological foundation models may one day allow users to design biological constructs, including virulent pandemic pathogens, in a manner unknown to, and disfavored by, nature. (Pannu et al., 2024) Natural pathogens have been shaped by selection pressures that favor coexistence with hosts - artificial ones may not have these limitations.

National Academy of Sciences (2006) (National Academies, 2006)

It is reasonable to anticipate that humans are capable of engineering infectious agents with virulence equal to or perhaps far worse than any observed naturally.

What can be done to reduce these risks? Several key approaches are emerging to help manage AI-enhanced biological risks. The first involves making it harder to obtain dangerous materials. DNA synthesis companies can screen orders for dangerous sequences, though this is currently voluntary and not all companies participate. New cryptographic screening methods are being developed that could make it harder to evade these protections (Carter et al., 2023). Similar screening needs to be implemented for automated labs and new benchtop DNA synthesis machines.

Enter image alt description
Figure 2.11: Benchtop DNA synthesis machine. (DnaScript, 2024)

On a broader scale, improving our ability to detect and respond to biological threats becomes even more critical as AI lowers barriers to creating them. This includes expanding disease surveillance systems, particularly around major travel hubs, and developing faster ways to create and distribute tests and vaccines when new threats emerge (Newman, 2024).

2.3.2 Cyber Risk

What does non-AI enabled cybersecurity look like? Just to quickly catch people up on the state of cyber security in 2024. A single software update in July 2024 crashed millions of Windows computers worldwide. Airlines couldn't fly, hospitals had to cancel surgeries, and banks couldn't process transactions. The total cost to companies was over $5 billion - all from one faulty update at CrowdStrike, a cybersecurity company. (CrowdStrike, 2024) This wasn't even a cyber attack - it was an accident. But it shows how vulnerable our computer systems are, and why we need to think carefully about how AI could make attacks worse.

There are way more examples of this kind. In 2021, a ransomware attack on Colonial Pipeline caused widespread gas shortages across the U.S. East Coast. (CISA, 2021) Gas stations ran dry and prices shot up. The company had to pay $4.4 million to get their systems back online (Cunha & Estima, 2023). In 2014, hackers working for North Korea breached Sony Pictures through targeted phishing emails, causing damages estimated between $70-100 million (Slattery et al., 2024). Besides all these demonstrated examples we also have potential attacks that haven't happened yet.

Enter image alt description
Figure 2.12: Stages of a cyberattack. The objective is to design benchmarks and evaluations that assess models ability to aid malicious actors with all four stages of a cyberattack. (Li et al., 2024)

What are cyber attack overhangs? Chinese state actors have already positioned themselves inside critical U.S. infrastructure systems near military bases through an operation called "Volt Typhoon." (CISA, 2024) While they haven't caused damage yet, they have the capability to disrupt water controls, energy systems, and ports (Gabriel et al., 2024). This is not just a China U.S. problem. This type of positioning as a deterrent can happen between any group of nations. Cyber security experts call this a "cyberattack overhang" - where devastating attacks are possible but haven't occurred, largely due to attacker restraint rather than robust defenses.

We can't be exhaustive, but we hope you are getting the picture. Society is in a pretty precarious state, even before AI comes into the picture. In the next few paragraphs, we’ll go through how existing threats like these are amplified much further if we plug in AI models to this existing scenario.

The Economics of AI-Enabled Attacks. AI fundamentally changes the cost-benefit calculations for attackers. Traditional attacks require significant human expertise and time - writing malware, crafting convincing phishing emails, or exploring target networks can take weeks or months. Recent research shows that AI can dramatically reduce these costs. For example, autonomous AI agents can now hack websites for about $10 per attempt - roughly 8 times cheaper than using human expertise (Fang et al., 2024).

Enter image alt description
Figure 2.13: Schematic of using autonomous LLM agents to hack websites. (Fang et al., 2024)

AI enabled Phishing and Social Engineering. Traditional phishing campaigns rely on human attackers crafting deceptive messages to trick targets into revealing information or clicking malicious links. The quality and personalization of these attacks drops sharply when trying to scale up, as each target requires manual research and customization. AI removes these limitations. Language models can automatically analyze public data about targets and generate highly personalized messages at scale. Tests show AI-generated phishing emails get more clicks (65%) than human-written ones (60%) while taking 40% less time to create (Slattery et al., 2024). Tools like FraudGPT specifically optimized for this - automatically generating convincing scam emails customized to each target's background, interests, and relationships.

Social Engineering Phone Calls and Deepfakes. Traditional phone scams require human operators making calls, limiting both their scale and effectiveness. Scammers often struggle with language barriers or maintaining consistent impersonation. Voice impersonation attacks used to require skilled actors who could mimic their targets. AI has fundamentally changed this dynamic. State-of-the-art voice cloning needs just a few minutes of audio to create highly convincing replicas of someone's voice. These systems are already being used in real-world attacks - in 2023, scammers used cloned voices in several high-profile cases to trick companies into transferring large sums of money. The technology also enables real-time voice changing during calls, letting attackers maintain perfect impersonations throughout extended conversations.

AI enabled Vulnerability Discovery. Finding software vulnerabilities traditionally requires skilled cybersecurity researchers spending significant time analyzing code and testing attack paths. This expert-intensive process means many vulnerabilities remain undiscovered until after attackers exploit them. AI systems can now scan code and probe systems automatically, finding potential weaknesses much faster than humans. One research team demonstrated an AI system that could autonomously discover and exploit website vulnerabilities without any human guidance, successfully hacking 73% of test targets (Fang et al., 2024). The system could even discover novel attack paths that weren't known beforehand.

AI enabled Malware Development. Writing effective malware has historically required deep technical expertise in programming and system internals. The complexity of creating sophisticated malware limited these capabilities to skilled attackers. AI dramatically lowers this barrier. In the bio risk section we saw how researchers took an AI trained to design therapeutic molecules and redirected it to generate chemical weapons (Slattery et al., 2024). Similar approaches are already being applied to malware creation. You take tools that are designed to write correct code, and simply reward them for writing malware. Tools like WormGPT help attackers generate malicious code and build attack frameworks without requiring deep technical knowledge.

AI enabled System Infiltration. Traditional system intrusions often rely on methodical human hackers exploring networks, looking for vulnerabilities, and carefully planning their attacks. This process can take weeks or months of patient work. AI agents can execute these attacks far more quickly and systematically. They can rapidly map networks, test multiple attack paths in parallel, and adapt their strategy based on what works. When combined with automated vulnerability discovery and malware generation, this enables end-to-end automated attacks that previously required teams of skilled humans (Fang et al., 2024).

AI Enabled Automated Infrastructure Attacks. Traditional attacks on infrastructure systems like power grids or water treatment facilities required deep knowledge of industrial control systems. Attackers needed to understand both the technical vulnerabilities and the specific operational details of their targets. The Stuxnet attack on Iran's nuclear program, while successful, took years of preparation and millions of dollars to execute. AI could make these attacks more accessible and harder to defend against. AI systems can automatically map industrial networks, identify critical control points, and determine which systems to target for maximum impact. Research shows current language models can already analyze technical documentation and generate attack plans that would have previously required teams of experts (Ladish, 2024). While we haven't seen AI-powered infrastructure attacks in the wild yet, the building blocks are falling into place.

AI Enabled Credential Theft and Password Attacks. Traditional password cracking relies on either brute force attempts (trying many combinations) or using lists of common passwords. These attacks are relatively easy to defend against using standard security measures like rate limiting and multi-factor authentication. AI enables more sophisticated approaches. Language models can analyze personal information about targets to generate highly targeted password guesses. They can also help automate credential stuffing attacks - where stolen username/password combinations are automatically tried across many services. Early research shows AI-assisted password cracking can be significantly more effective than traditional methods while requiring less computational resources (Slattery et al., 2024).

AI Enabled Network Defense Evasion. Traditional malware often uses fixed patterns or behaviors that security tools can recognize and block. Attackers try to evade detection by manually modifying their code, but this is time-consuming and often leaves traces that defenders can spot. AI systems can automatically generate variations of malware that preserve functionality while appearing completely different to security tools. Each attack can use unique code, communication patterns, and behaviors - making it much harder for traditional security tools to identify threats. Some AI models have demonstrated the ability to automatically modify malware to bypass specific security products, showing how AI could enable adaptive attacks that actively evade defenses.

From Individual Attacks to Systemic Risk. While individual cyber attacks are already dangerous, AI raises the risk of systemic failures. When attacks required significant human effort, this naturally limited their scale and frequency. AI removes these limits. Automated attacks could target thousands of systems simultaneously, potentially triggering cascading failures across interconnected infrastructure. A successful attack on one system could rapidly spread to others before defenders can respond (Newman, 2024).

The Democratization of Advanced Capabilities. A particularly concerning trend is how AI democratizes sophisticated attack capabilities. Previously, the most advanced cyber operations were limited to nation-states with significant resources. Tools like WormGPT and FraudGPT are making these capabilities accessible to individual criminals or terrorist groups. This fundamentally changes the threat landscape - attacks that once required teams of experts can now be executed by individuals with minimal technical expertise (Ladish, 2024).

The Offense-Defense Balance. AI should theoretically help defenders more than attackers. A perfectly secured system would have no vulnerabilities to exploit, and AI could help find and fix weaknesses before attackers discover them. However, real-world security faces practical challenges. Many organizations struggle to implement even basic security practices. When Colonial Pipeline was attacked, they were using a single shared password without multi-factor authentication.

The challenge is that defense requires consistent excellence - attackers only need to find one weakness, while defenders must protect everything. AI makes finding these weaknesses easier and more automated, potentially shifting the balance further toward offense (Slattery et al., 2024).

The Time Pressure Problem. Another key concern is the speed of AI-enabled attacks. Traditional cyber attacks often took weeks or months, giving defenders time to detect and respond to intrusions. AI systems could execute complex attacks in hours or minutes. This dramatically reduces the time available for detection and response. Some researchers warn this could lead to "flash attacks" - where systems are compromised and damaged before human defenders can even understand what's happening (Fang et al., 2024).

Research and Disclosure Challenges. The dual-use nature of AI cybersecurity research creates difficult tradeoffs. When researchers discover new AI-enabled attack vectors, sharing this information could help improve defenses but might also give ideas to attackers. This dynamic makes it harder to openly discuss and address emerging threats. Several major AI labs have already encountered attacks they chose not to publicly detail out of security concerns (Slattery et al., 2024).

Enter image alt description
Figure 2.14: A graphic from a cybersecurity risks report by MIT. (Madnick, 2023)

2.3.3 Autonomous Weapons Risk

In the previous sections, we saw how AI amplifies risks in biological and cyber domains by removing human bottlenecks and enabling attacks at unprecedented speed and scale. The same pattern emerges even more dramatically with military systems. Traditional weapons are constrained by their human operators - a person can only control one drone, make decisions at human speed, and may refuse unethical orders. AI removes these human constraints, setting the stage for a fundamental transformation in how wars are fought.

How widespread is military AI deployment today? AI-enabled weapons are already being used in active conflicts, with real-world impacts we can observe. According to reports made to the UN Security Council, autonomous drones were used to track and attack retreating forces in Libya in 2021, marking one of the first documented cases of lethal autonomous weapons (LAWs) making targeting decisions without direct human control (Panel of Experts on Libya, 2021). In Ukraine, both parties have used loitering munitions. Russian KUB-BLA, Lancet-3 and Ukrainian Switchblade, Phoenix Ghost are AI-enabled drones. The Lancet is using an Nvidia computing module for autonomous target tracking. (Bode & Watts, 2023) Israel has conducted AI-guided drone swarm attacks in Gaza, while Turkey's Kargu-2 can find and attack human targets on its own using machine learning, rather than needing constant human guidance. These deployments show how quickly military AI is moving from theoretical possibilities to battlefield realities (Simmons-Edler et al., 2024; Bode & Watts, 2023).

How are military systems becoming more autonomous over time? The evolution of military AI follows a clear progression from basic algorithms to increasingly autonomous systems. Early automated weapons like the U.S. Phalanx close-in weapon system from the 1970s operated under narrow constraints - they could only respond to specific threats in specific ways (Simmons-Edler et al., 2024). Today's systems use machine learning to actively perceive and respond to their environment, as is evident from all the examples we gave in the previous paragraph. This progression from algorithmic to autonomous capabilities mirrors broader trends in AI development, but with big implications for risk, warfare and human lives. (Simmons-Edler et al., 2024).

How do Military Incentives Drive Increasing Autonomy? Several forces push toward greater AI control of weapons. Speed offers decisive advantages in modern warfare - when DARPA tested an AI system against an experienced F-16 pilot in simulated dogfights, the AI won consistently by executing maneuvers too precise and rapid for humans to counter. Cost creates additional pressure - the U.S. military's Replicator program aims to deploy thousands of autonomous drones at a fraction of the cost of traditional aircraft (Simmons-Edler et al., 2024). Perhaps most importantly, military planners worry about enemies jamming communications to remotely operated weapons. This drives development of systems that can continue fighting even when cut off from human control (Bode & Watts, 2023). These incentives mean military AI development increasingly focuses on systems that can operate with minimal human oversight.

Many modern systems are specifically designed to operate in GPS-denied environments where maintaining human control becomes impossible. In Ukraine, military commanders have explicitly called for more autonomous operation to match the speed of modern combat, with one Ukrainian commander noting they 'already conduct fully robotic operations without human intervention' (Bode & Watts, 2023).

There is also a “marketing messaging shift” around autonomous capabilities in real conflicts. After a UN report suggested autonomous targeting by Turkish Kargu-2 drones in Libya, the manufacturer quickly shifted from advertising its 'autonomous attack modes' to emphasizing human control. However, technical analysis reveals many current systems retain latent autonomous targeting capabilities even while being operated with humans-in-the-loop - suggesting a small software update could enable fully autonomous operation (Bode & Watts, 2023).

Enter image alt description
Figure 2.15: Loitering munitions are expendable uncrewed aircraft which can integrate sensor based analysis to hover over, detect, and crash into targets. These systems were developed during the 1980s and early 1990s to conduct Suppression of Enemy Air Defence (SEAD) operations. They “blur the line between drone and missile”. (Bode & Watts, 2023)

How do advances in swarm intelligence amplify these risks? As AI enables better coordination between autonomous systems, military planners are increasingly focused on deploying weapons in interconnected swarms. The U.S. Replicator already has plans to build and deploy thousands of coordinated autonomous drones that can overwhelm defenses through sheer numbers and synchronized actions. (Defense Innovation Unit, 2023) When combined with increasing autonomy, these swarm capabilities mean that future conflicts may involve massive groups of AI systems making coordinated decisions faster than humans can track or control (Simmons-Edler et al., 2024).

What Happens as Humans Lose Meaningful Control? The pressure to match the speed and scale of AI-driven warfare leads to a gradual erosion of human decision-making. Military commanders increasingly rely on AI systems not just for individual weapons, but for broader tactical decisions. In 2023, Palantir demonstrated an AI system that could recommend specific missile deployments and artillery strikes. While presented as advisory tools, these systems create pressure to delegate more control to AI as human commanders struggle to keep pace (Simmons-Edler et al., 2024).

Even when systems nominally keep humans in control, combat conditions can make this control more theoretical than real. Operators often make targeting decisions under intense battlefield stress, with only seconds to verify computer-suggested targets. Studies of similar high-pressure situations show operators tend to uncritically trust machine suggestions rather than exercising genuine oversight. This means that even systems designed for human control may effectively operate autonomously in practice (Bode & Watts, 2023).

The "Lavender" targeting system demonstrates where this leads - humans just set the acceptable thresholds. Lavender uses machine learning to assign residents a numerical score relating to the suspected likelihood that a person is a member of an armed group. Based on reports, Israeli military officers are responsible for setting the threshold beyond which an individual can be marked as a target subject to attack. (Human Rights Watch, 2024; Abraham, 2024). As warfare accelerates beyond human decision speeds, maintaining meaningful human control becomes increasingly difficult.

What happens when automation removes human safeguards? Traditional warfare had built-in human constraints that limited escalation. Soldiers could refuse unethical orders, feel empathy for civilians, or become fatigued - all natural brakes on conflict. AI systems remove these constraints. Recent studies of military AI systems found they consistently recommend more aggressive actions than human strategists, including escalating to nuclear weapons in simulated conflicts (Rivera et al., 2024). The history of nuclear close calls shows the importance of human judgment - in 1983, nuclear conflict was prevented by a single Soviet officer. Stanislav Petrov chose to ignore a computerized warning of incoming U.S. missiles, correctly judging it to be a false alarm. As militaries increasingly rely on AI for early warning and response, we may lose these crucial moments of human judgment that have historically prevented catastrophic escalation (Simmons-Edler et al., 2024).

How Does This Create Dangerous Arms Race Dynamics? The development of autonomous weapons is creating powerful pressure for military competition in ways that threaten both safety and research. When one country develops new AI military capabilities, others feel they must rapidly match them to maintain strategic balance. China and Russia have set 2028-2030 as targets for major military automation, while the U.S. Replicator program aims to build and deploy thousands of autonomous drones by 2025. (Greenwalt, 2023; U.S Defense Innovation Unit, 2023) This competition creates pressure to cut corners on safety testing and oversight. (Simmons-Edler et al., 2024). This mirrors the nuclear arms race during the Cold War, where competition for superiority ultimately increased risks for all parties. Once again, as we mention in many other sections, we see a fear based race dynamic where only the actors willing to compromise and undermine safety stay in the race. (Leahy et al., 2024)

Complete automation leads to loss of human safeguards. Traditional warfare had built-in human constraints that limited escalation. Soldiers could refuse unethical orders, feel empathy for civilians, or become fatigued - all natural brakes on conflict. AI systems remove these constraints. Recent studies of military AI systems found they consistently recommend more aggressive actions than human strategists, including escalating to nuclear weapons in simulated conflicts. When researchers tested AI models in military planning scenarios, the AIs showed concerning tendencies to recommend pre-emptive strikes and rapid escalation, often without clear strategic justification (Rivera et al., 2024). The loss of human judgment becomes especially dangerous when combined with the increasing speed of AI-driven warfare. The history of nuclear close calls shows the importance of human judgment - in 1983, Soviet officer Stanislav Petrov chose to ignore a computerized warning of incoming U.S. missiles, correctly judging it to be a false alarm. As militaries increasingly rely on AI for early warning and response, we may lose these crucial moments of human judgment that have historically prevented catastrophic escalation (Simmons-Edler et al., 2024).

What happens when AI systems interact in conflict? The risks of autonomous weapons become even more concerning when multiple AI systems engage with each other in combat. AI systems can interact in unexpected ways that create feedback loops, similar to how algorithmic trading can cause flash crashes in financial markets. But unlike market crashes that only affect money, autonomous weapons could trigger rapid escalations of violence before humans can intervene. This risk becomes especially severe when AI systems are connected to nuclear arsenals or other weapons of mass destruction. The complexity of these interactions means even well-tested individual systems could produce catastrophic outcomes when deployed together (Simmons-Edler et al., 2024).

How does this create risks of automated escalation? The combination of increasing autonomy, swarm intelligence, and pressure for speed creates a clear path to potential catastrophe. As weapons become more autonomous, they can act more independently. As they gain swarm capabilities, they can coordinate at massive scale. As warfare accelerates, humans have less ability to intervene. Each of these trends amplifies the others - autonomous swarms can act faster than human-controlled ones, high-speed warfare creates pressure for more autonomy, and larger swarms demand more automation to coordinate effectively. This self-reinforcing cycle pushes toward automated warfare even if no single actor intends that outcome (Simmons-Edler et al., 2024).

This can create unexpected feedback loops, similar to how algorithmic trading can cause flash crashes in financial markets. But unlike market crashes that only affect money, autonomous weapons could trigger rapid escalations of violence before humans can intervene. This risk becomes especially concerning as AI systems begin interfacing with nuclear command and control. The complexity of these interactions means even well-tested individual systems could produce catastrophic outcomes when deployed together (Simmons-Edler et al., 2024). When wars require human soldiers, the human cost creates political barriers to conflict. Studies suggest that countries are more willing to initiate conflicts when they can rely on autonomous systems instead of human troops. Combined with the risks of automated nuclear escalation, this creates multiple paths to catastrophic outcomes that could threaten humanity's long-term future (Simmons-Edler et al., 2024).

2.3.4 Adversarial AI Risk

Adversarial attacks reveal a fundamental vulnerability in machine learning systems - they can be reliably fooled through careful manipulation of their inputs. This manipulation can happen in several ways: during the system's operation (runtime/inference time attacks), during its training (data poisoning), or through pre-planted vulnerabilities (backdoors).

What are runtime adversarial attacks? The simplest way to understand runtime attacks is through computer vision. By adding carefully crafted noise to an image - changes so subtle humans can't notice them - attackers can make an AI confidently misclassify what it sees. A photo of a panda with imperceptible pixel changes causes the AI to classify it as a gibbon with 99.3% confidence, while to humans it still looks exactly like a panda (Goodfellow et al., 2014). These attacks have evolved beyond simple misclassification - attackers can now choose exactly what they want the AI to see.

Enter image alt description
Figure 2.16: Perturbations: Small but intentional changes to data such that the model outputs an incorrect answer with high confidence. (Goodfellow et al., 2014) The image shows how we can fool an image classifier with an adversarial attack (Fast Gradient Sign Method (FGSM) attack). (OpenAI, 2017)

Example: Runtime attacks in the real world. These attacks aren't just theoretical - they work in physical settings too. Think about AI systems controlling cars, robots, or security cameras. Just like adding careful pixel noise to digital images, attackers can modify physical objects to fool AI systems. Researchers showed that putting a few small stickers on a stop sign could trick autonomous vehicles into seeing a speed limit sign instead. The stickers were designed to look like ordinary graffiti but created adversarial patterns that fooled the AI.

Enter image alt description
Figure 2.17: Robust Physical Perturbations (RP2): Small visual stickers placed on physical objects like stop signs can cause image classifiers to misclassify them, even under different viewing conditions. (Eykholt et al., 2018)

Example: Optical Attacks - Runtime attacks using light. You don't even need to physically modify objects anymore - shining specific light patterns works too because it creates those same adversarial patterns through light and shadow. All an attacker needs is line of sight and basic equipment to project these patterns and compromise vision-based AI systems. (Eykholt et al., 2018)

Enter image alt description
Figure 2.18: Optical Perturbations: Small visual stickers placed on physical objects like stop signs can cause image classifiers to misclassify them, even under different viewing conditions. (Gnanasambandam et al, 2021)

Example: Dolphin Attacks - Runtime attack on audio systems. Just as AI systems can be fooled by carefully crafted visual patterns, they're vulnerable to precisely engineered audio patterns too. Remember how small changes in pixels could dramatically change what a vision AI sees? The same principle works in audio - tiny changes in sound waves, carefully designed, can completely change what an audio AI "hears." Researchers found they could control voice assistants like Siri or Alexa using commands encoded in ultrasonic frequencies - sounds that are completely inaudible to humans. Using nothing more than a smartphone and a $3 speaker, attackers could trick these systems into executing commands like "call 911" or "unlock front door" without the victim even knowing. These attacks worked from up to 1.7 meters away - someone just walking past your device could trigger them (Zhang et al., 2017). Just like in the vision examples where self-driving cars could miss stop signs, audio attacks create serious risks - unauthorized purchases, control of security systems, or disruption of emergency communications.

What are prompt injections? Runtime attacks against language models are called prompt injections. Just like attackers can fool vision systems with carefully crafted pixels or audio systems with engineered sound waves, they can manipulate language models through carefully constructed text patterns. By adding specific phrases to their input, attackers can completely override how a language model behaves. As an example, assume a malicious actor embeds a paragraph within some website which has hidden instructions for a LLM to stop its current operation and instead perform some harmful action. If an unsuspecting user asks for a summary of the website content, then the model might inadvertently follow the malicious embedded instructions instead of providing a simple summary.

Enter image alt description
Figure 2.19: An instance of an ad-hoc jailbreak prompt, crafted solely through user creativity by employing various techniques like drawing hypothetical situations, exploring privilege escalation, and more. (Shayegani et al., 2023)

Prompt injection attacks have already compromised real systems. Take Slack's AI assistant as an example - attackers showed they could place specific text instructions in a public channel that, like the inaudible commands in audio attacks, were hidden in plain sight. When the AI processed messages, these hidden instructions tricked it into leaking confidential information from private channels the attacker couldn't normally access (Liu et al., 2024). They are particularly concerning because an attack developed against one system (e.g. GPT) frequently works against others too (Claude, Gemini, Llama, etc.).

Prompt injection attacks can be automated. Early attacks required manual trial and error, but new automated systems can systematically generate effective attacks. For example, AutoDAN can automatically generate "jailbreak" prompts that reliably make language models ignore their safety constraints (Liu et al., 2023). Researchers are also developing ways to plant undetectable backdoors in machine learning models that persist even after security audits (Goldwasser et al., 2024, Planting Undetectable Backdoors). These automated methods make attacks more accessible and harder to defend against. Another concern is that they can also cause failures in downstream systems. Many organizations use pre-trained models as starting points for their own applications, through fine tuning, or some other type of “AI integration” (e.g. email writing assistants). Which means that all systems that use these underlying base models will be vulnerable as soon as one attack is discovered. (Liu et al., 2024)

Enter image alt description
Figure 2.20: Illustration of LLM-integrated Application under attack. An attacker injects instruction/data into the data to make an LLM-integrated Application produce attacker-desired responses for a user. Formalizing and Benchmarking Prompt Injection Attacks and Defenses (Liu et al., 2024)

So far we've seen how attackers can fool AI systems during their operation - whether through pixel patterns, sound waves, or text prompts. But there's another way to compromise these systems: during their training. This type of attack, called data poisoning, happens long before the system is ever deployed.

What is data poisoning? Unlike runtime attacks that fool an AI system while it's running, data poisoning compromises the system during training. Runtime attacks require attackers to have access to a system's inputs, but with data poisoning, attackers only need to contribute some training data once to permanently compromise the system.

Think of it like teaching someone with a textbook full of deliberate mistakes - they'll learn the wrong things and make predictable errors. What makes poisoning particularly dangerous is that attackers only need to corrupt the training data once to permanently compromise the system. This is especially concerning as more AI systems are trained on data scraped from the internet where anyone can potentially inject harmful examples. (Schwarzschild et al., 2021) As long as models keep getting trained on more data scraped from the internet or collected from users, then with every uploaded photo or written comment that might be used to train future AI systems, there's an opportunity for poisoning.

Data poisoning becomes more powerful as AI systems grow larger and more complex. Researchers found that by poisoning just 0.1% of a language model's training data, they could create reliable backdoors that persist even after additional training. It has also been found that larger language models are actually more vulnerable to certain types of poisoning attacks, not less (Sandoval-Segura et al., 2022). This vulnerability increases with model size and dataset size - which is exactly the direction AI systems are heading as we saw from numerous examples in the capabilities chapter.

Enter image alt description
Figure 2.21: An illustrating example of backdoor attacks. The face recognition system is poisoned to have backdoor with a physical key, i.e., a pair of commodity reading glasses. Different people wearing the glasses in front of the camera from different angles can trigger the backdoor to be recognized as the target label, but wearing a different pair of glasses will not trigger the backdoor. Targeted Backdoor Attacks on Deep Learning Systems (Chen et al., 2017)

Example: Data poisoning using backdoors. A backdoor is one example of a specific type of poisoning attack. In a backdoor attack if we manage to introduce poisoned data during training, then the AI behaves normally most of the time but fails in a predictable way when it sees a specific trigger. This is like having a security guard who does their job perfectly except when they see someone wearing a particular color tie - then they always let that person through regardless of credentials. Researchers demonstrated this by creating a facial recognition system that would misidentify anyone as an authorized user if they wore specific glasses (Chen et al., 2017).

What are privacy attacks? Unlike adversarial attacks that cause obvious failures, privacy attacks can be subtle and hard to detect. Researchers have shown that even when language models appear to be working normally, they can be leaking sensitive information from their training data. This creates a particular challenge for AI safety because we might deploy systems that seem secure but are actually compromising privacy in ways we can't easily observe (Carlini et al., 2021)

Enter image alt description
Figure 2.22: Extracting Training Data from Large Language Models (Carlini et al., 2021)

What are membership inference attacks? One of the most basic but powerful privacy attacks is membership inference - determining whether specific data points have been used to train a model. This might sound harmless, but imagine an AI system trained on medical records - being able to determine if someone's data was in the training set could reveal private medical information. Researchers have shown that these attacks can work with just the ability to query the model, no special access required (Shokri et al., 2017). Another variation of this are model inversion attacks which aim to infer and reconstruct private training data by abusing access to a model. (Nguyen et al., 2023)

LLMs are trained on huge amounts of internet data, which often contains personal information. Researchers have shown these models can be prompted to just tell us things like email addresses, phone numbers, and even social security numbers (Carlini et al., 2021). The larger and more capable the model, the more private information it potentially retains. If we combine this with data poisoning, then we can further amplify privacy vulnerabilities by making specific data points easier to detect (Chen et al., 2022).

How do these vulnerabilities combine to create enhanced risks? The interaction between many attack methods creates compounding risks. For example, attackers can use privacy attacks to extract sensitive information, which they then use to make other attacks more effective. They might learn details about a model's training data that help them craft better adversarial examples or more effective poisoning strategies. This creates a cycle where one type of vulnerability enables others. (Shayegani et al., 2023).

How can we defend against these attacks? One of the most promising approaches to defending against adversarial attacks is adversarial training - deliberately exposing AI systems to adversarial examples during training to make them more robust. Think of it like building immunity through controlled exposure. However, this approach creates its own challenges. While adversarial training can make systems more robust against known types of attacks, it often comes at the cost of reduced performance on normal inputs. More concerning, researchers have found that making systems robust against one type of attack can sometimes make them more vulnerable to others (Zhao et al., 2024). This suggests we may face fundamental trade-offs between different types of robustness and performance. There might even be potential fundamental limitations to how much we can mitigate these issues if we continue with the current training paradigms that we talked about in the capabilities chapter (pre-training followed by instruction tuning). (Bansal et al., 2022)

How does this relate back to misuse and AI Safety? Despite efforts to make language models safer through alignment training, they remain susceptible to a wide range of attacks. (Shayegani et al., 2023) We want AI systems to learn from broad datasets to be more capable, but this increases privacy risks. We want to reuse pre-trained models to make development more efficient, but this creates opportunities for backdoors and privacy attacks (Feng & Tramèr, 2024). We want to make models more robust through techniques like adversarial training, but this can sometimes make them more vulnerable to other types of attacks (Zhao et al., 2024).

Multi-modal systems (LMMs) that combine text, images, and other types of data create even more attack opportunities. Attackers can inject malicious content through one modality (like images) to affect behavior in another modality (like text generation). These cross-modal attacks are particularly concerning because they can bypass safety measures designed for single-modal systems. For example, attackers can embed adversarial patterns in images that trigger harmful text generation, even when the text prompts themselves are completely safe. (Chen et al., 2024) All of this suggests we need new approaches to AI development that consider security and privacy as fundamental requirements, not afterthoughts (King & Meinhardt, 2024).