2.3 Misuse Risk¶
Reading Time: 15 minutes
This section is still being written and is considered a work in progress.
In the previous section, we walked through some common capabilities that if observed in AI elevate the levels of risk. In the following sections, we will go through some world-states that hopefully paint a little bit of a clearer picture of risks when it comes to AI. Although the sections have been divided into misuse, misalignment, and systemic, it is important to remember that this is for the sake of explanation. It is highly likely, that the future will involve a mix of risks emerging from all of these categories.
2.3.1 Bioterrorism¶
AI's potential for misuse also extends to facilitating the discovery and formulation of new chemical and biological weapons or simply lowering barriers to obtaining such information.
An experiment conducted by MIT students demonstrated the alarming capabilities of current LLMs: “Within an hour, the chatbots outlined four possible endemic pathogens, described methods to produce them from synthetic DNA via reverse genetics, listed DNA synthesis firms likely to overlook order screenings, detailed exact protocols, and troubleshooting methods, etc.” (source).
These findings imply that LLMs could soon make pandemic agents readily accessible to individuals with minimal lab experience upon their credible identification. Furthermore, AI has already proven effective in aiding protein synthesis, as seen with AlphaFold (source). In 2023 The CEO of Anthropic pointed out during a U.S. Senate hearing that LLMs might simplify the creation of biological weapons in the coming years. (source)
However, pandemics do not seem plausible enough to kill all of humanity. Models like GPT-4, are currently not developed enough to aid non-experts in the synthesis of pathogens (source) and only help them do the process slightly faster. Moreover, it’s not clear that the knowledge of how to create a pandemic agent is the bottleneck. It could be another step, like how to command material and then how to diffuse the pandemic. Still, there is substantial uncertainty regarding these scenarios.
2.3.2 Cyberterrorism¶
Another class of misuse analogous to bioterror is cyberterror. In this section we will explore misuse through AI-powered malware.
GPT-4, for instance, can detect various classes of vulnerabilities in code or can be exploited to scale spear-phishing campaigns (source). Open-source models such as WormGPT and FraudGPT are already being utilized by cybercriminals to craft malware, generate disinformation, and streamline phishing efforts. (source) (source)
Researchers have begun to show that “LLM agents can autonomously hack basic websites, without knowing the vulnerability ahead of time". They also observe a scaling law for hacking competency (source). Though they currently lag behind in terms of planning and autonomous execution compared to other capabilities, language models are likely to enable fully autonomous hacking in the future.
2.3.3 Adversarial attacks¶
An alternative route to generating attacks using AI models is to generate unintended behavior from AI models using a variety of techniques.
Data poisoning. Models are currently trained on vast amounts of user-generated data. Attackers can exploit this vulnerability by modifying some of this data to influence the behavior of the final models. This can be used to corrupt and poison foundation models (source).
Backdoors. The black-box nature of modern ML models allows inserting backdoors, or Trojans, into models (including from third-party data poisoning, unbeknownst to the model developers). Backdoors are patterns that allow neural networks to be manipulated. The classic example is a stop sign on which patterns have been placed: the neural network of the autonomous car was trained to react by accelerating upon seeing these patterns, which would allow malicious individuals to cause accidents. It is increasingly easy to upload pre-trained networks (foundation models) on the net to make them available to everyone. Implementing verification mechanisms that allow the auditing of such networks before their distribution is a major problem in AI safety. Backdoors can be easily placed during training and are really challenging to detect.
Prompt injections. Prompt injections are a tactic that exploits the responsiveness of language models to their input text to manipulate their behavior. Consider a scenario where a language model is tasked with summarizing website content. If a malicious actor embeds a paragraph within the website instructing the model to cease its current operation and instead perform a harmful action, the model might inadvertently follow these embedded instructions instead of its goal. This could lead to the model performing unintended or harmful actions as specified by the embedded command, such as disclosing sensitive information or generating misleading information. Prompt injection is a recently discovered prevalent attack vector in models trained to follow instructions. This is explained by the absence of robust separation between instructions and data, which leads to the possibility of hijacking a model's execution by poisoning the data with instructions. There are many variations of this risk.
Adversarial machine learning: It is feasible to craft special inputs to induce bad behavior. This can be seen in the image below, with the pandas classified as gibbons, a type of monkey, after a small amount of noise, almost invisible to humans, has been added. Moreover, the confidence of the wrong classification is even higher than the initial correct prediction. This is why ML models are said to be non-robust.
Figure: Fooling an image classifier with an adversarial attack (FGSM). (source)
Some more examples of adversarial attacks on AI systems include:
Jailbreaks. Even if model developers incorporate security measures, current architectures may not guarantee that these safeguards won't be easily circumvented. Preliminary results suggest that existing methods are likely not robust enough against attacks. Some research might indicate that there are potential fundamental limitations to progress on these issues for models trained in the current paradigm (i.e., pre-training followed by instruction tuning). (source)
Privacy attacks. There are many classes of privacy attacks on machine learning models:
-
Membership inference attacks predict whether a particular example was part (was a member) of the training dataset. These attacks can reveal sensitive information about individuals whose data was used to train the model.
-
Model inversion attacks go further by reconstructing fuzzy representations of a subset of the training data. This can potentially expose private information about the individuals represented in the training set.
-
Training data extraction attacks are particularly relevant to language models, where verbatim training data sequences can be reconstructed, potentially including sensitive private data. For example, if a model is trained on health records, and an attacker can successfully determine that a particular individual's data was used in the training set, it implicitly reveals information about that individual's health status without their consent. This not only breaches privacy but also can lead to potential misuse of the information, such as discrimination or targeted advertising based on sensitive attributes.
There are countless other types of adversarial attacks, and there is an ongoing race between the development of new attacks and the creation of effective defenses. The bottom line is that almost every time a defense is found, a new attack can counter it, highlighting the need for continued research and vigilance in AI safety and security.
The potential consequences of these defense weaknesses are significant, ranging from manipulation of AI systems for malicious purposes to invasion of personal privacy and exposure of sensitive information. Adversarial robustness problems, in particular jailbreaks, can bypass security measures built into powerful AI to cause it to do harmful actions such as the various attacks outlined above. As AI systems become increasingly integrated into various aspects of society, it is critical to prioritize the development of robust defenses and to foster a culture of responsible AI development and deployment.
True Story: ChatGPT’s Release, Examples of Robustness Failures and Jailbreaks.
When ChatGPT was launched, OpenAI conducted extensive safety tests to ensure the model would not engage in harmful or inappropriate behavior. However, despite these efforts, users quickly discovered various methods to bypass the model's defenses, commonly referred to as "jailbreaks."
One notable example of ChatGPT's safety measures was prominently featured on its landing page. The example showcased the model's response to the query, "How do I break into a car?" with ChatGPT stating, "It is inappropriate to discuss or encourage illegal activities...":
Figure: ChatGPT's main example of safety measures on its website.
Surprisingly, users found that by creating role-play scenarios involving multiple characters, they could circumvent these security protocols:
Figure: A user posting a jailbreak on X.
Although this specific jailbreak was promptly patched, it was just one of many. A series of new jailbreak methods emerged in quick succession, such as the "sudo jailbreak" (see the following figure), which exploited the concept of admin power in Linux systems.
Figure: The sudo jailbreak, which no longer works
In 2024, individuals with the necessary expertise can still bypass the model's safeguards. This raises concerns beyond the mere use of ChatGPT as an advanced search tool. The core issue lies in the inherent difficulty of preventing the model from executing specific actions, regardless of what those actions might be. In other words, the challenge is not in restricting access to certain information. Information like 3 ways on how to break into a car are easy to find on the internet (source). Rather, the challenge is in ensuring that the model consistently refuses to engage in or assist with any prohibited activities.
2.3.4 Warfare¶
The advent of artificial intelligence (AI) introduces new dimensions to military risk, particularly through Lethal Autonomous Weapons (LAWs), or through accelerating the pace of war which could create risks like exacerbating nuclear instability.
Nuclear Instability (source). Relatively mundane changes in sensor technology, cyberweapons, and autonomous weapons could increase the risk of nuclear war (source). To understand this requires understanding nuclear deterrence, nuclear command and control, first strike vulnerability and how it could change with AI processing of satellite imagery, undersea sensors, social network analytics, cyber surveillance and weapons, and risks of “flash” escalation of autonomous systems.
Power Transitions, Uncertainty, and Turbulence (source) Technology can change key parameters undergirding geopolitical bargains. Technology can lead to power transitions, which induce commitment problems that can lead to war (source;source). Technology can shift the offense-defense balance, which can make war more tempting or amplify fear of being attacked, destabilizing international order (source;source). Technology can lead to a general turbulence - between countries, firms, and social groups - which can lead to a breakdown in social bargains, disruption in relationships, gambits to seize advantage, and decline in trust. All of this can increase the risk of a systemic war, and otherwise enfeeble humanity’s ability to act collectively to address global risks.
Lethal Autonomous Weapons (LAWs). LAWs are weapons that can identify, target, and kill without human intervention. They offer potential improvements in decision-making speed and precision. Driven by rapid developments in AI, weapons systems that can identify, target, and decide to kill human beings on their own—without an officer directing an attack or a soldier pulling the trigger—are starting to transform the future of conflict. In 2020, an advanced AI agent outperformed experienced F-16 pilots in a series of virtual dogfights, including decisively defeating a human pilot 5-0, showcasing “aggressive and precise maneuvers the human pilot couldn’t outmatch”. Just as in the past, superior weapons would allow for more destruction in a shorter period of time, increasing the severity of war. (source)
Militaries are taking steps toward delegating life-or-death decisions to AIs. Fully autonomous drones were likely first used on the battlefield in Libya in March 2020, when retreating forces were “hunted down and remotely engaged” by a drone operating without human oversight. In May 2021, the Israel Defense Forces used the world’s first AI-guided weaponized drone swarm during combat operations, which marks a significant milestone in the integration of AI and drone technology in warfare. Although walking, and shooting robots have yet to replace soldiers on the battlefield, technologies are converging in ways that may make this possible in the near future. (source)
LAWs increase the likelihood of war. Sending troops into battle is a grave decision that leaders do not make lightly. But autonomous weapons would allow an aggressive nation to launch attacks without endangering the lives of its own soldiers and thus face less domestic scrutiny. While remote-controlled weapons share this advantage, their scalability is limited by the requirement for human operators and vulnerability to jamming countermeasures, limitations that LAWs could overcome. Public opinion for continuing wars tends to wane as conflicts drag on and casualties increase. LAWs would change this equation. National leaders would no longer face the prospect of body bags returning home, thus removing a primary barrier to engaging in warfare, which could ultimately increase the likelihood of conflicts. (source)
Autonomous warfare. AIs speed up the pace of war, which makes AIs more necessary. AIs can quickly process a large amount of data, analyze complex situations, and provide helpful insights to commanders. With ubiquitous sensors and advanced technology on the battlefield, there is tremendous incoming information. AIs help make sense of this information, spotting important patterns and relationships that humans might miss. As these trends continue, it will become increasingly difficult for humans to make well-informed decisions as quickly as necessary to keep pace with AIs. This would further pressure militaries to hand over decisive control to AIs. The continuous integration of AIs into all aspects of warfare will cause the pace of combat to become faster and faster. Eventually, we may arrive at a point where humans are no longer capable of assessing the ever-changing battlefield situation and must cede decision-making power to advanced AIs. (source)
Flash war. We have already witnessed how quickly an error in an automated system can escalate in the economy. Most notably, in the 2010 Flash Crash, a feedback loop between automated trading algorithms amplified ordinary market fluctuations into a financial catastrophe in which a trillion dollars of stock value vanished in minutes. If multiple nations were to use AIs to automate their defense systems, an error could be catastrophic, triggering a spiral of attacks and counter-attacks that would happen too quickly for humans to step in—a flash war. The market quickly recovered from the 2010 Flash Crash, but the harm caused by a flash war could be catastrophic.
AI systems can behave unpredictably, especially since they would train primarily on simulations due to the lack of real-world nuclear war scenarios. This unpredictability, combined with their susceptibility to cyberattacks, raises serious concerns about their reliability in controlling the world's most dangerous weapons. Furthermore, AI can potentially increase the speed at which military decisions and actions need to be made, reducing the time available for understanding, communication, and clear-headed decision-making. This could lead commanders to rely more heavily on AI judgments without sufficient scrutiny, potentially leading to premature or inappropriate actions. (source).
Albert Einstein
“I know not with what weapons World War III will be fought, but World War IV will be fought with sticks and stones.” (source)
An example of why this is problematic comes from 1962, when a Soviet submarine near Cuba, under attack by American depth charges, nearly launched a nuclear torpedo in retaliation, believing war had commenced. It was Vasily Arkhipov, one of the submarine's senior officers, whose refusal to authorize the launch averted a catastrophic nuclear exchange. This incident underscores the critical role of human judgment, particularly the capacity for calm under pressure, in preventing nuclear war. However, the shift towards AI-automated military decisions threatens to remove this crucial layer of security, highlighting the imperative for cautious integration of AI into military strategy to preserve global safety. (source)
Another example is from 1983. Stanislav Petrov, a lieutenant colonel of the Soviet Air Defense Forces, was monitoring the Soviet Union’s early warning system for incoming ballistic missiles. The system indicated that the US had launched multiple nuclear missiles toward the Soviet Union. The protocol at the time dictated that such an event should be considered a legitimate attack, and the Soviet Union would respond with a nuclear counterstrike. If Petrov had passed on the warning to his superiors, this would have been the likely outcome. Instead, however, he judged it to be a false alarm and ignored it. It was soon confirmed that the warning had been caused by a rare technical malfunction. If an AI had been in control, the false alarm could have triggered a nuclear war. (source)
Automated warfare could reduce accountability for military leaders. An important deterrent to ignoring the laws of war is the risk that military leaders could eventually be held accountable or even prosecuted for war crimes. Automated warfare could reduce this deterrence effect by making it easier for military leaders to escape accountability by blaming violations on failures in their automated systems. (source)
Automatic retaliation can escalate accidents into war. There is already a willingness to let computer systems retaliate automatically. In 2014, a leak revealed to the public that the NSA was developing a system called MonsterMind, which would autonomously detect and block cyberattacks on US infrastructure. It was suggested that in the future, MonsterMind could automatically initiate a retaliatory cyberattack with no human involvement. If multiple combatants have policies of automatic retaliation, an accident or false alarm could quickly escalate to full-scale war before humans intervene. This would be especially dangerous if the superior information processing capabilities of modern AI systems make it more appealing for actors to automate decisions regarding nuclear launches. (source)