Chapter 3: Strategies

Definitions

Before we can build solutions, we must agree on the problem. Differentiating between AI safety, alignment, ethics, and control helps us build targeted effective strategies for each task.

  • 4 min
  • Written by Markov Grey, Charbel-Raphaël Segerie

How we define problems directly impacts which strategies we pursue in solving that problem. In a new and evolving field like AI safety, clearly defined terms are essential for effective communication and research. Ambiguity leads to miscommunication, hinders collaboration, obscures disagreements, and facilitates safety washing (Ren et al., 2024; Lizka, 2023). The terms we use reflect our assumptions about the nature of the problems we're trying to solve and shape the solutions we develop. Terms like "alignment" and "safety" are used with varying meanings, reflecting different underlying assumptions about the nature of the problem and the research goals. The goal of this section is to explain different perspectives on these words, what specific safety strategies aim to achieve, and establish how our text will utilize them.

AI Safety #

Definition 3.1 — AI safety

AI safety ensures AI systems do not cause harm to humans or the environment. It encompasses the broadest range of research and engineering practices focused on preventing harmful outcomes from AI systems. While alignment focuses on aspects such as an AI's goals and intentions, safety addresses a broader range of concerns (Rudner et al., 2021). It is concerned with ensuring that AI systems do not inadvertently or deliberately cause harm or danger to humans or the environment. AI safety research seeks to identify the causes of unintended AI behavior and develop tools for ensuring safe and reliable operation. It can include technical subfields like robustness (ensuring reliable performance, including against adversarial attacks), monitoring (observing AI behavior), and capability control (limiting potentially dangerous abilities).

AI Alignment #

Definition 3.2 — AI Alignment

AI alignment aims to ensure AI systems act in accordance with human intentions and values. Alignment is a subset of safety that focuses specifically on the technical problem of ensuring AI objectives align with human intentions and values. Theoretically, a system could be aligned but unsafe (e.g., competently pursuing the wrong goal due to misspecification) or safe but unaligned (e.g., constrained by control mechanisms despite misaligned objectives). While this sounds straightforward, the precise scope varies significantly across research communities. We already saw a brief definition of alignment in the previous chapter, but this section offers a more nuanced perspective on the various definitions that we could potentially work with.

What Do We Mean by 'Alignment'? Optional 3 min read

AI Ethics #

Definition 3.3 — AI Ethics

AI ethics is the field that examines the moral principles and societal implications of AI systems. It addresses the ethical considerations of potential societal upheavals resulting from AI advancements and the moral frameworks necessary to navigate these changes. The core of AI ethics lies in ensuring that AI developments are aligned with human dignity, fairness, and societal well-being, through a deep understanding of their broader societal impact. Research in AI ethics would encompass, for example, privacy norms, identifying and mitigating bias in systems (Huang et al., 2022; Harvard, 2025; Khan et al., 2022).

Ethics complements technical safety approaches by providing normative guidance on what constitutes beneficial AI outcomes. Alignment focuses on ensuring AI systems pursue intended objectives, research in ethics focuses on which objectives are worth pursuing (Huang et al., 2023; LaCroix & Luccioni, 2022). AI ethics might also include discussions of digital rights and potentially even the rights of digital minds, and AIs in the future.

This chapter focuses primarily on safety frameworks as they inform technical safety and governance strategies rather than exploring ethics, meta-ethics or digital rights.

AI Control #

Definition 3.4 — AI Control

AI control ensures systems remain under human authority despite potential misalignment. AI control implements mechanisms to ensure AI systems remain under human direction, even when they might act against our interests. Unlike alignment approaches that focus on giving AI systems the right goals, control addresses what happens if those goals diverge from human intentions (Greenblatt et al., 2024).

Control and alignment work as complementary safety approaches. While alignment aims to prevent preference divergence by designing systems with the right objectives, control creates layers of security that function even when alignment fails. Control measures include monitoring AI actions, restricting system capabilities, human auditing processes, and mechanisms to terminate AI systems when necessary (Greenblatt et al., 2023). Some researchers argue that even if alignment is needed for superintelligence-level AIs, control through monitoring may be a working strategy for less capable systems (Greenblatt et al., 2024). Ideally, an AGI would be aligned and controllable, meaning it would have the right goals and be subject to human oversight and intervention if something goes wrong.

The control line of AI safety work is discussed in much more detail in our chapter on AI evaluations.

Footnotes

  1. While AI alignment does not necessarily encompass all systemic risks and misuse, there is some overlap. Some alignment techniques could help mitigate specific misuse scenarios—for instance, alignment methods could ensure that models refuse to cooperate with users intending to use AI for harmful purposes, such as bioterrorism. Similarly, from a systemic risk perspective, a well-aligned AI might recognize and refuse to participate in problematic processes embedded within systems, such as financial markets. However, challenges remain, as malicious actors might attempt to circumvent these protections through targeted fine-tuning of models for harmful purposes, and in this case, even a perfectly aligned model wouldn't be able to resist

Was this section useful?

Previous section
Next section
Home