
AI Safety Fundamentals
Listen to resources from the AI Safety Fundamentals courses!
https://aisafetyfundamentals.com/
Episodes
154 episodes
In Search of a Dynamist Vision for Safe Superhuman AI
By Helen TonerThis essay describes AI safety policies that rely on centralised control (surveillance, fewer AI projects, licensing regimes) as "stasist" approaches that sacrifice innovation for stability. Toner argues we need "dynamist" ...
•
16:55

It’s Practically Impossible to Run a Big AI Company Ethically
By Sigal Samuel (Vox Future Perfect)Even "safety-first" AI companies like Anthropic face market pressure that can override ethical commitments. This article demonstrates the constraints facing AI companies, and why voluntary corporate go...
•
17:14

Seeking Stability in the Competition for AI Advantage
By Iskander Rehman, Karl P. Mueller, Michael J. MazarrThis RAND article describes some of the international dynamics driving the race to AGI between the US and China, and analyses whether nuclear deterrence logic applies to this race.
•
18:18

Solarpunk: A Vision for a Sustainable Future
By Joshua KrookWhat might sustainable human progress look like, beyond pure technological acceleration? This essay provides an alternative vision, based on communities living in greater harmony with each other and with nature, alongside ...
•
13:21

The Gentle Singularity
By Sam AltmanThis blog post offers a vivid, optimistic vision of rapid AI progress from the CEO of OpenAI. Altman suggests that the accelerating technological change will feel "impressive but manageable," and that there are serious chall...
•
10:20

Preparing for Launch
By Tim Fist, Tao Burga, and Tim HwangThe Institute for Progress lays out how the US Government could shape the development of AI towards human flourishing by accelerating beneficial AI applications and defences against societal harms.
•
38:01

AI-Enabled Coups: How a Small Group Could Use AI to Seize Power
By Tom Davidson, Lukas Finnveden and Rose Hadshar. The development of AI that is more broadly capable than humans will create a new and serious threat: AI-enabled coups. An AI-enabled coup could be staged by a very small gr...
•
2:09:31

Progress on Causal Influence Diagrams
By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggAbout 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progr...
•
23:03

Careers in Alignment
Richard Ngo compiles a number of resources for thinking about careers in alignment research.Original text:
•
7:51

Cooperation, Conflict, and Transformative Artificial Intelligence: Sections 1 & 2 — Introduction, Strategy and Governance
Transformative artificial intelligence (TAI) may be a key factor in the long-run trajectory of civilization. A growing interdisciplinary community has begun to study how the development of TAI can be made safe and beneficial to sentient life (B...
•
27:32

Logical Induction (Blog Post)
MIRI is releasing a paper introducing a new model of deductively limited reasoning: “Logical induction,” authored by Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, myself, and Jessica Taylor. Readers may wish to start with the abridged ve...
•
11:56

Embedded Agents
Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don’t already know. There’s a complicated engineering problem here. But there’s...
•
17:39

Understanding Intermediate Layers Using Linear Classifier Probes
Abstract:Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer t...
•
16:34

Feature Visualization
There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability has formed in response to these concerns. As it matures, two major threads of research have begun to coalesce: featur...
•
31:44

Acquisition of Chess Knowledge in Alphazero
Abstract:What is learned by sophisticated neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts,...
•
22:21

Takeaways From Our Robust Injury Classifier Project [Redwood Research]
With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper). Our original aim was to use adversarial training to make a system that (as far as we could tell) never produced injuriou...
•
12:01

High-Stakes Alignment via Adversarial Training [Redwood Research Report]
(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) This post motivates and summarizes this paper from Redwo...
•
19:15

Introduction to Logical Decision Theory for Computer Scientists
Decision theories differ on exactly how to calculate the expectation--the probability of an outcome, conditional on an action. This foundational difference bubbles up to real-life questions about whether to vote in elections, or accept a ...
•
14:27

Debate Update: Obfuscated Arguments Problem
This is an update on the work on AI Safety via Debate that we previously wrote about here. What we did: ...
•
28:30

Robust Feature-Level Adversaries Are Interpretability Tools
Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image...
•
35:33

AI Safety via Red Teaming Language Models With Language Models
Abstract: Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotator...
•
6:47

AI Safety via Debate
Abstract:To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent beha...
•
39:49

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models
Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To over...
•
16:08
