
AI Safety Fundamentals
Listen to resources from the AI Safety Fundamentals courses!
https://aisafetyfundamentals.com/
Episodes
147 episodes
Progress on Causal Influence Diagrams
By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggAbout 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progr...
•
23:03

Careers in Alignment
Richard Ngo compiles a number of resources for thinking about careers in alignment research.Original text:
•
7:51

Cooperation, Conflict, and Transformative Artificial Intelligence: Sections 1 & 2 — Introduction, Strategy and Governance
Transformative artificial intelligence (TAI) may be a key factor in the long-run trajectory of civilization. A growing interdisciplinary community has begun to study how the development of TAI can be made safe and beneficial to sentient life (B...
•
27:32

Logical Induction (Blog Post)
MIRI is releasing a paper introducing a new model of deductively limited reasoning: “Logical induction,” authored by Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, myself, and Jessica Taylor. Readers may wish to start with the abridged ve...
•
11:56

Embedded Agents
Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don’t already know. There’s a complicated engineering problem here. But there’s...
•
17:39

Understanding Intermediate Layers Using Linear Classifier Probes
Abstract:Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer t...
•
16:34

Feature Visualization
There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability has formed in response to these concerns. As it matures, two major threads of research have begun to coalesce: featur...
•
31:44

Acquisition of Chess Knowledge in Alphazero
Abstract:What is learned by sophisticated neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts,...
•
22:21

Takeaways From Our Robust Injury Classifier Project [Redwood Research]
With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper). Our original aim was to use adversarial training to make a system that (as far as we could tell) never produced injuriou...
•
12:01

High-Stakes Alignment via Adversarial Training [Redwood Research Report]
(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) This post motivates and summarizes this paper from Redwo...
•
19:15

Introduction to Logical Decision Theory for Computer Scientists
Decision theories differ on exactly how to calculate the expectation--the probability of an outcome, conditional on an action. This foundational difference bubbles up to real-life questions about whether to vote in elections, or accept a ...
•
14:27

Debate Update: Obfuscated Arguments Problem
This is an update on the work on AI Safety via Debate that we previously wrote about here. What we did: ...
•
28:30

Robust Feature-Level Adversaries Are Interpretability Tools
Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image...
•
35:33

AI Safety via Red Teaming Language Models With Language Models
Abstract: Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotator...
•
6:47

AI Safety via Debate
Abstract:To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent beha...
•
39:49

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models
Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To over...
•
16:08

Summarizing Books With Human Feedback
To safely deploy powerful, general-purpose artificial intelligence in the future, we need to ensure that machine learning models act in accordance with human intentions. This challenge has become known as the alignment problem.
•
6:15

Supervising Strong Learners by Amplifying Weak Experts
Abstract: Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a traini...
•
19:10

Measuring Progress on Scalable Oversight for Large Language Models
Abstract: Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at ...
•
9:32

Is Power-Seeking AI an Existential Risk?
This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture that informs such concern. On this picture, intelligen...
•
3:21:02

Yudkowsky Contra Christiano on AI Takeoff Speeds
In 2008, thousands of blog readers - including yours truly, who had discovered the rationality community just a few months before - watched Robin Hanson debate Eliezer ...
•
1:02:21

Why AI Alignment Could Be Hard With Modern Deep Learning
Why would we program AI that wants to harm us? Because we might not know how to do otherwise.Source:https://www.cold-takes.com/w...
•
28:50

AGI Ruin: A List of Lethalities
I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fractio...
•
1:01:34
