AI Safety Fundamentals Artwork

AI Safety Fundamentals

Listen to resources from the AI Safety Fundamentals courses!

https://aisafetyfundamentals.com/

Episodes

147 episodes

Progress on Causal Influence Diagrams

By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggAbout 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progr...

January 04, 2025 • 23:03

AI Safety Fundamentals Artwork

Careers in Alignment

Richard Ngo compiles a number of resources for thinking about careers in alignment research.Original text:

January 04, 2025 • 7:51

AI Safety Fundamentals Artwork

Cooperation, Conflict, and Transformative Artificial Intelligence: Sections 1 & 2 — Introduction, Strategy and Governance

Transformative artificial intelligence (TAI) may be a key factor in the long-run trajectory of civilization. A growing interdisciplinary community has begun to study how the development of TAI can be made safe and beneficial to sentient life (B...

January 04, 2025 • 27:32

AI Safety Fundamentals Artwork

Logical Induction (Blog Post)

MIRI is releasing a paper introducing a new model of deductively limited reasoning: “Logical induction,” authored by Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, myself, and Jessica Taylor. Readers may wish to start with the abridged ve...

January 04, 2025 • 11:56

AI Safety Fundamentals Artwork

Embedded Agents

Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don’t already know. There’s a complicated engineering problem here. But there’s...

January 04, 2025 • 17:39

AI Safety Fundamentals Artwork

Understanding Intermediate Layers Using Linear Classifier Probes

Abstract:Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer t...

January 04, 2025 • 16:34

AI Safety Fundamentals Artwork

Feature Visualization

There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability has formed in response to these concerns. As it matures, two major threads of research have begun to coalesce: featur...

January 04, 2025 • 31:44

AI Safety Fundamentals Artwork

Acquisition of Chess Knowledge in Alphazero

Abstract:What is learned by sophisticated neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts,...

January 04, 2025 • 22:21

AI Safety Fundamentals Artwork

Takeaways From Our Robust Injury Classifier Project [Redwood Research]

With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper). Our original aim was to use adversarial training to make a system that (as far as we could tell) never produced injuriou...

January 04, 2025 • 12:01

AI Safety Fundamentals Artwork

High-Stakes Alignment via Adversarial Training [Redwood Research Report]

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) This post motivates and summarizes this paper from Redwo...

January 04, 2025 • 19:15

AI Safety Fundamentals Artwork

Introduction to Logical Decision Theory for Computer Scientists

Decision theories differ on exactly how to calculate the expectation--the probability of an outcome, conditional on an action. This foundational difference bubbles up to real-life questions about whether to vote in elections, or accept a ...

January 04, 2025 • 14:27

AI Safety Fundamentals Artwork

Debate Update: Obfuscated Arguments Problem

This is an update on the work on AI Safety via Debate that we previously wrote about here. What we did: ...

January 04, 2025 • 28:30

AI Safety Fundamentals Artwork

Robust Feature-Level Adversaries Are Interpretability Tools

Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image...

January 04, 2025 • 35:33

AI Safety Fundamentals Artwork

AI Safety via Red Teaming Language Models With Language Models

Abstract: Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotator...

January 04, 2025 • 6:47

AI Safety Fundamentals Artwork

AI Safety via Debate

Abstract:To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent beha...

January 04, 2025 • 39:49

AI Safety Fundamentals Artwork

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To over...

January 04, 2025 • 16:08

AI Safety Fundamentals Artwork

Summarizing Books With Human Feedback

To safely deploy powerful, general-purpose artificial intelligence in the future, we need to ensure that machine learning models act in accordance with human intentions. This challenge has become known as the alignment problem.

January 04, 2025 • 6:15

AI Safety Fundamentals Artwork

Supervising Strong Learners by Amplifying Weak Experts

Abstract: Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a traini...

January 04, 2025 • 19:10

AI Safety Fundamentals Artwork

Measuring Progress on Scalable Oversight for Large Language Models

Abstract: Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at ...

January 04, 2025 • 9:32

AI Safety Fundamentals Artwork

Is Power-Seeking AI an Existential Risk?

This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture that informs such concern. On this picture, intelligen...

January 04, 2025 • 3:21:02

AI Safety Fundamentals Artwork

Yudkowsky Contra Christiano on AI Takeoff Speeds

In 2008, thousands of blog readers - including yours truly, who had discovered the rationality community just a few months before - watched Robin Hanson debate Eliezer ...

January 04, 2025 • 1:02:21

AI Safety Fundamentals Artwork

Why AI Alignment Could Be Hard With Modern Deep Learning

Why would we program AI that wants to harm us? Because we might not know how to do otherwise.Source:https://www.cold-takes.com/w...

January 04, 2025 • 28:50

AI Safety Fundamentals Artwork

AGI Ruin: A List of Lethalities

I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fractio...

January 04, 2025 • 1:01:34

AI Safety Fundamentals Artwork

Where I Agree and Disagree with Eliezer

(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)Agreements Powerful AI systems have a good chance of deliberately and irreversibly disempowering humanity. This is a much easi...

January 04, 2025 • 42:46

AI Safety Fundamentals Artwork

ML Systems Will Have Weird Failure Modes

Previously, I've argued that future ML systems might exhibit unfamiliar, emergent capabilities, and that thought experiments provide one approach towards predicting these capabilities and their consequences. In this post I’ll describe a particu...

January 04, 2025 • 13:49

AI Safety Fundamentals Artwork