Algorithms for Walking, Running, Swimming, Flying, and Manipulation
© Russ Tedrake, 2024
Last modified .
How to cite these notes, use annotations, and give feedback.
Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2024 semester. Lecture videos are available on YouTube.
Previous Chapter | Table of contents | Next Chapter |
Imitation learning, also known as "learning from demonstrations" (LfD), is the problem of learning a policy from a collection of demonstrations. For state-based feedback, these demonstrations take the form of a set of state-action sequences, $\left[ \bx[\cdot], \bu[\cdot]\right]$. For the richer class of output feedback, this takes the form of observation-action sequences, $\left[ \by[\cdot], \bu[\cdot]\right]$. Note that we do not require any explicit definition of a cost or reward function; in most cases we assume that the demonstrations are obtained from an optimal or near-optimal policy.
Broadly speaking, most approaches to imitation learning can be categorized as
either behavior cloning (BC) or inverse reinforcement learning (IRL).
Behavior cloning attempts to learn a policy directly from the data using supervised
learning. Inverse RL (aka inverse optimal control) attempts to learn a cost function
from the data, and then uses potentially more traditional optimal control approaches
to synthesize a policy for this cost function, in the hopes of generalizing
significantly beyond the demonstration data. See
I think it's fair to say that today, in 2024, behavior cloning is once again taking the robotics world by storm (especially in manipulation research), and it now seems like the shortest path to building "generalist" robot foundation-model-style policies. We'll devote most of this chapter to it.
Behavior cloning
Famously, Large Language Models (LLMs) are trained with behavior cloning (and
then fine-tuned to make them more aligned with human preferences
Is predicting actions fundamentally different from next-token prediction in
language? There are a few reasons why it might be. Actions are continuous and
high-dimensional, whereas language tokens are discrete. Our control systems get put
into the feedback loop with physics, and have to deal with stochasticity from the
environment that LLMs don't experience. Google DeepMind released the RT series of
"vision-language-action" (VLA) models (RT-1
Diffusion Policy and ALOHA seem to have been the watershed results for dexterous manipulation in robotics. Since that time, the internet is now rich with videos of highly dexterous manipulation from all sorts of robots ranging from very low cost manipulators up to and including humanoid robots with dexterous hands.
In one sense, it might seem a little disappointing that, after we've spent so much time in these notes exploring the rich mathematical foundations of dynamics and control, that behavior cloning from human teleop demonstrations can outperform out best methods for some class of problems (which are arguably more about understanding the world than about dynamics and control). But think of it this way: using supervised learning is an awfully clever way to explore the space of policy parameterizations, and has accelerated us into bigger questions about using cameras in the feedback loop, learning multitask/foundation models, and leveraging structure (such as 3D geometry / objectness) in our representations or not. The success of LLMs (and now multimodal models) is undeniable, and it would be a mistake to ignore the great new possibilities that have opened up. And I do think that there is something fundamentally about imitation learning for manipulation: in many cases the "rules of the game" (e.g. task specification) is not fully described by physics -- I've come to appreciate that humans bring an amazing amount of background knowledge / common sense to bear when they are performing even relatively simple manipulation tasks. I'm very confident that our knowledge of dynamics and control will help us penetrate the new vistas enabled by these high-capacity models.
Let me make one more important point. Sometimes people say that BC is
fundamentally limited because it can never outperform the human demonstrator. But
one of the early results from behavior cloning was that this limit that is not
strictly true.
There is also a precident for combining BC with other methods to improve beyond
the original demonstrations. For DeepMind's AlphaGo
In 2016, a few years after the start of the deep learning revolution,
My own journey with imitation learning started with a project led by Pete
Florence and Lucas Manuelli on using a particular form of self-supervised
learning for 3D geometries to train the policies
Writing feedback controllers which operate directly on the RGB camera images was something entirely new for me. We had been using RGB-D cameras to do some amount of visual state estimation / pose estimation before that, but were leaning pretty heavily on the depth channel and very explicit 3D reasoning/matching. But I remember around the time of our first imitation learning project, I asked the students "if you could only choose one, RGB or depth, which would you choose". They chose RGB. There are many situations where a task is unclear or even ambiguous when looking only at a depth image. The ambiguities can often be resolved with the addition of RGB, and there are many depth cues in RGB that allow us and our visuomotor policies to be successfull for 3D tasks even without an explicit depth sensor. Diffusion Policy and ACT, for instance, achieve their amazing performance by consuming RGB (no depth).
In my experience, control theorists have very satisfying answers to almost any
dynamics and control problem. But (with a few rare exceptions) they didn't do
computer vision. This was something entirely new. The sensor model -- the mapping
from, e.g. the state and parameters of a MultibodyPlant
to the output
image $y$ -- is potentially a full game-engine-quality renderer. Even though there
are lots of projects now on making differentiable renderers, these can only do so
much because the pixelation process is inherently very local/non-smooth. Going
from the image back into a manageable intermediate (latent) state representation,
$z$, started becoming viable with the rise of deep networks for perception.
Data/learning does feel fundamental here -- mapping from RGB into a
meaningful representation for control is more about the statistics of natural
scenes than about the model-based physics of propagating light.
This, for me, was the first main lesson I took from our imitation learning
work: closing feedback loops by directly consuming RGB at control rates is now
possible, and is incredibly powerful for robust performance in visually complex
settings. Imitation learning (and reinforcement learning) have so far enabled this
in a way that model-based control pipelines which require explicit state (or
belief) estimation do not. (Of course, techniques like teacher-student
distillation
One of the famous challenges in imitation learning is the problem of distribution shift. Imagine that you are training a policy for driving a car...
DAGGER
Teacher/Student.
scalable, reliable training, ..., can cope with multimodality, ...
The visuomotor policies that we'll study here should output low-level robot actions -- $\bu$ in the parlance of these notes. These need not be torque commands directly... in fact it's more typical for them to output a slightly higher-level command like joint velocity or end-effector velocity, which gets passed to a low-level controller. Note that there is also a now large body of literature where people use LLMs or VLMs to determine a sequence of high level actions, but assume that someone has authored or otherwise obtained a set of "skill libraries" that map the discrete high-level actions to control; while interesting, I would not call those approached visuomotor policies and will not discuss those here.
In all cases, the input encoders (discussed next) map the recent history of observations into some latent representation, which then eventually gets mapped back into actions via the action decoder. It is quite useful to categorize the different visuomotor policy architectures based on the different choices that they make about the action decoder.
There is a series of work, now commonly referred to as VLA
(vision-language-action) architectures, which leverage the successful transformer architectures from language and vision by discretizing and tokenize the robot
action space. Early examples include Decision Transformer
These tokenized-action architectures naturally learn a probability over next tokens, which deals very directly with the potential multimodality in the training data. But this comes at the cost of having discretized the action space. I don't worry much about the resolution of this discretization being limiting, but I'm a bit worried that the discretization destroys the natural inductive bias of the continuous space. (For instance, the end-effector position 0.1 is closer to 0.2 than to 0.5, but this information is completely discarded in the discretization.) Other people don't seem as concerned, and perhaps when we eventually have enough data it will all be in the noise.
Behavior Transformers (BeT)
There were a number of attempts to handle multimodality in a more natively
continous setting. Implicit BC
Both the ACT paper and the Diffusion Policy paper strongly emphasized another detail about the output encoding: rather than predicting a single (current) action to take, these models predicted an entire sequence of future actions, and then operate in a fashion similar to model-predictive control.
Although researchers are rapidly adopting additional input modalities, by far
the most common input modalities are the robot proprioception (e.g. joint
sensors), which can be passed into the model directly, and image observations
which need to be encoded from raw RGB into some intermediate representation.
Although there is a torrent of literature on this, there are a few choices that
have clearly emerged as the standards: ResNet and ViT. For instance, the original
Diffusion Policy paper used a ResNet-18 (without pretraining) with small
modifications, e.g. to maintain spatial information
Language-conditioned multitask policies...
One particularly successful form of behavior cloning for visuomotor
policies with continuous action spaces is the Diffusion Policy
"Denoising Diffusion" models are an approach to generative AI, made famous by
their ability to generate high-resolution photorealistic images. Inspired by the
"manifold hypothesis" (e.g. the idea that realistic images live on a
low-dimensional manifold in pixel space), the intuition behind denoising diffusion
is that we train a model by adding noise to samples drawn from the data
distribution, then learn to predict the noise from the noisy images, in order to
"denoise" random images back on to the manifold. While image generation made these
models famous, they have proven to be highly capable in generating samples from a
wide variety of high-dimension continuous distributions, even distributions that
are conditioned on high-dimensional inputs. I recommend this blog post and
Let's consider samples $\bu \in \Re^m$ drawn from a training dataset $\mathcal{D}.$ Diffusion models are trained to estimate a noise vector ${\bf \epsilon} \in \Re^m$ to minimize the loss function $$\ell(\theta) = \mathbb{E}_{\bu, {\bf \epsilon}, \sigma} || {\bf f}_\theta(\bu + \sigma {\bf \epsilon}, \sigma ) - {\bf \epsilon} ||^2,$$ where $\theta$ is the parameter vector, and $f_\theta$ is typically some high-capacity neural network. In practice, training is done by randomly sampling $\bu$ from $\mathcal{D}$, ${\bf \epsilon}$ from $\mathcal{N}({\bf 0}_m, {\bf I}_{m \times m})$, and $\sigma$ from a uniform distribution over a positive set of numbers denoted as $\{\sigma_k\}_{k=0}^K,$ where we have $\sigma_k > \sigma_{k-1}.$
To sample a new output from the model, the denoising diffusion
implicit models (DDIM) sampler
Diffusion models have a slightly convoluted history. The term
"diffusion" came from a paper
It is straight-forward to condition the generative model on an exogeneous input, by simply adding an additional signal, $\by$, to the denoiser: $f_\theta(\bu, \sigma, \by).$
Behavior cloning is perhaps the simplest form of imitation learning -- it simply attempts learn a policy using supervised learning to match expert demonstrations. While it is tempting to learn deterministic output-feedback policies (maps from history of observations to actions), one quickly finds that human demonstrations are typically not unique. Perhaps this is not surprising, as we know that optimal feedback policies in general are not unique! To address this non-uniqueness / multi-modality in the human demonstrations, it's well understood that behavior cloning benefits from learning a conditional distribution over actions.
Diffusion Policy is the natural application of (conditional) denoising
diffusion models to learning these policies. It was inspired, in
particular, but the modeling choices in Diffuser
Let me be clear, it almost certainly does not make sense to use a diffusion policy to represent a linear (output) feedback control policy. But it can be helpful to understand what the Diffusion Policy looks like in this extremely simplified case.
Consider the case where we have the standard linear-Gaussian dynamical system: \begin{gather*} \bx[n+1] = \bA\bx[n] + \bB\bu[n] + \bw[n], \\ \by[n] = \bC\bx[n] + \bD\bu[n] + \bv[n], \\ \bw[n] \sim \mathcal{N}({\bf 0}, {\bf \Sigma}_w), \quad \bv[n] \sim \mathcal{N}({\bf 0}, {\bf \Sigma}_v). \end{gather*} Imagine that we create a dataset by rolling out trajectory demonstrations using a linear (output) feedback policy -- it could be, for instance, the optimal policy from LQG design. The question is: what (exactly) does the diffusion policy learn?
Let's start with the state-feedback case, where we generate roll-outs using a controller, $\bu = - \bK\bx$, given a Gaussian distribution of intial conditions and Gaussian process noise. In this case, the training loss function reduces to $$\ell(\theta) = \mathbb{E}_{\bx, {\bf \epsilon}, \sigma} || {\bf f}_\theta(-\bK\bx + \sigma {\bf \epsilon}, \sigma, \bx) - {\bf \epsilon} ||^2,$$ where the expectation in $\bx$ is over the stationary distribution of the closed-loop system. In this case, we don't need a neural network; take $f_\theta$ to be a simple function. In particular, we can achieve zero loss using the denoiser given by $${\bf f}_\theta(\bu, \sigma, \bx) = \frac{1}{\sigma}\left[\bu + \bK\bx\right].$$ At evaluation time, the sampling iterations, $$\bu_{k-1} = \bu_k + \frac{\sigma_{k-1} - \sigma_k}{\sigma_k}\left[\bu_k + \bK\bx\right],$$ will converge on $\bu_0 = -\bK\bx.$ (Clearly $\bu_k = -\bK\bx$ is a fixed point of the iteration, and the $\frac{\sigma_{k-1} - \sigma_k}{\sigma_k}$ term is like the step-size of gradient descent.)
Returning to output feedback, we know that the optimal linear output-feedback policy from LQG is typically written in a state-space form, e.g.: \begin{gather*} \hat{\bx}[n+1] = \bA\hat{\bx}[n] + \bB_c\bu[n] + {\bf L}\left(\by[n] - \bC\hat{x}[n] - \bD\bu[n]\right), \\ \bu[n] = -\bK\hat{\bx}[n]. \end{gather*} But the Diffusion Policy architecture (with $H_u=1$) is typically formulated as learning a denoiser conditioned on a finite history of actions and observations, \begin{gather*}f_\theta(\bu[n], \sigma, \bar{\by}_{H_y}, \bar{\bu}_{H_y}), \\ \bar{\by}_{H_y} = \left[\by[n-1],... ,\by[n-H_y]\right], \\ \bar{\bu}_{H_y} = \left[\bu[n-1],... ,\bu[n-H_y]\right].\end{gather*} So the more direct analogy would be to generating demonstrations from an autoregressive linear policy, e.g. $$\bu[n] = -\bK \begin{bmatrix} \bar{\by}_{H_y} \\ \bar{\bu}_{H_y}\end{bmatrix}.$$ This autoregressive form is commonly used in disturbance-based output feedback, and may be helpful to think of it e.g. as an "unrolled" (truncated) Kalman filter. Again, we can achieve zero loss with a denoiser of the form $${\bf f}_\theta(\bu, \sigma, \bar{\by}_{H_y}, \bar{\bu}_{H_y}) = \frac{1}{\sigma}\left[\bu + \bK \begin{bmatrix} \bar{\by}_{H_y} \\ \bar{\bu}_{H_y}\end{bmatrix}\right].$$
This shapes my current mental model for what Diffusion Policy is learning, even in complicated manipulation settings: it may be helpful to think about the history of actions and observations being compressed into a (task-relevant) belief-state representation, like we would have in an (unrolled, truncated) Kalman filter, that is sufficient for predicting actions.
Another important aspect of the Diffusion Policy architecture is that it is predicting not just the instantaneous action, but a sequence of future actions. Even in the linear policy setting, we can see that this puts different pressure on the learned representations...
If control directly from pixels was the first capability unlocked by imitation
learning, I would say that large-scale multitask decision making is the second.
Multitask learning has a long history in the machine learning community
In my view, this has potentially profound implications for how we think about control. Our basic control definitions start with, e.g. we have a state $\bx$, inputs $\bu$, outputs $\by.$ The discussion on output feedback got us thinking a little about state representations for control -- for instance a belief state is a sufficient (but not necessary) state because it is a sufficient statistic of the history of actions and observations. But multitask in the imitation learning setting changes things. In the simple case, we'll say that our inputs $\bu$, and outputs $\by$ are the same across all of the tasks. But it may well be that the underlying state space is not. (I admit that philosophically there is a state of the entire universe which is the same across tasks, but I mean the more tractable representations of state that we've been using through the notes.) What does it mean to learn a state representation for control across tasks where even the dimension/cardinality of the state space can be different? Even our catch-all definition of belief state breaks down in this case.
Are there tractable ways to describe distributions over tasks that are amenable to our strongest theoretical tools, but still relevant for the complexity and diversity of the real world? When we talked about stochastic optimal control, we gave examples where taking an average over many possible rollouts can actually simplify the loss landscape, avoiding some local minima and making optimization easier. Can multitask control formulations have a similar effect?
Going further, how exactly is it that solving/learning one task can potentially help us in solving/learning another? This brings up basic questions about designing a curriculum for our control systems. Is is possible for us to soar to higher and higher heights if we sequence our control problem instances correctly?
When I start using phrases like "learning" and "curriculum", then it becomes very natural to think in terms of our natural intelligence. How did we learn to walk? To play tennis? But let's remember that these analogies only go so far. For me, the GPT series of models are clearly unlike any single natural intelligence, they are more like a collective intelligence of the entire species (though still certainly deficient in some metrics). In the age of foundation models, it may not be the case that every robot needs to learn to use a toaster; the dream of "fleet learning" is that one robot will learn how to use a toaster and then they will all have learned.
This brings up fundamental questions about the learning algorithms, about data efficiency (and privacy). But it also challenges our theories of dynamics and control. For instance, there are open questions about how to balance being a generalist and using only shared data vs being a specialist. Certainly if a particular robot is solving problems in a particular warehouse, then while the statistics of tasks across the world may help form robust representations, this robot can almost certainly perform better if it narrows and specializes the policies (and world models) to exploit the distribution of tasks in the warehouse.
A particular version of this question appears in the context of
"cross-embodiment" data and models. Right now, robot data with action labels is
scarce (compared with online data for text and images). This, in part, has
motivated the use of datasets which combine data from many robots/platforms
The fact that many of these fundamental questions are now being asked makes this a simply amazing time to be a roboticist. However, the pace of new innovations is so fast that often researchers feel pressure to race to publication before having done proper rigorous theoretical or empirical work. We are building tall towers but with somewhat shakey foundations. I firmly believe that the tenants of dynamics and control (amongst other rigorous technical tools) have a lot to contribute to understanding and continuing to push the field forward, and that some of the maturity with which we can understand these simpler (but not very simple!) problems can serve as a model for what we should expect about our understanding of the even more complex ones.
Previous Chapter | Table of contents | Next Chapter |