Aviv’s Substack

Jan 6, 2023

I think the opposite - by aligning reviewers with real world progress, we will *reduce noise*. I can accept my paper being rejected from a top tier venue for not showing real world value - this is the case in many other fields. I am much less happy with the current state, where a paper gets killed for not showing results on some reviewer’s favorite-yet-arbitrary game.

Expand full comment

Jan 5, 2023Edited

I disagree with this perspective. I think the claim

> with the current knowledge in the field, we believe there are concrete benefits to reap in deployable RL

is probably false, and at minimum need more justification. My expectation is that 95% of real-world tasks are best solved by paying humans to complete the task, recording their actions, and doing imitation learning on the resulting dataset. I would predict that almost every challenge that is proposed will be solved this way for the forseeable future.

The only reason to study or care about RL is to make progress towards the RL-first dream.

Expand full comment

Reply (2)

> Imitation learning

Definitely, we refer to it in point 2 - “acquiring labelled demonstrations can significantly speed up learning”. This is exactly the point - if you try to solve a real problem, you’ll make the most of existing (and affordable) technology, and only then develop new technology to push the boundary. A lot of current RL research tries to re-solve solved problems using RL.

> 95% of real-world tasks are best solved by paying humans to complete the task

> RL-first

What kind of real-world tasks do you have in mind?

We are definitely not against RL first - it's an exciting direction to study.

Still, there are real problems that can be solved without RL first technology. Some examples, which I'm personally familiar with (but there are many more!), are autonomous driving, logistics, computer networks, and computer security. The point is problems for which *humans are not very good at exemplifying solutions*, and RL can potentially bring much value - outperform humans operators.

Expand full comment

Jan 5, 2023Edited

I see, I think we are in agreement! I think I mostly don't see many examples of

> [real-world] problems for which humans are not very good at exemplifying solutions [but RL could potentially solve]

Of the ones you mentioned, autonomous driving is the only one I am familiar with, and in that domain I think humans *are* very good at exemplifying solutions! Collecting a giant dataset of human driving data and imitating it seems way more plausible to be an acceptable immediate solution to the problem, vs running PPO on a car (which would cause many catastrophic crashes *and* not even do a very good job learning).

The only places I've seen our (very slow, very inefficeint) RL algorithms discover solutions better than the best humans are in closed domains where the full rules can be easily & cheaply encoded into a computer -- chess, Dota, etc. -- such that an enormous number of samples can be generated. I think almost no "real world" problems have this property.

(Some problems can be "hacked into" this setting by constructing a suffienctly robust simulator, training in the simulator, and transferring into the real world; but I don't think this a promising or scalable strategy because it is super bottlenecked by our ability manually engineer simulators. If that's what you have in mind, I think that should be considered a different research area called "simulator design", not really "RL".)

Expand full comment

oh, of course no one will run PPO (or any other onilne learning agent) on the car. And of course, human labelled data is definitely part of any autonomous driving program. But imitation learning is not enough (think safety), and there are many other challenges - how to drive in a way that is comfortable to the passengers, how to deal with noisy or out of distribution scenarios, and how to do all this in an economically feasible way are some of them. RL (in the sense of solving large POMDPs) can definitely help, but the problems above don't fit the simple "RL benchmark" protocol.

Expand full comment

Jan 5, 2023Edited

> RL (in the sense of solving large POMDPs) can definitely help

This is again the part that I don't really buy. I don't see how any current RL algorithm can help on any of these axes. If I were solving them, I would do it purely with imitation/supervision.

For example, if I wanted passengers to be comfortable, I would have them rate each ride and then filter out the ones with low comfort ratings from the dataset, or set up a conditional imitation model which predicts actions conditioned on comfort, and at deployment time condition on high comfort.

Again, I for sure agree that RL *in principle* could help, but I don't see any current RL algorithms being useful at all.

Maybe you could be a bit more concrete about what examples you have in mind? What is an example of a setting, some restrictions, how you would turn it into a MDP/POMDP, what algorithm you would use, etc.

Expand full comment

https://www.youtube.com/watch?v=ODSJsviD_SU&ab_channel=Tesla

Jan 6, 2023

Here's a publically available resource to look at. Tesla's 2022 AI day part on FSD planning goes into quite a bit of detail into their problem setup. You can appreciate from it the challenge is setting up a good "POMDP" for the problem, and using different froms of learning and planning to solve it quickly.

Expand full comment

Abhishek Gupta

The only reason to study or care about RL is to make progress towards the RL-first dream. --> this seems like an unsubstantiated claim no? deployable RL in principle allows for continuously improving systems on deployment, which is certainly likely to improve over just imitation learning. As in, RL systems allow you to start from the performance of IL or w/e and then allow you to overfit to the problem at hand with experience the agent collects itself. Now of course we have to account for the cost of rewards and such, but that considered, it allows for adaptive agents that can autonomously specialize to every deployment environment. This seems like a great property for deployment systems to me.

The IL perspective is not unreasonable but I think this might be ignoring the cost of data collection. IL data is expensive, RL data may be relatively cheaper (in terms of human effort) and so production systems that can get more improvement per unit cost of data collection are likely to be more scalable. I don't know if this means we need to be RL-first or purely AGI driven. Now I think the practicalities of RL at it's current state make IL more likely to succeed, but since we are talking about how to do better *research*, it feels like RL methods have the potential to provide improvement at a cheaper data cost than imitation learning.

Expand full comment

Jan 5, 2023Edited

> deployable RL in principle allows for continuously improving systems on deployment

...precisely *is* the dream! Yes, of course in principle RL is better. But current methods are simply inadequate. In practice, if I were to deploy, say, a robot arm to fold laundry, I would absolutely not want that arm running PPO!

I suppose it's fair to say that we don't know for sure that current methods are terrible, and it might be useful to construct challenges/benchmarks that let us verify that fact. But I don't think pushing on those challenges is a good direction for the field, instead it makes more sense to me to stick with fully toy MDPs and continue to focus research on identifying good algorithms to solve them.

To summarize: in this essay, in particular point 2, it is suggested that we put effort into using our current MDP solvers (which are terrible) to solve real-world problems. I think that's misguided, and we should either put effort into improving MDP solvers (for the RL-first dream!), or solving practical tasks using current techniques (which inevitably means *avoiding* our current, terrible MDP solvers).

Expand full comment

Jonathan Somer