I think the point you bring up about publishing negative results is priceless. I think that kind of sincerity is needed in general but for RL that is necessary condition to go up that slope of enlightenment. So many efforts result in the agent learning nothing. Not all but many of those could be interpreted and reported. It can be of help to all the practitioners of the field.
1. Iterated offline (no online access to systems, but we can learn-and-deploy a few number of times, so it's not pure offline)
2. Micro-data: our systems are physical, don't get faster with time, our main bottleneck is extreme small samples.
3. Safety: we cannot "lose" while learning.
On the other hand, our systems are already nicely "feature-extracted", the dimensionality is relatively low (max 100s), so all the representation learning research on RL (atari, robots with camera) is meaningless for us. This also means that we can concentrate on small benchmarks where we can run a lot of experiments (to experimentally validate some choices, but also to report statistically valid results)
So the only remark I have is that sample complexity is important, otherwise I agree with everything.
The main question to me: what would you propose to do? We have a lot of fun internally, learning a ton on CartPole and Acrobot (also holding the SOTA on those), but current reviewing practices are extremely hostile to our niche approach (small dimensions, small systems, small data, but rigorous comparison). Also, for running proper challenges, algorithms should be evaluated on a simulator which requires computational power, and probably a third party that has no stakes in winning (cf ImageNet). Who would that be?
This is a great thought provoking post. I particularly like that you highlighted these two different views of RL research (RL-first vs deployable RL). It seems to me that more explicit acknowledgment of these two views would help the RL community progress particularly with respect to how papers are reviewed. Reviewers and authors implicitly take on one of these two views and when they differ a paper may be judged unfairly. For instance, RL-first reviewers may prefer tabula rasa learning and look down on the use of domain knowledge whereas deployable-RL reviewers may want to judge an assumption of domain knowledge based on how reasonable or broadly applicable it is (e.g., assuming an imperfect domain simulator is a mild assumption for many applications). Explicit acknowledgment of what view a paper is written for could make it easier to decide what criteria to apply to the paper.
I disagree with "Criticize others' research" especially "..ask how a paper gets the field closer to real-world impact. If you are a senior reviewer or an area chair - consider instructing your reviewers to judge papers differently..." This is clearly not a good idea as it makes reviewing process even more noisy, e..g reviewers can simply ask for unrealistic use cases or use this to kill a paper if they don't like the paper.
I think the opposite - by aligning reviewers with real world progress, we will *reduce noise*. I can accept my paper being rejected from a top tier venue for not showing real world value - this is the case in many other fields. I am much less happy with the current state, where a paper gets killed for not showing results on some reviewer’s favorite-yet-arbitrary game.
I disagree with this perspective. I think the claim
> with the current knowledge in the field, we believe there are concrete benefits to reap in deployable RL
is probably false, and at minimum need more justification. My expectation is that 95% of real-world tasks are best solved by paying humans to complete the task, recording their actions, and doing imitation learning on the resulting dataset. I would predict that almost every challenge that is proposed will be solved this way for the forseeable future.
The only reason to study or care about RL is to make progress towards the RL-first dream.
Definitely, we refer to it in point 2 - “acquiring labelled demonstrations can significantly speed up learning”. This is exactly the point - if you try to solve a real problem, you’ll make the most of existing (and affordable) technology, and only then develop new technology to push the boundary. A lot of current RL research tries to re-solve solved problems using RL.
> 95% of real-world tasks are best solved by paying humans to complete the task
> RL-first
What kind of real-world tasks do you have in mind?
We are definitely not against RL first - it's an exciting direction to study.
Still, there are real problems that can be solved without RL first technology. Some examples, which I'm personally familiar with (but there are many more!), are autonomous driving, logistics, computer networks, and computer security. The point is problems for which *humans are not very good at exemplifying solutions*, and RL can potentially bring much value - outperform humans operators.
I see, I think we are in agreement! I think I mostly don't see many examples of
> [real-world] problems for which humans are not very good at exemplifying solutions [but RL could potentially solve]
Of the ones you mentioned, autonomous driving is the only one I am familiar with, and in that domain I think humans *are* very good at exemplifying solutions! Collecting a giant dataset of human driving data and imitating it seems way more plausible to be an acceptable immediate solution to the problem, vs running PPO on a car (which would cause many catastrophic crashes *and* not even do a very good job learning).
The only places I've seen our (very slow, very inefficeint) RL algorithms discover solutions better than the best humans are in closed domains where the full rules can be easily & cheaply encoded into a computer -- chess, Dota, etc. -- such that an enormous number of samples can be generated. I think almost no "real world" problems have this property.
(Some problems can be "hacked into" this setting by constructing a suffienctly robust simulator, training in the simulator, and transferring into the real world; but I don't think this a promising or scalable strategy because it is super bottlenecked by our ability manually engineer simulators. If that's what you have in mind, I think that should be considered a different research area called "simulator design", not really "RL".)
oh, of course no one will run PPO (or any other onilne learning agent) on the car. And of course, human labelled data is definitely part of any autonomous driving program. But imitation learning is not enough (think safety), and there are many other challenges - how to drive in a way that is comfortable to the passengers, how to deal with noisy or out of distribution scenarios, and how to do all this in an economically feasible way are some of them. RL (in the sense of solving large POMDPs) can definitely help, but the problems above don't fit the simple "RL benchmark" protocol.
> RL (in the sense of solving large POMDPs) can definitely help
This is again the part that I don't really buy. I don't see how any current RL algorithm can help on any of these axes. If I were solving them, I would do it purely with imitation/supervision.
For example, if I wanted passengers to be comfortable, I would have them rate each ride and then filter out the ones with low comfort ratings from the dataset, or set up a conditional imitation model which predicts actions conditioned on comfort, and at deployment time condition on high comfort.
Again, I for sure agree that RL *in principle* could help, but I don't see any current RL algorithms being useful at all.
Maybe you could be a bit more concrete about what examples you have in mind? What is an example of a setting, some restrictions, how you would turn it into a MDP/POMDP, what algorithm you would use, etc.
Here's a publically available resource to look at. Tesla's 2022 AI day part on FSD planning goes into quite a bit of detail into their problem setup. You can appreciate from it the challenge is setting up a good "POMDP" for the problem, and using different froms of learning and planning to solve it quickly.
The only reason to study or care about RL is to make progress towards the RL-first dream. --> this seems like an unsubstantiated claim no? deployable RL in principle allows for continuously improving systems on deployment, which is certainly likely to improve over just imitation learning. As in, RL systems allow you to start from the performance of IL or w/e and then allow you to overfit to the problem at hand with experience the agent collects itself. Now of course we have to account for the cost of rewards and such, but that considered, it allows for adaptive agents that can autonomously specialize to every deployment environment. This seems like a great property for deployment systems to me.
The IL perspective is not unreasonable but I think this might be ignoring the cost of data collection. IL data is expensive, RL data may be relatively cheaper (in terms of human effort) and so production systems that can get more improvement per unit cost of data collection are likely to be more scalable. I don't know if this means we need to be RL-first or purely AGI driven. Now I think the practicalities of RL at it's current state make IL more likely to succeed, but since we are talking about how to do better *research*, it feels like RL methods have the potential to provide improvement at a cheaper data cost than imitation learning.
> deployable RL in principle allows for continuously improving systems on deployment
...precisely *is* the dream! Yes, of course in principle RL is better. But current methods are simply inadequate. In practice, if I were to deploy, say, a robot arm to fold laundry, I would absolutely not want that arm running PPO!
I suppose it's fair to say that we don't know for sure that current methods are terrible, and it might be useful to construct challenges/benchmarks that let us verify that fact. But I don't think pushing on those challenges is a good direction for the field, instead it makes more sense to me to stick with fully toy MDPs and continue to focus research on identifying good algorithms to solve them.
To summarize: in this essay, in particular point 2, it is suggested that we put effort into using our current MDP solvers (which are terrible) to solve real-world problems. I think that's misguided, and we should either put effort into improving MDP solvers (for the RL-first dream!), or solving practical tasks using current techniques (which inevitably means *avoiding* our current, terrible MDP solvers).
1. Can we create a challenge where the evaluated model actually does “something“ in real life, an api that allows your model to actually control some real-life entity. You could grant everyone access to the data from other model’s runs, and maintain some queue that people can submit their models to for evaluation.
2. I think there should be emphasis on goals for RL that aren’t “performance”. For instance, a regression model’s weights could carry more importance than the models ability to predict. It could have tremendous value without having ever been deployed. Can RL algorithms provide insight?
I think the point you bring up about publishing negative results is priceless. I think that kind of sincerity is needed in general but for RL that is necessary condition to go up that slope of enlightenment. So many efforts result in the agent learning nothing. Not all but many of those could be interpreted and reported. It can be of help to all the practitioners of the field.
In our case (self-driving engineering systems: https://balazskegl.medium.com/building-autopilots-for-engineering-systems-using-ai-86a4f312c1f2) there are three motivating constraints:
1. Iterated offline (no online access to systems, but we can learn-and-deploy a few number of times, so it's not pure offline)
2. Micro-data: our systems are physical, don't get faster with time, our main bottleneck is extreme small samples.
3. Safety: we cannot "lose" while learning.
On the other hand, our systems are already nicely "feature-extracted", the dimensionality is relatively low (max 100s), so all the representation learning research on RL (atari, robots with camera) is meaningless for us. This also means that we can concentrate on small benchmarks where we can run a lot of experiments (to experimentally validate some choices, but also to report statistically valid results)
So the only remark I have is that sample complexity is important, otherwise I agree with everything.
The main question to me: what would you propose to do? We have a lot of fun internally, learning a ton on CartPole and Acrobot (also holding the SOTA on those), but current reviewing practices are extremely hostile to our niche approach (small dimensions, small systems, small data, but rigorous comparison). Also, for running proper challenges, algorithms should be evaluated on a simulator which requires computational power, and probably a third party that has no stakes in winning (cf ImageNet). Who would that be?
This is a great thought provoking post. I particularly like that you highlighted these two different views of RL research (RL-first vs deployable RL). It seems to me that more explicit acknowledgment of these two views would help the RL community progress particularly with respect to how papers are reviewed. Reviewers and authors implicitly take on one of these two views and when they differ a paper may be judged unfairly. For instance, RL-first reviewers may prefer tabula rasa learning and look down on the use of domain knowledge whereas deployable-RL reviewers may want to judge an assumption of domain knowledge based on how reasonable or broadly applicable it is (e.g., assuming an imperfect domain simulator is a mild assumption for many applications). Explicit acknowledgment of what view a paper is written for could make it easier to decide what criteria to apply to the paper.
Nice. We are working one introducing a novel challenge to the RL community, related to agroecology. Hopefully it should be released this spring :)
A very good piece, thanks for sharing.
I disagree with "Criticize others' research" especially "..ask how a paper gets the field closer to real-world impact. If you are a senior reviewer or an area chair - consider instructing your reviewers to judge papers differently..." This is clearly not a good idea as it makes reviewing process even more noisy, e..g reviewers can simply ask for unrealistic use cases or use this to kill a paper if they don't like the paper.
I think the opposite - by aligning reviewers with real world progress, we will *reduce noise*. I can accept my paper being rejected from a top tier venue for not showing real world value - this is the case in many other fields. I am much less happy with the current state, where a paper gets killed for not showing results on some reviewer’s favorite-yet-arbitrary game.
I disagree with this perspective. I think the claim
> with the current knowledge in the field, we believe there are concrete benefits to reap in deployable RL
is probably false, and at minimum need more justification. My expectation is that 95% of real-world tasks are best solved by paying humans to complete the task, recording their actions, and doing imitation learning on the resulting dataset. I would predict that almost every challenge that is proposed will be solved this way for the forseeable future.
The only reason to study or care about RL is to make progress towards the RL-first dream.
> Imitation learning
Definitely, we refer to it in point 2 - “acquiring labelled demonstrations can significantly speed up learning”. This is exactly the point - if you try to solve a real problem, you’ll make the most of existing (and affordable) technology, and only then develop new technology to push the boundary. A lot of current RL research tries to re-solve solved problems using RL.
> 95% of real-world tasks are best solved by paying humans to complete the task
> RL-first
What kind of real-world tasks do you have in mind?
We are definitely not against RL first - it's an exciting direction to study.
Still, there are real problems that can be solved without RL first technology. Some examples, which I'm personally familiar with (but there are many more!), are autonomous driving, logistics, computer networks, and computer security. The point is problems for which *humans are not very good at exemplifying solutions*, and RL can potentially bring much value - outperform humans operators.
I see, I think we are in agreement! I think I mostly don't see many examples of
> [real-world] problems for which humans are not very good at exemplifying solutions [but RL could potentially solve]
Of the ones you mentioned, autonomous driving is the only one I am familiar with, and in that domain I think humans *are* very good at exemplifying solutions! Collecting a giant dataset of human driving data and imitating it seems way more plausible to be an acceptable immediate solution to the problem, vs running PPO on a car (which would cause many catastrophic crashes *and* not even do a very good job learning).
The only places I've seen our (very slow, very inefficeint) RL algorithms discover solutions better than the best humans are in closed domains where the full rules can be easily & cheaply encoded into a computer -- chess, Dota, etc. -- such that an enormous number of samples can be generated. I think almost no "real world" problems have this property.
(Some problems can be "hacked into" this setting by constructing a suffienctly robust simulator, training in the simulator, and transferring into the real world; but I don't think this a promising or scalable strategy because it is super bottlenecked by our ability manually engineer simulators. If that's what you have in mind, I think that should be considered a different research area called "simulator design", not really "RL".)
oh, of course no one will run PPO (or any other onilne learning agent) on the car. And of course, human labelled data is definitely part of any autonomous driving program. But imitation learning is not enough (think safety), and there are many other challenges - how to drive in a way that is comfortable to the passengers, how to deal with noisy or out of distribution scenarios, and how to do all this in an economically feasible way are some of them. RL (in the sense of solving large POMDPs) can definitely help, but the problems above don't fit the simple "RL benchmark" protocol.
> RL (in the sense of solving large POMDPs) can definitely help
This is again the part that I don't really buy. I don't see how any current RL algorithm can help on any of these axes. If I were solving them, I would do it purely with imitation/supervision.
For example, if I wanted passengers to be comfortable, I would have them rate each ride and then filter out the ones with low comfort ratings from the dataset, or set up a conditional imitation model which predicts actions conditioned on comfort, and at deployment time condition on high comfort.
Again, I for sure agree that RL *in principle* could help, but I don't see any current RL algorithms being useful at all.
Maybe you could be a bit more concrete about what examples you have in mind? What is an example of a setting, some restrictions, how you would turn it into a MDP/POMDP, what algorithm you would use, etc.
Here's a publically available resource to look at. Tesla's 2022 AI day part on FSD planning goes into quite a bit of detail into their problem setup. You can appreciate from it the challenge is setting up a good "POMDP" for the problem, and using different froms of learning and planning to solve it quickly.
https://www.youtube.com/watch?v=ODSJsviD_SU&ab_channel=Tesla
The only reason to study or care about RL is to make progress towards the RL-first dream. --> this seems like an unsubstantiated claim no? deployable RL in principle allows for continuously improving systems on deployment, which is certainly likely to improve over just imitation learning. As in, RL systems allow you to start from the performance of IL or w/e and then allow you to overfit to the problem at hand with experience the agent collects itself. Now of course we have to account for the cost of rewards and such, but that considered, it allows for adaptive agents that can autonomously specialize to every deployment environment. This seems like a great property for deployment systems to me.
The IL perspective is not unreasonable but I think this might be ignoring the cost of data collection. IL data is expensive, RL data may be relatively cheaper (in terms of human effort) and so production systems that can get more improvement per unit cost of data collection are likely to be more scalable. I don't know if this means we need to be RL-first or purely AGI driven. Now I think the practicalities of RL at it's current state make IL more likely to succeed, but since we are talking about how to do better *research*, it feels like RL methods have the potential to provide improvement at a cheaper data cost than imitation learning.
> deployable RL in principle allows for continuously improving systems on deployment
...precisely *is* the dream! Yes, of course in principle RL is better. But current methods are simply inadequate. In practice, if I were to deploy, say, a robot arm to fold laundry, I would absolutely not want that arm running PPO!
I suppose it's fair to say that we don't know for sure that current methods are terrible, and it might be useful to construct challenges/benchmarks that let us verify that fact. But I don't think pushing on those challenges is a good direction for the field, instead it makes more sense to me to stick with fully toy MDPs and continue to focus research on identifying good algorithms to solve them.
To summarize: in this essay, in particular point 2, it is suggested that we put effort into using our current MDP solvers (which are terrible) to solve real-world problems. I think that's misguided, and we should either put effort into improving MDP solvers (for the RL-first dream!), or solving practical tasks using current techniques (which inevitably means *avoiding* our current, terrible MDP solvers).
2 points come to mind:
1. Can we create a challenge where the evaluated model actually does “something“ in real life, an api that allows your model to actually control some real-life entity. You could grant everyone access to the data from other model’s runs, and maintain some queue that people can submit their models to for evaluation.
2. I think there should be emphasis on goals for RL that aren’t “performance”. For instance, a regression model’s weights could carry more importance than the models ability to predict. It could have tremendous value without having ever been deployed. Can RL algorithms provide insight?