这篇博客将长期更新,分为5个部分,包括abstract, Introduction, method, simulation, conclusion,争取做到每天更新,每天看一篇论文,在学习其内容的同时,摘录其优秀的整段文字,并进行分类整理。



1. Evolving reinforcement learning algorithms

We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.

我们通过搜索计算图的空间来提出一种用于元学习强化学习算法的方法,该计算图可计算基于值的无模型RL智能体的损失函数以进行优化。 学习的算法与领域无关,并且可以推广到训练期间未看到的新环境。 我们的方法既可以从头开始学习,又可以引导已知的现有算法(例如DQN)进行访问,从而可以进行可解释的修改,从而提高性能。 从简单的经典控制和gridworld任务开始学习,我们的方法重新发现了时差(TD)算法。 从DQN开始,我们重点介绍了两种学习过的算法,它们在其他经典控制任务,gridworld类型任务和Atari游戏上均具有良好的泛化性能。 对学习到的算法行为的分析表明,它与最近提出的RL算法相似,后者解决了基于值的方法中的高估问题。

2.Adaptive Optimal control for a class of nonlinear systems: The online policy iteration approach

This paper studies the online adaptive optimal controller design for a class of nonlinear systems through a novel policy iteration (PI) algorithm. By using the technique of neural network linear differential inclusion (LDI) to linearize the nonlinear terms in each iteration, the optimal law for controller design can be solved through the relevant algebraic Riccati equation (ARE) without using the system internal parameters. Based on PI approach, the adaptive optimal control algorithm is developed with the online linearization and the two-step iteration, i.e., policy evaluation and policy improvement. The convergence of the proposed PI algorithm is also proved. Finally, two numerical examples are given to illustrate the effectiveness and applicability of the proposed method.

本文通过一种新颖的策略迭代(PI)算法研究了一类非线性系统的在线自适应最优控制器设计。 通过使用神经网络线性微分包含(LDI)技术对每次迭代中的非线性项进行线性化,可以通过相关的代数Riccati方程(ARE)来求解控制器设计的最佳定律,而无需使用系统内部参数。 基于PI方法,通过在线线性化和两步迭代(即策略评估和策略改进)开发了自适应最优控制算法。 还证明了所提出的PI算法的收敛性。 最后,通过两个数值例子说明了该方法的有效性和适用性。

3.Safe Optimal Control Under Parametric Uncertainties

We address the issue of safe optimal path planning under parametric uncertainties using a novel regularizer that allows trading off optimality with safety. The proposed regularizer leverages the notion that collisions may be modeled as constraint violations in an optimal control setting in order to produce open-loop trajectories with reduced risk of collisions. The risk of constraint violation is evaluated using a state-dependent relevance function and first-order variations in the constraint function with respect to parametric variations. The approach is generic and can be adapted to any optimal control formulation that deals with constraints under parametric uncertainty. Simulations using a holonomic robot avoiding multiple dynamic obstacles with uncertain velocities are used to demonstrate the effectiveness of the proposed approach. Finally, we introduce the car vs. train problem to emphasize the dependence of the resultant risk aversion behavior on the form of the constraint function used to derive the regularizer.


4.Learning-Based model predictive control: Toward safe learning in control

Recent successes in the field of machine learning, as well as the availability of increased sensing and computational capabilities in modern control systems, have led to a growing interest in learning and data-driven control techniques. Model predictive control (MPC), as the prime methodology for constrained control, offers a significant opportunity to exploit the abundance of data in a reliable manner, particularly while taking safety constraints into account. This review aims at summarizing and categorizing previous research on learning-based MPC, i.e., the integration or combination of MPC with learning methods, for which we consider three main categories. Most of the research addresses learning for automatic improvement of the prediction model from recorded data. There is, however, also an increasing interest in techniques to infer the parameterization of the MPC controller, i.e., the cost and constraints, that lead to the best closed-loop performance. Finally, we discuss concepts that leverage MPC to augment learning-based controllers with constraint satisfaction properties.



1. Evolving reinforcement learning algorithms

Our learned loss function should generalize across many different environments, instead of being specific to a particular domain. Thus, we design a search language based on genetic programming (Koza, 1993) that can express general symbolic loss functions which can be applied to any environment. Data typing and a generic interface to variables in the MDP allow the learned program to be domain agnostic. This language also supports the use of neural network modules as subcomponents of the program, so that more complex neural network architectures can be realized. Efficiently searching over the space of useful programs is generally difficult. For the outer loop optimization, we use regularized evolution (Real et al., 2019), a recent variant of classic evolutionary algorithms that employ tournament selection (Goldberg & Deb, 1991). This approach can scale with the number of compute nodes and has been shown to work for designing algorithms for supervised learning (Real et al., 2020). We adapt this method to automatically design algorithms for reinforcement learning.

我们的学习损失函数应该在许多不同的环境中推广,而不是特定于特定领域。因此,我们设计了一种基于遗传程序设计的搜索语言(Koza,1993),该语言可以表达可应用于任何环境的通用符号损失函数。数据类型输入和MDP中变量的通用接口使学习的程序与域无关。该语言还支持将神经网络模块用作程序的子组件,以便可以实现更复杂的神经网络体系结构。通常很难有效地搜索有用程序的空间。对于外循环优化,我们使用正则化进化(Real et al。,2019),这是采用锦标赛选择的经典进化算法的最新变体(Goldberg&Deb,1991)。这种方法可以随着计算节点的数量而扩展,并且已经证明可用于设计监督学习的算法(Real等,2020)。我们采用这种方法来自动设计用于强化学习的算法。

2.Adaptive Optimal control for a class of nonlinear systems: The online policy iteration approach

To realize the online adaptive algorithm, however, we need a synchronous linearization technique to accompany with the PI solution process. As a result, we put forward a new online LDIPI (OLDIPI) algorithm, which can address the adaptive optimal control problem for a class of nonlinear systems. First, the original nonlinear system is approximated to a linear plant model based on the neural network LDI. Then inspired by the linear PI algorithm, the proposed OLDIPI algorithm can converge to the optimal solution; furthermore, the algorithm can be implemented online in least-square sense under a persistent excitation condition. The convergence of the proposed algorithm is also proved, and the corresponding simulation results are given to illustrate the feasibility and applicability.

但是,要实现在线自适应算法,我们需要一种同步线性化技术来配合PI解法过程。 因此,我们提出了一种新的在线LDIPI(OLDIPI)算法,该算法可以解决一类非线性系统的自适应最优控制问题。 首先,将原始的非线性系统近似为基于神经网络LDI的线性工厂模型。 然后在线性PI算法的启发下,提出的OLDIPI算法可以收敛到最优解。 此外,该算法可以在持续激发条件下以最小二乘方式在线实现。 证明了该算法的收敛性,并给出了相应的仿真结果,说明了该算法的可行性和适用性。

3.Safe Optimal Control Under Parametric Uncertainties

The rest of the letter is organized as follows. Section II introduces sensitivity functions and the framework of DOC. Section III presents the main idea of the letter, involving the construction of an appropriate regularizer that provides open-loop trajectories with lower chance of constraint violation under parametric uncertainties. In Section IV, we first analyze the proposed approach by applying it on simple path planning problems with one dynamic obstacle, and then present the results obtained from experiments on environments with multiple uncertain dynamic obstacles. Section V concludes the letter.

这封信的其余部分安排如下。 第二部分介绍了敏感性功能和DOC的框架。 第三节介绍了这封信的主要思想,涉及构造一个适当的正则化器,该正则化器在参数不确定性下提供开环轨迹的约束违规几率较低。 在第四部分中,我们首先分析提出的方法,将其应用于具有一个动态障碍物的简单路径规划问题,然后介绍在具有多个不确定动态障碍物的环境中进行实验所获得的结果。 第五节总结这封信。

1. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

RL methods are typically divided into model-free (Schulman et al., 2015a,b,2017; Williams, 1992) and model-based (Lillicrap et al., 2015; Watkins and Dayan, 1992) approaches. Model-based approaches all perform some degree of planning, from predicting the value of some state (Mnih et al., 2013; Silver et al., 2016), obtaining representations by unrolling a learned dynamics model (Racaniere ` et al., 2017), or learning a policy directly on a learned dynamics model (Agrawal et al., 2016; Chua et al., 2018; Finn and Levine, 2017; Kurutach et al., 2018; Nagabandi et al., 2018; Oh et al., 2015; Sutton, 1990). One line of work (Amos et al., 2018; Lee et al., 2018; Srinivas et al., 2018; Tamar et al., 2016) embeds a differentiable planner inside a policy, with the planner learned end-to-end with the rest of the policy. Other work (Lenz et al., 2015; Watter et al., 2015) explicitly learns a representation for use inside a standard planning algorithm. In contrast, SoRB learns to predict the distances between states, which can be viewed as a high-level inverse model. SoRB predicts a scalar (the distance) rather than actions or observations, making the prediction problem substantially easier. By planning over previously visited states, SoRB does not have to cope with infeasible states that can be predicted by forward models in state-space and latent-space.

RL方法通常分为无模型(Schulman等人,2015a,b,2017; Williams,1992)和基于模型的方法(Lillicrap等人,2015; Watkins和Dayan,1992)。从预测某个状态的价值(Mnih等人,2013; Silver等人,2016),通过展开学习的动力学模型来获取表示形式(Racaniere`等人,2017),所有基于模型的方法都执行某种程度的计划。 ),或者直接在学习的动力学模型上学习策略(Agrawal等人,2016; Chua等人,2018; Finn和Levine,2017; Kurutach等人,2018; Nagabandi等人,2018; Oh等人) (2015年;萨顿,1990年)。一项工作(Amos等人,2018; Lee等人,2018; Srinivas等人,2018; Tamar等人,2016)将一个可区分的计划者嵌入到策略中,而计划者则是端到端学习的其余的政策。其他工作(Lenz等人,2015; Watter等人,2015)明确学习了在标准计划算法中使用的表示形式。相反,SoRB学习预测状态之间的距离,这可以看作是高级逆模型。 SoRB预测的是标量(距离)而不是动作或观察值,从而使预测问题大为简化。通过对先前访问的状态进行规划,SoRB不必应对状态空间和潜在空间中的前向模型可以预测的不可行状态。


1. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Planning algorithms must be able to (1) sample valid states, (2) estimate the distance between reachable pairs of states, and (3) use a local policy to navigate between nearby states. These requirements are difficult to satisfy in complex tasks with high dimensional observations, such as images. For example, consider a robot arm stacking blocks using image observations. Sampling states requires generating photo-realistic images, and estimating distances and choosing actions requires reasoning about dozens of interactions between blocks. Our method will obtain distance estimates and a local policy using a RL algorithm. To sample states, we will simply use a replay buffer of previously visited states as a non-parametric generative model.

规划算法必须能够(1)对有效状态进行采样,(2)估计可到达的状态对之间的距离,以及(3)使用本地策略在附近的状态之间导航。 在具有高维观测的复杂任务(例如图像)中,很难满足这些要求。 例如,考虑使用图像观察的机械臂堆积块。 采样状态需要生成逼真的图像,估计距离和选择动作需要对块之间的数十种交互进行推理。 我们的方法将使用RL算法获得距离估算值和本地策略。 为了采样状态,我们将简单地使用先前访问过的状态的重播缓冲区作为非参数生成模型。

2. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

After learning a goal-conditioned Q-function, we perform graph search to find a set of waypoints and use the goal-conditioned policy to reach each. We view the combination of graph search and the underlying goal-conditioned policy as a new SEARCHPOLICY, shown in Algorithm 1. The algorithm starts by using graph search to obtain the shortest path sw1; sw2;· · · from the current state s to the goal state sg, planning over the states in our replay buffer B. We then estimate the distance from the current state to the first waypoint, as well as the distance from the current state to the goal. In most cases, we then condition the policy on the first waypoint, sw1. However, if the goal state is closer than the next waypoint and the goal state is not too far away, then we directly condition the policy on the final goal. If the replay buffer is empty or there is not a path in G to the goal, then Algorithm 1 resorts to standard goal-conditioned RL.

学习了目标条件的Q函数后,我们执行图搜索以找到一组航路点,并使用目标条件的策略来达到每个目标。 我们将图搜索和基础目标条件策略的组合视为一种新的SEARCHPOLICY,如算法1所示。SW2;···从当前状态S到目标状态SG,规划了我们的重播缓冲B.状态然后我们估计从当前状态到第一个航路点的距离,以及从当前状态的距离目标。 在大多数情况下,我们然后在第一个航路点sw1上设置策略。 但是,如果目标状态比下一个航路点更近且目标状态距离不太远,则我们直接将策略限制在最终目标上。 如果重播缓冲区为空或G中没有通往目标的路径,则算法1求助于标准的目标条件RL。


1. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

We start by building intuition for our method by applying it to two simple 2D navigation tasks, shown in Figure 4a. The start and goal state are chosen randomly in free space, and reaching the goal often takes over 100 steps, even for the optimal policy. We used goal-conditioned RL to learn a policy for each environment, and then evaluated this policy on randomly sampled (start, goal) pairs of varying difficulty. To implement SoRB, we used exactly the same policy, both to perform graph search and then to reach each of the planned waypoints. In Figure 4b, we observe that the goal-conditioned policy can reach nearby goals, but fails to generalize to distant goals. In contrast, SoRB successfully reaches goals over 100 steps away, with little drop in success rate. Figure 4c compares rollouts from the goal-conditioned policy and our policy. Note that our policy takes actions that temporarily lead away from the goal so the agent can maneuver through a hallway to eventually reach the goal.

首先,通过将其应用于两个简单的2D导航任务来建立我们所提方法的直觉,如图4a所示。 起始状态和目标状态是在自由空间中随机选择的,即使对于最佳策略,达到目标也通常要花费100多个步骤。 我们使用目标条件的RL来学习每种环境的策略,然后针对随机难度不同的采样(开始,目标)对该策略进行评估。 为了实施SoRB,我们使用了完全相同的策略,既执行图搜索,然后到达每个计划的航路点。 在图4b中,我们观察到以目标为条件的策略可以达到附近的目标,但不能推广到遥远的目标。 相比之下,SoRB成功地实现了100步之外的目标,成功率几乎没有下降。 图4c比较了目标条件政策和我们政策的推出情况。 请注意,我们的政策采取的措施会暂时偏离目标,因此智能体可以穿过走廊以最终达到目标。

2.Safe Optimal Control Under Parametric Uncertainties

In this section, we apply the proposed approach on simple test examples to analyze the optimal trajectories obtained by penalizing RCS. First, we analyze the claim of penalizing RCS over constraint sensitivity using a 2D path planning problem involving a dynamic obstacle with uncertainty in its speed. Subsequently, the effect of various constraint forms that represent the collision avoidance condition, chosen from a set of valid ones, is studied. We then stress upon the need to select an appropriate constraint function to construct RCS using the car vs. train problem, and finally, the trade-off studies with multiple obstacles are presented. The videos demonstrating the optimal trajectories for the example problems discussed in this section can be found in the supplementary material.

在本节中,我们将提出的方法应用于简单的测试示例,以分析通过惩罚RCS获得的最佳轨迹。 首先,我们分析了使用涉及动态障碍物且速度不确定的2D路径规划问题对RCS进行约束敏感度惩罚的说法。 随后,研究了代表一组避免碰撞条件的各种约束形式的效果,这些约束形式是从一组有效的约束形式中选择的。 然后,我们强调需要选择适当的约束函数以使用汽车与火车的问题构造RCS,最后,提出了具有多个障碍的权衡研究。 补充材料中提供了视频,这些视频演示了本节中讨论的示例问题的最佳轨迹。


1. Evolving reinforcement learning algorithms

In this work, we have presented a method for learning reinforcement learning algorithms. We design a general language for representing algorithms which compute the loss function for value-based model-free RL agents to optimize. We highlight two learned algorithms which although relatively simple, can obtain good generalization performance over a wide range of environments. Our analysis of the learned algorithms sheds insight on their benefit as regularization terms which are similar to recently proposed algorithms. Our work is limited to discrete action and value-based RL algorithms that are close to DQN, but could easily be expanded to express more general RL algorithms such as actor-critic or policy gradient methods. How actions are sampled from the policy could also be part of the search space. The set of environments we use for both training and testing could also be expanded to include a more diverse set of problem types. We leave these problems for future work.

在这项工作中,我们提出了一种学习强化学习算法的方法。 我们设计了一种通用语言来表示算法,该算法可为基于价值的无模型RL代理计算损失函数以进行优化。 我们重点介绍了两种学习算法,这些算法虽然相对简单,但可以在多种环境中获得良好的泛化性能。 我们对所学算法的分析揭示了它们作为正则化术语的好处,与最近提出的算法相似。 我们的工作仅限于接近DQN的离散行动和基于价值的RL算法,但可以轻松扩展以表达更通用的RL算法,例如参与者批评法或策略梯度法。 如何从策略中采样操作也可能是搜索空间的一部分。 我们用于培训和测试的环境也可以扩展为包括更多不同类型的问题。 我们将这些问题留给以后的工作。

2. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

We presented SoRB, a method that combines planning via graph search and goal-conditioned RL. By exploiting the structure of goal-reaching tasks, we can obtain policies that generalize substantially better than those learned directly from RL. In our experiments, we show that SoRB can solve temporally extended navigation problems, traverse environments with image observations, and generalize to new houses in the SUNCG dataset. Our method relies heavily on goal-conditioned RL, and we expect advances in this area to make our method applicable to even more difficult tasks. While we used a stage-wise procedure, first learning the goal-conditioned policy and then applying graph search, in future work we aim to explore how graph search can improve the goal-conditioned policy itself, perhaps via policy distillation or obtaining better Q-value estimates. In addition, while the planning algorithm we use is simple (namely, Dijkstra), we believe that the key idea of using distance estimates obtained from RL algorithms for planning will open doors to incorporating more sophisticated planning techniques into RL.


3.Safe Optimal Control Under Parametric Uncertainties

A sensitivity function-based regularizer is introduced to obtain conservative solutions that avoid constraint violation under parametric uncertainties in optimal control problems. Using the fact that collision avoidance can be expressed as a state constraint, the approach is applied for path planning problems involving dynamic uncertain obstacles. The proposed regularizer is first analyzed on simple problems to study its characteristics and to identify its limitations. It is observed that the form of the constraint function used to construct the regularizer affects the behavior of the trajectories. The results on environments with as many as ten dynamic obstacles indicate that safety can be enhanced with an acceptable trade-off in optimality.

引入基于灵敏度函数的正则化器以获得保守的解决方案,以避免最优控制问题中参数不确定性下的约束违规。 利用避免冲突可以表示为状态约束这一事实,该方法可用于涉及动态不确定障碍物的路径规划问题。 首先对提出的正则器进行简单问题分析,以研究其特征并确定其局限性。 可以看出,用于构造正则化器的约束函数的形式会影响轨迹的行为。 在具有多达十个动态障碍的环境中的结果表明,可以通过在最佳性上进行可接受的折衷来提高安全性。

If you like my blog, please donate for me.