0%

## Abstract

### 1. Evolving reinforcement learning algorithms

We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.

### 2.Adaptive Optimal control for a class of nonlinear systems: The online policy iteration approach

This paper studies the online adaptive optimal controller design for a class of nonlinear systems through a novel policy iteration (PI) algorithm. By using the technique of neural network linear differential inclusion (LDI) to linearize the nonlinear terms in each iteration, the optimal law for controller design can be solved through the relevant algebraic Riccati equation (ARE) without using the system internal parameters. Based on PI approach, the adaptive optimal control algorithm is developed with the online linearization and the two-step iteration, i.e., policy evaluation and policy improvement. The convergence of the proposed PI algorithm is also proved. Finally, two numerical examples are given to illustrate the effectiveness and applicability of the proposed method.

### 3.Safe Optimal Control Under Parametric Uncertainties

We address the issue of safe optimal path planning under parametric uncertainties using a novel regularizer that allows trading off optimality with safety. The proposed regularizer leverages the notion that collisions may be modeled as constraint violations in an optimal control setting in order to produce open-loop trajectories with reduced risk of collisions. The risk of constraint violation is evaluated using a state-dependent relevance function and first-order variations in the constraint function with respect to parametric variations. The approach is generic and can be adapted to any optimal control formulation that deals with constraints under parametric uncertainty. Simulations using a holonomic robot avoiding multiple dynamic obstacles with uncertain velocities are used to demonstrate the effectiveness of the proposed approach. Finally, we introduce the car vs. train problem to emphasize the dependence of the resultant risk aversion behavior on the form of the constraint function used to derive the regularizer.

### 4.Learning-Based model predictive control: Toward safe learning in control

Recent successes in the field of machine learning, as well as the availability of increased sensing and computational capabilities in modern control systems, have led to a growing interest in learning and data-driven control techniques. Model predictive control (MPC), as the prime methodology for constrained control, offers a significant opportunity to exploit the abundance of data in a reliable manner, particularly while taking safety constraints into account. This review aims at summarizing and categorizing previous research on learning-based MPC, i.e., the integration or combination of MPC with learning methods, for which we consider three main categories. Most of the research addresses learning for automatic improvement of the prediction model from recorded data. There is, however, also an increasing interest in techniques to infer the parameterization of the MPC controller, i.e., the cost and constraints, that lead to the best closed-loop performance. Finally, we discuss concepts that leverage MPC to augment learning-based controllers with constraint satisfaction properties.

## Introduction

### 1. Evolving reinforcement learning algorithms

Our learned loss function should generalize across many different environments, instead of being specific to a particular domain. Thus, we design a search language based on genetic programming (Koza, 1993) that can express general symbolic loss functions which can be applied to any environment. Data typing and a generic interface to variables in the MDP allow the learned program to be domain agnostic. This language also supports the use of neural network modules as subcomponents of the program, so that more complex neural network architectures can be realized. Efficiently searching over the space of useful programs is generally difficult. For the outer loop optimization, we use regularized evolution (Real et al., 2019), a recent variant of classic evolutionary algorithms that employ tournament selection (Goldberg & Deb, 1991). This approach can scale with the number of compute nodes and has been shown to work for designing algorithms for supervised learning (Real et al., 2020). We adapt this method to automatically design algorithms for reinforcement learning.

### 2.Adaptive Optimal control for a class of nonlinear systems: The online policy iteration approach

To realize the online adaptive algorithm, however, we need a synchronous linearization technique to accompany with the PI solution process. As a result, we put forward a new online LDIPI (OLDIPI) algorithm, which can address the adaptive optimal control problem for a class of nonlinear systems. First, the original nonlinear system is approximated to a linear plant model based on the neural network LDI. Then inspired by the linear PI algorithm, the proposed OLDIPI algorithm can converge to the optimal solution; furthermore, the algorithm can be implemented online in least-square sense under a persistent excitation condition. The convergence of the proposed algorithm is also proved, and the corresponding simulation results are given to illustrate the feasibility and applicability.

### 3.Safe Optimal Control Under Parametric Uncertainties

The rest of the letter is organized as follows. Section II introduces sensitivity functions and the framework of DOC. Section III presents the main idea of the letter, involving the construction of an appropriate regularizer that provides open-loop trajectories with lower chance of constraint violation under parametric uncertainties. In Section IV, we first analyze the proposed approach by applying it on simple path planning problems with one dynamic obstacle, and then present the results obtained from experiments on environments with multiple uncertain dynamic obstacles. Section V concludes the letter.

### 1. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

RL methods are typically divided into model-free (Schulman et al., 2015a,b,2017; Williams, 1992) and model-based (Lillicrap et al., 2015; Watkins and Dayan, 1992) approaches. Model-based approaches all perform some degree of planning, from predicting the value of some state (Mnih et al., 2013; Silver et al., 2016), obtaining representations by unrolling a learned dynamics model (Racaniere  et al., 2017), or learning a policy directly on a learned dynamics model (Agrawal et al., 2016; Chua et al., 2018; Finn and Levine, 2017; Kurutach et al., 2018; Nagabandi et al., 2018; Oh et al., 2015; Sutton, 1990). One line of work (Amos et al., 2018; Lee et al., 2018; Srinivas et al., 2018; Tamar et al., 2016) embeds a differentiable planner inside a policy, with the planner learned end-to-end with the rest of the policy. Other work (Lenz et al., 2015; Watter et al., 2015) explicitly learns a representation for use inside a standard planning algorithm. In contrast, SoRB learns to predict the distances between states, which can be viewed as a high-level inverse model. SoRB predicts a scalar (the distance) rather than actions or observations, making the prediction problem substantially easier. By planning over previously visited states, SoRB does not have to cope with infeasible states that can be predicted by forward models in state-space and latent-space.

RL方法通常分为无模型（Schulman等人，2015a，b，2017; Williams，1992）和基于模型的方法（Lillicrap等人，2015; Watkins和Dayan，1992）。从预测某个状态的价值（Mnih等人，2013; Silver等人，2016），通过展开学习的动力学模型来获取表示形式（Racaniere等人，2017），所有基于模型的方法都执行某种程度的计划。 ），或者直接在学习的动力学模型上学习策略（Agrawal等人，2016; Chua等人，2018; Finn和Levine，2017; Kurutach等人，2018; Nagabandi等人，2018; Oh等人） （2015年；萨顿，1990年）。一项工作（Amos等人，2018； Lee等人，2018； Srinivas等人，2018； Tamar等人，2016）将一个可区分的计划者嵌入到策略中，而计划者则是端到端学习的其余的政策。其他工作（Lenz等人，2015； Watter等人，2015）明确学习了在标准计划算法中使用的表示形式。相反，SoRB学习预测状态之间的距离，这可以看作是高级逆模型。 SoRB预测的是标量（距离）而不是动作或观察值，从而使预测问题大为简化。通过对先前访问的状态进行规划，SoRB不必应对状态空间和潜在空间中的前向模型可以预测的不可行状态。

## Method

### 1. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Planning algorithms must be able to (1) sample valid states, (2) estimate the distance between reachable pairs of states, and (3) use a local policy to navigate between nearby states. These requirements are difficult to satisfy in complex tasks with high dimensional observations, such as images. For example, consider a robot arm stacking blocks using image observations. Sampling states requires generating photo-realistic images, and estimating distances and choosing actions requires reasoning about dozens of interactions between blocks. Our method will obtain distance estimates and a local policy using a RL algorithm. To sample states, we will simply use a replay buffer of previously visited states as a non-parametric generative model.

### 2. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

After learning a goal-conditioned Q-function, we perform graph search to find a set of waypoints and use the goal-conditioned policy to reach each. We view the combination of graph search and the underlying goal-conditioned policy as a new SEARCHPOLICY, shown in Algorithm 1. The algorithm starts by using graph search to obtain the shortest path sw1; sw2;· · · from the current state s to the goal state sg, planning over the states in our replay buffer B. We then estimate the distance from the current state to the first waypoint, as well as the distance from the current state to the goal. In most cases, we then condition the policy on the first waypoint, sw1. However, if the goal state is closer than the next waypoint and the goal state is not too far away, then we directly condition the policy on the final goal. If the replay buffer is empty or there is not a path in G to the goal, then Algorithm 1 resorts to standard goal-conditioned RL.

## Simulation

### 1. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

We start by building intuition for our method by applying it to two simple 2D navigation tasks, shown in Figure 4a. The start and goal state are chosen randomly in free space, and reaching the goal often takes over 100 steps, even for the optimal policy. We used goal-conditioned RL to learn a policy for each environment, and then evaluated this policy on randomly sampled (start, goal) pairs of varying difficulty. To implement SoRB, we used exactly the same policy, both to perform graph search and then to reach each of the planned waypoints. In Figure 4b, we observe that the goal-conditioned policy can reach nearby goals, but fails to generalize to distant goals. In contrast, SoRB successfully reaches goals over 100 steps away, with little drop in success rate. Figure 4c compares rollouts from the goal-conditioned policy and our policy. Note that our policy takes actions that temporarily lead away from the goal so the agent can maneuver through a hallway to eventually reach the goal.

### 2.Safe Optimal Control Under Parametric Uncertainties

In this section, we apply the proposed approach on simple test examples to analyze the optimal trajectories obtained by penalizing RCS. First, we analyze the claim of penalizing RCS over constraint sensitivity using a 2D path planning problem involving a dynamic obstacle with uncertainty in its speed. Subsequently, the effect of various constraint forms that represent the collision avoidance condition, chosen from a set of valid ones, is studied. We then stress upon the need to select an appropriate constraint function to construct RCS using the car vs. train problem, and finally, the trade-off studies with multiple obstacles are presented. The videos demonstrating the optimal trajectories for the example problems discussed in this section can be found in the supplementary material.

## Conclusion

### 1. Evolving reinforcement learning algorithms

In this work, we have presented a method for learning reinforcement learning algorithms. We design a general language for representing algorithms which compute the loss function for value-based model-free RL agents to optimize. We highlight two learned algorithms which although relatively simple, can obtain good generalization performance over a wide range of environments. Our analysis of the learned algorithms sheds insight on their benefit as regularization terms which are similar to recently proposed algorithms. Our work is limited to discrete action and value-based RL algorithms that are close to DQN, but could easily be expanded to express more general RL algorithms such as actor-critic or policy gradient methods. How actions are sampled from the policy could also be part of the search space. The set of environments we use for both training and testing could also be expanded to include a more diverse set of problem types. We leave these problems for future work.

### 2. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

We presented SoRB, a method that combines planning via graph search and goal-conditioned RL. By exploiting the structure of goal-reaching tasks, we can obtain policies that generalize substantially better than those learned directly from RL. In our experiments, we show that SoRB can solve temporally extended navigation problems, traverse environments with image observations, and generalize to new houses in the SUNCG dataset. Our method relies heavily on goal-conditioned RL, and we expect advances in this area to make our method applicable to even more difficult tasks. While we used a stage-wise procedure, first learning the goal-conditioned policy and then applying graph search, in future work we aim to explore how graph search can improve the goal-conditioned policy itself, perhaps via policy distillation or obtaining better Q-value estimates. In addition, while the planning algorithm we use is simple (namely, Dijkstra), we believe that the key idea of using distance estimates obtained from RL algorithms for planning will open doors to incorporating more sophisticated planning techniques into RL.

### 3.Safe Optimal Control Under Parametric Uncertainties

A sensitivity function-based regularizer is introduced to obtain conservative solutions that avoid constraint violation under parametric uncertainties in optimal control problems. Using the fact that collision avoidance can be expressed as a state constraint, the approach is applied for path planning problems involving dynamic uncertain obstacles. The proposed regularizer is first analyzed on simple problems to study its characteristics and to identify its limitations. It is observed that the form of the constraint function used to construct the regularizer affects the behavior of the trajectories. The results on environments with as many as ten dynamic obstacles indicate that safety can be enhanced with an acceptable trade-off in optimality.

If you like my blog, please donate for me.