【论文阅读】Motion Planning through policy search

阅读原文时间：2022年04月14日阅读：1

想着CSDN还是不适合做论文类的笔记，那里就当做技术/系统笔记区，博客园就专心搞看论文的笔记和一些想法好了，【】以后中框号中间的都算作是自己的内心OS 有时候可能是问题，有时候可能是自问自答，毕竟是笔记嘛心路历程记录；然后可能有很多时候都是中英文夹杂，是因为我觉得有些方法并没有很好地中文翻译的意思(比如configuration space)，再加上英文能更好的搜索。希望大家能接受这种夹杂写法，或者接受不了的话直接关掉这个看原文

前言：这是一篇02年的关于Motion Planning - Policy search 的论文，然后是看一篇文献综述无意想继续了解，文献综述中：There exist basically two types of trajectory-based policies:

1)way-point based polices ->也就是这一篇的内容

2)dynamical system based

论文：那篇文献综述：A Survey on Policy Search Algorithms for Learning Robot Controllers in a Handful of Trials

此篇文章所介绍的：Motion Planning through policy search

1 Introduction

现在的机器人导航系统分为两种等级的控制：

global path planning
local motion control

这种通常都是在低维度空间使用，例如将robot's configuration space离散成x,y点，而在这个情况下的路径我们通常使用Dynamic programming 动态规划算法，因为他能保证在机器人模型和环境的约束下取得最优路径，而且converges very quickly for most environments

其他规划也有：a) potential-field based planners, b) probabilistic roadmaps

虽然到现在看来好像probabilistic roadmaps 更多一点。【DP是一种算法，而其他规划中的PRM才是一种在地图里找路径的方法吧，所以可能还得深入看一下 12,17 reference paper怎么讲的这个事】

虽然我们已经规划出来x,y的waypoint路径点，将其转换为controls for the mobile robot也是一个问题，而一般controls都是五个维度\((x,y,\theta,v_t,v_r)\)，而规划出来的x,y才两个维度，所以就引出了local motion control

但是呢 local controller have all abandoned optimality in the local space，以此来适应动态的障碍物和机器人的动力学约束（比如速度无法瞬间到达一个值）

现有的local controller通常都直接把机器人视为一个bounding circle 这样就不需要考虑\(\theta\)，【这里我存在异议：就算是个圆也有轮子限制了他需要考虑\(\theta吧\) 还是说完全理想化模型不考虑转向问题了】

然后呢这些控制器就使用heuristic function去在global planner里寻找local path的结果：The controller then use a variety of heuristics for finding some path that approximates the trajectory suggested by the global planner

这样的缺点也比较明显 high-level structure of the path may be close to optimal, but controls may be suboptimal on a detailed level，而这里的detailed level在对于车穿过一些狭窄的空间又很重要

generate a complete plan in the full state space of the robot, 首先是在低维度下做出规划，使用DP value function离散化空间， The plan is represented as a sequence of waypoints and a controller that drives through the waypoints in sequence. 也就是一系列的waypoints和一个控制器【谁… 不是这样呀拿waypoints的方法，DWA不算是waypoints那种】

Using the high-dimensional projection of the value function plan as a seed 用价值函数在高维度做映射作为一个seed
perform gradient ascent in the plane space 然后在上面的seed plane space里用梯度下降方法
modifying the waypoints to increase the expected reward of the plan 然后修改waypoints来提高整体plan的期望收益

【所以就是第一点我不是很懂… 等看后面瞅瞅吧感觉好像RL里MDP原型哦】

这篇文章的假设就是：value function生成的初始路径是靠近optimal solution的【这不就是手动假设不陷入局部最优？】 we don't provide any guarantees on the quality of the final plan if the initial value function policy is allowed to be arbitrarily bad.

这里介绍了我们比较熟悉的一个方法Dynamic window method for converting high-level plans into local controls，这个意思也就是直接把high-level的规划走到了local control那块，不像PRM或者本文的走waypoints先离散再得路径最后local control

关于DWA的比较好的博文：DWA局部路径规划算法论文阅读：The Dynamic Window Approach to Collision Avoidance

这里也提到了Applying online search techniques to continuous-state reinforcement learning说这一篇和本文的方法很像，但是他们是从low-dimensional approximate value function to a higher space，而且他们没有search for polices去最小化 期望收益【我觉得这里文章写错了吧应该是maximize】

【很有意思的是，这部分提到的论文都讲到了方法+缺点，但是对于DWA只是一句话介绍了而没有点明缺点 emmm 有意思】

总结来说：控制理论方法最佳的应用还是在无障碍物的情况下，怎样在二维甚至三维中处理障碍物式的规划还是在继续研究的【可能在我们现在看来怎么找到这些障碍物分类看起来比控制更为重要？】

3 Markov Decision Processes

【噢这里看上去… 印证了我说的好像RL的MDP原型，不过这里的公式还不像RL第二版书籍里的那么好看… 就是：02年的时候和21年已经不一样了一点点】

这里我曾看RL书的时候写过马尔科夫笔记：【书籍阅读 Ch3】Reinforcement Learning An Introduction, 2nd Edition

前面主要是MDP的一些公式，比如MDP的描述：\((S,s_0,A,T,R)\)，我们要做的就是maximize reward over the lifetime of the agent

\(S\) 是所有的状态，就是包含这个agent的所有可能的状态
\(s_0\) 是初始状态
\(A\) 是所有可能的动作
\(T\) 是transition就是：\(T(s,a,s')\) 表示在状态\(s\) 下做出了\(a\) 的动作后状态变为\(s'\) 的概率是：\(T(s,a,s')=p(s'|s,a)\) The probability of being in state \(s'\) starting from state \(s\) and taking action \(a\)
\(R\) 表示reward 收益

Bellman equation:

\[V\left(s_{i}\right)=R\left(s_{i}\right)+\gamma \sum_{j=1}^{|S|} V\left(s_{j}\right) \sum_{k=1}^{|\mathcal{A}|} p\left(s_{j} \mid \pi\left(a_{k} \mid s_{i}\right), s_{i}\right)
\]

这里唯一让我比较疑惑的是，这里的policy \(\pi(a_k|s_k)\)是deterministic policy也就是不是概率形式而是只有0,1的这样 -> \(\pi(a_k|s_k)= \{0,1\}\)，但是这样子的设定很少出现在RL里

随后State-value都会随着policy采取的不同而进行改变，最终updating the policy until the value is maximize over state。而这样就需要考虑computational cost

噢【上面的疑惑解决了】，这一段的后面提到了：In order to find a policy for continuous, high-dimensional problem, we need a different, more compact representation of a policy, 和一种评估policy的方式

不过RL书里评估policy的方式通常是 \(argmax(Q(S_t,a))\) q-value的最大，或者是这样的：

\[{argmax}_a \sum_{s',r} p(s',r|s,a)[r+\gamma v_\pi(s')]
\]

这里介绍了，policy search的作用是计算长期的收益【其实好像后面RL管这个长期的叫returns】：policy search operates by evaluating the long term cumulative reward of the current policy

然后现有的不同policy的选择有：Bayes nets[14],neural nets[1] 就是连续化状态特征并返回控制action，但是在这种情况下 mapping from a discrete grid world to these complex representations is not easy

所以呢，本文就将policy表示成a series of waypoints \(w_j\) 然后再连接着controller

waypoint的呢就比较像这下面这个：

来源google pictureL：Optimal Multi-Criteria Waypoint Selection for Autonomous Vehicle Navigation in Structured Environment

那这个controller：emits actions, where each action moves the agent towards the next(closet) waypoint.

这一系列的waypoints可以被视为从起点到终点路径的线性化后的一系列点，当运行的过程中，下一个时刻t的action a是一个关于现在状态s和下一个waypoint的函数:

\[a=\left(\Delta v_{t}, \Delta v_{r}\right)=f\left(s(t), \mathbf{w}_{j}\right)
\]

这个policy还有一个参数 \(d_{approach}\) 来决定controller什么时候换到下一个waypoint上，用公式表示也就是：

\[\mathbf{w}_{j}=\mathbf{w}_{j+1} \text { iff }\left\|s(t)-\mathbf{w}_{j}\right\|<d_{\text {approach }}
\]

而这种公式下的policy对于最后一个final policy就不会是全局最优的了（因为没有下一个点）【所以在这里只要waypoint分的够细就可以减小这种误差？】

如果local controller的模型能准确反应下一个状态的情况，那么policy search对于每一个状态的policy都是很重要的【所以是model-based 就是有环境建模了】

在介绍部分曾经提到过一个guarantee：也就是初始情况不能是随意的否则结果不一定好(we want an estimate of a good path to begin the search)，所以怎样做初始Policy估计对于这一篇的方法也是很重要的：

extracting the set of maximum likelihood states 在低维度的expected轨迹中提取最大可能的状态【这个likelihood是怎么给出的？】
把这些状态映射到高维度空间，得到一系列的waypoints点去反映expected path
prune by merging path neighbors that share the same value function policy

首先是optimal policy \(\pi^*\) 得法：最大化从起点到终点间的expected reward：

\[V_{\pi^{*}}(s)=\max _{\pi \in \pi} V_{\pi}\left(s_{0}\right) =\max _{\pi \in \pi} R\left(s_{0}\right)+\int_{\mid S !} p\left(s(d t) \mid s_{0}, \pi\left(s_{0}\right)\right) V_{\pi}(s(d t)) d s \enspace \text{(4)}
\]

\(\pi(s_0)\) 是状态\(s_0\)下的policy
\(s(dt)\) 是过\(dt\)时间后的状态
积分符号是指 the expectation reward of the next state distribution after time \(d_t\)

这里我们做出最后一个假设：local controller是无偏的，也就是说保持下一个状态分布 centered on the expected trajectory 表示为:\(G(s(t),\theta)\)【这里的centered on 没有很理解是什么意思？】，这里的\(\theta\) 是local controller的一个参数，这样每一个action的结果都将会是\(w_j\)和\(w_{j+1}\)之间的最大的likelihood，也使得计算期望分布变得容易

经过上面的变化我们可以简化公式(4)中的积分项将\(p(s(t)|s_0,\pi(s_0))\) 变为a function centered on the maximum likelihood next state \(G(s|s(t),\theta):S\rarr[0,1]\)，然后以矩阵形式写(4)公式就变为了：

\[V(\pi)=R(s_0)+G(s(dt),\theta) \cdot V(s(t))
\]

\(s(dt)\) 这里的与上一个公式4中有所不同了，这里表示maximum likelihood next state after time \(dt\) under policy \(\pi\)
\(G(s(dt))\) 表示状态空间的分布情况
\(\theta\) 会是local controller里的一个常量

如果再将policy\(\pi(s_i)\) 描述成在欧拉空间的一系列的 n 个waypoints \(\mathbf{w}_1 \dots \mathbf{w}_n\)，公式(4)又可以变为：

\[\begin{aligned}
V\left(\pi_{n}\right) &=\int_{t=0}^{T} G(s(t)) \cdot R(s(t)) d t \\
&=\sum_{j=0}^{n-1} \int_{s\left(\mathbf{w}_{j}\right)}^{s\left(\mathbf{w}_{j+1}\right)} G(s) \cdot R(s) d s \enspace \text{(7)}
\end{aligned}\]

因为我们要最大化公式(7)的值，所以显而易见的方式就是：求导取0找最值，在这里的话因为不止一个变量所以是差分在正梯度里求，那么式子就变为了：

\[\begin{aligned}
\nabla V\left(\pi_{n}\right) &=\frac{\partial}{\partial \mathbf{w}_{1 \ldots n}} \sum_{j=0}^{n-1} \int_{s\left(\mathbf{w}_{j}\right)}^{s\left(\mathbf{w}_{j+1}\right)} G(s) \cdot R(s) d s \\
&=\sum_{j=1}^{n-1} \frac{\partial}{\partial \mathbf{w}_{j}}\left(\int_{s\left(\mathbf{w}_{j}\right)}^{s\left(\mathbf{w}_{j+1}\right)} G(s) \cdot R(s) d s\right)
\end{aligned}\]

注意对于purely deterministic action models（也可以说是完美的控制器）,对于\(G(\cdot)\)提出来不需要积分，simply integrate the reward along the expected path，而这样的处理方式对于local controller的噪音也可以的纳入考虑，越大的噪音就会采取越保守的策略：the larger the noise in the local controller, the more conservative eventual policy

【然后看到这里的伪代码我就疑惑了如果不是PRM的话，那就是… 他们栅格化了地图grid world，然后每一个栅格就是一个单独的state，一个action走到一个对应的栅格，DP的话，定义好了Reward所以可以先来一次所有的state的value价值 emmm 其实如果是按路径长短来看，DP肯定很好了，但是因为这篇论文在开头提及过：需要考虑五个维度也就是速度信息，DP有时候离障碍物太近了所以可能会牺牲速度减慢之类的，但是走了policy search 对每一个waypoint再来一次max 这样的话速度就可以保持的比较好】

最初的policy是由n个waypoints的value function来给出的，但是呢optimal policy可能不止n个waypoints，所以我们需要有introduce waypoints as needed的方式

The number of waypoints needed to represent the policy

这里represent:to form or be something,也就是用来表示policy的waypoints的个数的选择就像在expectation-maximize估计clusters一样，这里我们选用K-means的方法?【看了参考15是K-means 但是我所理解的K-means不是做聚类的嘛… 这个number of waypoints的选择是什么…意思】,插入新的waypoint的情况为：

\[\begin{aligned}
\text { Insert } \mathbf{w}_{j+1}^{\prime} & \text { if } \quad R\left(\mathbf{w}_{j+1}^{\prime}\right)<R\left(\mathbf{w}_{j}\right) \\
& \text { and } \quad R\left(\mathbf{w}_{j+1}^{\prime}\right)<R\left(\mathbf{w}_{j+1}\right)
\end{aligned}\]

Table 1: Algorithm summary 这篇文章的整个算法总结为：

Run the dynamic program and extract a policy from 2 dimensional value function (Section 3.1) 【所以就是栅格化地图然后对于每个x,y点都计算一个value值】
Determine the maximum likelihood trajectory and convert to a set of 5-dimensional waypoints to form tha expected path (Section 3.3)【根据1.选出最好的轨迹然后映射到五维状态的waypoints里组成期望路径】
Policy search: for each waypoint \(w_j\)【policy search指的是：寻找每一个waypoint】

(a) Determine value contribution of trajectory from previous waypoint \(w_{j-1}\) to next waypoint \(w_{j+1}\). (Section 3.4)

(b) Measure gradient at endpoints

(c) Move waypoint \(w\), along gradient until path segment value increases

(d) Repeat for all waypoint until convergence
Add new points (section 3.5)

(a) Find lowest immediate reward along the tory

(b) Insert a new waypoint

(c) Repeat step 3.
Repeat waypoint insertion until convergence

as has been stated already, value function planning in a more exact state space is actually computational intractable.【为啥state都弄好了，value function planning还是会无法求解？】

\(R=(-c \cdot d)\) 当机器人没有到达终点的时候，reward的定义，其中d表示走的距离
\(R=g\) 当机器人到达终点，reward等于一个很大的正数
\(R=(-h \cdot d)\) for attempting to ravel a distance \(d\) when an obstacle is closer than a certain safety range.

定义好这些reward后，我们可以使用它们来决定机器人的姿态、速度和现在的位置：比如在机器人的在这个状态下撞到障碍物的可能性很大时，通过reward对其进行惩罚。随后状态空间由gradient term \(\frac{\partial}{\partial \mathbf{w}_{j}}\)来计算。撞到障碍物的可能性由\(\zeta\left(\mathbf{w}_{j}\right)\)来决定，then we compute the gradient as:

\[\nabla V\left(\mathbf{w}_{j}\right)=\left(\left(\mathbf{w}_{j-1}-\mathbf{w}_{j}\right)+\left(\mathbf{w}_{j+1}-\mathbf{w}_{j}\right)\right)-\left(\zeta\left(\mathbf{w}_{j}\right)-\mathbf{w}_{j}\right)
\]

到这里理论的推导就完全结束了，但是我想问的问题很多…. 主要是没法根据现有的这一套去复现他的操作：

对于waypoint的定义应该就是state的意思了，也就是waypoint是五维的，包含速度信息，那么reward的定义仅根据距离d就定义是否没有考虑速度了？
既然第一步已经栅格化了地图，直接在栅格化的里面把每个栅格看做一个waypoint不行吗？前后有连接关系的那种
关于第一点的疑惑，还是说速度v和距离d是有关系的，也就是距离障碍物的d与速度成反比，但是文章中也没点明这个点

5 Experimental Results

这个具体的就是他们的规划时间虽然增多了，路径长了，但是运行时间短，也就是速度是连续的不需要减速，这幅图可能能体现出一点那个意思：

(a)(b)图以value function确实得出了一条路径，但是要是考虑可行性，例如车的尺寸在这里是54cm，那个走廊的宽度是60cm，这个肯定没有(c)(d)可行性高，或者说风险更小。(c)图是没有加入waypoint 也就是Table 1中第四步，(d)图中with additional waypoints可以看到在走廊的路径更为直线的行驶。【这样子的方式加入是否可以用到lattice形式，使motion planning直接加入到考虑更为平滑？但是这个速度的关系还是没有点明】

6 Conclusion

所以呢，整篇论文提出了：how seeding of the policy search with a solution from an approximate value function can be used to directly perform motion planning in the full state space of the robot.

而正如我前面提到的这里的MDP是fully observable，所以对于动态的障碍物可能无法，算是一个全局的policy search，更像是怎样更高速符合motion planning的方式去运行机器人

手机扫一扫

移动阅读更方便

你可能感兴趣的文章

杰哥教你面试之一百问系列:java中高级多线程concurrent的使用

Vue、SPA实现登陆

【论文阅读】Motion Planning through policy search

1 Introduction

2 Related work

3 Markov Decision Processes

4 Mobile Robot Navigation

5 Experimental Results

6 Conclusion