|
自动化学报 2012
Expectation-maximization Policy Search with Parameter-based Exploration
|
Abstract:
In order to reduce large variance of gradient estimation resulted from stochastic exploration strategy, a kind of expectation-maximization policy search reinforcement learning with parameter-based exploration is proposed. At first, a probability distribution over the parameters of a controller is used to define a policy. Secondly, samples are collected by directly sampling in the controller parameter space according to the probability distribution for several times. During the sample-collection procedure of each episode, because the selected actions are deterministic, sampling from the defined policy leads to a small variance in the samples, which can reduce the variance of gradient estimation. At last, based on the collected samples, policy parameters are iteratively updated by maximizing the lower bound of the expected return function. In order to reduce the time-consumption and to lower the cost of sampling, an importance sampling technique is used to repeatedly use samples collected from policy update process. Simulation results on two continuous-space control problems illustrate that the proposed policy search method can not only obtain the most optimal policy but also improve the convergence speed as compared with several policy search reinforcement learning methods with action-based stochastic exploration, thus has a better learning performance.