Learn to Move Through a Combination of Policy Gradient Algorithms: DDPG, D4PG, and TD3

. Deep Reinforcement Learning has recently seen progress for continuous control tasks, driven by yearly challenges such as the NeurIPS Competition Track . This work combines complementary characteristics of two current state of the art methods, Twin-Delayed Deep Deterministic Policy Gradient and Distributed Distributional Deep Deterministic Policy Gradient, and applied this in the state-of-the-art Learn to move — Walk Around locomotion control challenge which was part of the NeurIPS 2019 Competition Track . The combined approach showed improved results and achieved the 4th place in this competition. The article presents this combination and evaluates the performance.


Introduction
The NeurIPS 2019: Learn to Move -Walk Around 1 challenge [1,2] poses a continuous control task for a physiologically plausible 3D walking agent in the physics-based OpenSim environment [3] that is to be controlled by activation of muscle fibers attached to the agent. The agent is supposed to follow a prescribed 2D velocity vector. The task became incrementally harder compared to the previous NeurIPS 2018: AI for Prosthetics challenge, in which the provided 1D velocity vector had always the same direction and only the absolute value was changing.
We solved the task by combining Twin-delayed Deep Deterministic Policy Gradient (TD3) [4] and Distributed Distributional Deep Deterministic Policy Gradient (D4PG) [5] algorithms. Both algorithms are extensions of the Deep Deterministic Policy Gradient (DDPG) [6] and implement several improvements (see table 1). This solution showed to score an improvement compared to the two algorithms individually and scored fourth place out of 310 teams in this competition. In this paper, we evaluate the feasibility and performance of combining these improvements and compare it to the performance of the two original algorithms in the NeurIPS 2019: Learn to Move -Walk Around challenge. The combined algorithm is tested against its components, TD3 and D4PG, in two experiments. Other top ranked solutions for this and previous years challenge variants are described in [7,8,5,2]. Deep Reinforcement Learning methods has been successfully applied in an increasing number of areas, ranging from computer games towards robotic control [9,10,11,12,13,14,15].

Methods
Fig. 1. The task of the competition: Developing a controller capable of locomotion for the skeleton, which can only be controlled via activation of its muscles on its legs. The figure shows a movement of the agent in a sequence of five time steps. Active muscles are shown in red, inactive muscles are shown in blue.

Combination of algorithms
Our algorithm is based on DDPG and combines all improvements (see table 1 for an overview) introduced by TD3 and D4PG. The implementation of TD3 and D4PG improvements is mostly straightforward (compartmentalization of both algorithms can be seen in algorithm 1; red highlights D4PG and green TD3). The improvements themselves do not intertwine with each other, except for clipped double Q-Learning of TD3 and the distributional value function of D4PG (in algorithm 1 computing the Q-value with twin critics and choosing their minimum is part of the distributional update steps). The combined algorithm should provide a more stable learning signal, while offering the same scalability as found in D4PG. Overall, we assume that it should lead to an improvement, first, compared to TD3 as the sample efficiency is better. Second, in comparison with D4PG, as improvements to stabilize the learning process are deployed.

Comparison of Deterministic Policy Gradient algorithms
In the following, we describe experiments that demonstrate the viability of the combination of deterministic policy gradient algorithms for the NeurIPS 2019: Learn to Move -Walk Around challenge. We compare the combined approach to its' combinational parts, TD3 and D4PG. Further, we elaborate on the details regarding the challenge, especially the reward function, which had to be optimized, as this appeared detrimental to a reinforcement learning problem.

8:
Update critics θi ← θi + βt(δ θ i ) 9: if t mod d then 10: Update φ ← φ + αtδ φ and replicate network weights to the actors 11: Update target networks θ i and φ 12: end if 13: end for 14: return: policy parameters φ Actor 1: repeat 2: Sample action a = π θ (x) + N (0, 1) 3: Execute action a, observe reward r and state x 4: Store (x, a, r, x ) 5: until learner finishes The same configuration for all three algorithms was used to evaluate their characteristics. Actor and critic neural networks (TD3 and our algorithm operated with a pair of critics) were given three hidden layers with sizes (512, 512, 256) in all three cases. We used Gaussian noise for exploration. Further, for each algorithm we deployed a trainer thread to perform update steps and 22 sampler threads to produce samples in parallel, which were necessary for the updates, and store them in a shared replay buffer. Although originally TD3 has no distributed training framework, we extended the algorithm for better evaluation of the other improvements. A sampler is a copy of the policy network, which acts on the environment, whereas the trainer contains the algorithm, which optimizes the policy and value function, and copies the weights of the policy functions to the sampler threads every 500 update steps.
The implementation of prioritized experience replay [16] that is employed by D4PG and our approach uses the absolute TD-errors as sample-weighing-strategy and produces batches dependant on those weights, favoring more important samples. TD3 was implemented with a regular replay buffer, which is sampled uniformly. For realizing the distributional critic used in D4PG and our algorithm, we used a quantile distribution [17] consisting of 101 atoms. N-step returns were set to five. Delayed updates of policy networks were executed in a ratio of two critic updates for every actor update. For target policy smoothing we used Gaussian noise.

OpenSim environment
The NeurIPS 2019: Learn to Move -Walk Around challenge poses the task to control a physiologically plausible 3D walking agent in the physics-based OpenSim environment [3] only by activation of muscle fibers attached to the agent. The activation range of the muscles spans the continuous space between 0 and 1. The agent has 22 muscles distributed over its lower body, so the action space amounted to twenty-two dimension. The agent is supposed to follow a provided 2D velocity vector field. This vector field V is a 2 x 11 x 11 tensor of 2D velocities in forward and leftward direction of the agent. It spans a 11 x 11 grid within 5 meters around the agent with the agent at its center. The distance between each discrete point in the grid amounts to 0.5 meters (as can be seen in figure 2). The vector field is one part of the observation space the agent could access. The second part of the provided observation space is a dictionary of 97 observations for pelvis state, ground reaction forces, joint angles, and velocities, as well as muscle states, such as their length. Therefore, the accessible observation space amounts to 341 dimensions. Our solution took only into account the actual target velocity in the agents position, as well as the difference between target velocity and real velocity, resulting in an observation space of 103 dimensions for our agent.
The environment provided two different reward functions for round one and two of the NeurIPS 2019: Learn to Move -Walk Around on which the agent was optimized. We used the reward function of the second round to conduct the experiments described in section 3, which was provided by the competition's environment 2 . It was not shaped in any form. The environment returned reward in each step (dense reward). The total reward J(π) is described as a sum of reward for staying alive and reward for performing footsteps, where the latter was defined as bridging a minimum distance between contact with the ground, while traveling in the right direction and using minimal effort in terms of muscle activation. The maximum number of steps in the first round was set to 1000 and in round two to 2500 steps per episode.
In equation (2) w step , w vel and w ef f ort refers to a constant weight of the stepping reward as well as to the weights for effort and velocity costs. The costs and rewards are defined as with ∆t i = 0.01 seconds is the simulation time step. v pelvis and v vectorf ield are the velocity of the pelvis and the target velocity and A m are muscle activations. A bonus of 500 was given by successfully standing near the target for a time period (two to four seconds). This bonus could be achieved twice, as after achieving the first reward a new velocity field spawned, which the agent also needed to solve to successfully end the episode.

Experiments
In the first experiment, we ran a test agent with no exploration noise every 500 updates of policy weights and collected reward values for that episode. For each algorithm, we repeated the training process with a different seed three times. The average result of three tests for each algorithm is plotted in figure 3. One repetition of the training process amounted to 1,000,000 update steps of the trainer thread. As described above, during training the episode length was constrained to 1000 steps per episode. This limits the reward, which could be achieved, by the number of steps.
For the second experiment, we tested the performance of the trained networks from the first experiment. We ran each of the 9 agents (3 algorithms x 3 seeds) on the same 50 episodes after the absolved training process (1,000,000 updates steps). The maximum length of an episode was set to 2500 steps. As mentioned above, this was established in the second round of the learning to run challenge, where the agent's task was to solve two velocity fields in one episode. Thereby, the reward was limited. The networks were compared in terms of reward earned and steps taken. Fig. 3. Results of experiment 1. In later stages of training the agent was able to achieve a bonus reward of 500 by standing at its target for multiple seconds, resulting in spikes in the later stages of training. However, the rewards of the episodes only amount to about 350 in average because of their occurrence rate and difference in time during training runs.

Results of the experiments
The results for experiment 1, which are depicted in figure 3, show that a combination of algorithm converges faster to a well-performing policy, enabling the agent to achieve the second round's reward of 500. Due to restricted episode length of 1000 during training the agent was not able to solve whole episodes of the environment (default settings for difficulty 2 of the environment are 2500 steps per episode). Further, the reward is maxed out by around 350 for an episode. This relates to the spikes in figure 3. The agent was not able to achieve the second bonus, as the episode length was constrained to 1000 steps per episode during training. The spikes amount to a reward of around 350 as they are averaged over episodes, where the agent did not achieve bonus due to different occurrence rates in all seeds. Moreover, we can also observe a smoother trend of the curve until convergence (around 600,000 steps) for the combination of algorithms in comparison to D4PG. In later stages of the training process we could also observe, that our approach has less low-reward outliers as D4PG. TD3 was not able to produce a policy, which was able to score more than a reward of around 50 and therefore wasn't able to produce any high-reward outliers by scoring the bonus reward. D4PG was able to score the round two bonus, but at a later stage of the training than our proposed algorithm In experiment 2 (see table 2), we observed similar results. The trained policy of our combined approach was able to outperform both TD3 and D4PG. After the finished training, the TD3 algorithm scored worse compared to D4PG and our approach in terms of average steps and reward. Our proposed combined approach was able to produce a policy, which scores about 30% higher on average than D4PG in terms of average reward and average steps taken in episodes. We were also able to decrease the standard deviation in reward by about 20% and in steps by about 15%, which implies less proneness to the failure mode.

Discussion
We found that combining the algorithms improved the results. TD3 couldn't solve the task at hand. The policy produced was not able to exhibit walking behavior and finished each episode abruptly with falling down in runtime with a frozen model. During training, TD3 was also not able to improve its behavior, such that the standing bonus could never be achieved. In general, TD3 fell off behind D4PG and our approach. This could be due to the fact, that certain improvements like prioritized sampling or n-step returns are helpful features for solving the challenge posed in this particular environment. D4PG was able to exhibit better performance compared to TD3. On runtime, a frozen D4PG policy was able to move around and in some episodes earn the bonus by standing in the middle of the target for an amount of time.
The combined approach was able to perform even better than D4PG. It maximized reward faster than D4PG and showed to be more stable, as the training curve of our algorithm has less low-reward outliers than D4PG. It also scored higher in the second experiment than TD3 and D4PG (see table 2), while having a smaller standard deviation of reward and steps than D4PG. However, it was not able to fully solve an environment of the second round. This might be due to the fact that we chose to reduce the episode length to 1000 steps during training.
All in all, in the comparison to its components, we could not find any unfavorable repercussions for the integrated approach in the two experiments by combining the here mentioned improvements of D4PG and TD3.

A.1 Deterministic Policy Gradient Algorithms
Environments with continuous action spaces come a bit closer to the reality. Although, they are often more difficult to solve, it is necessary to be able to solve problems in this group, when we try to make progress towards algorithms we can deploy in the real world to e.g. build controllers for robots. In the NeurIPS 2019: Learn to Move -Walk Around challenge using a continuous action space as design choice is justified by imitating a humanoid 3D model more realistically. This humanoid is controlled by activation of the muscles on his legs. The action space consists of activation intervals for different muscles, from which a value can be sampled from. The correct sampling of a value to produce a behaviour pattern is the goal of the environment provided by the challenge. One of the current state-of-the-art algorithms to solve such types of environments is the model-free off-policy algorithm Deep Deterministic Policy Gradient (DDPG, [6]) as well as its improved versions, twin-delayed DDPG (TD3, [4]) and distributed, distributional DDPG (D4PG, [5]). In the following we describe these algorithms in more detail.

Deep Deterministic Policy Gradient (DDPG)
In this work we use DDPG as baseline algorithm for solving locomotion in reinforcement learning problems. DDPG is an off-policy, model-free algorithm and it is able to solve problems in environments with continuous action spaces. It can be seen as a variant of Deep Q-networks, as it combines Deterministic Policy Gradients (DPG, [19]) with Q-Learning and other extensions, namely experience replay and target value and policy networks. DDPG is furthermore an actor-critic-algorithm and consists of 4 different neural networks in total: The actor, the critic and both target actor and target critic. The actor is also called the policy function π(s) = a, which computes an action for a given state. The critic, also refered to as Q-value function, computes the Q-value, a numerical value that represents the discounted future reward for a state-action-pair. The critic is also the main objective to be optimized, such that we find the maximal, real Q-value for given state-actionpairs. We can derive the optimal Q-Value function Q * (s, a) by minimizing the loss between the output of the function approximator and the bellman-equation: This computes the Q-value for a given time step t. The discount rate γ diminishes additional reward of steps into the future. This has the effect, that immediate reward is given a preference over future reward. The value and policy function on the right-hand site of the equation are the target value and policy function. Their functions are discussed in section A.1.
Given the target function and the neural networks as function approximators (given by actor and critic networks) we now can derive the loss function L: where Y t is the target in a supervised learning sense and is computed by using the Bellman-equation as intermediate optimum: θ Q are the function parameters for policy µ and value function Q. In the next sections we discuss how to update the parameters of a policy by the optimization step of the value function and the components experience replay buffer, exploration noise and target policy and value networks.
Deterministic policy gradients In an environment which provides a continuous action space we can derive a deterministic policy by using Deterministic policy gradient (DPG). Rather than returning a probability distribution over actions A given a state, a deterministic policy µ(s) = a returns a single action in a deterministic way. The main objective J(θ) in an off-policy actor-critic algorithm, which mainly optimizes the value function is defined as: where θ are the parameters and S is the state space.
is defined as the discounted sum of state visitation probability density at state s . ρ µ (s → s , k) gives us the probability density from state s to state s after moving k steps by using policy µ. ρ 0 (s) is the initial distribution over states.
We can now compute the gradient of J(θ) using the Deterministic policy gradient theorem.
First, the chainrule yields the gradient of Q ∇ a Q µ (s, a) with respect to a. Second, we derive the gradient of the deterministic policy ∇ θ µ θ (s) with respect to theta, which optimizes our policy. As an example to show how to compute updates, consider DPG in combination with on-policy actor-critic policy SARSA. First, we compute the TD-error in SARSA: The parameter update of the value function is defined as: Then, we can use the Deterministic policy gradient theorem to compute policy parameter updates of θ using equation 12: One problem of using DPG is exploration because of the deterministic nature of the policy we optimize. On way to prevent this is to add noise to the parameter space or action space, which in this case would result in an off-policy nondeterministic policy.
Exploration noise As mentioned in A.1, DPG updates could inhibit exploration depending on the environment. To ensure exploration in the continuous action space, DDPG uses an exploration policy µ , in which noise is added to the actions of the policy network µ.
N denotes noise sampled from a noise generating process, such as Gaussian noise. The authors of the DDPG paper suggest using the Ornstein-Uhlenbeck process [20] for exploring physical environments, as it allows temporally correlated exploration.
Target value and policy networks DDPG utilizes frozen copies of value and policy function to compute the target Y t (equation 9). More specifically, they are used to compute the right-hand site of the bellman-equation, as it was found that the learn process gets less stable, when not using copies due to the change of weights during optimization. Thus, the learning process consists of the following steps: first, a batch of training data is sampled from the experience buffer. Second, the loss L (equation 8) is computed using the target value and policy networks to generate Y t . After update steps of value and policy networks, the target networks get softupdated by: θ Q and θ µ are the parameters of the value network and the policy network, θ Q and θ µ are the parameters of the target value and the policy network. The constant τ 1 is a hyperparameter, that realizes the soft update by scaling down the update step, so that the parameters of the target networks change slower than those of the actor and critic networks.
Experience replay buffer An experience sample typically consists of the tuple s = {s n , a, r, d, s n+1 }, where s n is the current and s n+1 the next state, a is the action, r is the reward and d is a boolean indicating, whether an episode is over or not. DDPG makes use of an experience replay buffer, in which samples generated by the interaction of policy and environment are stored and from which batches are sampled to perform updates using the value function, the bellman-equation and DPG.

A.2 Twin-delayed Deep Deterministic Policy Gradient (TD3)
One common problem of DDPG is the overestimation of the Q-value, which in turn results in policy-breaking. Twin-delayed Deep Deterministic Policy Gradient is able to diminish this effect by extending DDPG algorithm with three additional improvements. The first improvement is introducing a second value function network (as in twin-critics) to learn two q-functions. Second, it updates the policy network less frequently than the value networks. The third extension consists of target policy smoothing, i.e. adding a small amount of noise to the output of the target policy network. All these mentioned extensions provide more stability for approximating the optimal policy.
Clipped double Q-Learning Addressing overestimation of the Q-value, i.e. a state-action-pair is incorrectly valued too high, the first improvement of TD3 over DDPG is implemented by using two critics or value function networks instead of one (which also means two target critics). The two value functions are optimized with one target Q-function, which uses the minimum of the Q-values estimated by both target functions: By always choosing the minimum Q-value, it is more difficult for the value functions to develop an overestimation of Q-value for certain inputs.
Delayed policy network updates Less frequent updates of the policy network ensures, that the value function has a harder time converging on the failure mode, where it overestimates actions incorrectly. In a scenario, where the value function would start overestimating the outputs of a poor policy, additional updates of the value network while keeping the same policy could lead to overcoming the incorrect estimation of the poor performing policy.
Target policy smoothing The third improvement of TD3 is also an improvement of the target Y t . The action produced by the target policy network, which is utilized in the target Q-function, gets modified by adding a small amount of noise, which is also clipped into an interval. This has the effect of covering a clipped area around the action in the action space, instead of predicting a deterministic action. In case, that the value-function produces a Q-value incorrectly to large for a certain action, adding a clipped amount of noise to the action acts as a regularizer, as the high-valued action gets smoothed by the noise.

A.3 Distributed Distributional Deep Deterministic Policy Gradient (D4PG)
D4PG, similar to TD3, is an extended version of DDPG. It implements 4 additional improvements, which overall address stability and scalability of DDPG. The first improvement, a distributional value function, provides a more stable estimation of the Q-value. Second, the process of gathering experiences is distributed over a number of in parallel acting policy networks, which store their experiences in a shared experience replay buffer. The third improvement, prioritized experience replay, weighs the produced experiences, so that important experiences are more often sampled than others. The last improvement is n-step returns. When computing the TD-error n-step-returns allows a more confident estimation of a state-action-pair by producing a reward over n steps into the future.
Multiple sampler To address the sample-inefficiency problem of model-free reinforcement learning, multiple copys of the policy network run in parallel to produce samples and store them in a shared experience buffer. The copys are updated at the same time and the number of sampler can be chosen as required.
Distributional value function D4PG uses a distributional version of critic updates. This means, that expected Q-value is modeled as a random variable, thus the value function maps the input, a state-action-pair, to a distribution Z w , which is distributed over w. Given Q w (s, a) = EZ w (x, a), the loss for the distributional function is given by minimizing the distance between two distributions L(w) = E[d(T µ θ , Z w (s, a), Z w (s, a)], where T µ θ is the Bellman operator. As [21] show, this improvement results in a more stable learning signal.
N-step returns When constructing the target and doing the forward step of the value network for computing the loss, this improvement incorporates computing the sum of rewards of n-steps instead of having a one-step reward. The target incorporating n-step returns is computed by: This estimates future reward more accurately.
Prioritized experience replay Instead of sampling uniformly from the replay buffer, the samples stored in the prioritized experience replay buffer are weighted with an importance weight and are sampled with a non-uniform probability p i . The weight, which adjust the probability can, e.g. be realized by the TD-error. This would have the effect, that samples with high TD-error get sampled more often than others.