Reinforcement Learning


The Reinforcement Learning Problem is approached by means of an Actor-Critic design. This method splits the agent into a return-estimator (Critic) and an action-selection mechanism (Actor). Information about state and reward is provided by the plant to the agent. As the agent is still viewed as one unit, both of its parts are embedded in the same class, the ActorCritic. It does not itself implement a method to solve the learning problem but only provides preliminaries for an algorithm doing so. Meaning, that it defines common members and method interfaces. Furthermore, it binds the Actor-Critic approach to a PuPy.RobotActor, such that any of its descendants can be used within PuPy.

The Actor-Critic implementation is kept general, meaning that it is not limited to a specific learning problem. For this, the template classes Plant and Policy are defined. Using the former, a concrete environment can be implemented by specifying state and reward. The latter class is required to hide the representation of the action from the ActorCritic. Due to the integration in PuPy, motor targets (a low-level representation of an action) have to be generated, but the action representation for the Reinforcement Learning problem may be more abstract. For example, gait parameters could be used as action. From these, a motor target sequence has to be generated to actually steer the robot.

On top of the ActorCritic implementation, this module provides a couple algorithms to solve a problem stated in terms of Reinforcement Learning. All algorithms follow the same approach, namely (action dependent) Heuristic Dynamic Programming. The baseline algorithm is implemented in ADHDP. A data collecting version is provided in CollectingADHDP, as the class is already embedded in the PuPy framework through ActorCritic. Hence the collector works in the same fashion as PuPy.RobotCollector.

If a new specialisation of an ActorCritic is created, typically its ActorCritic._step() method is adapted (this is for example the case in ADHDP). If so, the two methods ActorCritic._pre_increment_hook() (before returning) and ActorCritic._next_action_hook() (after computation of the next action) should be called as other structures may rely on those.

Some variations of the baseline algorithm are implemented as well in ActionGradient, ActionRecomputation and ActionBruteForce. They fulfill the same purpose but approach it differently (specifically, the actor is implemented differently). The details are given in details of the respective class.


class HDPy.Plant(state_space_dim=None, norm=None)

A template for Actor-Critic plants. The Plant describes the interaction of the Actor-Critic with the environment. Given a robot which follows a certain Policy, the environment generates rewards and robot states.

An additional instance to PuPy.Normalization may be supplied in norm for normalizing sensor values.


Reset plant to initial state.


A reward generated by the Plant based on the current sensor readings in epoch. The reward is single-dimensional.

The reward is evaluated in every step. It builds the foundation of the approximated return.


Set the normalization instance to norm.


Return the state-part of the critic input (i.e. the reservoir input).

The state-part is derived from the current robot state and possibly also its action. As return format, a Nx1 numpy vector is expected, where 2 dimensions should exist (e.g. numpy.atleast_2d()).

Although the reservoir input will consist of both, the state and action, this method must only return the state part of it.


Return the dimension of the state space. This value is equal to the size of the vector returned by state_input().

class HDPy.Policy(action_space_dim=None, norm=None)

A template for Actor-Critic policies. The Policy defines how an action is translated into a control (motor) signal. It continously receives action updates from the Critic which it has to digest.

An additional instance to PuPy.Normalization may be supplied in norm for normalizing sensor values.


Return the dimension of the action space. This value is equal to the size of the vector returned by initial_action().

Policy.get_iterator(time_start_ms, time_end_ms, step_size_ms)

Return an iterator for the motor_target sequence, according to the current action configuration.

The motor_targets glue the Policy and Plant together. Since they are applied in the robot and effect the sensor readouts, they are an “input” to the environment. As the targets are generated as effect of the action update, they are an output of the policy.


Return the initial action. A valid action must be returned since the ActorCritic relies on the format.

The action has to be a 2-dimensional numpy vector, with both dimensions available.


Undo any policy updates.


Set the normalization instance to norm.


Update the Policy according to the current action update action_upd, which was in turn computed by the ActorCritic.

class HDPy.ActorCritic(plant, policy, gamma=1.0, alpha=1.0, init_steps=1, norm=None, momentum=0.0)

Actor-critic design.

The Actor-Critic estimates the return function

\[J_t = \sum\limits_{k=t}^{T} \gamma^k r_{t+k+1}\]

while the return is optimized at the same time. This is done by incrementally updating the estimate for \(J_t\) and choosing the next action by optimizing the return in a single step. See [ESN-ACD] for details.

A reservoir instance compliant with the interface of ReservoirNode. Specifically, must provide a reset method and reset_states must be False. The input dimension must be compliant with the specification of the action.
The reservoir readout function. An instance of PlainRLS is expected. Note that the readout must include a bias. The dimensions of reservoir and readout must match and the output of the latter must be single dimensional.
An instance of Plant. The plant defines the interaction with the environment.
An instance of Policy. The policy defines the interaction with the robot’s actuators.
Choice of gamma in the return function. May be a constant or a function of the time (relative to the episode start).

Choice of alpha in the action update. May be a constant or a function of the time (relative to the episode start).

The corresponding formula is

\[a_{t+1} = a_{t} + \alpha \frac{\partial J_t}{\partial a_t}\]

See [ESN-ACD] for details.

A PuPy.Normalization for normalization purposes. Note that the parameters for a_curr and a_next should be exchangable, since it’s really the same kind of ‘sensor’.

Start a new episode of the same experiment. This method can also be used to initialize the ActorCritic, for example when it is loaded from a file.

__call__(epoch, time_start_ms, time_end_ms, step_size_ms)

One round in the actor-critic cycle. The current observations are given in epoch and the timing information in the rest of the parameters. For a detailed description of the parameters, see PuPy.PuppyActor.

This routine computes the reward from the epoch and manages consecutive epochs, then lets _step() compute the next action.

init_episode(epoch, time_start_ms, time_end_ms, step_size_ms)

Define the behaviour during the initial phase, i.e. as long as

num_step <= init_steps

with num_step the episode’s step iterator and init_steps given at construction (default 1). The default is to store the epoch but do nothing else.


The step iterator num_step is incremented before this method is called.

_step(s_curr, s_next, a_curr, reward)

Execute one step of the actor and return the next action.

When overloading this method, it must be ensured that _next_action_hook() is executed as soon as the next action is determined and also _pre_increment_hook() should be called before the method returns (passing relevant intermediate results).

Previous observed state. dict, same as epoch of the __call__().
Latest observed state. dict, same as epoch of the __call__().
Previously executed action. This is the action which lead from s_curr into s_next. Type specified through the Policy.
Reward of s_next
_pre_increment_hook(epoch, **kwargs)

Template method for subclasses.

Before the actor-critic cycle increments, this method is invoked with all relevant locals of the ADHDP.__call__() method.


Postprocessing hook, after the next action a_next was proposed by the algorithm. Must return the possibly altered next action in the same format.


Store the current instance in a file at pth.


If alpha or gamma was set to a user-defined function, make sure it’s pickable. Especially, anonymous functions (lambda) can’t be pickled.

static load(pth)

Load an instance from a file pth.


Set the normalization instance to norm. The normalization is propagated to the plant and policy.


Define a value for alpha. May be either a constant or a function of the time.


Define a value for gamma. May be either a constant or a function of the time.


Define a value for momentum. May be either a constant or a function of the time.

class HDPy.Momentum

Template class for an action momentum.

With a momentum, the next action is computed from the lastest one and the proposed action \(a^*\). The momentum controls how much each of the two influences the next action. Generally, a momentum of zero implies following strictly the proposal, while a momentum of one does the opposite. Usually, the (linear) momentum is formulated as

\[a_{t+1} = m a_t + (1-m) a^*\]

The momentum may be time dependent with - time0: Episode counter - time1: Episode’s step counter

__call__(a_curr, a_prop, time0=None, time1=None)

Return the next action from a current action a_curr, a proposal a_prop at episode time0 in step time1.

class HDPy.ConstMomentum(value)

Bases: HDPy.rl.Momentum

Linear momentum equation, as specified in Momentum with time-constant momentum value (m).

Momentum value, [0,1].
class HDPy.RadialMomentum(value)

Bases: HDPy.rl.Momentum

Momentum with respect to angular action. The resulting action is the (smaller) intermediate angle of the latest action and proposal (with respect to the momentum). The actions are supposed to be in radians, hence the output is in the range \([0,2\pi]\). The momentum is a time-constant value (m).

Momentum value, [0,1].
class HDPy.ADHDP(reservoir, readout, *args, **kwargs)

Bases: HDPy.rl.ActorCritic

Action dependent Heuristic Dynamic Programming structure and basic algorithm implementation.

In the _step() method, this class provides the implementation of a baseline algorithm. By default, the behaviour is online, i.e. the critic is trained and the actor in effect. Note that the actor can be modified through the _next_action_hook() routine.

Critic reservoir. Should have been initialized.
Reservoir readout. Usually an online linear regression (RLS) instance, like StabilizedRLS.
_critic_eval(state, action, simulate, action_name='a_curr')

Evaluate the critic at state and action.


Return the critic’s derivative at r_state.

init_episode(epoch, time_start_ms, time_end_ms, step_size_ms)

Initial behaviour (after reset)


Assuming identical initial trajectories, the initial state is the same - and thus doesn’t matter. Non-identical initial trajectories will result in non-identical behaviour, therefore the initial state should be different (initial state w.r.t. start of learning). Due to this, the critic is already updated in the initial trajectory.

_step(s_curr, s_next, a_curr, reward)

Execute one step of the actor and return the next action.

This is the baseline ADHDP algorithm. The next action is computed as

\[a_{t+1} = m a_t + (1-m) \left( a_t + \alpha \frac{\partial J(s_t, a_t)}{\partial a} \right)\]

with \(m\) the momentum and \(\alpha\) the step size. The critic trained on the TD error with discount rate \(\gamma\):

\[err = r + \gamma J(s_{t+1}, a_{t+1}) - J(s_t, a_t)\]
Latest observed state. dict, same as s_next of the __call__().
Previous observed state. dict, same as s_next of the __call__().
Reward of s_next
class HDPy.CollectingADHDP(expfile, *args, **kwargs)

Bases: HDPy.hdp.ADHDP

Actor-Critic design with data collector.

A collector (PuPy.PuppyCollector instance) is created for recording sensor data and actor-critic internals together. The file is stored at expfile.

_pre_increment_hook(epoch, **kwargs)

Add the ADHDP internals to the epoch and use the collector to save all the data.

class HDPy.ActionGradient(*args, **kwargs)

Bases: HDPy.hdp.CollectingADHDP

Determine the next action by gradient ascent search. The gradient ascent computes the action which maximizes the predicted return for a fixed state.

Additional keyword arguments:

Stop gradient descent if gradient below this threshold.
Maximum number of gradient descent steps
class HDPy.ActionRecomputation(expfile, *args, **kwargs)

Bases: HDPy.hdp.CollectingADHDP

Determine the next action the same way as the baseline algorithm for critic training, then recompute it based on the updated critic and with the latest state information.

class HDPy.ActionBruteForce(candidates, *args, **kwargs)

Bases: HDPy.hdp.CollectingADHDP

Find the optimal action by computing the expected return at different sampled locations and picking the action which yields the highest one.

Action samples. Must be a list of valid actions.


Breaks old code

Table Of Contents

Previous topic

Reservoir Computing

Next topic

Utility functions

This Page