# Plants and Policies¶

## Introduction¶

As described in Reinforcement Learning, the learning problem formulation is achieved by specifying a Plant and a Policy. For Puppy and ePuck, some examples have already been implemented.

To create a custom plant, basically the Plant class has to be subtyped. As the plant models the environment, it has to compute a reward and state from sensor measurements. Together, they encode the abstract learning target, a problem designer has in mind. The implementation of a plant is quite straight-forward. The two functions Plant.state_input() and Plant.reward() are called whenever the state or reward is requested. They are expected to return a vector ($$N \times 1$$) and scalar, respectively. The state space dimension $$N$$ may be announced through the plant’s constructor and later queried by calling Plant.state_space_dim(). If the plant is dependent on the episode, the reset() method can be implemented as well to reset the instance’s internal state. Note that the sensor values are not preprocessed, specifically not normalized. To do so, a normalization instance (PuPy.Normalization) is automatically registered at Plant.normalization. Note that normalization is mandatory for Plant.input_state().

The implementation of a custom policy is analogous to the creation of a new Plant. Here, the class Policy is to be subtyped, some of its methods are to be implemented. As with Plant, the normalization and action space dimensions are automatically registered, in the later case through the default constructor. Furthermore, the policy is reset at the beginning of a new episode through Policy.reset().

The action itself is completely defined through the methods Policy.initial_action(), Policy.update() and Policy.get_iterator(). The first gives a valid action, used for initial behaviour (i.e. before the actor was in operation). The other two define the behaviour during the experiment. After an action has been selected, the Policy.update() method is called, which should note the new action and update internal structures. As with the state, the action is passed as $$M \times 1$$ vector. This will be followed by a call to Policy.get_iterator(), which in turn produces the sequence of motor targets, as requested by WebotsRobotMixin.

## Reference¶

class HDPy.Plant(state_space_dim=None, norm=None)

A template for Actor-Critic plants. The Plant describes the interaction of the Actor-Critic with the environment. Given a robot which follows a certain Policy, the environment generates rewards and robot states.

An additional instance to PuPy.Normalization may be supplied in norm for normalizing sensor values.

reset()

Reset plant to initial state.

reward(epoch)

A reward generated by the Plant based on the current sensor readings in epoch. The reward is single-dimensional.

The reward is evaluated in every step. It builds the foundation of the approximated return.

set_normalization(norm)

Set the normalization instance to norm.

state_input(state)

Return the state-part of the critic input (i.e. the reservoir input).

The state-part is derived from the current robot state and possibly also its action. As return format, a Nx1 numpy vector is expected, where 2 dimensions should exist (e.g. numpy.atleast_2d()).

Although the reservoir input will consist of both, the state and action, this method must only return the state part of it.

state_space_dim()

Return the dimension of the state space. This value is equal to the size of the vector returned by state_input().

class HDPy.Policy(action_space_dim=None, norm=None)

A template for Actor-Critic policies. The Policy defines how an action is translated into a control (motor) signal. It continously receives action updates from the Critic which it has to digest.

An additional instance to PuPy.Normalization may be supplied in norm for normalizing sensor values.

action_space_dim()

Return the dimension of the action space. This value is equal to the size of the vector returned by initial_action().

get_iterator(time_start_ms, time_end_ms, step_size_ms)

Return an iterator for the motor_target sequence, according to the current action configuration.

The motor_targets glue the Policy and Plant together. Since they are applied in the robot and effect the sensor readouts, they are an “input” to the environment. As the targets are generated as effect of the action update, they are an output of the policy.

initial_action()

Return the initial action. A valid action must be returned since the ActorCritic relies on the format.

The action has to be a 2-dimensional numpy vector, with both dimensions available.

reset()

set_normalization(norm)

Set the normalization instance to norm.

update(action_upd)

Update the Policy according to the current action update action_upd, which was in turn computed by the ActorCritic.

### Puppy Plants¶

class HDPy.puppy.plant.SpeedReward

A Plant with focus on the speed of the robot.

class HDPy.puppy.plant.LineFollower(origin, direction, reward_noise=0.01)

A Plant which gives negative reward proportional to the distance to a line in the xy plane. The line is described by its origin and the direction.

A Plant which gives negative reward proportional to the distance to point target in the xy plane. If the robot is closer than radius to the target, the reward will be 0.0.

A Plant which gives negative reward proportional to the distance to point target in the xy plane. If the robot is closer than radius to the target, the reward will be 0.0. The state is composed of the distance to predefined landmarks, specified with their coordinates in the xy plane. Gaussian noise will be added to the reward, if reward_noise is positive.

class HDPy.puppy.plant.DiffTargetLocationLandmarks(target, landmarks, reward_noise=0.01, init_distance=100)

A Plant which gives positive reward proportional to the absolute difference (between two episodes) in distance to point target in the xy plane. The state is composed of the distance to predefined landmarks, specified with their coordinates in the xy plane. Gaussian noise will be added to the reward, if reward_noise is positive.

Before the first call, the distance is set to init_distance.

### Puppy Policies¶

Puppy’s policies are based on manipulating gait parameters. There are four parameters and four legs, hence a total of 16 parameters. The parameters under control may be reduced (one to four) and legs may be grouped (All together, Front/Rear, Left/Right).

The class naming scheme is as follows:

[<Group>][<Param>]{1,4}

with <Group>:

 AA All together FR Front/Rear LR Left/Right II Individually

and <Param>:

 A Amplitude P Phase F Frequency O Offset

If several parameters are controlled at once, the <Param> identifier is repeated up to four times. The order is A, P, F, O.

Examples:

class HDPy.puppy.policy.FRA(gait)

Front/Rear; Amplitude

class HDPy.puppy.policy.LRA(gait)

Left/Right; Amplitude

class HDPy.puppy.policy.LRP(gait)

Left/Right; Phase

class HDPy.puppy.policy.IIAPFO(gait)

Individual; Amplitude, Phase, Frequency, Offset

### ePuck Plants¶

class HDPy.epuck.plant.CollisionAvoidanceFrontal(theta, obs_noise=0.0)

Bases: HDPy.rl.Plant

Plant for ePuck to realize collision avoidance. The state consists of the three frontal infrared sensors. The reward is negative, if one of the three frontal sensors reads a proximity lower than theta. Gaussian noise is added to the reward if obs_noise is positive.

class HDPy.epuck.plant.CollisionAvoidanceSideways(theta, obs_noise=0.0)

Bases: HDPy.rl.Plant

Plant for ePuck to realize collision avoidance. The state consists of the frontal and two sideways infrared sensors. The reward is negative, if one of those sensors reads a proximity lower than theta. Gaussian noise is added to the reward if obs_noise is positive.

class HDPy.epuck.plant.CollisionAvoidanceFull(theta, obs_noise=0.0)

Bases: HDPy.rl.Plant

Plant for ePuck to realize collision avoidance. The state consists of all eight infrared sensors. The reward is negative, if one of the sensors reads a proximity lower than theta. Gaussian noise is added to the reward if obs_noise is positive.

class HDPy.epuck.plant.Attractor(attractor, repeller, scale)

Bases: HDPy.rl.Plant

Plant for ePuck to guide it to an attractor and away from a repeller. Both points are to be passed as tuples. The state consists of the robot’s location. The reward is inversely proportional with factor scale to the distances to the attractor and repeller, i.e.

$r = \frac{s}{\Delta_a} - \frac{s}{\Delta_r}$

### ePuck Policies¶

Bases: HDPy.rl.Policy

ePuck policy with the heading as action.

Due to historical reasons, it is up to the implementation of the robot to interprete the action (i.e. if it’s considered relative or absolute).

Note that since Webots is not used for ePuck simulation, the action sequence is reduced to a single item and hence not returned as list. This behaviour works fine with the Robot class.

ePuck policy with the heading as action and random initialization.

The only difference to Heading is that the initial action is not 0.0 but randomly sampled in [0, 2*pi].