Heuristic Dynamic Programming in Python

This module combines Reinforcement Learning and Reservoir Computing by means of an Actor-Critic design. In Reinforcement Learning, the learning subject is expressed through the agent while the teacher denoted as environment or plant. At each time step, the agent chooses an action \(a_t\), which leads it from state \(s_t\) to state \(s_{t+1}\). The state information is provided to the agent by the environment, together with a reward \(r_{t+1}\) which announces how good or bad the state is considered. Note that the reward cannot be used as learning target, as it is not an error but merely a hint if the agent goes into the right direction. Instead, the agent’s goal is to collect as much reward as possible with time. The Return expresses this by taking future rewards into account:

\[R_t = \sum\limits_{k=0}^T \gamma^k r_{t+k+1}\]

As it may not be meaningful to consider the whole future, the influence of rewards is decreased the farther they are off. This is controlled through the discount rate \(\gamma\). Further, experiments are often episodic (meaning that they terminate somewhen). This is accounted for by summing until the episode length \(T\) [RL].

An Actor-Critic design splits the agent into two parts: The Actor decides on the action, for which it is in turn criticised by the Critic. Meaning, that the Critic learns long-time behaviour, i.e. approximates the Return, while the Actor uses the Critic’s approximation to select the action which maximizes the Return in a single step. This module incorporates Reservoir Computing as the Critic’s function approximator [ESN-ACD].

This documentation gives an overview over the module’s functionality, gives an usage example and lists the interfaces. This order is kept constant over all (i.e. most) pages. The first four pages (Basics) list the basic interfaces and describe the methods which implement Reservoir Computing and Reinforcement Learning. These structures are independent on the experimental platform.

This package was originally implemented for two platforms, the Puppy and ePuck robots. The corresponding (and hence platform dependent) code is documented in the second section (Platforms).

The third section (Resources) provides further information and download and installation resources.

Note that some of the examples write files. In this case, the paths are usually hardcoded and valid for a unix-like file tree. As data is temporary, it is hence stored in /tmp. When working on other systems, the paths have to be adapted.

Furthermore, due to Python’s magnificent online help, the interface documentation is also available from within the interactive interpreter (e.g. IPython):

>>> import HDPy
>>> help(HDPy)


The examples have been written for linux. As most of them include paths, they are also specified for a unix-like filesystem. On other systems, they have to be adapted. Also note that some of the paths may require adaptions, even on a linux machine (e.g. normalization data files).

Indices and tables

Table Of Contents

Next topic

Reservoir Computing

This Page