:mod:`sampling` --- Sampling Methods ************************************ .. contents:: .. module:: ailib.sampling Often, one would like to accquire some samples that follow a certain probability distribution. For example, in algorithm testing, training samples are required but real-world measurements are too rare to be of use, so artificial samples have to be generated. In almost any case, the random samples have to fulfill some constraints, expressed through the shape of how the samples are distributed. Thus, the standard scenario is that a probability distribution is given (i.e. assumed) and one requires samples that - in high numbers - match the prerequisited distribution. Sampling from known distributions ================================= In the most simple case, a well-known :ref:`distribution function ` is provided. A probability distribution can be characterized by the probability density function :math:`p(x)` .. math:: P(a \leq X \leq b) = \int\limits_a^b p(x) dx and the cumulative distribution function :math:`F(x)` .. math:: F(x) := P(X \leq x) = \int\limits_{-\infty}^x p(t) dt given the :ref:`probability mass function ` :math:`P(X)`. For sampling, the inverse cumulative distribution function :math:`F^{-1}(y)` (**i-cdf**) is especially interesting. If a closed form representation exists, one can sample the uniform distribution (which usually is relatively easy) in the i-cdf's domain :math:`y \in [0,1]` and then use the i-cdf to retrieve samples distributed according to its probability distribution: 1. Draw :math:`u \sim \mathcal{U}^{[0,1]}` 2. Obtain a sample :math:`s = F^{-1}(u)` The `numpy.random `_ package contains several functions to sample directly from the most prominent distributions. Unfortunately, the inverse cdf does not always exist, which makes more complex methods inevitable. Sampling from arbitrary distributions ===================================== The target sample distribution may not always be known (e.g. it can be determined by measurements from a source with unknown distribution) or a closed form representation of the i-cdf may could not be available. In this case, a more elaborate mechanism has to be made use of (which, as always, comes with the cost of less efficient computation). Rejection sampling ------------------ This is a relatively simple but often inefficient sampling scheme. A proposal distribution :math:`q(x)` is required as well as a scaling constant :math:`M < \infty`. The proposal distribution can have any shape, but it must be guaranteed that .. math:: M q(x) \geq p(x) \quad \forall x \in \mathcal{X} with :math:`p(x)` the target distribution. To find any tuple :math:`(q(x),M)` that fullfills the above equation is often not very hard find. However, as described later, the larger the scaling, the less efficient the algorithm becomes and finding an acceptable tuple may be difficult. The algorithm can then be laid out: 1. Iterate, until :math:`N` accepted samples have been generated 1. Sample from the proposal :math:`x \sim Q` 2. Sample from the uniform distribution :math:`u \sim U^{[0,1]}` 3. Accept the sample :math:`x`, if :math:`u < \frac{ p(x) }{ M q(x) }`. Reject it otherwise. As it can be seen from the algorithm, a sample has to be obtained from the distribution :math:`q(x)`, hence a proposal distribution should be chosen that is relatively easy to sample. A sample is accepted, iff :math:`u < \frac{ p(x) }{ M q(x) }`, thus the probability that a sample is accepted is proportional to :math:`\frac{1}{M}`. If the proposal and target distributions don't match well, the scaling has to be large which introduces massive performance issues. .. todo: algorithm explanation .. todo: importance sampling .. Markov-Chain Monte-Carlo Sampling --------------------------------- .. TODO: Intro to MCMC, proposals (A) of Metropolis & MH Interfaces ========== .. todo: Iterator & generic approach .. autofunction:: rouletteWheelN .. autofunction:: rouletteWheel .. autofunction:: rejectionSamplerN .. autofunction:: rejectionSampler .. todo: .. autofunction:: importanceSampler .. autofunction:: metropolisHastingsSampler The proposal is .. math:: \min \left( 1, \frac{ p(x^*) q(x_{t-1}, x^*) }{ p(x_{t-1}) q(x^*, x_{t-1}) } \right) .. autofunction:: metropolisSampler The metropolis sampler assumes a symmetric proposal, i.e. .. math:: q(x^*, x_t) = q(x_t, x^*) This reduces the proposal to .. math:: \min \left( 1, \frac{p(x^*)}{p(x_{t-1})} \right) Examples ========