$$ \newcommand{\Ex}{\mathbb{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\SampleAvg}{\frac{1}{N({S})} \sum_{s \in {S}}} \newcommand{\indic}{\mathbb{1}} \newcommand{\avg}{\overline} \newcommand{\est}{\hat} \newcommand{\trueval}[1]{#1^{*}} \newcommand{\Gam}[1]{\mathrm{Gamma}#1} $$

$$ \renewcommand{\like}{\cal L} \renewcommand{\loglike}{\ell} \renewcommand{\err}{\cal E} \renewcommand{\dat}{\cal D} \renewcommand{\hyp}{\cal H} \renewcommand{\Ex}[2]{E_{#1}[#2]} \renewcommand{\x}{\mathbf x} \renewcommand{\v}[1]{\mathbf #1} $$

In [46]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import quad
from scipy import stats
from scipy.stats import norm, gamma
from scipy.stats import distributions
import seaborn as sns
import time
sns.set_style('whitegrid')
sns.set_context('poster')
plt.rcParams['figure.figsize'] = (12,6)

Introduction

MCMC stands for Monte Carlo Markov Chain, where Monte Carlo refers to random number generation and Markov Chain refer to a sequence of random variables that each depend only on its immediate previous random variable.

MCMC is a special family of sampling algorithms. Sampling algorithm allows us to draw sample from a given probability distribution. By drawing samples and applying the law of large number we can compute the expectation, integrals, or optimize some function!

MCMC sample methods are central to Bayesian statistics since the denominator in the Bayes rule normally involves computing integral with respect to many parameters. $$p(\theta \mid \cal{D}) = \frac{p(\cal{D} \mid \theta) p(\theta)}{p(D)}$$

where $p(D) = \int p(\cal{D} \mid \theta) p(\theta) d\theta$. This is an expensive operation. MCMC methods allows us to approximate the integral. This is why MCMC is central to the rising popularity of Bayesian statistics in the last decade.

Before we discuss MCMC, it is instructive to learn a basic but fundamental sampling methods.

Basic Sampling Methods

Inverse Transform

Given an invertible CDF F of a distribution of x, we can perform sampling using the following algorithm.

  1. get a uniform sample u from uniform(0,1)
  2. sovle for x: $x = F^{-1}(u)$
  3. repeat.

For example, we want to draw samples from the exponential distribution $f(x) = \lambda e^{-\lambda x}$ for $x>0$. Solving $x = F^{-1}(u)$, we have $x = (-1/\lambda)\ln(1-u)$

In [9]:
# drawing from exponential distribution f(x,1), lambda=1
p = lambda x: np.exp(-x)
CDF = lambda x: 1 - np.exp(-x)
invCDF = lambda r: -np.log(1-r)
xmin = 0
xmax = 6
rmin = CDF(xmin)
rmax = CDF(xmax)
N = 10000
# generate samples in [rmin rmax] the invert the CDF 
# to get the samples of our target distribution
R = np.random.uniform(rmin,rmax, N)
X = invCDF(R)
hinfo = np.histogram(X,100)
plt.hist(X, bins=100, label='Samples')

# plot our (normalized) function
x = np.linspace(xmin,xmax,1000)
plt.plot(x, hinfo[0][0]*p(x), color='r', label=u'p(x)')
plt.legend()
Out[9]:
<matplotlib.legend.Legend at 0x10fdc0c18>

Rejection Sampling

This is quite simple. The key idea is to generate samples from a uniform distribution with a support as a rectangle [xmin, xmax] * [0, ymax], see how many fall below y(x) at a specific x.

The algorithm is as follows.

  1. Draw x uniformly from [xmin, xmax]
  2. Draw y uniformly from [0, ymax]
  3. if y < f(x), accept the sample x
  4. otherwise reject it
  5. repeat

The following code generate samples from distribution $f(x) = e^{-x}$ for $x \in[0,10]$

In [3]:
P = lambda x: np.exp(-x)

# domain limits
xmin = 0 # the lower limit of our domain
xmax = 10 # the upper limit of our domain

# range limit (supremum) for y
ymax = 1
#you might have to do an optimization to find this.

N = 10000 # the total of samples we wish to generate
accepted = 0 # the number of accepted samples
samples = np.zeros(N)
count = 0 # the total count of proposals

# generation loop
while (accepted < N):
    
    # pick a uniform number on [xmin, xmax) (e.g. 0...10)
    x = np.random.uniform(xmin, xmax)
    
    # pick a uniform number on [0, ymax)
    y = np.random.uniform(0,ymax)
    
    # Do the accept/reject comparison
    if y < P(x):
        samples[accepted] = x
        accepted += 1
    
    count +=1
    
print("Count",count, "Accepted", accepted)

# get the histogram info
hinfo = np.histogram(samples,30)

# plot the histogram
plt.hist(samples,bins=30, label=u'Samples');

# plot our (normalized) function
xvals=np.linspace(xmin, xmax, 1000)
plt.plot(xvals, hinfo[0][0]*P(xvals), 'r', label=u'P(x)')

# turn on the legend
plt.legend()
Count 98786 Accepted 10000
Out[3]:
<matplotlib.legend.Legend at 0x10fe2a710>

Notice that in the previous example, we need about 100K proposals to accept 10K samples. This is a waste of computation. This is because in the range $[4,10]$, we have a very low probabilty density. This problem is called low acceptance probability.

To remedy the low acceptance probability , we can introduct a proposal density g(x). which should have the following properties:

  • easy to sample from
  • M is beween 1, and $\infty$ so that Mg(x) > f(x) on the entire domain of interest
  • ideally g(x) will be somewhat close to f so that you will sample more in the high density regions and less in the low density regions.

There are two equivalent view of this strategy: Algorithm 1 and Algorithm 2. Algorithm 1 is more direct, easy to understand.

Algorithm 1: Note that in the picture $g(x)$ is $Q(x)$ and f(x) is $P^*(x)$.

rejection sampling

Algorithm 2:

  • Draw x from your proposal distribution g(x)
  • Draw y uniformly from [0,1]
  • if y< f(x)/Mg(x)accept the sample otherwise reject it repeat

Notice that the two algorithm are equivalent. In Algorithm 2, step 3, $y < f(x)/Mg(x)$ is equivalent to $yMg(x) < f(x)$, since y is choosen uniformly from $[0,1]$. This is equivalent to sampling from $[0, Mg(x)]$ (or $[0, CQ(x)]$ as in the picture) which is like in the Algorithm 1.

In [6]:
p = lambda x: np.exp(-x)  # our distribution
g = lambda x: 1/(x+1)  # our proposal pdf (we're thus choosing M to be 1)
invCDFg = lambda x: np.log(x +1) # generates our proposal using inverse sampling

# domain limits
xmin = 0 # the lower limit of our domain
xmax = 10 # the upper limit of our domain

# range limits for inverse sampling
umin = invCDFg(xmin)
umax = invCDFg(xmax)

N = 10000 # the total of samples we wish to generate
accepted = 0 # the number of accepted samples
samples = np.zeros(N)
count = 0 # the total count of proposals

# generation loop
while (accepted < N):
    
    # Sample from g using inverse sampling
    u = np.random.uniform(umin, umax)
    xproposal = np.exp(u) - 1
    
    # pick a uniform number on [0, 1)
    y = np.random.uniform(0,1)
    
    # Do the accept/reject comparison
    if y < p(xproposal)/g(xproposal):
        samples[accepted] = xproposal
        accepted += 1
    
    count +=1
    
print("Count", count, "Accepted", accepted)

# get the histogram info
hinfo = np.histogram(samples,50)

# plot the histogram
plt.hist(samples,bins=50, label=u'Samples');

# plot our (normalized) function
xvals=np.linspace(xmin, xmax, 1000)
plt.plot(xvals, hinfo[0][0]*p(xvals), 'r', label=u'p(x)')
plt.plot(xvals, hinfo[0][0]*g(xvals), 'k', label=u'g(x)')

# turn on the legend
plt.legend()
Count 23837 Accepted 10000
Out[6]:
<matplotlib.legend.Legend at 0x11031b780>

One clear effect is that we need 24109 proposal to get 10K samples.

Importance Sampling

This is a directly a method to compute integrals, which we need to perform Bayes analysis.

For example suppose we have $X \sim f(x)$, and we want to compute $E(h(x))$ wrt to this density, we have

$\begin{align} E_f[h] &= \int_{V}f(x)h(x)dx\\ &= \int h(x)g(x)\frac{f(x)}{g(x)}dx &= E_g[h(x)\frac{f(x)}{g(x)}] \end{align}$

Here $g(x)$ is a function that is easy to sample from and is close to the function $f(x)$. Now we can approximate the last expression by Monte Carlo methods

$$E_f[h] = \lim_{N \rightarrow \infty}\frac{1}{N}\sum_{x_i \sim g(.)}h(x_i)\frac{f(x_i)}{g(x_i)}$$

The fundamental difference compared to rejection sampling is we use all the samples. But the contribution to the expection is weighted by $w(x_i) = \frac{f(x_i)}{g(x_i)}$

Example: Calculate $\int_{0}^{\pi}sin(x)xdx$

The function has a shape that is similar to Gaussian and therefore we choose here a Gaussian as importance sampling distribution.

In [7]:
from scipy import stats
from scipy.stats import norm

mu = 2;
sig =.7;

f = lambda x: np.sin(x)*x
infun = lambda x: np.sin(x)-x*np.cos(x)
p = lambda x: (1/np.sqrt(2*np.pi*sig**2))*np.exp(-(x-mu)**2/(2.0*sig**2))
normfun = lambda x:  norm.cdf(x-mu, scale=sig)
In [8]:
plt.figure(figsize=(18,8))  # set the figure size


# range of integration
xmax =np.pi 
xmin =0

# Number of draws 
N =1000

# Just want to plot the function
x=np.linspace(xmin, xmax, 1000)
plt.subplot(1,2,1)
plt.plot(x, f(x), 'b', label=u'Original  $x\sin(x)$')
plt.plot( x, p(x), 'r', label=u'Importance Sampling Function: Normal')
plt.plot(x, np.ones(1000)/np.pi,'k')

# drawing from normal distribution
xis = mu + sig*np.random.randn(N,1);
plt.plot(xis, 1/(np.pi*p(xis)),'.', alpha=0.1)

# some plotting miscell
plt.xlim([0, np.pi])
plt.ylim([0,2])
plt.xlabel('x')
plt.legend()
# =============================================
# EXACT SOLUTION 
# =============================================
Iexact = infun(xmax)-infun(xmin)
print("Exact solution is: ", Iexact)
Exact solution is:  3.141592653589793
In [40]:
plt.figure(figsize=(18,8))
# ============================================
# VANILLA MONTE CARLO 
# ============================================
Ivmc = np.zeros(1000)
for k in np.arange(0,1000):
    x = np.random.uniform(low=xmin, high=xmax, size=N)
    Ivmc[k] = (xmax-xmin)*np.mean( f(x))

print("Mean basic MC estimate: ", np.mean(Ivmc))
print("Standard deviation of our estimates: ", np.std(Ivmc))

# ============================================
# IMPORTANCE SAMPLING 
# ============================================
# CHOOSE Gaussian so it similar to the original functions

Iis = np.zeros(1000)
for k in np.arange(0,1000):
    # DRAW FROM THE GAUSSIAN mean =2 std = sqrt(0.7) 
    xis = mu + sig*np.random.randn(N,1);
    #hist(x)
    xis = xis[ (xis<xmax) & (xis>xmin)] ;

    # normalization for gaussian from 0..pi
    normal = normfun(np.pi)-normfun(0);


    Iis[k] =np.mean(f(xis)/p(xis))*normal;

print("Mean importance sampling MC estimate: ", np.mean(Iis))
print("Standard deviation of our estimates: ", np.std(Iis))
plt.subplot(1,2,2)
plt.hist(Iis,30, histtype='step', label=u'Importance Sampling');
plt.hist(Ivmc, 30, color='r',histtype='step', label=u'Vanilla MC');
plt.gca().set_xlabel('distribution of estimation of integral')
plt.legend()
Mean basic MC estimate:  3.14263399245922
Standard deviation of our estimates:  0.06016872403795413
Mean importance sampling MC estimate:  3.1415904991728296
Standard deviation of our estimates:  0.015732597782855223
Out[40]:
<matplotlib.legend.Legend at 0x110e87cf8>

Monte Carlo Integration Computation

Suppose we want to compute $I = \int_{a}^{b}f(x)dx$. We can use Monte Carlo integration method. The idea is to relate $I$ to the expectation of $f(x)$ wrt to the uniform distribution over $[a,b]$, $U_{ab}(x)$. Specifically, we have:

$$J = \int_a^b f(x)U_{ab}(x)dx = E_{U}f(x)$$

And we have $$I = (b-a) * J$$

Recall that using law of large numbers we can compute $J$, by repeatedly sample $X \sim U_{ab}(X)$, and compute $f(X)$ and the sample mean.

Hence

$$I = V\lim_{n \rightarrow \infty} \frac{1}{N}\sum_{x_i \sim U} f(x_i),$$ where V here is $b-a$

We now use sampling to compute

$$I = \int_{2}^{3}[x^2 + 4xsin(x)]dx$$

In [11]:
def f(x):
    return x**2 + 4*x*np.sin(x)

# closed from of integration
def intf(x):
    return x**3/3.0+4.0*np.sin(x) - 4.0*x*np.cos(x)   
In [12]:
# limits of integration
a = 2
b = 3

# size of simulation
N = 10000

# select randomly x
X = np.random.uniform(low=a, high=b, size=N)

# compute f(x)
Y = f(X)

# compute integration
V = b-a
print('approximated integration: ', V * np.mean(Y))
print('exact integration: ', intf(b) - intf(a))
approximated integration:  11.811425127333335
exact integration:  11.811358925098283

We want to compute $\int\int f(x,y)dxdy$ where $f(x,y) = x^2 + y^2$ over the region defined by the condition $x^2 + y^2 \leq 1$

In multidimensional case, $V$ is the volume of the support of integrand.

In [14]:
fmd = lambda x,y: x*x + y*y
# use N draws 
N= 8000
X= np.random.uniform(low=-1, high=1, size=N) 
Y= np.random.uniform(low=-1, high=1, size=N) 
Z=fmd(X, Y)   # CALCULATE THE f(x) 

R = X**2 + Y**2
V = np.pi*1.0*1.0
# number of points falling in V
N = np.sum(R<1)

# sum of f(points falling in V)
sumsamples = np.sum(Z[R<1])

print("I=",V*sumsamples/N, "actual", np.pi/2.0) #actual value (change to polar to calculate)
I= 1.5720586332483761 actual 1.5707963267948966

How accurate is MC method in computing integration

In [20]:
trials = np.arange(1, 1000)
errors = []
true = intf(b) - intf(a)
errors = [V * np.mean(f(np.random.uniform(low=a, high=b, size=N))) - true for N in trials]
errors_2 = np.sqrt(np.power(errors,2))
In [21]:
fig = plt.figure(figsize=(10,8))
ax = fig.subplots(nrows=1, ncols=2)
ax[0].plot(trials, errors_2);
ax[0].set_xlabel('number of samples N')
ax[0].set_ylabel('sqrt(error^2)');
ax[0].plot(trials[10:], 1/np.sqrt(trials[10:]), 'r')
# ax[0].annotate(r'$\frac{1}{\sqrt(N)}$', x=200, y= 0.08)
ax[0].annotate(r'$\frac{1}{\sqrt{N}}$', xy=(200,0.08), xytext=(300, 0.1),
            arrowprops=dict(facecolor='black', shrink=1),
            )

ax[1].hist(errors, bins=100, normed=True);
ax[1].set_title('error distribution');
In [22]:
# multiple MC estimations
m=1000
N=10000
Imc=np.zeros(m)


for i in np.arange(m):
    
    X = np.random.uniform(low=a, high=b, size=N) # N values uniformly drawn from a to b 
    Y =f(X)   # CALCULATE THE f(x) 

    Imc[i]= (b-a) * np.sum(Y)/ N;
    
    
plt.hist(Imc, bins=30)
plt.xlabel("error distribution")
print(np.mean(Imc), np.std(Imc))
11.81116928731566 0.0041515968034460445

MCMC Sampling Methods

Monte Carlo Markov Chain is a simulation method (Monte Carlo) that generate samples from a distribution $f(x)$ indirectly using a Markov Chain. In this notebook, we will learn about two pouplar MCMC algorithms: Metropolis-Hastings algorithm and Gibbs sampling.

In order to understand MCMC methods, we need to understand what is a Markov Chain (MC) and its target distribution (more on this later). But intuitively, in order to sample from a distribution $p(x)$, which is supposedly difficult to sample directly from, we build an MC whose target distribution is $p(x)$. Hence sampling from the constructed MC, which is easy, is like sampling from $p(x)$.

Markov Chain

We first discuss Markov Chain whose random variables are discrete.

  • Markov Chain is a model that describe a sequence of random variable $X_0, X_1, \ldots, X_n, \ldots$. They are all drawn from a distribution, where the value of $X_n$ depends only on the immediate variable $X_{n-1}$. In other words, it lies happily between complete independence assumption and complete dependence assumption.

Markov chain lives in both in space and time. Space here means the domain of $X_t$, and time here means the domain of $t$. Both space and time can be discrete or continuous. We will first focus only on discrete domains. Moreover our discrete space will be finite also. That is $dom(X_t) = \{1,2,\ldots, M\}$

We now define formally Markov chains:

A sequence of random variables $X_0, X_1, \ldots, X_n, \ldots$ taking values in the state space $\{1,2,\ldots, M\}$ is called a Markov chain if for all $n \geq 0$, $$p(X_{n+1} = j \mid X_n = i, X_{n-1} = i_{n-1}, \ldots, X_0 = i_0) = p(X_{n+1} = j \mid X_{n} = i)$$

The quantity $p(X_{n+1} = j \mid X_{n} = i)$ is called the transition probability from state i to state j.

Time-homogenenous Markov chain has the transition probability $p(X_{n+1} = j \mid X_{n} = i)$ unchanged in time.

Markov property is the condition in the defnition. It says that we need only $X_n$ to predict $X_{n+1}$. If n is the presents, before n is the past, after n is the future, Markov property says that given the present the past and the future are conditionally independent.

Knowing the dynamics of a Markov chain means knowing its transition probability from one state to another state. These dynamics is encoded in a matrix, called the transiation matrix, whose (i,j) entry is the probability going from state i to state j in one step of the chain: M-by-M matrix $Q = (q_{ij})$

An example of Markov chain:

example

n-step transition probability from i to j is the probability of being at j exactly n steps after staring at i. We denote this by $q_{ij}^{(n)} = p(X_n = j \mid X_0 = i)$.

Note that $q_{ij}^{(2)} = \sum_{k}q_{ik}q_{kj}$

The right hand side is the (ij) entry of $Q^2$. We can say that $Q^2$ is the two-step transition probabilities. By induction, the nth power. In other words, the ith row of $Q^n$ is the conditional PMF of $X_n$ given $X_0 = i$

mariginal distributions $p(X_1), \ldots, p(X_n), \ldots$. To compute these distributions, we need $Q$ and the intial distribution of $X_0$. Suppose the marginal distribution $p(X_0) = (t_1, \dots, t_M)$, then the marginal distribution at any time can be computed from the transition matrix, averaging over all the state using LOTP.

Define $\v{t} = (t_1, \dots, t_M)$ by $t_i = p(X_0 = i)$. Then the marginal distribution of $X_n$ is given by the vector $\v{t}Q^n$, viewing $\v{t}$ as row vector.

We now study a number of characteristics of a Markov chain. They are important in understanding the long-run behaviour of the Markov chain.

recurrent versus transient

State i of a Markov chain is recurrent if starting from i, the probability is 1 that the chain will eventually return to i. Otherwise, the state is transient, which means that if the chain starts from i, there is a positive probability of never returning to i.

Recall that the number of failures, in repeated Bernoulli trials, before the first successful trial follow the Geometric distribution with parameter p: $X \sim Geom(p),$ where $p(X = k) = q^kp$. Note that in scipy.stats.geom have a slightly different definition, where k counts also the success trial. recurrent_transient

In [24]:
%matplotlib inline
from scipy.stats import geom
import numpy as np
import matplotlib.pyplot as plt

def mygeom(k, p):
    return (1-p)**k*p
In [25]:
k = np.arange(10) 
plt.plot(k, mygeom(k,1/2), 'o', color='r', alpha=0.3)
plt.plot(k, geom.pmf(k+1, 1/2), 'x', color='b', alpha=0.3)
Out[25]:
[<matplotlib.lines.Line2D at 0x113430e48>]

There is a connection between transient states and geometric distribution: the number of returns to transient state is Geometric. Let i be a transient state of a Markov chain. Suppose the probability of never returning to i, starting from i, is a positive number $p >0$. Then, starting from i, the number of times that the chain returns to i before leaving forever is distributed Geom(p)

This means, the chain will eventually leave state i forever. I understand this as probability of the number of times the chain returns before leaving forever approaches 0 as this number increase (See the graph). Hence eventually, the chain leave i forever.

How do we know if a state is recurrent? We cannot analyze the graph of Markov chain visually if there are too many states. Irreducibility allows us to do that: if it is possible to get from any state to any other state.

Irreducible and reducible chain.

    A Markov chain with transition matrix $Q$ is irreducible if for any two states i and j, it is possible to go from i to j in a finite number of steps (with positive probability). That is, for any states i, j there is some positive integer n such that the (i,j) entry of $Q^n$ is positive. A Markov chain that is not irreducible is called reducible.

It is easy to prove irreducibility implies all states are reccurrent.

We have irreducbile implies all recurrent. But the converse is not true. We can have a reducible Markov chain whose states are all recurrent. For example:

recurrent reducible

period of a state, periodic and aperiodic chain

The period of a state i in a Markov chain is the greated common divisor (gcd) of a the possible numbers of steps it can take to return to i when starting at i. In other words, the period of i is the gcd of numbers n such that the (i,i) entry of $Q^n$ is positive.

A state is called aperiodic if its period equals 1, and period otherwise. The chain itself is called aperiodic if all its states are aperiodic, and periodic otherwise

aperiodic

Stationary distribution

We are interested in the behavior of Markov chain in long run. At first, the chain may spend tim in transient states. Eventually though, the chain will spend all its time in recurrent states. But what fraction of time will it spend in each of the recurrent states? This is answered by the stationary distribution of the chain, also known as the steady-state distribution.

We will show that for irreducible and aperiodic Markov chains, the stationary distribution describles the long-run behavior of the chain, regardless of its intial conditions.

Stationary distribution. A row vector $\v{s} = (s_1, \ldots, s_M)$ such that $s_i \geq 0$ and $\sum_is_i = 1$ is a stationary distribution for a Markov chain with transition matrix Q if $$\sum_is_iq_{ij} = s_j$$ for all j, ore equivalently, $$\v{s}Q = \v{s}$$

From the definition, we can see that if the distribution of intial state $X_0$ is $\v{s}$, then the distribution of $X_1$, $X_2$, etc., all will have distribution $\v{s}$. In other words, if a Markov chain starts with its stationary distribution, it will stay in it forever.

Existence and uniqueness

For finite state space, a stationary distribution always exists (This is due to Perron-Frobenious theorem)

Any irreducible Markov chain has an unique stationary distribution. In this distribution, every state has positive probability.

Convergence

We have the following important theorem without proof.

Convergence to stationary distribution. Let $X_0, X_1, \ldots$ be a Markov chain with stationary distribution $\v{s}$ and transition matrix $Q$, such that some power $Q^m$ is positive in all entries. (These assumptions are equivalent to assuming that the chain is irreducible and aperiodic.) Then $P(X_n = i) \rightarrow s_i$ as $n \rightarrow \infty$. In terms of the transition matrix, $Q^n$ converges to a matrix in which each row is s. Therefore, after a large number of steps, the probability that the chain is in state i is close to the stationary probability si, regardless of the chain's initial conditions. Intuitively, the extra condition of aperiodicity is needed in order to rule out chains that just go around in circles, such as the chain in the following example.

In [26]:
from numpy.linalg import matrix_power
Q = np.array([[0,1,0,0,0], [0, 0, 1, 0, 0], [0,0,0,1,0], [0,0,0,0,1], [1,0,0,0,0]])
intial = np.array([1,0,0,0,0])
Q
Out[26]:
array([[0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0]])
In [27]:
for n in range(10,20):
    print('n = ', n) 
    print(np.dot(intial.T, matrix_power(Q,n)))
    print('-'*30)
n =  10
[1 0 0 0 0]
------------------------------
n =  11
[0 1 0 0 0]
------------------------------
n =  12
[0 0 1 0 0]
------------------------------
n =  13
[0 0 0 1 0]
------------------------------
n =  14
[0 0 0 0 1]
------------------------------
n =  15
[1 0 0 0 0]
------------------------------
n =  16
[0 1 0 0 0]
------------------------------
n =  17
[0 0 1 0 0]
------------------------------
n =  18
[0 0 0 1 0]
------------------------------
n =  19
[0 0 0 0 1]
------------------------------
In [28]:
# transition
import numpy as np
transition_matrix = np.array([[1/3, 2/3], [0.5, 0.5]])
transition_matrix
Out[28]:
array([[0.33333333, 0.66666667],
       [0.5       , 0.5       ]])
In [29]:
# now for higher n. We should see that it converges to the
# stationary distribution
tm_before = transition_matrix
for i in range(5):
    tm_next = np.dot(tm_before, transition_matrix)
    print(tm_next)
    print("-----------------")
    tm_before = tm_next
[[0.44444444 0.55555556]
 [0.41666667 0.58333333]]
-----------------
[[0.42592593 0.57407407]
 [0.43055556 0.56944444]]
-----------------
[[0.42901235 0.57098765]
 [0.42824074 0.57175926]]
-----------------
[[0.42849794 0.57150206]
 [0.42862654 0.57137346]]
-----------------
[[0.42858368 0.57141632]
 [0.42856224 0.57143776]]
-----------------

We can compute the stationary distribution analytically. Assume that it is $s = [p,1-p]$. Then $$sT = s$$ give us $$p(1/3) + (1-p)(1/2) = p$$

and thus $p = 3/7$

And we can see that we can get to this stationary distribution starting from multiple places

reversibility

Stationary distribution is important to understand a Markov chain's long-run behavior.

To find stationary distribution, we need to find eigenvectors of the transition matrix. This involving solving a high-degree polynomial which can be difficult when the number of states (hence the size of Q) is high.

Fortunately, there is an important special cases where working with eginevalue equations for large matrices can be avoided.

Reversibiliy. Let $Q = (q_{ij})$ be the transition matrix of a Markov chain. Suppose there is $\v{s} = (s_1, \dots, s_M)$ with $s_i \geq 0$, $\sum_i s_i = 1$, such that $$s_iq_{ij} = s_{j}q_{ji}$$

for all states i and j.

This equation is called the reversibility or detailed balance condition, and we say that the chain is reversible with respect to $\v{s}$ if it holds.

Reversibiliy. Let $Q = (q_{ij})$ be the transition matrix of a Markov chain. Suppose there is $\v{s} = (s_1, \dots, s_M)$ with $s_i \geq 0$, $\sum_i s_i = 1$, such that $$s_iq_{ij} = s_{j}q_{ji}$$

for all states i and j.

This equation is called the reversibility or detailed balance condition, and we say that the chain is reversible with respect to $\v{s}$ if it holds.

In continous state spaces, the transition matrix T becomes an integral kernel K. We will discuss more details when presenting MCMC methods.

MCMC Methods

The transition matrix is only good for the case of finite state space. In the case of continous state space, the transition matrix T becomes an integral kernel $K$ and $p(x)$ becomes the corresponding eigenfunction

$$\int p(x^{(i)})K(x^{(i+1)}| x^{(i)})dx^{(i)} = p(x^{(i+1)})$$

The kernel $K$ is the conditional density of $x^{(i+1)}$ given the value $x^{(i)}$. It is a mathematical representation of a Markov chain algorithm.

Metropolis

  1. Use a proposall distribution to propose a step
  2. Calculuate the pdf at that step, and compare it to the one at previous step
  3. If the probability increased we accept, if the probability decreased we accept some of the time, based on the ratio of the new probability to the old one.
  4. Accumulate samples
In [32]:
def metropolis(p, qdraw, nsamp, xinit):
    samples = np.empty(nsamp)
    x_prev = xinit
    for i in range(nsamp):
        # new proposal x*
        x_star = qdraw(x_prev)
        
        # p(x*)
        p_star = p(x_star)
        p_prev = p(x_prev)
        pdfratio = p_star/p_prev
        if np.random.uniform() < min(1, pdfratio):
            samples[i] = x_star
            x_prev = x_star
        else: # we always get a sample
            samples[i] = x_prev
    return samples
In [34]:
def metropolis2(p, qdraw, nsamp, xinit):
    samples = np.empty(nsamp)
    x_prev = xinit
    for i in range(nsamp):
        # new proposal x*
        x_star = qdraw(x_prev)
        
        # p(x*)
        p_star = p(x_star)
        p_prev = p(x_prev)
        
        if (p_star > p_prev):
            samples[i] = x_star
            x_prev = x_star
        else:
            pdfratio = p_star/p_prev
            if np.random.uniform() < pdfratio:
                samples[i] = x_star
                x_prev = x_star
            else:
                samples[i] = x_prev
    return samples

Now we can samples from a Gaussian distribution

In [36]:
from scipy.stats import uniform
def propmaker(delta):
    rv = uniform(-delta, 2*delta)
    return rv
uni = propmaker(0.5)
def uniprop(xprev):
    return xprev+uni.rvs()
In [37]:
samps = metropolis(norm.pdf, uniprop, 100000, 0.0)
plt.hist(samps,bins=80, alpha=0.4, label=u'MCMC distribution', normed=True); 
#plot the true function
xxx= np.linspace(-5,5,100)
plt.plot(xxx, norm.pdf(xxx), 'r', label=u'True distribution') 
plt.legend()
print("starting point was ", 0)
starting point was  0
In [38]:
# samps = metropolis(norm.pdf, uniprop, 100000, 0.0)
samps = metropolis2(norm.pdf, uniprop, 100000, 100)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:16: RuntimeWarning: invalid value encountered in double_scalars
  app.launch_new_instance()
In [39]:
plt.hist(samps,bins=80, alpha=0.4, label=u'MCMC distribution', normed=True); 
#plot the true function
xxx= np.linspace(-5,5,100)
plt.plot(xxx, norm.pdf(xxx), 'r', label=u'True distribution') 
plt.legend()
print("starting point was ", 100)
starting point was  100

We can see that if we start sampling from a very low probability region (x0 =100), the result will not be great!

Metropolis-Hastings

The Metropolis-Hasting algorithm is a general recipe that lets us start with any irreducible Markov chain on the state space of interest and then modify it into a new Markov chain that has desired stationary distribution

Our goal is to modify a MC P to construct a MC $X_0, X_1, \dots$ with stationary distribution $s$. The pseudocode is as follows.

  1. Start at any state $X_0$ (choosen randomly or deterministically)
  2. If $X_n = i$, propose a new state j using the transition probabilities in the ith row of the original transition matrix P.
  3. Compute the acceptance probability $a_{ij} = min(\frac{s_jp_{ij}}{s_ip_{ij}},1)$
  4. Flip a coin that lands Heads with probability $a_{ij}$
  5. If the coin lands Heads, accept the proposal (i.e., go to j) setting $X_{n+1} = j$. Otherwise, reject the proposal (i.e., stay at i), setting $X_{n+1} = i$.

The Metropolis-Hastings algorithm can slo be applied in a continous state space, using pdfs instead of pmfs. This is very useufl in Bayesian inference, where we oftwent want to study the posterior distribution of an unknown parameter. This posterior distribution maybe very complicated to work with analytically, and may have an unknown normalizing constant.

We have the pseudocode for Metropolis-Hasting algorithm as follows.

  1. Initialize $x^{(0)}$
  2. For i = 0 to N-1
    • Sample $u \sim U_{[0,1]}$
    • Sample $x^* \sim q(x^* | x^{(i)})$
    • If $u < A(x^{(i)} , x^*) = min(1, \frac{p(x^*)q(x^i|x^{*})}{p(x^i)q(x^*|x^{(i)})})$
      • $x^{(i+1)} = x^*$
    • Else
      • $x^{(i+1)} = x^{(i)}$
In [40]:
# target distribution
p = lambda x: 0.3*np.exp(-0.2*x**2) + 0.7*np.exp(-0.2*(x-10)**2)
In [41]:
# proposal distribution
from scipy.stats import norm
q = lambda xi: norm(xi, 100)
In [43]:
def metropolis_hastings_sample(p, q, x0 = 0, N=5000):
    x = np.zeros(N)
    x[0] = x0
    for i in range(N-1):
        u = np.random.rand()
        proposal = q(x[i]).rvs()
        acceptance = np.min([1, (p(proposal)*q(proposal).pdf(x[i]))/(p(x[i])*q(x[i]).pdf(proposal))])
        if u < acceptance:
            x[i+1] = proposal
        else:
            x[i+1] = x[i]
    return x
In [44]:
samples = metropolis_hastings_sample(p,q)
In [47]:
normalized_factor = quad(p, -10, 20)
In [48]:
x = np.linspace(-10,20, 1000)
plt.plot(x,p(x)/normalized_factor[0])
plt.hist(samples, bins=40,normed=True);

Example: Normal-Normal conjugacy

Let $Y \mid \theta \sim \cal{N}(\theta, \sigma^2)$, where $\sigma^2$ is known but $\theta$ is unknown.

Using Bayesian framework, we treat $\theta$ as a random variable, with prior given by $\theta \sim \cal{N}(\mu, \tau^2)$ for some known constants $\mu$ and $\sigma^2$. That is, we have a two level model $$\theta \sim \cal{N}(\mu, \tau^2)\\ Y \mid \theta \sim \cal{N}(\theta, \sigma^2)$$

We will use the Metropolis-Hastings algorithm to find the posterior mean and variance of $\theta$ after observing the value of $Y$.

After observing Y = y, we have $f_{\theta \mid y}(\theta \mid y) \propto f_{y\mid \theta}(y \mid \theta)f_{\theta}(\theta) \propto e^{-\frac{1}{2\sigma^2}(y-\theta)^2}e^{-\frac{1}{2\tau^2}(\theta-\mu)^2}$.

By completing square, we can obtain an explitie formula for the posterior distribution of $\theta$:

normal normal

Suppose that we didn't know how to complete the square (we do not have the above formula). Or we just simply want to verify our formula for a specific values of $y, \sigma^2, \mu, \tau^2$. We can do this by simulating from the posteior distribution of $\theta$, using the Metropolis-Hastings algorithm to construct a Markov chain whose stationary distribution is $f_{\theta \mid y}(\theta \mid y)$.

A Metropolis-Hastings algorithm for generating $\theta_0, \theta_1, \ldots$ is as follows.

Gibbs Sampling

Gibss sampling is an MCMC algorithm for obtaining approximate draws from a joint distribution, based on sampling from conditional distributions one at a time: at each stage, one variable is updated (keeping all other varibles fixed) by drawing from the condtional distributions of that variable given all other variables.

This approach is very useful when we have conditional distributions that are pleasant to work with.

We have to major Gibbs samplers:

  1. Systematic scan Gibbs sampler: components (dimensions, variables) are updated one by one in order
  2. Random scan Gibbs sampler: a randomly chosen component is update at each stag.

We illustrate Gibbs sampler in the context of bivariate $p(X,Y)$ and both $X, Y$ are discrete.

Systematic scan Gibbs sampler: Let X and Y be discrete r.v.s with joint PMF $p_{X,Y}(x,y) = p(X=x, Y=y)$. We wish to construct a two-dimensional Markov chain $(X_n, Y_n)$ whose stationary distribution is $p_{X,Y}$. If the current state is $(X_n, Y_n) = (x_n,y_n)$, then:

  1. Draw a value $x_{n+1}$ from the conditional distribution of X given $Y = y_n$ and set $X_{n+1} = x_{n+1}$
  2. Draw a value $y_{n+1}$ from the conditional distribution of Y given $X = x_{n+1}$, and set $Y_{n+1} = y_{n+1}$
  3. Repeat step 1 and step 2 over and over, the stationary distribution of the chain $(X_0, Y_0),(X_1, Y_1), \dots$ is $p_{X,Y}$

Random scan Gibbs sampler The difference compared to systematic scan Gibbs sampler is that in each stage, it picks a unifomrly random component and updates it, according to the conditional distributions given the other component:

  1. Choose which component to update, with equal probabilities.
  2. If the X-coponent was chosen, draw a value $x_{n+1}$ from the conditional distribution of X given $Y = y_n$ and set $X_{n+1}= x_{n+1}, Y_{n+1} = y_n$. We do similar update if $Y_n$ is choosen to update.
  3. Repeat steps 1 and 2 over and over, the stationary distribution of the chain $(X_0, Y_0), (X_1, Y_1), \dots$ is $p_{X,Y}$

Example: Building a Gibbs sampler for density $$f(x,y) = x^2exp(-xy^2 - y^2 + 2y -4x)$$

We need to find out the conditional pmf of $X$ and $Y$

Holding y constant, we have $$f(x\mid y) = exp(-y^2 +2y)Gamma(3, y^2 + 4)$$

Holding x constant, we have $$f(y\mid x = \cal{N}(\mu = \frac{1}{1+x}, \sigma^2 = \frac{1}{2(1+x)})$$

In [50]:
func = lambda x,y: x**2*np.exp( -x*y**2 - y**2 + 2*y - 4*x )

numgridpoints=400
x = np.linspace(0,2,numgridpoints)
y = np.linspace(-1,2.5,numgridpoints)
xx,yy = np.meshgrid(x,y)
zz = np.zeros((numgridpoints,numgridpoints))
for i in np.arange(0,numgridpoints):
    for j in np.arange(0,numgridpoints):
        zz[i,j]=func(xx[i,j],yy[i,j])
        
plt.contourf(xx,yy,zz);
In [51]:
def xcond(y):
    return gamma.rvs(3, scale=1/(y*y + 4))

def ycond(x):
    return norm.rvs(1/(1+x), scale=1.0/np.sqrt(2*(x+1)))

def gibbs(xgiveny_sample, ygivenx_sample, N, start = [0,0]):
    x=start[0]
    y=start[1]
    samples=np.zeros((N+1, 2))
    samples[0,0]=x
    samples[0,1]=y
    
    for i in range(1,N,2):
        x=xgiveny_sample(y)
        samples[i,0]=x
        samples[i, 1]=y
        
        y=ygivenx_sample(x)
        samples[i+1,0]=x
        samples[i+1,1]=y
        
    return samples
In [52]:
out=gibbs(xcond, ycond, 100000)
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
plt.hist2d(out[10000:,0],out[10000:,1], normed=True, bins=100, cmap=cmap)
plt.contour(xx,yy,zz)
plt.show()
In [53]:
out=gibbs(xcond, ycond, 10000, start = [2,2])
nr_t = 50
plt.contour(xx,yy,zz)
plt.plot(out[:nr_t, 0],out[:nr_t, 1], '.',alpha=0.8)
plt.plot(out[:nr_t, 0],out[:nr_t, 1], c='r', alpha=0.5, lw=1)
Out[53]:
[<matplotlib.lines.Line2D at 0x10cfcf7b8>]
In [54]:
def corrplot(trace, maxlags=50, dimension_label=''):
    plt.acorr(trace-np.mean(trace),  normed=True, maxlags=maxlags);
    plt.xlim([0, maxlags])
    plt.xlabel('lags')
    plt.ylabel('autocorrelation')
    plt.title(dimension_label)
In [55]:
corrplot(out[4000:,0], 100, 'X')
In [56]:
corrplot(out[4000:,1], 100, 'Y')