|
Image Segmentation w/ EM Algorithm
|
|
Overview
It is possible to segment images using a clustering method, where each segment is the cluster center to which a pixel belongs. Images are represented by their r,g, and b values. In this implementation image pixels are clustered by applying an Expectation Maximization (EM) algorithm to a mixture of normal distribution model. Then the image is segmented by mapping each pixel to the cluster center with the highest value of the posterior probability for that pixel. The results of segmentation can be displayed easily as simple images, where each pixel's color is replaced with the mean color of the closest segment. This is the same basic technique used for some image compression, as storage only involves mapping pixels to a limited set of colors (cluster centers).
EM Algorithm (Concept)
A normal distribution is capable of modelling a single blob of data points. But a single normal distribution will not produce a good model if there are multiple blobs of data points. The multivariate normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. It is often used to describe a set of (possibly) correlated components (random variables) each of which should cluster around a mean value. With the help of affine transformation, any multivariate gaussian can be converted to have zero mean and independent components. Therefore, under the right transformations, any gaussian can be represented as the product of one-dimensional normal distributions (each with zero mean and unit standard deviation). This permits a way to simulate multivariate normal distributions. Simply simulate a standard normal distribution for each component, applying the appropriate transformations afterwards.
Extrapolating this idea to clustering, think of the data as if it were generated in a certain way and then figure out the specifics of that process. Assume that each data item was pulled from one of multiple distinct models. For k clusters, assume there were k distinct models. Assume that a magical process determined which model to use when producing a data item in question. We know neither the models nor the origin of each data item (it's classified cluster). This is a very common situation, applicable to many types of problems.
A natural approach to solving this problem is to alternate between estimating classifications of data items to models, and re-estimating the models themselves. For example, K-means iteratively assigns points to clusters and then re-estimates the cluster centers. But what if the models are probabilistic or there is not a clear "distance function". In this case an Expectation Maximization (EM) algorithm is the standard.
At a high level, the EM algorithm works to average out what we don't know. It alternates between averaging and re-estimating parameters with soft methods. The E step (expectation) involves creating a function for the expectation of the log-likelihood evaluated using the current parameter estimates. The M step (maximization) involves re-computing the parameters to maximize the function created in the E step (the expected log-likelihood). For the next iteration, the E step creates the log-likelihood using the newer parameter estimates and so on. The algorithm repeats until acceptable convergence.
Extrapolating this idea to clustering, think of the data as if it were generated in a certain way and then figure out the specifics of that process. Assume that each data item was pulled from one of multiple distinct models. For k clusters, assume there were k distinct models. Assume that a magical process determined which model to use when producing a data item in question. We know neither the models nor the origin of each data item (it's classified cluster). This is a very common situation, applicable to many types of problems.
A natural approach to solving this problem is to alternate between estimating classifications of data items to models, and re-estimating the models themselves. For example, K-means iteratively assigns points to clusters and then re-estimates the cluster centers. But what if the models are probabilistic or there is not a clear "distance function". In this case an Expectation Maximization (EM) algorithm is the standard.
At a high level, the EM algorithm works to average out what we don't know. It alternates between averaging and re-estimating parameters with soft methods. The E step (expectation) involves creating a function for the expectation of the log-likelihood evaluated using the current parameter estimates. The M step (maximization) involves re-computing the parameters to maximize the function created in the E step (the expected log-likelihood). For the next iteration, the E step creates the log-likelihood using the newer parameter estimates and so on. The algorithm repeats until acceptable convergence.
Results
Results were generated by segmenting 5 different images to 10,20 and 50 segments. The resulting pixel to cluster mappings were stored as images after each iteration and the results were compiled into animated GIFs. Additionally, one image (polar lights) was segmented into 20 segments using 6 different random initialized values fed into the EM algorithm.
Balloons (10, 20 and 50 segments, 15 iterations)
Mountains (10, 20 and 50 segments, 15 iterations)
Nature (10, 20 and 50 segments, 15 iterations)
Ocean (10, 20 and 50 segments, 15 iterations)
Polar Lights (10, 20 and 50 segments, 15 iterations)
Exploring Variation in Results
To keep everything consistent, the EM algorithm was run for a fixed 30 iterations on the polarlights image using 6 different initial conditions for “pi” and “mu”. For each of the 6 runs, the initial values for pi were randomized and then unitized so they summed to 1. The same process was used for each topic’s mu vector. It is easiest to detect origins of variation by looking at the progression of iterations in the animated gifs. Although each run converges on very similar segmentations, there are slight differences. Depending on the random initial values for pi and mu, the first assignments of pixels to segments is different for each of the 6 trials. Certain colors appear in the early iterations and then morph more towards the original image by later iterations. Obviously the center shifts to shades of green, the right blues, and the left purples. These generalities hold across runs despite the difference in initial conditions, as they are the “ground truth” assignments looking at the original image. The variance in the final images reflects the fact that the EM algorithm converges to find parameters that will obtain local maxima and likelihood of the data given the model’s parameters. (LOCAL maxima not GLOBAL maxima). Intuitively, the process will find different maximizations given different initial conditions. If one cluster is assigned a high Pi value originally, it will be relatively easier for this cluster to achieve assignments… while for a cluster that was originally assigned a low Pi value, it will struggle to achieve assignments (relatively).
Polar Lights (20 segments, 30 iterations) -- Randomized Starting Values (6 different cases)
Implementation
All of the code was written from scratch in Matlab. It has been included in its entirety below. I'm sure that further optimizations will be apparent to those experienced with vectorizing code, but this worked well enough for my purposes. A few notes are worth mentioning...
- While conceptually the EM algorithm involves selecting parameters that maximize an entire function, the only values required for successive iterations are the weights and the parameters themselves. Instead of calculating the full equation, the weights are recalculated and the parameters are adjusted each iteration.
- Here the pi's (probability that a given cluster was chosen for data point creation) and mu's (mean value for each model/distribution) are initialized to values that are close to even, plus a small amount of gaussian random noise. Shown later is a snippet to initialize the pi's and mu's in a uniform random fashion.
- The E step contains terms that may lead to underflow errors because of the multiplication of many tiny numbers. To avoid underflow errors, the implementation performs the E step in the log domain where a product of k terms becomes a sum of k terms. Unfortunately the denominator of one element in the equation contains a sum of products, meaning I couldn't decompose it into a sum of logs of individual terms. Instead I compute the logarithm of this expression for sums of products. One could use a log sum expression command to accomplish this, but I did it from scratch. I'll try to elaborate on this below, with an example from the topic model case (see INSERT LINK). The same logic applies to this implementation. Consider the following expression:
\[p\left ( \delta _{ij}=1|\theta ^{(n)},x \right )=\frac{[\prod _{k}p_{j,k}^{xk}]\pi_{j}}{\sum _{l}[\prod _{k}p_{l,k}^{xk}]\pi_{l}}\]
To avoid underflow, this will have to be computed in the log domain. But that denominator is a sum of products which doesn't decompose into a sum of logs of individual terms. Instead, compute the logarithm of sums of products. For easy notation, denote:
\[\begin{align*}
Y &= p\left ( \delta _{ij}=1|\theta ^{(n)},x \right )\\
A_{j} &= [\prod _{k}p_{j,k}^{x_k}]\pi_{j}\\
A_{max} &= \max_{j} A_{j}
\end{align*}
\]
Therefore,
\[\log A_{j} = \log \pi_{j} + \sum_{k} x_{k}\log p_{j, k}\\
\log A_{max} = \log\left(\max_{j}A_{j}\right) =\max_{j} \log\left(A_{j}\right)\]
Then, derive and calculate the logarithm of Y as:
\[\begin{align*}
\log Y &= {\log A_{j}} - {\log\left(\sum_{l} A_{l}\right)}\\
&={\log A_{j}} - {\log\left(\sum_{l} e^{\log A_{l}}\right)}\\
&={\log A_{j}} - {\log\left(e^{\log A_{max}}\sum_{l} e^{\log A_{l} - \log A_{max}}\right)}\\
&={\log A_{j}} - {\log e^{\log A_{max}}} - {\log\left(\sum_{l} e^{\log A_{l} - \log A_{max}}\right)}\\
&={\log A_{j}} - {\log A_{max}} - {\log\left(\sum_{l} e^{\log A_{l} - \log A_{max}}\right)}
\end{align*}\]
Then, simply take the exponent of that to yield Y. This means that each iteration, the E step will follow the following formula:
For every data item "i", do:
For every data item "i", do:
- Compute log(Aj) for every j
- When performing step 1, note the biggest computed value (considering each final log(Aj) ) and remember it as log(Amax)
- Because it is identical for all j, compute the following term from the final equation for logY:
\[{\log\left(\sum_{l} e^{\log A_{l} - \log A_{max}}\right)} \]
4. For each cluster j, compute the logY
5. Exponentiate this logY ie: exp(logY) to yield w(i,j)
The M step will change the values of Aj, so each E step this will be an effective computation of new weights corresponding to the assignments of data points to clusters.
5. Exponentiate this logY ie: exp(logY) to yield w(i,j)
The M step will change the values of Aj, so each E step this will be an effective computation of new weights corresponding to the assignments of data points to clusters.
For the random initializations used for the exploration of variation, the following snippets were used.