This post describes some generative machine learning algorithms.

# Gaussian Discriminant Analysis

Gaussian Discriminant Analysis is a **supervised classification algorithm**. In this algorithm we suppose the p(x|y) is multivariate Gaussian with parameter mean \(\overrightarrow{u}\) and covariance \(\Sigma : P(x|y) \sim N(\overrightarrow{u}, \Sigma) \).

If we suppose that y is Bernoulli distributed, then there are two distributions that can be defined, one for p(x|y=0) and one for p(x|y=1). The separator between these two distributions could be used to predict the values of y.

\(P(x|y=0) = \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} exp(-\frac{1}{2}(x – \mu_0)^T\Sigma^{-1}(x – \mu_0))\) \(P(x|y=1) = \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} exp(-\frac{1}{2}(x – \mu_1)^T\Sigma^{-1}(x – \mu_1))\) \(P(y) = \phi^y.(1-\phi)^{1-y}\)We need to find \(\phi, \mu_0, \mu_1, \Sigma\) that maximize the log-likelihood function:

\(l(\phi, \mu_0, \mu_1, \Sigma) = log(\prod_{i=1}^{m} P(y^{(i)}|x^{(i)};\phi, \mu_0, \mu_1, \Sigma))\)Using the Naive Bayes rule, we can transform the equation to:

\(= log(\prod_{i=1}^{m} P(x^{(i)} | y^{(i)};\mu_0, \mu_1, \Sigma) * P(y^{(i)};\phi))) – log(\prod_{i=1}^{m} P(x^{(i)}))\)To maximize l, we need to maximize: \(log(\prod_{i=1}^{m} P(x^{(i)} | y^{(i)};\mu_0, \mu_1, \Sigma) * P(y^{(i)};\phi)))\)

For new data, to predict y, we need to calculate \(P(y=1|x) = \frac{P(x|y=1) * P(y=1)}{P(x|y=0)P(y=0) + P(x|y=1)P(y=1)}\) (P(y=0) and P(y=1) can be easily calculated from training data).

# EM Algorithm

EM algorithm is an **unsupervised clustering algorithm** in which we assign a label to each training example. The clustering applied is soft clustering, which means that for each point we assign a probability for each label.

For a point x, we calculate the probability for each label.

There are two main steps in this algorithm:

**E-Step**: Using the current estimate of parameters, calculate if points are more likely to come from the red distribution or the blue distribution.

**M-Step**: Re-estimate the parameters.

Given a training set \(S = \{x^{(i)}\}_{i=1}^m\). For each \(x^{(i)}\), there is hidden label \(z^{(i)}\) assigned to it.

Z is a multinomial distribution with \(P(z = j)=?_j\).

k the number of distinct values of Z (number of classes).

For each \(x^{(i)}\), we would like to know the value \(z^{(i)}\) that is assigned to it. In other words, find \(l \in [1, k]\) that maximizes the probability \(P(z^{(i)} = l | x^{(i)})\)

Log-likelihood function to maximize: \(l(?) = log(\prod_{i=1}^m P(x^{(i)}; ?)) = \sum_{i=1}^m log(P(x^{(i)}; ?))\)

\( = \sum_{i=1}^m log(\sum_{l=1}^k P(x^{(i)}|z^{(i)} = l; ?) * P(z^{(i)} = l))\) \( = \sum_{i=1}^m log(\sum_{l=1}^k P(x^{(i)}, z^{(i)} = l; ?))\)Let’s define \(Q_i\) as a probability distribution over Z: (\(Q_i(z^{(i)} = l)\) > = 0 and \(\sum_{l=1}^k Q_i(z^{(i)} = l) = 1\))

\(l(?) = \sum_{i=1}^m log(\sum_{l=1}^k Q_i(z^{(i)} = l) \frac{P(x^{(i)}, z^{(i)} = l; ?)}{Q_i(z^{(i)} = l)})\) \( = \sum_{i=1}^m log(E_{z^{(i)} \sim Q_i}[\frac{P(x^{(i)}, z^{(i)}; ?)}{Q_i(z^{(i)})}])\)Log is concave function. Based on Jensen’s inequality log(E[X]) >= E[log(X)].

Then:

\(l(?) >= \sum_{i=1}^m E_{z^{(i)} \sim Q_i}[log(\frac{P(x^{(i)}, z^{(i)}; ?)}{Q_i(z^{(i)})})]\) \( = \sum_{i=1}^m \sum_{l=1}^k Q_i(z^{(i)}=l) log(\frac{P(x^{(i)}, z^{(i)}=l; ?)}{Q_i(z^{(i)}=l)})\)

If we set \(Q_i(z^{(i)}) = P(z^{(i)}| x^{(i)}; ?)\) then: \(\frac{P(x^{(i)}, z^{(i)}; ?)}{Q(z^{(i)})}\) \(= \frac{P(z^{(i)}| x^{(i)}; ?) * P(x^{(i)}; ?)}{P(z^{(i)}| x^{(i)}; ?)}\) \(= P(x^{(i)}; ?)\) = Constant (over z). In that case log(E[X]) = E[log(X)] and for the current estimate of ?, \(l(?) = \sum_{i=1}^m \sum_{l=1}^k Q_i(z^{(i)}=l) log(\frac{P(x^{(i)}, z^{(i)}=l; ?)}{Q_i(z^{(i)}=l)})\) (E-Step)

EM (Expectation-Maximization) algorithm for density estimation:

1-Initiliaze the parameters ?

2-For each i, set \(Q_i(z^{(i)}) = P(z^{(i)}| x^{(i)}; ?)\) (E-Step)

3-Update ? (M-Step)

\(? := arg\ \underset{?}{max} \sum_{i=1}^m \sum_{l=1}^k Q_i(z^{(i)}=l) log(\frac{P(x^{(i)}, z^{(i)}=l; ?)}{Q_i(z^{(i)}=l)})\)**Mixture of Gaussians**

Given a value \(z^{(i)}=j\), If x is distributed normally \(N(?_j, ?_j)\), then:

\(Q_i(z^{(i)} = j) = P(z^{(i)} = j| x^{(i)}; ?)\) \(= \frac{P(x^{(i)} | z^{(i)}=j) * P(z^{(i)} = j)}{\sum_{l=1}^k P(x^{(i)} | z^{(i)}=l) * P(z^{(i)}=l)}\) \(= \frac{\frac{1}{{(2?)}^{\frac{n}{2}} |?_j|^\frac{1}{2}} exp(-\frac{1}{2} (x^{(i)}-?_j)^T {?_j}^{-1} (x^{(i)}-?_j)) * ?_j}{\sum_{l=1}^k \frac{1}{{(2?)}^{\frac{n}{2}} |?_l|^\frac{1}{2}} exp(-\frac{1}{2} (x^{(i)}-?_l)^T {?_l}^{-1} (x^{(i)}-?_l)) * ?_l}\)Using the method of Lagrange Multipliers, we can find \(?_j, ?_j, ?_j\) that maximize: \(\sum_{i=1}^m \sum_{l=1}^k Q_i(z^{(i)}=l) log(\frac{\frac{1}{{(2?)}^{\frac{n}{2}} |?_j|^\frac{1}{2}} exp(-\frac{1}{2} (x^{(i)}-?_j)^T {?_j}^{-1} (x^{(i)}-?_j)) * ?_j}{Q_i(z^{(i)}=l)})\) with the constraint \(\sum_{j=1}^k ?_j = 1\). We also need to set to zero the partial derivatives with respect to \(?_j\) and \(?_j\) and solve the equations to find the new values of these parameters. Repeat the operations until convergence.

More details about the algorithm can be found here: https://www.youtube.com/watch?v=ZZGTuAkF-Hw

# Naive Bayes Classifier

Naive Bayes classifier is a **supervised classification algorithm**. In this algorithm, we try to calculate P(y|x), but this probability can be calculated using the naive bayes rule:

P(y|x) = P(x|y).P(y)/P(x)

If x is the vector \([x_1,x_2,…,x_n]\), then:

\(P(x|y) = P(x_1,x_2,…,x_n|y) = P(x_1|y) P(x_2|y, x_1) ….P(x_n|y, x_{n-1},…,x_1)\) \(P(x) = P(x1,x2,…,x_n)\)If we suppose that \(x_i\) are conditionally independent (Naive Bayes assumption), then:

\(P(x|y) = P(x_1|y) * P(x_2|y) * … * P(x_n|y)\) \(= \prod_{i=1}^n P(x_i|y)\) \(P(x) = \prod_{i=1}^n P(x_i)\)If \(\exists\) i such as \(P(x_i|y)=0\), then P(x|y) = 0. To avoid that case we need to use Laplace Smoothing when calculating probabilities.

We suppose that \(x_i \in \{0,1\}\) (one-hot encoding: [1,0,1,…,0]) and y \(\in \{0,1\}\).

We define \(\phi_{i|y=1}, \phi_{i|y=0}, \phi_y\) such as:

\(P(x_i=1|y=1) = \phi_{i|y=1} \\ P(x_i=0|y=1) = 1 – \phi_{i|y=1}\) \(P(x_i=1|y=0) = \phi_{i|y=0} \\ P(x_i=0|y=0) = 1 – \phi_{i|y=0}\) \(P(y) = \phi_y^y.(1-\phi_y)^{1-y}\)We need to find \(\phi_y, \phi_{i|y=0}, \phi_{i|y=1}\) that maximize the log-likelihood function:

\(l(\phi, \phi_{i|y=0}, \phi_{i|y=1}) = log(\prod_{i=1}^{m} \prod_{j=1}^n P(x_j^{(i)}|y^{(i)}) P(y^{(i)})/P(x^{(i)}))\)Or maximize the following expression:

\(log(\prod_{i=1}^{m} \prod_{j=1}^n P(x_j^{(i)}|y^{(i)}) P(y^{(i)}))\)