# Machine Learning Algorithms – Part 1

This post describes some discriminative machine learning algorithms.

 Distribution of Y given X Algorithm to predict Y Normal distribution Linear regression Bernoulli distribution Logistic regression Multinomial distribution Multinomial logistic regression (Softmax regression) Exponential family distribution Generalized linear regression
 Distribution of X Algorithm to predict Y Multivariate normal distribution Gaussian discriminant analysis or EM Algorithm X Features conditionally independent $$p(x_1, x_2|y)=p(x_1|y) * p(x_2|y)$$ Naive Bayes Algorithm

Other ML algorithms are based on geometry, like the SVM and K-means algorithms.

# Linear Regression

Below a table listing house prices by size.

 x = House Size ($$m^2$$) y = House Price (k\$) 50 99 50 100 50 100 50 101 60 110

For the size 50$$m^2$$, if we suppose that prices are normally distributed around the mean Î¼=100 with a standard deviation Ïƒ, then P(y|x = 50) = $$\frac{1}{\sigma \sqrt{2\pi}} exp(-\frac{1}{2} (\frac{y-Î¼}{\sigma})^{2})$$

We define h(x) as a function that returns the mean of the distribution of y given x (E[y|x]). We will define this function as a linear function. $$E[y|x] = h_{Î¸}(x) = \theta^T x$$

P(y|x = 50; Î¸) = $$\frac{1}{\sigma \sqrt{2\pi}} exp(-\frac{1}{2} (\frac{y-h_{Î¸}(x)}{\sigma})^{2})$$

We need to find Î¸ that maximizes the probability for all values of x. In other words, we need to find Î¸ that maximizes the likelihood function L:

$$L(\theta)=P(\overrightarrow{y}|X;Î¸)=\prod_{i=1}^{n} P(y^{(i)}|x^{(i)};Î¸)$$

Or maximizes the log likelihood function l:

$$l(\theta)=log(L(\theta )) = \sum_{i=1}^{m} log(P(y^{(i)}|x^{(i)};\theta ))$$ $$= \sum_{i=1}^{m} log(\frac{1}{\sigma \sqrt{2\pi}}) -\frac{1}{2} \sum_{i=1}^{n} (\frac{y^{(i)}-h_{Î¸}(x^{(i)})}{\sigma})^{2}$$

To maximize l, we need to minimize J(Î¸) = $$\frac{1}{2} \sum_{i=1}^{m} (y^{(i)}-h_{Î¸}(x^{(i)}))^{2}$$. This function is called the Cost function (or Energy function, or Loss function, or Objective function) of a linear regression model. It’s also called “Least-squares cost function”.

J(Î¸) is convex, to minimize it, we need to solve the equation $$\frac{\partial J(Î¸)}{\partial Î¸} = 0$$. A convex function has no local minimum.

There are many methods to solve this equation:

• Gradient descent (Batch or Stochastic Gradient descent)
• Normal equation
• Newton method
• Matrix differentiation

Gradient descent is the most used Optimizer (also called Learner or Solver) for learning model weights.

$$Î¸_{j} := Î¸_{j} – \alpha \frac{\partial J(Î¸)}{\partial Î¸_{j}} = Î¸_{j} – Î± \frac{\partial \frac{1}{2} \sum_{i=1}^{n} (y^{(i)}-h_{Î¸}(x^{(i)}))^{2}}{\partial Î¸_{j}}$$

Î± is called “Learning rate”

$$Î¸_{j} := Î¸_{j} – Î± \frac{1}{2} \sum_{i=1}^{n} \frac{\partial (y^{(i)}-h_{Î¸}(x^{(i)}))^{2}}{\partial Î¸_{j}}$$

If $$h_{Î¸}(x)$$ is a linear function ($$h_{Î¸} = Î¸^{T}x$$), then :

$$Î¸_{j} := Î¸_{j} – Î± \sum_{i=1}^{n} (h_{Î¸}(x^{(i)}) – y^{(i)}) * x_j^{(i)}$$

Batch size should fit the size of CPU or GPU memories, otherwise learning speed will be extremely slow.

When using Batch gradient descent, the cost function in general decreases without oscillations.

Stochastic (Online) Gradient Descent (SGD) (use one example for each iteration – pass through all data N times (N Epoch))

$$Î¸_{j} := Î¸_{j} – \alpha (h_{Î¸}(x^{(i)}) – y^{(i)}) * x_j^{(i)}$$

This learning rule is called “Least mean squares (LMS)” learning rule. It’s also called Widrow-Hoff learning rule.

Run gradient descent for each mini-batch until we pass through traning set (1 epoch). Repeat the operation many times.

$$Î¸_{j} := Î¸_{j} – \alpha \sum_{i=1}^{20} (h_{Î¸}(x^{(i)}) – y^{(i)}) * x_j^{(i)}$$

Mini-batch size should fit the size of CPU or GPU memories.

When using Batch gradient descent, the cost function decreases quickly but with oscillations.

Learning rate decay

Itâ€™s a technique used to automatically reduce learning rate after each epoch. Decay rate is a hyperparameter.

$$Î± = \frac{1}{1+ decayrate + epochnum} . Î±_0$$

Momentum

Momentum is a method used to accelerate gradient descent. The idea is to add an extra term to the equation to accelerate descent steps.

$$Î¸_{j_{t+1}} := Î¸_{j_t} – Î± \frac{\partial J(Î¸_{j_t})}{\partial Î¸_j} \color{blue} {+ Î» (Î¸_{j_t} – Î¸_{j_{t-1}})}$$

Below another way to write the expression:

$$v(Î¸_{j},t) = Î± . \frac{\partial J(Î¸_j)}{\partial Î¸_j} + Î» . v(Î¸_{j},t-1) \\ Î¸_{j} := Î¸_{j} – \color{blue} {v(Î¸_{j},t)}$$

Nesterov Momentum is a slightly different version of momentum method.

Adam is another method used to accelerate gradient descent.

The problem in this method is that the term grad_squared becomes large after running many gradient descent steps. The term grad_squared is used to accelerate gradient descent when gradients are small, and slow down gradient descent when gradients are large.

RMSprop

RMSprop is an enhanced version of AdaGrad.

The term decay_rate is used to apply exponential smoothing on grad_squared term.

Adam is a combination of Momentum and RMSprop.

Normal equation

To minimize the cost function $$J(Î¸) = \frac{1}{2} \sum_{i=1}^{n} (y^{(i)}-h_{Î¸}(x^{(i)}))^{2} = \frac{1}{2} (XÎ¸ – y)^T(XÎ¸ – y)$$, we need to solve the equation:

$$\frac{\partial J(Î¸)}{\partial Î¸} = 0 \\ \frac{\partial trace(J(Î¸))}{\partial Î¸} = 0 \\ \frac{\partial trace((XÎ¸ – y)^T(XÎ¸ – y))}{\partial Î¸} = 0 \\ \frac{\partial trace(Î¸^TX^TXÎ¸ – Î¸^TX^Ty – y^TXÎ¸ + y^Ty)}{\partial Î¸} = 0$$ $$\frac{\partial trace(Î¸^TX^TXÎ¸) – trace(Î¸^TX^Ty) – trace(y^TXÎ¸) + trace(y^Ty))}{\partial Î¸} = 0 \\ \frac{\partial trace(Î¸^TX^TXÎ¸) – trace(Î¸^TX^Ty) – trace(y^TXÎ¸))}{\partial Î¸} = 0$$ $$\frac{\partial trace(Î¸^TX^TXÎ¸) – trace(y^TXÎ¸) – trace(y^TXÎ¸))}{\partial Î¸} = 0 \\ \frac{\partial trace(Î¸^TX^TXÎ¸) – 2 trace(y^TXÎ¸))}{\partial Î¸} = 0 \\ \frac{\partial trace(Î¸Î¸^TX^TX)}{\partial Î¸} – 2 \frac{\partial trace(Î¸y^TX))}{\partial Î¸} = 0 \\ 2 X^TXÎ¸ – 2 X^Ty = 0 \\ X^TXÎ¸= X^Ty \\ Î¸ = {(X^TX)}^{-1}X^Ty$$

If $$X^TX$$ is singular, then we need to calculate the pseudo inverse instead of the inverse.

Newton method

$$J”(Î¸_{t}) := \frac{J'(Î¸_{t+1}) – J'(Î¸_{t})}{Î¸_{t+1} – Î¸_{t}}$$ $$\rightarrow Î¸_{t+1} := Î¸_{t} – \frac{J'(Î¸_{t})}{J”(Î¸_{t})}$$

Matrix differentiation

To minimize the cost function: $$J(Î¸) = \frac{1}{2} \sum_{i=1}^{n} (y^{(i)}-h_{Î¸}(x^{(i)}))^{2} = \frac{1}{2} (XÎ¸ – y)^T(XÎ¸ – y)$$, we need to solve the equation:

$$\frac{\partial J(Î¸)}{\partial Î¸} = 0 \\ \frac{\partial Î¸^TX^TXÎ¸ – Î¸^TX^Ty – y^TXÎ¸ + y^Ty}{\partial Î¸} = 2X^TXÎ¸ – \frac{\partial Î¸^TX^Ty}{\partial Î¸} – X^Ty = 0$$

$$2X^TXÎ¸ – \frac{\partial y^TXÎ¸}{\partial Î¸} – X^Ty = 2X^TXÎ¸ – 2X^Ty = 0$$ (Note: In matrix differentiation: $$\frac{\partial AÎ¸}{\partial Î¸} = A^T$$ and $$\frac{\partial Î¸^TAÎ¸}{\partial Î¸} = 2A^TÎ¸$$)

we can deduce $$X^TXÎ¸ = X^Ty$$ and $$Î¸ = (X^TX)^{-1}X^Ty$$

# Logistic Regression

Below a table that shows tumor types by size.

 x = Tumor Size (cm) y = Tumor Type (Benign=0, Malignant=1) 1 0 1 0 2 0 2 1 3 1 3 1

Given x, y is distributed according to the Bernoulli distribution with probability of success p = E[y|x].

$$P(y|x;Î¸) = p^y (1-p)^{(1-y)}$$

We define h(x) as a function that returns the expected value (p) of the distribution. We will define this function as:

$$E[y|x] = h_{Î¸}(x) = g(Î¸^T x) = \frac{1}{1+exp(-Î¸^T x)}$$. g is called Sigmoid (or logistic) function.

P(y|x; Î¸) = $$h_{Î¸}(x)^y (1-h_{Î¸}(x))^{(1-y)}$$

We need to find Î¸ that maximizes this probability for all values of x. In other words, we need to find Î¸ that maximizes the likelihood function L:

$$L(Î¸)=P(\overrightarrow{y}|X;Î¸)=\prod_{i=1}^{m} P(y^{(i)}|x^{(i)};Î¸)$$

Or maximize the log likelihood function l:

$$l(Î¸)=log(L(Î¸)) = \sum_{i=1}^{m} log(P(y^{(i)}|x^{(i)};Î¸ ))$$ $$= \sum_{i=1}^{m} y^{(i)} log(h_{Î¸}(x^{(i)}))+ (1-y^{(i)}) log(1-h_{Î¸}(x^{(i)}))$$

Or minimize the $$-l(Î¸) = \sum_{i=1}^{m} -y^{(i)} log(h_{Î¸}(x^{(i)})) – (1-y^{(i)}) log(1-h_{Î¸}(x^{(i)})) = J(Î¸)$$

J(Î¸) is convex, to minimize it, we need to solve the equation $$\frac{\partial J(Î¸)}{\partial Î¸} = 0$$.

There are many methods to solve this equation:

$$Î¸_{j} := Î¸_{j} – Î± \sum_{i=1}^{n} (h_{Î¸}(x^{(i)}) – y^{(i)}) * x_j^{(i)}$$

Logit function (inverse of logistic function)

Logit function is defined as follow: $$logit(p) = log(\frac{p}{1-p})$$

The idea in the use of this function is to transform the interval of p (outcome) from [0,1] to [0, âˆž]. So instead of applying linear regression on p, we will apply it on logit(p).

Once we find Î¸ that maximizes the Likelihood function, we can then estimate logit(p) given a value of x ($$logit(p) = h_{Î¸}(x)$$). p can be then calculated using the following formula:

$$p = \frac{1}{1+exp(-logit(h_{Î¸}(x)))}$$

# Multinomial Logistic Regression (using maximum likelihood estimation)

In multinomial logistic regression (also called Softmax Regression), y could have more than two outcomes {1,2,3,…,k}.

Below a table that shows tumor types by size.

 x = Tumor Size (cm) y = Tumor Type (Type1= 1, Type2= 2, Type3= 3) 1 1 1 1 2 2 2 2 2 3 3 3 3 3

Given x, we can define a multinomial distribution with probabilities of success $$\phi_j = E[y=j|x]$$.

$$P(y=j|x;\Theta) = Ï•_j \\ P(y=k|x;\Theta) = 1 – \sum_{j=1}^{k-1} Ï•_j \\ P(y|x;\Theta) = Ï•_1^{1\{y=1\}} * … * Ï•_{k-1}^{1\{y=k-1\}} * (1 – \sum_{j=1}^{k-1} Ï•_j)^{1\{y=k\}}$$

We define $$\tau(y)$$ as a function that returns a $$R^{k-1}$$ vector with value 1 at the index y: $$\tau(y) = \begin{bmatrix}0\\0\\1\\0\\0\end{bmatrix}$$, when $$y \in \{1,2,…,k-1\}$$, .

and $$\tau(y) = \begin{bmatrix}0\\0\\0\\0\\0\end{bmatrix}$$, when y = k.

We define $$\eta(x)$$ as a $$R^{k-1}$$ vector = $$\begin{bmatrix}log(\phi_1/\phi_k)\\log(\phi_2/\phi_k)\\…\\log(\phi_{k-1}/\phi_k)\end{bmatrix}$$

$$P(y|x;\Theta) = 1 * exp(Î·(x)^T * \tau(y) – (-log(\phi_k)))$$

This form is an exponential family distribution form.

We can invert $$\eta(x)$$ and find that:

$$Ï•_j = Ï•_k * exp(Î·(x)_j)$$ $$= \frac{1}{1 + \frac{1-Ï•_k}{Ï•_k}} * exp(Î·(x)_j)$$ $$=\frac{exp(Î·(x)_j)}{1 + \sum_{c=1}^{k-1} Ï•_c/Ï•_k}$$ $$=\frac{exp(Î·(x)_j)}{1 + \sum_{c=1}^{k-1} exp(Î·(x)_c)}$$

If we define Î·(x) as linear function, $$Î·(x) = Î˜^T x = \begin{bmatrix}Î˜_{1,1} x_1 +… + Î˜_{n,1} x_n \\Î˜_{1,2} x_1 +… + Î˜_{n,2} x_n\\…\\Î˜_{1,k-1} x_1 +… + Î˜_{n,k-1} x_n\end{bmatrix}$$, and Î˜ is a $$R^{n*(k-1)}$$ matrix.

Then: $$Ï•_j = \frac{exp(Î˜_j^T x)}{1 + \sum_{c=1}^{k-1} exp(Î˜_c^T x)}$$

The hypothesis function could be defined as: $$h_Î˜(x) = \begin{bmatrix}\frac{exp(Î˜_1^T x)}{1 + \sum_{c=1}^{k-1} exp(Î˜_c^T x)} \\…\\ \frac{exp(Î˜_{k-1}^T x)}{1 + \sum_{c=1}^{k-1} exp(Î˜_c^T x)} \end{bmatrix}$$

We need to find Î˜ that maximizes the probabilities P(y=j|x;Î˜) for all values of x. In other words, we need to find Î¸ that maximizes the likelihood function L:

$$L(Î¸)=P(\overrightarrow{y}|X;Î¸)=\prod_{i=1}^{m} P(y^{(i)}|x^{(i)};Î¸)$$ $$=\prod_{i=1}^{m} \phi_1^{1\{y^{(i)}=1\}} * … * \phi_{k-1}^{1\{y^{(i)}=k-1\}} * (1 – \sum_{j=1}^{k-1} \phi_j)^{1\{y^{(i)}=k\}}$$ $$=\prod_{i=1}^{m} \prod_{c=1}^{k-1} \phi_c^{1\{y^{(i)}=c\}} * (1 – \sum_{j=1}^{k-1} \phi_j)^{1\{y^{(i)}=k\}}$$

$$=\prod_{i=1}^{m} \prod_{c=1}^{k-1} \phi_c^{1\{y^{(i)}=c\}} * (1 – \sum_{j=1}^{k-1} \phi_j)^{1\{y^{(i)}=k\}}$$ and $$Ï•_j = \frac{exp(Î˜_j^T x)}{1 + \sum_{c=1}^{k-1} exp(Î˜_c^T x)}$$

# Multinomial Logistic Regression (using cross-entropy minimization)

In this section, we will try to minimize the cross-entropy between Y and estimated $$\widehat{Y}$$.

We define $$W \in R^{d*n}$$, $$b \in R^{d}$$ such as $$S(W x + b) = \widehat{Y}$$, S is the Softmax function, k is the number of outputs, and $$x \in R^n$$.

To estimate W and b, we will need to minimize the cross-entropy between the two probability vectors Y and $$\widehat{Y}$$.

The cross-entropy is defined as below:

$$D(\widehat{Y}, Y) = -\sum_{j=1}^d Y_j Log(\widehat{Y_j})$$

Example:

if $$\widehat{y} = \begin{bmatrix}0.7 \\0.1 \\0.2 \end{bmatrix} \& \ y=\begin{bmatrix}1 \\0 \\0 \end{bmatrix}$$ then $$D(\widehat{Y}, Y) = D(S(W x + b), Y) = -1*log(0.7)$$

We need to minimize the entropy for all training examples, therefore we will need to minimize the average cross-entropy of the entire training set.

$$L(W,b) = \frac{1}{m} \sum_{i=1}^m D(S(W x^{(i)} + b), y^{(i)})$$, L is called the loss function.

If we define $$W = \begin{bmatrix} — Î¸_1 — \\ — Î¸_2 — \\ .. \\ — Î¸_d –\end{bmatrix}$$ such as:

$$Î¸_1=\begin{bmatrix}Î¸_{1,0}\\Î¸_{1,1}\\…\\Î¸_{1,n}\end{bmatrix}, Î¸_2=\begin{bmatrix}Î¸_{2,0}\\Î¸_{2,1}\\…\\Î¸_{2,n}\end{bmatrix}, Î¸_d=\begin{bmatrix}Î¸_{d,0}\\Î¸_{d,1}\\…\\Î¸_{d,n}\end{bmatrix}$$

We can then write $$L(W,b) = \frac{1}{m} \sum_{i=1}^m D(S(W x^{(i)} + b), y^{(i)})$$

$$= \frac{1}{m} \sum_{i=1}^m \sum_{j=1}^d 1^{\{y^{(i)}=j\}} log(\frac{exp(Î¸_k^T x^{(i)})}{\sum_{k=1}^d exp(Î¸_k^T x^{(i)})})$$

For d=2 (nbr of class=2), $$= \frac{1}{m} \sum_{i=1}^m 1^{\{y^{(i)}=\begin{bmatrix}1 \\ 0\end{bmatrix}\}} log(\frac{exp(Î¸_1^T x^{(i)})}{exp(Î¸_1^T x^{(i)}) + exp(Î¸_2^T x^{(j)})}) + 1^{\{y^{(i)}=\begin{bmatrix}0 \\ 1\end{bmatrix}\}} log(\frac{exp(Î¸_2^T x^{(i)})}{exp(Î¸_1^T x^{(i)}) + exp(Î¸_2^T x^{(i)})})$$

$$1^{\{y^{(i)}=\begin{bmatrix}1 \\ 0\end{bmatrix}\}}$$ means that the value is 1 if $$y^{(i)}=\begin{bmatrix}1 \\ 0\end{bmatrix}$$ otherwise the value is 0.

To estimate $$Î¸_1,…,Î¸_d$$, we need to calculate the derivative and update $$Î¸_j = Î¸_j – Î± \frac{\partial L}{\partial Î¸_j}$$

# Kernel regression

Kernel regression is a non-linear model. In this model we define the hypothesis as the sum of kernels.

$$\widehat{y}(x) = Ï•(x) * Î¸ = Î¸_0 + \sum_{i=1}^d K(x, Î¼_i, Î») Î¸_i$$ such as:

$$Ï•(x) = [1, K(x, Î¼_1, Î»),…, K(x, Î¼_d, Î»)]$$ and $$Î¸ = [Î¸_0, Î¸_1,…, Î¸_d]$$

For example, we can define the kernel function as : $$K(x, Î¼_i, Î») = exp(-\frac{1}{Î»} ||x-Î¼_i||^2)$$

Usually we select d = number of training examples, and $$Î¼_i = x_i$$

Once the vector Ï•(X) calculated, we can use it as new engineered vector, and then use the normal equation to find Î¸:

$$Î¸ = {(Ï•(X)^TÏ•(X))}^{-1}Ï•(X)^Ty$$

# Bayes Point Machine

The Bayes Point Machine is a Bayesian linear classifier that can be converted to a nonlinear classifier by using feature expansions or kernel methods as the Support Vector Machine (SVM).

More details will be provided.

# Ordinal Regression

Ordinal Regression is used for predicting an ordinal variable. An ordinal variable is a categorical variable for which the possible values are ordered (eg. size: Small, Medium, Large).

More details will be provided.

# Poisson Regression

Poisson regression assumes the output variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear function.

log(E[Y|X]) = log(Î») = Î¸.x

# Hashing

Feature hashing is used to convert a categorical feature into a small-dimension feature space.

For example a vocabulary of words can be represented in a 5 dimensional space $$\{0,1\}^5$$ (e.g. Apple â†’ [1,0,1,1,0]).

Hashing introduces noise as collisions are permitted (Carâ†’ [0,0,0,1,1], Flightâ†’ [0,0,0,1,1]).

# Embedding

With embedding the representation is more dense (e.g. Car â†’ [0.4,0.9,0.1,0,0.1])

# Bucketing

Bucketing is a technique for decomposing feature values into buckets.

# Crossing

Crossing features is combining multiple features in one feature.

# Architecture

RNNs are generally used for sequence modeling (e.g. language modeling, time series modeling,…).

Unfolding a RNN network (many to many scenario) could be presented as follow:

When training the model, we need to find W that minimizes the sum of losses.

# Multilayer RNN

Multilayer RNN is a neural network with multiple RNN layers.

# Backpropagation through time

To train a RNN, we need to calculate the gradient to update parameters. Instead of doing the derivations for the full network, we will focus only on one hidden unit.

We define $$s_t = g(W.x_t + U.s_{t-1} + b)$$. g is an activation function.

Using the chain rule we can find that:

$$\frac{\partial loss(y_t,?_t)}{\partial W} = \frac{\partial loss(y_t,?_t)}{\partial ?_t}.\frac{\partial ?_t}{\partial s_t}.\frac{\partial s_t}{\partial W} \\ = \frac{\partial loss(y_t,?_t)}{\partial ?_t}.\frac{\partial ?_t}{\partial s_t}.(\sum_{k=t}^0 \frac{\partial s_t}{\partial s_k}.\frac{\partial s_k}{\partial W})$$

The term in red in the equation is a product of Jacobians ($$\frac{\partial s_t}{\partial s_k} = \prod_{i=k+1}^{t} \frac{\partial s_{i}}{\partial s_{i-1}}$$).

$$\frac{\partial loss(y_t,?_t)}{\partial ?_t}.\frac{\partial ?_t}{\partial s_t}.(\sum_{k=t}^0 \color{red} {\frac{\partial s_t}{\partial s_k}}.\frac{\partial s_k}{\partial W})$$

Because the derivatives of activation functions (except ReLU) are less than 1, therefore when evaluating the gradient, the red term tends to converge to 0, and the model becomes more biased and captures less dependencies.

RNN is trained by backpropagation through time. When gradient is passed back through many time steps, it tends to vanish or to explode.

Gradient Clipping is a technique to prevent exploding gradients in very deep networks, typically Recurrent Neural Networks. There are various ways to perform gradient clipping, but the most common one is to normalize the gradients of a parameter vector when its L2 norm exceeds a certain threshold according to : $$gradients := gradients * \frac{threshold}{L2 norm(gradients)}$$.

# Long Short Term Memory (LSTM)

LSTM was introduced to solve the vanishing gradient problem.

In LSTM, there are 4 gates that regulate the state of a RNN model:

Write gate (i?[0,1]) (Input gate): Write to memory cell.

Keep gate (f?[0,1]) (Forget gate): Erase memory cell.

Read gate (o?[0,1]) (Output gate): Read from memory cell.

Gate gate (g?[-1,1]) (Update gate) How much to write to memory cell.

$$c_t$$ is called memory state at time t.

i,f,o,g,h,c are vectors having the same size. W is a matrix with size nxh.

The gate states are calculated using the following formula:

The output $$h_t$$ is calculated using the following formulas:

$$c_t = f.c_{t-1} + i.g \\ h_t = o.tanh(c_t)$$

# Gated Recurrent Unit (GRU)

The Gated Recurrent Unit is a simplified version of an LSTM unit with fewer parameters. Just like an LSTM cell, it uses a gating mechanism to allow RNNs to efficiently learn long-range dependency by preventing the vanishing gradient problem. The GRU consists of a reset and update gates that determine which part of the old memory to keep or update with new values at the current time step.

# Introduction to Quantum Computing

“I think I can safely say that nobody understands quantum mechanics.” ~ Richard Feynman.

# Hilbert space

Hilbert space is generalization of the Euclidean space. In a Hilbert space we can have infinite number of dimensions.

A vector in an infinite-dimensional Hilbert space is represented as an infinite vector: $$(x_1, x_2, ….)$$.

The inner product of two vectors $$?x|y? = \sum_{i=1}^? x_i y_i$$

The norm $$||x|| = ?x|x?^{\frac{1}{2}} = \sqrt{\sum_{i=1}^? x_i^2} < ?$$

Tensor product of two vectors $$\begin{bmatrix} a_1 \\ a_2 \end{bmatrix} ? \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} = \begin{bmatrix} a_1 * b_1 \\ a_1 * b_2 \\ a_2 * b_1 \\ a_2 * b_2 \end{bmatrix}$$

# Qubit

In quantum computing, a qubit or quantum bit is a unit of quantum information. It’s the analog of a bit for quantum computation. A qubit is a two-state quantum-mechanical system. The state 0 could be represented as vector $$\begin{bmatrix} 1 \\ 0 \end{bmatrix} ? |0? ? |??$$ (horizontal polarization) and state 1 could be represented as orthogonal vector $$\begin{bmatrix} 0 \\ 1 \end{bmatrix} ? |1? ? |??$$ (vertical polarization) .

A generic qubit state corresponds to a vector $$\begin{bmatrix} ? \\ ? \end{bmatrix} = ? \begin{bmatrix} 1 \\ 0 \end{bmatrix} + ? \begin{bmatrix} 0 \\ 1 \end{bmatrix} ? ? \ |0? + ? \ |1?$$ in a two-dimensional Hilbert space such as $$|?|^2 + |?|^2 = 1$$ (normalized vector). $$|?|^2, |?|^2$$ are the probabilities for the particle to be in state 0 and 1. ? and ? are complex numbers and they are called amplitudes.

Another way to write the state ? of a qubit is by using cos and sin functions.

|?? = cos(?) |0? + sin(?) |1?

More details about qubit can be found here: https://arxiv.org/pdf/1312.1463.pdf

# Two qubits

A state in a combined two qubits system corresponds to a vector in four-dimensional Hilbert space $$|\gamma? = e \ |00? + f \ |01? + g \ |10?+ h \ |11? = \begin{bmatrix} e \\ f \\ g \\ h \end{bmatrix}$$.

In general, the state vector of two combined systems is the tensor product of the state vector of each system.

Some expressions and their equivalents:

|0??|0??|0? ? |000?

|0??|1??|0? ? |010?

$$(\sqrt{\frac{1}{2}}|0? + \sqrt{\frac{1}{2}}|1?)?(\sqrt{\frac{1}{2}}|0? + \sqrt{\frac{1}{2}}|1?) ? \frac{1}{2}|00? + \frac{1}{2}|01? + \frac{1}{2}|10? + \frac{1}{2}|11?$$

# Bloch sphere

A state ? of a qubit can be also written as: $$|?? = cos(\frac{?}{2})|0? + sin(\frac{?}{2}) exp(i?) |1?$$, and it can be visualized as a vector of length 1 on the Bloch sphere.

The angles ?,? correspond to the polar and azimuthal angles of spherical coordinates (??[0,?], ??[0,2?]).

# Bra-ket notation

|01? ? |0??|1? ? $$\begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \end{bmatrix}$$ (column vector)

?01| ? ?0|??1| ? $$\begin{bmatrix} 0 \ 1 \ 0 \ 0 \end{bmatrix}$$ (row vector)

# Quantum superposition

An arbitrary state of a qubit could be described as a superposition of 0 and 1 states.

# N qubits

A 3 qubits quantum system can store numbers such as 3 or 7.

|011? ? |3?

|111? ? |7?

But it can also store the two numbers simultaneously using a superposition state on the last qubit.

$$(\sqrt{\frac{1}{2}}|0? + \sqrt{\frac{1}{2}}|1?)?|1??|1? ? \sqrt{\frac{1}{2}} (|011? + |111?)$$

In general, a quantum computer with n qubits can be in an arbitrary superposition of up to $$2^n$$ different states simultaneously.

# Quantum gates

The Hadamard gate acts on a single qubit. It maps the state |0? to a superposition state with equal probabilities on state 0 and 1 ($$\sqrt{\frac{1}{2}}|0? + \sqrt{\frac{1}{2}}|1?$$).

The Hadamard matrix is defined as: $$H = \frac {1}{\sqrt {2}} \begin{bmatrix}1 \ 1 \\ 1 \ -1 \end{bmatrix}$$

$$H * \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} \sqrt{\frac{1}{2}} \\ \sqrt{\frac{1}{2}} \end{bmatrix}$$

Pauli X gate (NOT gate)

The Pauli-X gate acts on a single qubit. It is the quantum equivalent of the NOT gate for classical computers.

The Pauli X matrix is defined as: $$X = \begin{bmatrix}0 \ 1 \\ 1 \ 0 \end{bmatrix}$$

$$X * \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$$

Pauli Y gate

The Pauli-Y gate acts on a single qubit. It equates to a rotation around the Y-axis of the Bloch sphere by ? radians. It maps |0? to i |1? and |1? to ?i |0?

The Pauli Y matrix is defined as: $$Y = \begin{bmatrix}0 \ -i \\ i \ 0 \end{bmatrix}$$

$$Y * \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ i \end{bmatrix}$$

Pauli Z gate

The Pauli-Z gate acts on a single qubit. It equates to a rotation around the Z-axis of the Bloch sphere by ? radians. It leaves the basis state |0? and maps |1? to ?|1?

The Pauli Z matrix is defined as: $$Z = \begin{bmatrix}1 \ 0 \\ 0 \ -1 \end{bmatrix}$$

$$Z * \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}$$

Other gates are described in this link: https://en.wikipedia.org/wiki/Quantum_gate

# State measurement

State measurement can be done using polarizing filters. Filters transmit or block particles based on angle of polarization.

# IBM Q

IBM Q is a cloud quantum computing platform.

Below a simple experiment where we have two qubits. No transformation is applied on first qubit. A Pauli X transformation is applied on second qubit. As result there is one possible state |10? with probability equals 1.

# Data representation

Given a vector $$x = (x_1, x_2, …. , x_n)$$. The quantum system topology needed to store the vector x is $$log_2(n)$$ qubits.

For example, the quantum system representation of the following vector requires only 3 qubits.

(0, 0.2, 0.2, 0, 0.1, 0.2, 0.1, 0.3) ? 0 |000? + 0.2 |001? + 0.2 |010? + 0 |011? + 0.1 |100? + 0.2 |101? + 0.1 |110? + 0.3 |111?

# Neural Network

A neural network is a non-linear classifier (separator is not a linear function). It can also be used for regression.

A Shallow neural network is a one hidden layer neural network.

A Vanilla neural network is a regular neural network having layers that do not form cycles.

TensorFlow Playground is an interactive web interface for learning neural networks: http://playground.tensorflow.org.

# Computational Graph

Above the computational graph for the function $$f(x) = (x-1)^2$$.

Forward propagation

To minimize the function f, we assign a random value to x (e.g. x = 2), then we evaluate y, z, and f (forward propagation).

Backward propagation

Then we compute the partial derivative of f with respect to x step by step (Backward propagation).

$$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial y}*\frac{\partial y}{\partial x} + \frac{\partial f}{\partial z}*\frac{\partial z}{\partial x} = 2 \\ \frac{\partial f}{\partial y} = z = 1 \\ \frac{\partial f}{\partial z} = y = 1 \\ \frac{\partial y}{\partial x} = \frac{\partial z}{\partial x} = 1$$

Then we update $$x:= x – ?.\frac{\partial f}{\partial x}$$.

We repeat the operation until convergence.

# Activation functions

Activation functions introduce nonlinearity into models. The most used activation functions are:

Sigmoid

$$f(x) = \frac{1}{1+exp(-x)}$$

Sigmoid has a positive and non-zero centred output (sigmoid(0) ? 0.5).

When all activation units are positive, then weight update will be in the same direction (all positive or all negative updates) and that will cause a zigzag path during optimization.

$$z=?w_i.a_i+b \\ \frac{dL}{dw_i}=\frac{dL}{dz}.\frac{dz}{dw_i}=\frac{dL}{dz}.ai$$

If all ai>0, then the gradient will have the same sign as $$\frac{dL}{dz}$$ (all positive or all negative).

TanH

$$f(x) = \frac{2}{1+exp(-2x)} -1$$

When x is large, the derivative of the sigmoid or Tanh function is around zero (vanishing gradient/saturation).

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

Leaky ReLU

f(x) = max(0.01x, x)

Leaky Relu was introduced to fix the “Dying Relu” problem.

$$z=?w_i.a_i+b \\ f=Relu(z) \\ \frac{dL}{dw_i}=\frac{dL}{df}.\frac{df}{dz}.\frac{dz}{dw_i}$$

When z becomes negative, then the derivative of f becomes equal to zero, and the weights stop being updated.

PRelu (Parametric Rectifier)

f(x) = max(?.x, x)

ELU (Exponential Linear Unit)

f(x) = {x if x>0 otherwise ?.(exp(x)-1)}

Other activation functions: Maxout

# Cost function

$$J(?) = \frac{1}{m} \sum_{i=1}^{m} loss(y^{(i)}, f(x^{(i)}; ?))$$

We need to find ? that minimizes the cost function: $$\underset{?}{argmin}\ J(?)$$

# Neural Network Regression

Neural Network regression has no activation function at the output layer.

L1 Loss function

$$loss(y,?) = |y – ?|$$

L2 Loss function

$$loss(y,?) = (y – ?)^2$$

Hinge loss function

Hinge loss function is recommended when there are some outliers in the data.

$$loss(y,?) = max(0, |y-?| – m)$$

# Two-Class Neural Network

Binary Cross Entropy Loss function

$$loss(y,?) = – y.log(?) – (1-y).log(1 – ?)$$

# Multi-Class Neural Network – One-Task

Using Softmax, the output ? is modeled as a probability distribution, therefore we can assign only one label to each example.

Cross Entropy Loss function

$$loss(Y,\widehat{Y}) = -\sum_{j=1}^c Y_{j}.log(\widehat{Y}_{j})$$

Hinge Loss (SVM) function

$$y = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix},\ ? = \begin{bmatrix} 2 \\ -5 \\ 3 \end{bmatrix} \\ loss(y,?) = \sum_{c?1} max(0, ?_c – ?_1 + m)$$

For m = 1, the sum will be equal to 2.

# Multi-Class Neural Network – Multi-Task

In this version, we assign multiple labels to each example.

Loss function

$$loss(Y,\widehat{Y}) = \sum_{j=1}^c – Y_j.log(\widehat{Y}_j) – (1-Y_j).log(1 – \widehat{Y}_j)$$

# Regularization

Regularization is a very important technique to prevent overfitting.

Dropout

For each training example, ignore randomly p activation nodes of each hidden layer. p is called dropout rate (p?[0,1]). When testing, scale activations by the dropout rate p.

Inverted Dropout

With inverted dropout, scaling is applied at the training time, but inversely. First, dropout all activations by dropout factor p, and second, scale them by inverse dropout factor 1/p. Nothing needs to be applied at test time.

Data Augmentation

As a regularization technique, we can apply random transformations on input images when training a model.

Early stopping

Stop when error rates decreases on training data while it increases on dev (cross-validation) data.

L1 regularization

$$J(?) = \frac{1}{m} \sum_{i=1}^{m} loss(y^{(i)}, f(x^{(i)}; ?)) \color{blue} { + ? .\sum_{j} |?_j|}$$

? is called regularization parameter

L2 regularization

$$J(?) = \frac{1}{m} \sum_{i=1}^{m} loss(y^{(i)}, f(x^{(i)}; ?)) \color{blue} { + ? .\sum_{j} ?_j^2}$$

Lp regularization

$$J(?) = \frac{1}{m} \sum_{i=1}^{m} loss(y^{(i)}, f(x^{(i)}; ?)) \color{blue} { + ? .\sum_{j} ?_j^p}$$

For example, if the cost function $$J(?)=(?_1 – 1)^2 + (?_2 – 1)^2$$, then the $$L_2$$ regularized cost function is $$J(?)=(?_1 – 1)^2 + (?_2 – 1)^2 + ? (?_1^2 + ?_2^2)$$

If ? is large, then the point that minimizes the regularized J(?) will be around (0,0) –> Underfitting.

If ? ~ 0, then the point that minimizes the regularized J(?) will be around (1,1) –> Overfitting.

Elastic net

Combination of L1 and L2 regularizations.

# Normalization

Gradient descent converges quickly when data is normalized Xi ? [-1,1]. If features have different scales, then the update of parameters will not be in the same scale (zig-zag).

For example, if the activation function g is the sigmoid function, then when W.x+b is large g(W.x+b) is around 1, but the derivative of the sigmoid function is around zero. For this reason the gradient converges slowly when the W.x+b is large.

Below some normalization functions.

ZScore

$$X:= \frac{X – ?}{?}$$

MinMax

$$X:= \frac{X – min}{max-min}$$

Logistic

$$X:= \frac{1}{1+exp(-X)}$$

LogNormal

$$X:= \frac{1}{?\sqrt{2?}} \int_{0}^{X} \frac{exp(\frac{-(ln(t) – ?)^2}{2?^2})}{t} dt$$

Tanh

$$X:= tanh(X)$$

# Weight Initialization

Weight initialization is important because if weights are too big then activations explode. If weights are too small then gradients will be around zero (no learning).

When we normalize input data, we make the mean of the input features equals to zero, and the variance equals to one. To keep the activation units normalized too, we can initialize the weights $$W^{(1)}$$ so $$Var(g(W_{j}^{(1)}.x+b_{j}^{(1)}))$$ is equals to one.

If we suppose that g is Relu and $$W_{i,j}, b_j, x_i$$ are independent, then:

$$Var(g(W_{j}^{(1)}.x+b_{j}^{(1)})) = Var(\sum_{i} W_{i,j}^{(1)}.x_i+b_{j}^{(1)}) =\sum_{i} Var(W_{i,j}^{(1)}.x_i) + 0 \\ = \sum_{i} E(x_i)^2.Var(W_{i,j}^{(1)}) + E(W_{i,j}^{(1)})^2.Var(x_i) + Var(W_{i,j}^{(1)}).Var(x_i) \\ = \sum_{i} E(x_i)^2.Var(W_{i,j}^{(1)}) + E(W_{i,j}^{(1)})^2.Var(x_i) + Var(W_{i,j}^{(1)}).Var(x_i) \\ = \sum_{i} 0 + 0 + Var(W_{i,j}^{(1)}).Var(x_i) = n.Var(W_{i,j}^{(1)}).Var(x_i)$$

Xavier initialization

If we define $$W_{i,j}^{(1)} ? N(0,\frac{1}{\sqrt{n}})$$, then the initial variance of activation units will be one (n is number of input units).

We can apply this rule on all weights of the neural network.

# Batch Normalization

Batch normalization is a technique to provide any layer in a Neural Network with normalized inputs. Batch Normalization has a regularizing effect.

After training, ? will converge to the standard deviation of the mini-batches and ? will converge to the mean. The ?, ? parameters give more flexibility when shifting or scaling is needed.

# Hyperparameters

Neural network hyperparameters are:

• Learning rate (?) (e.g. 0.1, 0.01, 0.001,…)
• Number of hidden units
• Number of layers
• Mini-bach size
• Momentum rate (e.g. 0.9)
• Adam optimization parameters (e.g. ?1=0.9, ?2=0.999, ?=0.00000001)
• Learning rate decay

# Local Minimum

The probability that gradient descent gets stuck in a local minimum in a high dimensional space is extremely low. We could have a saddle point, but it’s rare to have a local minimum.

# Transfer Learning

Transfer Learning consists in the use of parameters of a trained model when training new hidden layers of an extended version of that model.

# Characteristics of Machine Learning Algorithms

Discriminative / Generative algorithms

Many machine learning algorithms are based on probabilistic models. Discriminative algorithms learn the conditional probability distribution of y given x p(y|x). Generative algorithms learn the joint probability distribution p(y,x) = p(y|x) * p(x), and therefore take in consideration the distribution of x.

Parametric & Non-Parametric algorithms

Neural Network, SVM,… are non-parametric algorithms because they have no fixed parameter vector; depending on the richness of the data we can choose the size of the parameter vector that we need (ex. we can add hidden units, change SVM kernels,…)

Capacity

The capacity is controlled by parameters called Hyperparameters (ex. degree p of a polynomial predictor, kernel choice in SVM, number of layers in a neural network).

# Summary of Machine Learning Algorithms

The table below describes briefly each machine learning algorithm.

 Algorithm Description Characteristics Linear regression To use when Y is normally-distributed Discriminative Parametric Logistic regression To use when Y is Bernoulli-distributed Discriminative Parametric Multinomial logistic regression (softmax regression) To use when Y is multinomially-distributed There are two versions of the algorithm, one based on maximum likelihood maximization, and one based on cross-entropy minimization. Discriminative Parametric Gaussian Discriminant Analysis Supervised classification algorithm To use when Y is Bernoulli distributed and the conditional distribution of X given Y is multivariate Gaussian Generative Parametric Naive Bayes Algorithm Supervised classification algorithm To use when Y is Bernoulli, and the conditional distribution of X given Y is Bernoulli and X features are conditionally independent Generative Parametric EM Unsupervised soft-clustering algorithm Generative Parametric Principal Component Analysis Reduce the dimensionality of X. Calculate eigenvectors for $$XX^T$$. Use eigenvectors with higher eigenvalues to transform data $$x^{(i)} := (u_1^T x^{(i)}, u_2^T x^{(i)},…, , u_k^T x^{(i)})$$ Factor Analysis Reduce the dimensionality of X To use when X is Gaussian. Transform X to Z matrix with lower dimensionality (x ? ?+?z). Neural Network Non-linear classifier Discriminative Non-Parametric K-Nearest Neighbor Regression Predict y as the average of values y1,y2,…,yk of k nearest neighbors of x Discriminative Non-Parametric K-Nearest Neighbor Classifier Predict y as the most common class among the k nearest neighbors of x Discriminative Non-Parametric Support Vector Machine Find a separator that maximizes the margin between two classes Discriminative Non-Parametric K-means Unsupervised hard-clustering algorithm Discriminative Non-Parametric

# What’s next!

Without any doubt Machine Learning / Artificial Intelligence will drastically change the world during the next decade. Deep Learning algorithms are reaching unprecedentedly high level of accuracy, but they are slow to train. For this reason Big companies are working on a design of new chips optimized to run Deep Learning algorithms.

# Machine Learning Research Scientists

Below the list of some known Machine Learning research scientists:

 Geoffrey Hinton Former University of Toronto professor. The godfather of neural network research. Andrew Ng Adjunct professor at Stanford University. Founder of Google Brain project. Yann LeCun Director of AI Research at Facebook. Founding father of convolutional neural network. Yoshua Bengio Full professor at Université de Montréal. One of the founding fathers of deep learning.

# Machine Learning Platforms

Microsoft Azure ML: https://studio.azureml.net/

Amazon ML: http://aws.amazon.com/machine-learning/

# High-Performance Hardware for Machine Learning

GPGPU Architecture

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU) to perform computations in applications traditionally handled by the central processing unit (CPU). There are GPU programming frameworks that allow the use of GPUs for general purpose tasks. The widely used frameworks are Cuda and OpenCL. Cuda is supported only on NVIDIA GPUs. OpenCL is open standard, it’s supported on most recent Intel and AMD GPUs.

FPGA Architecture

FPFAs are hardware implementations of algorithms, and hardware is always faster than software. A use of a multiple-FPGA parallel computing architecture can dramatically improve computing performance.

Tensor Processing Unit

TPUs are application-specific integrated circuits developed specifically for machine learning. They are designed explicitly for a higher volume of reduced precision computation (as little as 8-bit precision).

https://en.wikipedia.org/wiki/Tensor_processing_unit.