Linear Algebra
A matrix is a 2D array of numbers.
A tensor is a nD array of numbers.
Matrices are associative but not commutative.
Not all square matrices have inverse. A matrix that is not invertible is called Singular or Degenerative.
A matrix is “singular” if any of the following are true:
Any row or column contains all zeros.
Any two rows or columns are identical.
Any row or column is a linear combination of other rows or columns.
A square matrix A is singular, if and only if the determinant(A) = 0
Inner product of two vectors \(\overrightarrow{u}\) and \(\overrightarrow{v}\):
\(\overrightarrow{u} * \overrightarrow{v} = p * \overrightarrow{u} = p * \sqrt{\sum_{i=1}^{n}u_{i}^{2}} = u^{T} * v = \sum_{i=1}^{n}u_{i} * v_{i}\)
p is the projection of \(\overrightarrow{v}\) on \(\overrightarrow{u}\) and \(\overrightarrow{u}\) is the Euclidean norm of \(\overrightarrow{u}\).
\(A = \begin{bmatrix} a & b & c\\ d & e & f\\ g & h & i \end{bmatrix}\)
Determinant(A) = A = a(eihf) – b(digf) + c(dh – ge)
If \(f: R^{n*m} \mapsto R\)
\(\frac{\partial}{\partial A} f(A) = \begin{bmatrix} \frac{\partial}{\partial a_{11}}f(A) & … & \frac{\partial}{\partial a_{1m}} f(A)\\ … & … & …\\ \frac{\partial}{\partial a_{n1}}f(A) & … & \frac{\partial}{\partial a_{nm}}A\\ \end{bmatrix}\)
Example:
\( f(A) = a_{11} + … + a_{nm} \)
\(\frac{\partial}{\partial A} f(A) = \begin{bmatrix} 1 & … & 1\\ … & … & …\\ 1 & … & 1 \\ \end{bmatrix}\)
If A is a squared matrix:
trace(A) = \(\sum_{i=1}^n A_{ii}\)
trace(AB) = trace(BA)
trace(ABC) = trace(CAB) = trace(BCA)
trace(B) = trace(\(B^T\))
trace(a) = a
\(\frac{\partial}{\partial A} trace(AB) = B^T\)
\(\frac{\partial}{\partial A} trace(ABA^TC) = CAB + C^TAB^T\)
Eigenvector
Given a matrix A, if a vector μ satisfies the equation A*μ = λ*μ then μ is called an eigenvector for the matrix A, and λ is called the eigenvalue for the matrix A. The principal eigenvector for the matrix A is the eigenvalue with the largest eigenvalue.
Example:
The normalized eigenvectors for \(\begin{bmatrix}0 & 1 \\1 & 0 \end{bmatrix}\) are \(\begin{bmatrix}\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{bmatrix}\) and \(\begin{bmatrix}\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{bmatrix}\), the eigenvalues are 1 and 1.
Eigendecomposition
Given a squared matrix A∈\(R^{n*n}\), ∃ Q∈\(R^{n*n}\), Λ∈\(R^{n*n}\) and Λ diagonal, such as \(A=QΛQ^T\).
Q’s columns are the eigenvectors of \(A\)
Λ is the diagonal matrix whose diagonal elements are the eigenvalues
Example:
The eigendecomposition of \(\begin{bmatrix}0 & 1 \\1 & 0 \end{bmatrix}\) is Q=\(\begin{bmatrix}\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\end{bmatrix}\), Λ=\(\begin{bmatrix}\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\end{bmatrix}\)
Eigenvectors are vect
Single Value Decomposition
Given a matrix A∈\(R^{n*m}\), ∃ U∈\(R^{n*m}\), D∈\(R^{m*m}\) and D diagonal, V∈\(R^{m*m}\) such as \(A=UDV^T\).
U’s columns are the eigenvectors of \(AA^T\)
V’s columns are the eigenvectors of \(A^TA\)
Example:
The SVD decomposition of \(\begin{bmatrix} 0 & 1 \\1 & 0 \end{bmatrix}\) is U=\(\begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix}\), D=\(\begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix}\), V=\(\begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix}\).
More details about SVD can be found here: http://www.youtube.com/watch?v=9YtmGywfE4
The MoorePenrose Pseudoinverse
The MoorePenrose pseudoinverse is a matrix that can act as a partial replacement for the matrix inverse in cases where it does not exist (e.g. nonsquare matrices).
The pseudoinverse matrix is defined as: \(pinv(A) = \lim_{α \rightarrow 0} (A^TA + αI)^{1} A^T\)
Analysis
0! = 1
exp(1) = 2.718
exp(0) = 1
ln(1) = 0
\(ln(x) = log_e(x) \\ log_b (b^a) = a\)
exp(a + b) = exp(a) * exp(b)
ln(a * b) = ln(a) + ln(b)
\(cos(x)^2 + sin(x)^2 = 1\)
Euler’s formula
exp(iθ) = cos(θ) + i sin(θ)
Complex numbers
Rectangular form
z = a + ib (real part + imaginary part and i an imaginary unit satisfying \(i^2 = −1\)).
Polar form
z = r (cos(θ) + i sin(θ))
Exponential form
z = r.exp(iθ)
Multivariate equations
The solution set of a system of linear equations with 3 variables is the intersection of hyperplanes defined by each linear equation.
Derivatives
\(\frac{\partial f(x)}{\partial x} = \lim_{h \rightarrow 0} \frac{f(x+h) – f(x)}{h}\)
Function

Derivative

x^n

n * x^(n1)

exp(x)

exp(x)

f o g (x)

g’(x) * f’ o g(x)

ln(x)

1/x

sin(x)

cos(x)

cos(x)

sin(x)

Integration by parts
\(\int_{a}^{b} (f(x) g(x))’ dx = \int_{a}^{b} f'(x) g(x) dx+ \int_{a}^{b} f(x) g'(x) dx\)
Binomial theorem
\((x + y)^n = \sum_{k=0}^{n} C_n^k x^k y^{nk}\)
Chain rule
Z = f(x(u,v), y(u,v))
\(\frac{\partial Z}{\partial u} = \frac{\partial Z}{\partial x} * \frac{\partial x}{\partial u} + \frac{\partial Z}{\partial y} * \frac{\partial y}{\partial u}\)
Entropy
Entropy measures the uncertainty associated with a random variable.
\(H(X) = \sum_{i=1}^n p(x^{(i)}) log(p(x^{(i)}))\)
Example:
Entropy({1,1,1,1}) = 0
Entropy({0,1,0,1}) = ½ (4*log(½))
Hessian
\(H = \begin{bmatrix}\frac{\partial^2 f(θ)}{\partial θ_1\partial θ_1} & \frac{\partial^2 f(θ)}{\partial θ_1 \partial θ_2} \\ \frac{\partial^2 f(θ)}{\partial θ_2\partial θ_1} & \frac{\partial^2 f(θ)}{\partial θ_2\partial θ_2} \end{bmatrix}\)
Example:
\(f(θ) = θ_1^2 + θ_2^2 \\ H(f) = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}\)
A function f(θ) is convex if its Hessian matrix is positive semidefinite (\(x^T.H(θ).x >= 0\), for every \(x∈R^2\)).
\(x^T.H(θ).x = \begin{bmatrix} x_1 & x_2 \end{bmatrix} . \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} . \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = 2 x_1^2 + 2 x_2^2 >= 0\)
Method of Lagrange Multipliers
To maximize/minimize f(x) with the constraints \(h_r(x) = 0\) for r in {1,..,l},
We need to define the Lagrangian: \(L(x, α) = f(x) – \sum_{r=1}^l α_r h_r(x)\) and find x, α by solving the following equations:
\(\frac{\partial L}{\partial x} = 0\)
\(\frac{\partial L}{\partial α_r} = 0\) for all r
\(h_r(x) = 0\) for all r
Calculate the Hessian matrix (f”(x)) to know if the solution is a minimum or maximum.
Method of Lagrange Multipliers with inequality constraints
To minimize f(x) with the constraints \(g_i(x) \geq 0\) for i in {1,..,k} and \(h_r(x) = 0\) for r in {1,..,l},
We need to define the Lagrangian: \(L(x, α, β) = f(w) – \sum_{i=1}^k α_i g_i(x) – \sum_{r=1}^l β_r h_r(x)\) and find x, α, β by solving the following equations:
\(\frac{\partial L}{\partial x} = 0\)
\(\frac{\partial L}{\partial α_i} = 0\) for all i
\(\frac{\partial L}{\partial β_r} = 0\) for all r
\(h_r(x) = 0\) for all r
\(g_i(x) \geq 0\) for all i
\(α_i * g_i(x) = 0\) for all i (Karush–Kuhn–Tucker conditions)
\(α_i >= 0\) for all i (KTT conditions)
Lagrange strong duality – hard to understand 🙁
Lagrange dual function \(d(α, β) = \underset{x}{min} L(x, α, β)\), and x satisfies equality and inequality constraints.
We define \(d^* = \underset{α \geq 0, β}{max}\ d(α, β)\)
We define \(p^* = \underset{w}{min}\ f(x) \) (x satisfies equality and inequality constraints)
Under certain conditions (Slater conditions: f convex,…), \(p^* = d^*\)
Jensen’s inequality
If f a convex function, and X a random variable, then f(E[X]) <= E[f(X)].
If f a concave function, and X a random variable, then f(E[X]) >= E[f(X)].
If f is strictly convex (f”(x) > 0), then f(E[X]) = E[f(X)] holds true only if X = E[X] (X is a constant).
Probability
Below the main probability theorems.
Law of total probability
If A is an arbitrary event, and B are mutually exclusive events such as \(\sum_{i=1}^{n} P(B_{i}) = 1\), then:
\(P(A) = \sum_{i=1}^{n} P(AB_{i}) P(B_{i}) = \sum_{i=1}^{n} P(A,B_{i})\)
Example:
Suppose that 15% of the population of your country was exposed to a dangerous chemical Z. If exposure to Z quadruples the risk of lung cancer from .0001 to .0004. What’s the probability that you will get lung cancer.
P(cancer) = .15 * .0004 + .85 * .0001 = .000145
Bayes’ rule
P(AB) = P(BA) * P(A) / P(B)
Where A and B are events.
Example:
Suppose that 15% of the population of your country was exposed to a dangerous chemical Z. If exposure to Z quadruples the risk of lung cancer from .0001 to .0004. If you have lung cancer, what’s the probability that you were exposed to Z?
P(ZCancer) = P(CancerZ) * P(Z) / P(Cancer)
We can calculate the P(Cancer) using the law of total probability: P(Cancer) = P(CancerZ) * P(Z) + P(Cancer~Z) * P(~Z)
P(ZCancer) = .0004 * 0.15 / (.0004 * 0.15 + .0001 * .85) = 0.41
Chain rule
\(P(A_1,A_2,…,A_n) = P(A_1) P(A_2A_1) ….P(A_nA_{n1},…,A_1)\)
P(Y,X1,X2,X3,X4) = P(Y,X4,X3,X2,X1)
= P(YX4,X3,X2,X1) * P(X4X3,X2,X1) * P(X3X2,X1) * P(X2X1) * P(X1)
= P(YX4,X3) * P(X4) * P(X3X2,X1) * P(X2) * P(X1)
The Union Bound
\(P(A_1 \cup A_2 … \cup A_n) \leq P(A_1) + P(A_2) + … + P(A_n) \)
Nb of permutations with replacement
Nb of permutations with replacement = \({n^r}\), r the number of events, n the number of elements.
Probability = \(\frac{1}{n^r}\)
Example:
Probability of getting 3 six when rolling a dice = 1/6 * 1/6 * 1/6
Probability of getting 3 heads when flipping a coin = 1/2 * 1/2 * 1/2
Nb of permutations without replacement and with ordering
Nb of permutations without replacement = \(\frac{n!}{(nr)!}\), r the number of events, n the number of elements.
Probability = \(\frac{(nr)!}{n!}\)
Example:
Probability of getting 1 red ball and then 1 green ball from an urn that contains 4 balls (1 red, 1 green, 1 black and 1 blue) = 1/4 * 1/3
Nb of combinations without replacement and without ordering
Nb of combinations \(\frac{n!}{(nr)! \, r!}\), r the number of events, n the number of elements.
Probability = \(\frac{(nr)! \, r!}{n!}\)
Example:
Probability of getting 1 red ball and 1 green ball from an urn that contains 4 balls (1 red, 1 green, 1 black and 1 blue) = 1/4 * 1/3 + 1/4 * ⅓