### Machine Learning Basics

pedagogic talk mainly based on
Murphy, Machine Learning - A Probabilistic Perspective (2012)
Theis, Lecture Notes on Statistical Learning, TU Munich (2016)
Goodfellow, Bengio & Courville, Deep Learning (2016)

Nanotemper Technologies · Munich · 10 June 2016

F. Alexander Wolf | falexwolf.de

Institute of Computational Biology

Helmholtz Zentrum München

fullscreen: 'f' / navigation: arrow keys / black screen: 'b' / overview: 'o'

The Future of Robotics and Artificial Intelligence, Andrew Ng, Stanford University, 2011

Machine learning in robotics, natural language processing, neuroscience research, and computer vision.
Jordan & Mitchell, Science 349, 255 (2015)

### What is Machine Learning?

It's statistics using models with higher complexity.

These might yield higher precision but are less interpretable.

R. Tibshirani, Lecture Notes, Stanford U (2012)

### What does Machine Learning?

• Estimate a functional relation $$f : \mathcal{X} \rightarrow \mathcal{Y} \qquad X \mapsto Y$$ from data $\mathcal{D} = \{(x_i,y_i)\}_{i=1}^{N}$ (supervised case).
• Estimate similarity in the space $\mathcal{X}$ from data $\mathcal{D} = \{x_i\}_{i=1}^{N}$, $x_i \in \mathcal{X}$ (unsupervised case).

• Estimation based on data is referred to as learning.
• Also the supervised case requires learning similarity in $\mathcal{X}$.

### Classification example

Learn function $$f : \mathbb{R}^{28\times 28} \rightarrow \{2,4\}.$$

Examples from MNIST data base.

In which way are samples, e.g. for the label $y=2$, similar to each other?

▷ Strategy: find coordinates = features that reveal the similarity!

▷ Here PCA: diagonalize the covariance matrix $(x_i^\top \cdot x_j)_{ij=1}^N$.

### A simple model: k Nearest Neighbors

Model function $\hat f$: estimator $\hat y_x$ for $Y$ given $X=x$ $$\hat y_x = \hat f_\mathcal{D}(x) = \mathrm{E}_{p(y\,|\,x,\mathcal{D})}[y] = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} y_i$$

Hastie et al., Elements of Statistical Learning (2009)

### A simple model: k Nearest Neighbors

Model function $\hat f$: estimator for $Y$ given $X=x$ $$\hat y_x = \hat f_\mathcal{D}(x) = \mathrm{E}_{p(y\,|\,x,\mathcal{D})}[y] = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} y_i$$ Probabilistic model definition reflects uncertainty $$p(y\,|\,x,\mathcal{D}) = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} \mathbb{I}(y_i = y)$$

• Overfitting and Bias-Variance tradeoff: the lower the variance, the higher the bias.
• Is non-parametric, so no learning of parameters.
• Assumption: Euclidean distance is good similarity measure for $x$.
• Curse of dimensionality: does not work in high dimensions.

### Another simple model: linear regression

Estimator for $y$ given $x$ $$\hat y_x = \hat f_{\theta}(x) = \mathrm{E}_{p(y\,|\,x,\theta)}[y] = w_0 + x^\top w$$ Probabilistic model definition $$p(y\,|\,x,\theta) = \mathcal{N}(y \,|\, \hat y_x,\sigma), \quad \theta = (w_0,w,\sigma)$$ Estimate parameters from data $\mathcal{D}$ $$\theta^* = \text{argmax}_\theta p(\theta\,|\,\mathcal{D})$$

• Parametric, parameters $\theta$ have to be learned.
• High bias due to linearity assumption, but works in high dimensions, and is easily interpretable.

### Learning parameters

Estimate parameters from data $\mathcal{D}$ $$\theta^* = \text{argmax}_\theta p(\theta\,|\,\mathcal{D},\mathrm{model},\mathrm{beliefs}), \qquad {Optimization}$$ assuming a model and prior beliefs about parameters. Now
$$p(\theta\,|\,\mathcal{D}) = p(\mathcal{D}\,|\,\theta)p(\theta)/p(\mathcal{D}). \qquad\quad {Bayes'~rule}$$
Evaluate: assume uniform prior $p(\theta)$ and iid samples $(y_i, x_i)$ $$p(\theta\,|\,\mathcal{D}) \propto p(\mathcal{D}\,|\,\theta) = \prod_{i=1}^N p(y_i, x_i \,|\,\theta) \propto \prod_{i=1}^N p(y_i \,|\, x_i, \theta)$$

Linear regression: $\log p(\theta\,|\,\mathcal{D}) \simeq \sum_{i=1}^N (y_i - \hat f_{x_1})^2$   ▷ least squares!

### Learning parameters: robot example

Example based on S. Thrun, Statistics, Udacity (2012)
One example: Deep Learning

### Deep Learning: Neural Network Model

Wikipedia

A Neural Network consists of layered linear regressions (one for each neuron) stacked with non-linear activation functions $\phi$.

\begin{align} P(y\,|\,x,\theta) & = \mathcal{N}(y \,|\, v^\top z(x), \sigma^2)\\ z(x) & = (\phi(w_1^\top x), \ldots, \phi(w_H^\top x)) \end{align}
• Deep learning means many layers.
• In each hidden layer, combine weights $w_i$ to matrix $\mathbf{W}$.

### Deep Learning: Idea

Hubel and Wiesel (1959, 1962, 1968): Nobel prize 1981 for work on mammalian vision system.   ▷ Results on primary visual cortex (V1).

• V1 is arranged in a spatial map mirroring the structure of the image in the retina.
• V1 has simple cells whose activity is a linear function of the image in a small localized receptive field.
• V1 has complex cells whose activity is invariant to small spatial translations.
• Neurons in V1 respond most strongly to very specific, simple patterns of light, such as oriented bars, but respond hardly to any other patterns.

### Deep Learning: Convolution Layer

• discrete convolution of functions $f_t$ and $w_t$, $t\in\{1,2,...,D\}$,
$$\mathbf{\tilde f} = \sum_\tau w_{t-\tau}\, f_\tau = \mathbf{W} \mathbf{f}, \quad \mathbf{\tilde f},\mathbf{f} \in \mathbb{R}^D$$ where $W_{t\tau} = w_{t-\tau}$, $\mathbf{W} \in \mathbb{R}^{D\times D}$.

▷  Instead of $D^2$, only $D$ independent components.

#### Natural extension: sparsity

• demand: $w_{t-\tau} \stackrel{!}{=} 0$ for $|t-\tau| > d$
[usual property of kernels: e.g. Gaussian $W_{t\tau} = e^{-\frac{(t-\tau)^2}{2d^2}}$]

▷  Instead of $D^2$, only $2d$ nonzero components.  ▷  Statistics ☺!

### Deep Learning: Convolution Layer

general weight matrix $\mathbf{W}$
(arrows represent arbitrary values)
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: full range $D$
convolution type $\mathbf{W}$
(arrows: same values across receptive fields )
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: local range $2d$

### Deep Learning: Why is convolution useful?

Consider an example ($d=1$)

$\mathbf{W} = \left(\begin{array}{ccccc} \ddots & -1 & 1 & 0 & \ddots\\ \ddots & 0 & -1 & 1 & \ddots \end{array} \right)$ $\,\Leftrightarrow\,$ $\tilde f_t = f_t - f_{t-1}$,

that is, $\,\,\mathbf{f}$ =

$\mapsto\,\, \mathbf{\tilde f}$ =

▷  Simple edge structures revealed! Just as the simple cells in V1!

### Deep Learning: Pooling layers

#### Assumption

• In most cases, classification information does not depend strongly on the location (index $t$) of a pattern. That is, the presence of a pattern is more important than its location.
• In many cases, our only interest is the presence or absence of a pattern.

#### Max-Pooling Layer

• implement local translational invariance
• Just as complex cells in V1.

### Deep Learning: Convolutional Neural Network

1. Read input $\mathbf{f}$.
2. Convolution stage
$\,\mathbf{\tilde f}^{(k)} := \mathbf{W}^{(k)} \mathbf{f},$
where $\mathbf{W}^{(k)}$ is one of $K$ convolution kernels, $k=1,...,K$.
3. Detector stage $\, \tilde f_t^{(k)} := \phi(\tilde f_t^{(k)} + b)$ where $\phi$ is an activation function, $b$ a bias.
4. Pooling stage $\, \tilde f_t^{(k)} := \max_{\tau \in [t-d,t+d]} \tilde f_t^{(k)}$

• Receptive field grows and more and more complex features are constructed in each layer.

• What to generally learn from deep learning?
Convolutional networks are so successful because they efficiently encode our (correct) beliefs about the structure of certain data (translation-invariant, simple local features, complex features from simple features).
▷ Understand the similarity structure of your data and use a model that reflects it!

### Summary

• Machine Learning ist Statistics with models of higher complexity.
• It's all about similarity in the data space.
• Two simple examples: kNN and linear regression
• Learning is Bayes rule followed by an optimization.
• Deep Learning: very successful way of understanding and exploiting the similarity structure of e.g. image data.

Thank you!