Machine Learning Basics

pedagogic talk mainly based on
Murphy, Machine Learning - A Probabilistic Perspective (2012)
Theis, Lecture Notes on Statistical Learning, TU Munich (2016)
Goodfellow, Bengio & Courville, Deep Learning (2016)

Nanotemper Technologies · Munich · 10 June 2016


F. Alexander Wolf | falexwolf.de

Institute of Computational Biology

Helmholtz Zentrum München

fullscreen: 'f' / navigation: arrow keys / black screen: 'b' / overview: 'o'

The Future of Robotics and Artificial Intelligence, Andrew Ng, Stanford University, 2011

Machine learning in robotics, natural language processing, neuroscience research, and computer vision.
Jordan & Mitchell, Science 349, 255 (2015)

What is Machine Learning?

It's statistics using models with higher complexity.

These might yield higher precision but are less interpretable.

R. Tibshirani, Lecture Notes, Stanford U (2012)

What does Machine Learning?

  • Estimate a functional relation $$f : \mathcal{X} \rightarrow \mathcal{Y} \qquad X \mapsto Y$$ from data $\mathcal{D} = \{(x_i,y_i)\}_{i=1}^{N}$ (supervised case).
  • Estimate similarity in the space $\mathcal{X}$ from data $\mathcal{D} = \{x_i\}_{i=1}^{N}$, $x_i \in \mathcal{X}$ (unsupervised case).

Comments

  • Estimation based on data is referred to as learning.
  • Also the supervised case requires learning similarity in $\mathcal{X}$.

Classification example

Learn function $$f : \mathbb{R}^{28\times 28} \rightarrow \{2,4\}.$$

Examples from MNIST data base.

In which way are samples, e.g. for the label $y=2$, similar to each other?

▷ Strategy: find coordinates = features that reveal the similarity!

▷ Here PCA: diagonalize the covariance matrix $(x_i^\top \cdot x_j)_{ij=1}^N$.

A simple model: k Nearest Neighbors

Model function $\hat f$: estimator $\hat y_x$ for $Y$ given $X=x$ $$ \hat y_x = \hat f_\mathcal{D}(x) = \mathrm{E}_{p(y\,|\,x,\mathcal{D})}[y] = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} y_i $$

Hastie et al., Elements of Statistical Learning (2009)

A simple model: k Nearest Neighbors

Model function $\hat f$: estimator for $Y$ given $X=x$ $$ \hat y_x = \hat f_\mathcal{D}(x) = \mathrm{E}_{p(y\,|\,x,\mathcal{D})}[y] = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} y_i $$ Probabilistic model definition reflects uncertainty $$ p(y\,|\,x,\mathcal{D}) = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} \mathbb{I}(y_i = y) $$

  • Overfitting and Bias-Variance tradeoff: the lower the variance, the higher the bias.
  • Is non-parametric, so no learning of parameters.
  • Assumption: Euclidean distance is good similarity measure for $x$.
  • Curse of dimensionality: does not work in high dimensions.

Another simple model: linear regression

Estimator for $y$ given $x$ $$ \hat y_x = \hat f_{\theta}(x) = \mathrm{E}_{p(y\,|\,x,\theta)}[y] = w_0 + x^\top w $$ Probabilistic model definition $$ p(y\,|\,x,\theta) = \mathcal{N}(y \,|\, \hat y_x,\sigma), \quad \theta = (w_0,w,\sigma) $$ Estimate parameters from data $\mathcal{D}$ $$ \theta^* = \text{argmax}_\theta p(\theta\,|\,\mathcal{D}) $$

  • Parametric, parameters $\theta$ have to be learned.
  • High bias due to linearity assumption, but works in high dimensions, and is easily interpretable.

Learning parameters

Estimate parameters from data $\mathcal{D}$ $$ \theta^* = \text{argmax}_\theta p(\theta\,|\,\mathcal{D},\mathrm{model},\mathrm{beliefs}), \qquad {Optimization} $$ assuming a model and prior beliefs about parameters. Now
$$ p(\theta\,|\,\mathcal{D}) = p(\mathcal{D}\,|\,\theta)p(\theta)/p(\mathcal{D}). \qquad\quad {Bayes'~rule} $$
Evaluate: assume uniform prior $p(\theta)$ and iid samples $(y_i, x_i)$ $$ p(\theta\,|\,\mathcal{D}) \propto p(\mathcal{D}\,|\,\theta) = \prod_{i=1}^N p(y_i, x_i \,|\,\theta) \propto \prod_{i=1}^N p(y_i \,|\, x_i, \theta) $$

Linear regression: $ \log p(\theta\,|\,\mathcal{D}) \simeq \sum_{i=1}^N (y_i - \hat f_{x_1})^2$   ▷ least squares!

Learning parameters: robot example

Example based on S. Thrun, Statistics, Udacity (2012)
One example: Deep Learning

Deep Learning: Neural Network Model

Wikipedia

A Neural Network consists of layered linear regressions (one for each neuron) stacked with non-linear activation functions $\phi$.


\begin{align} P(y\,|\,x,\theta) & = \mathcal{N}(y \,|\, v^\top z(x), \sigma^2)\\ z(x) & = (\phi(w_1^\top x), \ldots, \phi(w_H^\top x)) \end{align}
  • Deep learning means many layers.
  • In each hidden layer, combine weights $ w_i$ to matrix $\mathbf{W}$.

Deep Learning: Idea

Hubel and Wiesel (1959, 1962, 1968): Nobel prize 1981 for work on mammalian vision system.   ▷ Results on primary visual cortex (V1).

  • V1 is arranged in a spatial map mirroring the structure of the image in the retina.
  • V1 has simple cells whose activity is a linear function of the image in a small localized receptive field.
  • V1 has complex cells whose activity is invariant to small spatial translations.
  • Neurons in V1 respond most strongly to very specific, simple patterns of light, such as oriented bars, but respond hardly to any other patterns.

Deep Learning: Convolution Layer

  • discrete convolution of functions $f_t$ and $w_t$, $t\in\{1,2,...,D\}$,
    $$ \mathbf{\tilde f} = \sum_\tau w_{t-\tau}\, f_\tau = \mathbf{W} \mathbf{f}, \quad \mathbf{\tilde f},\mathbf{f} \in \mathbb{R}^D $$ where $W_{t\tau} = w_{t-\tau}$, $\mathbf{W} \in \mathbb{R}^{D\times D}$.

▷  Instead of $D^2$, only $D$ independent components.

Natural extension: sparsity

  • demand: $w_{t-\tau} \stackrel{!}{=} 0$ for $|t-\tau| > d$
    [usual property of kernels: e.g. Gaussian $ W_{t\tau} = e^{-\frac{(t-\tau)^2}{2d^2}}$]

▷  Instead of $D^2$, only $2d$ nonzero components.  ▷  Statistics ☺!

Deep Learning: Convolution Layer

general weight matrix $\mathbf{W}$
(arrows represent arbitrary values)
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: full range $D$
convolution type $\mathbf{W}$
(arrows: same values across receptive fields )
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: local range $2d$

Deep Learning: Why is convolution useful?

Consider an example ($d=1$)

$ \mathbf{W} = \left(\begin{array}{ccccc} \ddots & -1 & 1 & 0 & \ddots\\ \ddots & 0 & -1 & 1 & \ddots \end{array} \right)$ $\,\Leftrightarrow\,$ $\tilde f_t = f_t - f_{t-1}$,

that is, $\,\,\mathbf{f}$ =

$\mapsto\,\, \mathbf{\tilde f}$ =

▷  Simple edge structures revealed! Just as the simple cells in V1!

Deep Learning: Pooling layers

Assumption

  • In most cases, classification information does not depend strongly on the location (index $t$) of a pattern. That is, the presence of a pattern is more important than its location.
  • In many cases, our only interest is the presence or absence of a pattern.

Max-Pooling Layer

  • implement local translational invariance
  • Just as complex cells in V1.

Deep Learning: Convolutional Neural Network

  1. Read input $\mathbf{f}$.
  2. Convolution stage
    $\,\mathbf{\tilde f}^{(k)} := \mathbf{W}^{(k)} \mathbf{f},$
    where $\mathbf{W}^{(k)}$ is one of $K$ convolution kernels, $k=1,...,K$.
  3. Detector stage $\, \tilde f_t^{(k)} := \phi(\tilde f_t^{(k)} + b)$ where $\phi$ is an activation function, $b$ a bias.
  4. Pooling stage $\, \tilde f_t^{(k)} := \max_{\tau \in [t-d,t+d]} \tilde f_t^{(k)}$

Deep Learning: Comments

  • Receptive field grows and more and more complex features are constructed in each layer.

  • What to generally learn from deep learning?
    Convolutional networks are so successful because they efficiently encode our (correct) beliefs about the structure of certain data (translation-invariant, simple local features, complex features from simple features).
    ▷ Understand the similarity structure of your data and use a model that reflects it!

Summary

  • Machine Learning ist Statistics with models of higher complexity.
  • It's all about similarity in the data space.
  • Two simple examples: kNN and linear regression
  • Learning is Bayes rule followed by an optimization.
  • Deep Learning: very successful way of understanding and exploiting the similarity structure of e.g. image data.

Thank you!