# Supervised Learning

Useful References:

• Pattern Recognition and Machine Learning (Bishop, Spring 2006)
• An Introduction to Support Vector Machines (Shawe-Taylor, Critsiani, CUP 2000)
• Kernel Methods for Pattern Analysis (Shawe-Taylor, Cristiani, CUP 2004)

A supervised learning problem tries to use a training set of input/output pairs to determine the funcitonal relationship between them. Typicaly the input $x$ is a high dimensional vector, and the output $y$ is a reduced dimension variable like a category. Examples included detecting people in a set of photos. Machine learning gives us a general toolkit for tackling problems that are ill-defined and typically 'easier' for humans to do.

Learning attempts to infer the algorithm from a set of labelled examples. We are generally trying to isolate an underlying algorithm that is:

1. Stable: Not inferring 'chance' correlations
2. Efficient
3. Rboust: It should be able to deal with noise e.g.bad image quality of labelling errors.

Typically we being a learning problem with a large dataset, e.g. labelled images. We have to convert this dataset into a computational representation and then train our algorithm on the training set. Then, given a new set of data, the algorithm should be able to appropriately label it. We will mostly focus on the training step, but it is also important to consider the problem of data collection and representation.

A typical scenario for machine learning is datasets where the dimensionality is very high, potentially even larger than the number of samples we have available. A statistical technique might 'overfit' in these instances as the sample size is sparse.

Notationally, we have access to a training set $\mathcal{S}=\{(x_{1},y_{1}),\cdots,(x_{m},y_{m})\}$, and we want to infer a function $f_{s}:f_{s}\left(x_{i}\right}\approx y_{i}$ for $x_{i}\notin \mathcal{S}$. Typically the data is in $\mathbb{R^{n}}$, or the outputs can be categorical \$y\in\mathbb{Z}_{n}). A learning algorithm then, is a mapping from the training set to the classification function.

## Difficulties in Machine Learning

There are a number of subtleties in machine learning. We need to consider a number of biases that can impact on our inferred classification function.

The key thing is recognising possible assumptions that have gone in to your datasets. For example, you don't know that your data is truly 'i.i.d', and in turn we can't "tell" what features our neural net has learnt to distinguish on; has it, for example, learnt to tell "bright" from "dark" images because the weather was different when collecting your 'positive' and 'negative' sets?

We also have to recognise what assumptions our representation has introduced, how accurate $f_{s}$ is on new data, how we evaluate our performances, how 'complex' the learning task is and in turn how to select between two different learned models $f_{s},g_{s}$.

In turn, we have to consider how new inputs might differ from our training sets, what noise is in our inputs, how deterministic the output is etc. We can also consider the problem of spurious correlations in some dimensions that aren't relevant to our problem, and what prior knowledge we have available.

## Least Squares Regression

This problem is a simple example of a learning algorithm, developed in astronomy and first described by Gauss. Given a data set, how should we predict the output $y$ for a new input $x$?

For example $\mathcal{S}=\{((1,1), 2), ((2,3),7)\}$. We can see that the rule appears to be $y_{i} = 2*x_{i,1}+x_{i,2}$. We can encode this as a system of linear equations, and solve this for the vector $\mathbf{w}$.

We can then solve for $\mathbf{w} by inverting $X$. ### Over and Under-determination If we have more equations than unknowns, the dataset is overdetermined and a linear relationship may not exist. This is called overdetermination. In turn, we may have fewer datapoitns than our set of unknowns, and our dataset is underdetermined. Often, underdetermination is our problem. For example in genomics we have 'fat' data with many more varaibles than datapoints. We will first focus on the overdetermined case however. ### Least-Squares and Overdetermination We want to infer a linear relationship in our overdetermined dataset. We seek a linear predictor $\hat{y}=\mathbf{w}\dot\mathbf{x}$ to minimize the square error such that Given our $m$ inputs, $\mathbf{x}_{i}\in\mathbb{R}^{n}$, we build a matrix X in $\mathbb{R}^{m\times n}$ and a vector $\mathbf{y}\in\mathbb{R}$. Our empirical mean-squares error is then given by We seek the optimal $\mathbf{w}$ to minimize the error, and we can solve for this by taking the derivative of this expression with respect to $\mathbf{w}$. Following the argument leads to the solution [\mathbf{w}=\left(X^{\text{T}}X\right)^{-1}X^{\text{t}}\mathbf{y}. In matlab we can use \texttt{w=X\y}. This can work in the over and under-determined case but it is vital to read the documentation to understand what assumptions go in to handling the underdetermined case. The main advantage is that it avoids the numerical instability incurred in numerical inversion. ## Optimal Supervised Learning We assume that our data is obtained by sampling in independent, identitcally distributed trials from an unknown probability density $P(\mathbf{x},y)$. In this case we see the optimal form of the classifier function that minimises the error =\int (y-f(\mathbf{x}))^{2}dP(\mathbf{x},y) ] and we denote the optimal solution $f^{*}$. We can use this to find the Bayes estimator for regression, Using Bayes' theorem to write $P(\mathbf{x},y)=P(y|\mathbf{x})P(\mathbf{x})$, we can decompose the integral into one over domain and range of the classifier such that , giving $f^{*}=\int_{\mathcal{Y}}ydP(y|\mathcal{x})$, the integral of the probability around a specific $\mathbf{x}$. This can be re-expressed as The main problem, however, is that we need to estimate $P$: we never have access to the full distribtuon! ### Bias and Variance Assume that our classification has a white-noise error term, with 0 mean and finite variance, such that $y=F(x)+\epsilon$. The optimal prediction is thus equal to $F(x)$ with square loss. We want to go further, though, to estimate the expected error by an arbitrary learner $A$ on our dataset $\mathcal{S}$, namely . Using the Lemma we can expand out the error function into three parts 1. \(E \left( \left( y-f(\mathbf{x})right)^{2}\right)$ : The Bayes Error

2. $\left(f^{*}(\mathbf{x})-E(A(\mathbf{x}))\right)^{2}$: The
Bias, or error in $f^{*}$

3. $E\left$\left(A(x)-E\left(A(x)\right)^{2}\right)\right$$ : The variance between training sets


There is a trade of between bias and variance, which can mostly be understood as the tradeoff between over and under-fitting the dataset. We also note that this exact decomposition holds only for the mean-squared error.

We might try to solve our problem of estimating $P$ by minimising the empirical error for our dataset, but the problem is that there always exists a function $f$ that will minimise this error (and, in turn, the bias). This is a problem as we can overfit the data; we can determine a perfect categorizer for the training set, but that function will vary largely depending on what training set we use.

### Hypothesis Space

Instead, we might try to select our function within a prescriped hypothesis space, which likely doesn't contain the global optimum function. We must be careful not to choose to large a space, else we fall back on to the overfitting problem.Thus, we seek to minimize our empirical error with respect to a given hypothesis space. In the case of regression, an example could be the least squares problem where we seek $f^{*}\in\mathcal{H}:=\{\mathbf{w}^{T}\mathbf{x}:\mathbf{w}\in\mathbb^{n}$.

The problem of choosing a hypothesis space is called Model Selection. For example, we might consider a hierarchy of hypothesis spaces $\mathcal{H}_{k}$, which are made up of order $k$polynomials. We can solve for a function that minimises in each dimension $r$. Generally, we'll find that as $r$ increases, the bias gets smaller and smaller. But, if we sample further points outside the training set, the observed error infact starts to grow beyond a certain level of accuracy $r$; this is the bias, variance tradeoff.

### K-fold Cross Validation

Cross validation is the practice of splitting the training set into $K$ susets. We train on some combination of $N subsets, and test on the remaining subsets, and rotate the training and tests sets. The most common example of this is 'leave one out' testing, where the training set is $K-1$ subsets and we test all $K$ permutations. We then select the model with the smallest average empirical error across all test sets.

It is worth noting here that there are a variety of theoretical approaches to model selection, including Bayesian selection based on prior knowledge of the distribution $P$, and 'structural risk minimization'.