# Feature Maps

Feature maps are a technique for 'lifting' linear regression techniques to caputre higher dimensional features of the data sets. These can be either explicit, where we construct a set of basis functions, or implicit, using kernel methods.

## Ridge Regression

An important tool for the study of feature maps is ridge regression, where we add a term to our linear regression that penalises miscategorised data in the training set, but also gives a tradeoff with the norm of the vector $\mathbf{w}$. This gives a new form of the empirical error term

Differentiating and settign equal to 0 gives the so called 'regularized solution'

Expanding out the vector $\mathbf{w}=\sum_{i=1}^{m}\alpha_{i}\mathbf{x}_{i}$ is called the 'primal form of the solution. We can instead define the dual function $f(\mathbf{x}=\sum_{i=1}^{m}\alpha_{i}\mathbf{x}_{i}^{T}\mathbf{x}$. The resulting vector of coefficients is given by $\mathbf{\alpha}=\left(X^{T}X+\lambda\mathbf{I}_{m}\right)^{-1}\mathbf{y}$.

This dual form is computationally helpful for $n>m$, as we no longer have to invert an $n\times n$ matrix.

## Feature Maps

Feature maps can be used, in conjunction with ridge regression, to generalise regression to a space of nonlinear functions. The penalty term in ridge regression maps us to a convex optimisation problem, which is classically solvable.

A feature map $\phi$ is a function $\mathbb{R}^{n}\rightarrow\mathbb{R}^{N}$, made up of $N$ basis funcions $\phi_{i}(\mathbf{x})$. The resulting vector $\phi(\mathbf{x})$ is called the feature vector, and the image of $\mathbb{R}^{n}$ under the feature map is called the feature space.

Non-linear regression then has the 'primal form' This partcular construction is called an explicit feature map as we have written out the feature functions $\phi_{i}$.

### Kernel Methods

The kernel method is effectively the dual-form of nonlinear regression. Instead of needing to specify the $N$ basis functions, we instead define a function called a kernel $K$, which encodes the inner products between the functions. This allows us to dramatically reduce the computational time required when dealing with the feature maps.

A kernal $K$ is define as a function $\mathbb{R}^{n}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$, which takes as argument two elements of the training set and returns Our hope is that the kernel requires computational time that is independent of the number of basis functions in the feature map, and instead depends only on the size of the training data.