# Dimensionality Reduction

A common task in data analysis is to try and identify a lower dimensional manifold in a high dimensional data space, such that the features of the data are almost entirely captured by the low-dimensional manifold. Formally, we seek a vector $\mathbf{y}_{i}$ for each data element $\mathbf{x}_{i}$ that parametrises the location of $\mathbf{x}_{i}$ on the manifold.

## Linear Algebra Introduction

Our $m$ $n$-dimensional data points can be collected into an $n\times m$ real-valued matrix, $X$. We will assume for the following that the data has been 'centered', namely $\bar{\mathbf{x}}=0$.

We define two related symmetric matrices which capture aspects of the data distribution; the Scatter matrix, $S$, and the Gram matrix, $G$.

Both matrices are real, symmetric and positive semi-defifnite. This gives us some useful properties if we consider their eigenvalue/eigenvector and singular value decompositions. Recall that for a symmetric matrix the eigenvectors form an orthonormal basis. The symmetric matrix $M$ can then be decomposed as $V\Lambda V^{T}$, where $V$ is the matrix with eigenvectors as each column, and $\Lambda$ the diagonal matrix of eigenvalues.

A generalisation of this beyond square matrices is the Singular Value Decomposition. In this cae, we can decompose an $n\times m$ matrix into an orthonormal $n\times n$ matrix, $V$, a basis of the columns, a diagonal $n\times n$ matrix of singular values, $\Sigma$, and an orthonormal basis over the rows, the $n\times m$ matrix $U: U^{T}U=I$. Using the singular value decomposition, we can thus write Here, the eigenvectors of $S$ are given by the right singular vectors of $X$, and the eigenvectors of $G$ are the left singular vectors of $X$.

An important property of the $SVD$ is that if we take only the first $k$ singular values and left and right vectors, we obtain the best rank-$k$ approximation to $X$ (assuming least-squares loss). Namely, we minimise the error _{ij})^{2}] with $\tilde{V}=V_{1:n,1:k\}\sqrt{\Sigma}_{1:k,1:k}$, and $\tilde{U}=U_{1:m,1:k}\sqrt{\Sigma}_{1:k,1:k}$.

A corollary of this result is that the given a matrix $P$, then the error function _{ij})^{2}], is minimised for the $n\times k$ matrix P by taking

## Principal Component Analysis

We want to fibd a mnatrix $P$ that projects a data element $\mathbf{x}$ on to a manifold vector $\mathbf{y}$. Lets assume a linaer approach, namely (\mathbf{y}=P^{T}\mathbf{x}) for an $n\times k$ real-matrix $P$. This linear mapping preserves local and global structure of the data. Our goal is to pick a $P$ that captures most of the variance of the data.

The variance in a given direction is captured by the decomposition of that vector into eigenvalues. Accordingly, we can capture a good percentage of the variance by choosing the first $k$ eigendirections, trunctating specifically at a point where we feel we have capture sufficient variance that the remaining eigennvalues we exclude mostly capture the effects of noise. In this case, the eigenspectrum adopts a $\frac{1}{f}$ shape.

Our matrix $P$ is then built up as matrix with the $k$ first eigenvectors of $S$ as its columns, sorted in descending eigenvalue. The manifold coordinates $\mathbf{y}$ is given by $P^{T}\mathbf{x}$, and the hyperplane projection by $P\mathbf{y}=PP^{T}\mathbf{x}$.

### Variance Partitioning

PCA can be interpreted as splitting the data in to an 'in-manifold' signal, and the 'out-of-manifold' noise. In this case, we assume the noise can be seperate in to a few dimensions, whereas real measurement noise is likely to be isotropic in all components. There are different extensions of PCA that try to tackle this in different ways.

The first is to include istropic gaussian noise, and to optimise the PCA by estimating its scale. This is called probabilistic PCA (pPCA).

Another alternative, called Factor Analysis, assumes independent gaussian noise along each measured dimension. This is sensible for exmple when the measured quantities are in difference units.