Principle Component Analysis (PCA)

Dimensionality reduction

About

Dimensionality reduction is a technique used in unsupervised learning that allows you to observe the “most important” dimensions of a set of data

{X\in R^{n\times D}}

More specifically we want to transform a set of

{n }

data points that each have

{D}

features to a set of

{n}

data points that each have

{d}

features where

{d<<D}

. The goal is to reduce the dimension of data points while preserving as much information from the original data as possible.

This can be helpful for many reasons:

Computational efficiency
Improved learning
1. Some algorithms’ assumptions are better satisfied in low dimensions. Also, dimensionality reduction can act as a regularizer
Pre-training
1. We can learn the “good features” that should be used for supervised learning from very large amounts of unsupervised data.
Visualization
1. It is much easier to plot low dimensional data.

Objective

Lets assume that each of the

{n}

data points

{x_i\in \R^D}

where

{D}

is a very high dimensional space. We want to find a compressed representation of

{x_i}

using some function

{f}

such that

{f(x_i)\in \R^d}

where

{d<<D}

. We also want to maintain that

{f(x_i)}

retains the “information” in

{x_i}

Consider the estimated inverse of

{f}

denoted

{\hat{f^{-1}}}

, that attempts reconstructs

{x_i}

from

{f(x_i)}

. Our goal is to make functions

{f,\hat{f^{-1}}}

such that the following distance is minimized:

||x_i-\hat{f^{-1}}(f(x_i))||_2^2

This is known as the reconstruction error.

Principle Component Analysis

Principle component analysis is one way can do this dimensionality reduction and minimize the reconstruction loss.

Suppose we are given

{n}

data points

{x_1,...,x_n\in \R^D}

and

{d<<D}

is the target dimensionality we want to reduce to. In PCA, we find a linear transformation

{f}

that maps all

{x_i}

from

{\R^D\rightarrow\R^d}

. When choosing

{f}

we want to preserve the structure of data.

Specifically, we find an

orthogonal projection

using a matrix of orthonormal row vectors

{Q\in \R^{d\times D}}

. So our reduction function

{f}

{f(x)=Qx}

. Since the rows of

{Q}

are orthonormal, they make an orthogonal basis for

{\R^d}

We can therefore define the estimated reconstruction (inverse) function

{\hat{f^{-1}}}

directly using

{Q^T}

. So we have

{\hat{f^{-1}}(f(x))=Q^Tf(x)=Q^TQx}

. Note that this is an “estimated inverse” function since

{\hat{f^{-1}}(f(x))=Q^TQx }

doesn’t always give back exactly

{x}

. Instead, it gives back an estimated “reconstruction” of

{x}

💡

Note for PCA to work correctly, we assume that the mean of our data is 0 and it has unit variance. We can simply subtract the mean and divide by the standard deviation across each dimension of

{X}

to achieve this. An intuition for why we are doing this is that we want to make sure that no dimension is more “important” than another.

PCA as Minimizing the Reconstruction Loss

Recall that the reconstruction is given by

{Q^TQx_i}

where

{Q\in\R^{d\times D}}

and is composed of orthonormal row vectors

{\{q_1,...,q_d\}}

and

{x_i\in\R^D}

. Since

{Q}

is composed of orthonormal vectors we know that

{QQ^T=I_d}

(identity matrix of dimension

{d}

The reconstruction error for the

{i}

th point is then:

||x_i-Q^TQx_i||^2_2

We can rewrite

{Q^TQx_i}

{\sum_{j=1}^d(x_i^Tq_j)q_j}

and rewrite the above as:

||\sum_{j=1}^d(x_i^Tq_j)q_j-x_i||_2^2

Until now we considered

{d<<D}

. But what if

{d=D?}

Well, then we would for some matrix

{Q^*}

{Q^*\in \R^{d\times D}\implies Q^*\in \R^{D\times D}}

. Since

{Q^*}

is also composed of orthonormal vectors,

{Q^*}

would now be an orthogonal matrix and

{{Q^*}^TQ^*=I_D}

(identity matrix of dimension

{D}

). Therefore we know:

{Q^*}^TQ^*x_i=\sum_{j=1}^D(x_i^Tq_j)q_j=x_i

for some sequence of orthonormal vectors

{\{q_1,...,q_D\}}

So we have

{x_i=\sum_{j=1}^D(x_i^Tq_j)q_j}

. Plugging this back in for the reconstruction error for

{x_i}

we get:

||x_i-Q^TQx_i||^2_2=||\sum^D_{j=1}(x_i^Tq_j)q_j-\sum_{j=1}^d(x_i^Tq_j)q_j||^2_2\\=||\sum^D_{j=d+1}(x_i^Tq_j)q_j||^2_2

Since we are trying to minimize this error, we can formally write this as an average over all data points as:

\argmin_{q_{d+1},...,q_D,||q_i||=1}\frac{1}{n}\sum_{i=1}^n ||\sum^D_{j=d+1}(x_i^Tq_j)q_j||^2_2

Using some algebraic manipulation, we can simplify this optimization to:

\argmin\frac1n\sum_{i=1}^n\sum_{j=d+1}^D(x_i^Tq_j)^2=\argmin\sum_{j=d+1}^Dq_j^T\frac{X^TX}nq_j

This is equivalent to maximizing over the first

{d}

vectors:

\argmax_{q_{d+1},...,q_D,||q_i||=1}\sum_{j=1}^dq_j^T\frac{X^TX}nq_j

So how do we solve this? Recall that

{\frac{X^TX}n\in\R^{D\times D}}

and

{(\frac{X^TX}n)q_i}

results in a vector in

{\R^D}

To maximize

{q_i^T\frac{X^TX}nq_i=q_i\cdot (\frac{X^TX}nq_i)}

we know

{(\frac{X^TX}n)q_i}

must be in the direction of

{q_i}

. This is because for an arbitrary vector

{v}

, the dot product of

{q_i\cdot v}

is maximized when

{v=kq_i}

by properties of the dot product. Therefore, we want

{q_i}

to be a vector such that

{(\frac{X^TX}n)q_i=kq_i}

for some scalar

{k}

In order for

{(\frac{X^TX}n)q_i=kq_i}

to hold,

{q_i}

must be an eigenvector of

{\frac{X^TX}n}

. Therefore, we will get

{\frac{X^TX}nq_i=\lambda q_i}

where

{\lambda}

is an

eigenvalue

{\frac{X^TX}n}

and

{q_i}

is an

eigenvector

{\frac{X^TX}n}

Now, which eigenvector/eigenvalue pair do we choose?

Well, since

{(\frac{X^TX}n)q_i=\lambda q_i}

, then

{q_i^T(\frac{X^TX}n)q_i=q_i^T\lambda q_i=\lambda||q_i||^2_2}

. Since we are maximizing over the space

{q_i}

where

{||q_i||^2_2=1}

, we know that

{\lambda||q_1||^2_2=\lambda*1=\lambda}

Now the eigenvector/eigenvalue we choose is trivial. We simply choose the

{d}

eigenvalues

{(\lambda)}

and its corresponding eigenvectors that are the greatest! Note, the largest eigenvalues of a matrix are known as the

principal

eigenvalues.

This means if we take

{q_1,...,q_d}

as the

{d}

eigenvectors corresponding to the

{d}

largest eigenvalues of

{\frac{X^TX}n}

, we can construct a matrix

{Q=\begin{bmatrix}q_1\\.\\.\\.\\q_d\end{bmatrix}\in \R^{d\times D}}

and use

{Q}

to project

{X\in \R^{n\times D}\rightarrow X'\in \R^{n\times d}}

using the transformation

{X'=XQ^T}

Computing the eigenvalue

Now one question remains, how do we compute the eigenvalues/eigenvectors from

{\frac{X^TX}n}

Well we can simply just compute the eigen-decomposition. (recall the procedure of solving

{det(\hat{\Sigma}-\lambda I)=0}

, and solving the characteristic equation).

While this works, it is a bit slow. Computing

{X^TX}

for the covariance matrix takes

{O(nD^2)}

time, followed by the

{O(D^3)}

time to compute the eigen-decomposition.

A faster method is to use the singular value decomposition (SVD) of

{X}

. This is given by:

X=U\Sigma V^T

I won’t get into the details of this here but just know that it runs in faster

{O(nDd)}

time.

Algorithm

Formally, our algorithm to reduce

{n}

points of

{D}

dimensional data to

{n}

points of

{d}

dimensional data (

{d<<D)}

is:

Let
${X\in\R^{n\times D}}$
be our dataset
Normalize
${X}$
such that it has mean 0 and standard deviation 1
Compute matrix
${\frac{X^TX}{n-1}\in\R^{D\times D}}$
1. Conventionally
  ${\frac{X^TX}{n-1}}$
  is done over
  ${\frac{X^TX}n}$
  as this comes from the
  maximizing variance formulation of PCA
  . However, both are equivalent as this just results in a different scaling of the eigenvectors.
Find the
${d}$
principle eigenvectors (eigenvectors corresponding to largest eigenvalues), denoted
${\{q_1,...,q_d\}}$
of
${\frac{X^TX}{n-1}}$
using either eigen-decomposition or SVD.
Create a matrix using the eigenvectors
${Q=\begin{bmatrix}q_1\\.\\.\\.\\q_d\end{bmatrix}\in \R^{d\times D}}$
.
Create the dimensionality-reduced dataset by
${XQ^T}$

Sources

Thanks for reading! If you have any questions or notice something wrong, please email me at

varchi [at] seas [dot] upenn [dot] edu.