Principle Component Analysis (PCA)

Dimensionality reduction

Cover Image

About

Dimensionality reduction is a technique used in unsupervised learning that allows you to observe the “most important” dimensions of a set of data

XRn×D{X\in R^{n\times D}}

.

More specifically we want to transform a set of

n{n }

data points that each have

D{D}

features to a set of

n{n}

data points that each have

d{d}

features where

d<<D{d<<D}

. The goal is to reduce the dimension of data points while preserving as much information from the original data as possible.

This can be helpful for many reasons:

  1. Computational efficiency

  2. Improved learning

    1. Some algorithms’ assumptions are better satisfied in low dimensions. Also, dimensionality reduction can act as a regularizer

  3. Pre-training

    1. We can learn the “good features” that should be used for supervised learning from very large amounts of unsupervised data.

  4. Visualization

    1. It is much easier to plot low dimensional data.

Objective

Lets assume that each of the

n{n}

data points

xiRD{x_i\in \R^D}

where

D{D}

is a very high dimensional space. We want to find a compressed representation of

xi{x_i}

using some function

f{f}

such that

f(xi)Rd{f(x_i)\in \R^d}

where

d<<D{d<<D}

. We also want to maintain that

f(xi){f(x_i)}

retains the “information” in

xi{x_i}

.

Consider the estimated inverse of

f{f}

denoted

f1^{\hat{f^{-1}}}

, that attempts reconstructs

xi{x_i}

from

f(xi){f(x_i)}

. Our goal is to make functions

f,f1^{f,\hat{f^{-1}}}

such that the following distance is minimized:

xif1^(f(xi))22||x_i-\hat{f^{-1}}(f(x_i))||_2^2

This is known as the reconstruction error.

Principle Component Analysis

Principle component analysis is one way can do this dimensionality reduction and minimize the reconstruction loss.

Suppose we are given

n{n}

data points

x1,...,xnRD{x_1,...,x_n\in \R^D}

and

d<<D{d<<D}

is the target dimensionality we want to reduce to. In PCA, we find a linear transformation

f{f}

that maps all

xi{x_i}

from

RDRd{\R^D\rightarrow\R^d}

. When choosing

f{f}

we want to preserve the structure of data.

Specifically, we find an

orthogonal projection

using a matrix of orthonormal row vectors

QRd×D{Q\in \R^{d\times D}}

. So our reduction function

f{f}

is

f(x)=Qx{f(x)=Qx}

. Since the rows of

Q{Q}

are orthonormal, they make an orthogonal basis for

Rd{\R^d}

.

We can therefore define the estimated reconstruction (inverse) function

f1^{\hat{f^{-1}}}

directly using

QT{Q^T}

. So we have

f1^(f(x))=QTf(x)=QTQx{\hat{f^{-1}}(f(x))=Q^Tf(x)=Q^TQx}

. Note that this is an “estimated inverse” function since

f1^(f(x))=QTQx{\hat{f^{-1}}(f(x))=Q^TQx }

doesn’t always give back exactly

x{x}

. Instead, it gives back an estimated “reconstruction” of

x{x}

.

💡

Note for PCA to work correctly, we assume that the mean of our data is 0 and it has unit variance. We can simply subtract the mean and divide by the standard deviation across each dimension of

X{X}

to achieve this. An intuition for why we are doing this is that we want to make sure that no dimension is more “important” than another.

PCA as Minimizing the Reconstruction Loss

Recall that the reconstruction is given by

QTQxi{Q^TQx_i}

where

QRd×D{Q\in\R^{d\times D}}

and is composed of orthonormal row vectors

{q1,...,qd}{\{q_1,...,q_d\}}

and

xiRD{x_i\in\R^D}

. Since

Q{Q}

is composed of orthonormal vectors we know that

QQT=Id{QQ^T=I_d}

(identity matrix of dimension

d{d}

).

The reconstruction error for the

i{i}

th point is then:

xiQTQxi22||x_i-Q^TQx_i||^2_2

We can rewrite

QTQxi{Q^TQx_i}

as

j=1d(xiTqj)qj{\sum_{j=1}^d(x_i^Tq_j)q_j}

and rewrite the above as:

j=1d(xiTqj)qjxi22||\sum_{j=1}^d(x_i^Tq_j)q_j-x_i||_2^2

Until now we considered

d<<D{d<<D}

. But what if

d=D?{d=D?}

Well, then we would for some matrix

Q{Q^*}

:

QRd×D    QRD×D{Q^*\in \R^{d\times D}\implies Q^*\in \R^{D\times D}}

. Since

Q{Q^*}

is also composed of orthonormal vectors,

Q{Q^*}

would now be an orthogonal matrix and

QTQ=ID{{Q^*}^TQ^*=I_D}

(identity matrix of dimension

D{D}

). Therefore we know:

QTQxi=j=1D(xiTqj)qj=xi{Q^*}^TQ^*x_i=\sum_{j=1}^D(x_i^Tq_j)q_j=x_i

for some sequence of orthonormal vectors

{q1,...,qD}{\{q_1,...,q_D\}}

.

So we have

xi=j=1D(xiTqj)qj{x_i=\sum_{j=1}^D(x_i^Tq_j)q_j}

. Plugging this back in for the reconstruction error for

xi{x_i}

we get:

xiQTQxi22=j=1D(xiTqj)qjj=1d(xiTqj)qj22=j=d+1D(xiTqj)qj22||x_i-Q^TQx_i||^2_2=||\sum^D_{j=1}(x_i^Tq_j)q_j-\sum_{j=1}^d(x_i^Tq_j)q_j||^2_2\\=||\sum^D_{j=d+1}(x_i^Tq_j)q_j||^2_2

Since we are trying to minimize this error, we can formally write this as an average over all data points as:

arg minqd+1,...,qD,qi=11ni=1nj=d+1D(xiTqj)qj22\argmin_{q_{d+1},...,q_D,||q_i||=1}\frac{1}{n}\sum_{i=1}^n ||\sum^D_{j=d+1}(x_i^Tq_j)q_j||^2_2

Using some algebraic manipulation, we can simplify this optimization to:

arg min1ni=1nj=d+1D(xiTqj)2=arg minj=d+1DqjTXTXnqj\argmin\frac1n\sum_{i=1}^n\sum_{j=d+1}^D(x_i^Tq_j)^2=\argmin\sum_{j=d+1}^Dq_j^T\frac{X^TX}nq_j

This is equivalent to maximizing over the first

d{d}

vectors:

arg maxqd+1,...,qD,qi=1j=1dqjTXTXnqj\argmax_{q_{d+1},...,q_D,||q_i||=1}\sum_{j=1}^dq_j^T\frac{X^TX}nq_j

So how do we solve this? Recall that

XTXnRD×D{\frac{X^TX}n\in\R^{D\times D}}

and

(XTXn)qi{(\frac{X^TX}n)q_i}

results in a vector in

RD{\R^D}

To maximize

qiTXTXnqi=qi(XTXnqi){q_i^T\frac{X^TX}nq_i=q_i\cdot (\frac{X^TX}nq_i)}

we know

(XTXn)qi{(\frac{X^TX}n)q_i}

must be in the direction of

qi{q_i}

. This is because for an arbitrary vector

v{v}

, the dot product of

qiv{q_i\cdot v}

is maximized when

v=kqi{v=kq_i}

by properties of the dot product. Therefore, we want

qi{q_i}

to be a vector such that

(XTXn)qi=kqi{(\frac{X^TX}n)q_i=kq_i}

for some scalar

k{k}

.

In order for

(XTXn)qi=kqi{(\frac{X^TX}n)q_i=kq_i}

to hold,

qi{q_i}

must be an eigenvector of

XTXn{\frac{X^TX}n}

. Therefore, we will get

XTXnqi=λqi{\frac{X^TX}nq_i=\lambda q_i}

where

λ{\lambda}

is an

eigenvalue

of

XTXn{\frac{X^TX}n}

and

qi{q_i}

is an

eigenvector

of

XTXn{\frac{X^TX}n}

.

Now, which eigenvector/eigenvalue pair do we choose?

Well, since

(XTXn)qi=λqi{(\frac{X^TX}n)q_i=\lambda q_i}

, then

qiT(XTXn)qi=qiTλqi=λqi22{q_i^T(\frac{X^TX}n)q_i=q_i^T\lambda q_i=\lambda||q_i||^2_2}

. Since we are maximizing over the space

qi{q_i}

where

qi22=1{||q_i||^2_2=1}

, we know that

λq122=λ1=λ{\lambda||q_1||^2_2=\lambda*1=\lambda}

.

Now the eigenvector/eigenvalue we choose is trivial. We simply choose the

d{d}

eigenvalues

(λ){(\lambda)}

and its corresponding eigenvectors that are the greatest! Note, the largest eigenvalues of a matrix are known as the

principal

eigenvalues.

This means if we take

q1,...,qd{q_1,...,q_d}

as the

d{d}

eigenvectors corresponding to the

d{d}

largest eigenvalues of

XTXn{\frac{X^TX}n}

, we can construct a matrix

Q=[q1...qd]Rd×D{Q=\begin{bmatrix}q_1\\.\\.\\.\\q_d\end{bmatrix}\in \R^{d\times D}}

and use

Q{Q}

to project

XRn×DXRn×d{X\in \R^{n\times D}\rightarrow X'\in \R^{n\times d}}

using the transformation

X=XQT{X'=XQ^T}

.

Computing the eigenvalue

Now one question remains, how do we compute the eigenvalues/eigenvectors from

XTXn{\frac{X^TX}n}

?

Well we can simply just compute the eigen-decomposition. (recall the procedure of solving

det(Σ^λI)=0{det(\hat{\Sigma}-\lambda I)=0}

, and solving the characteristic equation).

While this works, it is a bit slow. Computing

XTX{X^TX}

for the covariance matrix takes

O(nD2){O(nD^2)}

time, followed by the

O(D3){O(D^3)}

time to compute the eigen-decomposition.

A faster method is to use the singular value decomposition (SVD) of

X{X}

. This is given by:

X=UΣVTX=U\Sigma V^T

I won’t get into the details of this here but just know that it runs in faster

O(nDd){O(nDd)}

time.

Algorithm

Formally, our algorithm to reduce

n{n}

points of

D{D}

dimensional data to

n{n}

points of

d{d}

dimensional data (

d<<D){d<<D)}

is:

  1. Let

    XRn×D{X\in\R^{n\times D}}

    be our dataset

  2. Normalize

    X{X}

    such that it has mean 0 and standard deviation 1

  3. Compute matrix

    XTXn1RD×D{\frac{X^TX}{n-1}\in\R^{D\times D}}
    1. Conventionally

      XTXn1{\frac{X^TX}{n-1}}

      is done over

      XTXn{\frac{X^TX}n}

      as this comes from the

      maximizing variance formulation of PCA

      . However, both are equivalent as this just results in a different scaling of the eigenvectors.

  4. Find the

    d{d}

    principle eigenvectors (eigenvectors corresponding to largest eigenvalues), denoted

    {q1,...,qd}{\{q_1,...,q_d\}}

    of

    XTXn1{\frac{X^TX}{n-1}}

    using either eigen-decomposition or SVD.

  5. Create a matrix using the eigenvectors

    Q=[q1...qd]Rd×D{Q=\begin{bmatrix}q_1\\.\\.\\.\\q_d\end{bmatrix}\in \R^{d\times D}}

    .

  6. Create the dimensionality-reduced dataset by

    XQT{XQ^T}
Thanks for reading! If you have any questions or notice something wrong, please email me at

varchi [at] seas [dot] upenn [dot] edu.