Intro

One of the goals of principal component analysis (PCA) is to reduce the original data set into a smaller set of uncorrelated linear combinations of our independent variables. PCA is particularly useful in data sets where multicollinearity exists in a multiple linear regression setting. Addressing multicollinearity helps improve the reliability of the regression model, making it a more valuable tool for understanding relationships between variables and making predictions (Alin 2010).

In this article, we shall derive the principal components mathematically and demonstrate this method with the wine data set from UCI’s machine learning lab (Aeberhard and Forina 1991).

Derivation

Let \(X = \left[ X_1, X_2, \dots, X_p \right]'\) have covariance matrix \(\Sigma\), with eigenvalues \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p \geq 0\) and orthonormal eigenvectors \(e_1, \dots, e_p\). Consider the linear combinations

\[ Y_1 = a_1'X = a_{11}X_1 + a_{12}X_2 + \dots + a_{1p}X_p \]

\[ Y_2 = a_2'X = a_{21}X_1 + a_{22}X_2 + \dots + a_{2p}X_p \]

\[ \vdots \]

\[ Y_p = a_p'X = a_{p1}X_1 + a_{p2}X_2 + \dots + a_{pp}X_p \]

Then, using properties of the variance operator we obtain

\[ \text{Var}(Y_i) = a_i'\Sigma a_i \ i = 1,2,\dots,p \]

\[ \text{Cov}(Y_i,Y_k) = a_i'\Sigma a_k \ i,k = 1,2,\dots,p \]

The principal components are those uncorrelated linear combinations \(Y_1, Y_2, \dots, Y_p\) whose variances are as large as possible. The first principal component is the linear combination that maximizes \(\text{Var}(Y_1) = a_1'\Sigma a_1\), and so forth for the remaining principal components. We shall restrict attention to coefficient vectors of unit length in order to avoid indeterminacy with scaling since \(a_i\) could be scaled by any number.

As such, we shall maximize the following form:

\[ \underset{a_i \neq 0}{\mathrm{argmax}}\frac{a_i'\Sigma a_i}{a_i'a_i} \]

The maximization of quadratic forms lemma states: let \(A_{p \times p}\) be a symmetric, and positive definite matrix with eigenvalues with eigenvalues \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p > 0\) and orthonormal eigenvectors \(e_1, \dots, e_p\). Then

\[ \underset{x\neq 0}{\mathrm{argmax}} \ \frac{x'Ax}{x'x} = \lambda_i \]

and

\[ \underset{x\neq 0}{\mathrm{argmin}} \ \frac{x'Ax}{x'x} = \lambda_p \]

and

\[ \underset{x \perp e_1, e_2, \dots, e_k}{\mathrm{argmax}} \ \frac{x'Ax}{x'x} = \lambda_{k+1} \ \text{for} \ k=1,\dots,p-1 \]

Proof:

Since \(A\) is a diagonal matrix, we can decompose it into a form where the entries are the eigenvalues and eigenvectors. Specifically,

\[ A = P \Lambda P' \]

Where \(P\) is a square \(p \times p\) matrix whose \(i\)th column is the eigenvector \(e_i\) of \(A\) and \(\Lambda\) is the diagonal matrix whose elements are the corresponding eigenvalues.

Let \(y=P'x\), so that \(x=Py\) therefore

\[ \frac{x'Ax}{x'x} = \frac{x'P \Lambda P'x}{x'x} = \frac{y'P'P \Lambda y}{y'P'Py} = \frac{y'P'P \Lambda y}{y'P'Py} = \frac{y' \Lambda y}{y'y} \]

Since \(P'P=I\). We shall first prove the first principal component has variance \(\lambda_1\), so

\[ \frac{y' \Lambda y}{y'y} = \frac{\sum_{i=1}^p \lambda_iy_i^2}{\sum_{i=1}^p y_i^2} \leq \frac{\sum_{i=1}^p \lambda_1y_i^2}{\sum_{i=1}^p y_i^2} = \lambda_1 \]

Since \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p > 0\). Therefore the maximum is achieved when \(x=e_1\) because

\[ \frac{x'Ax}{x'x} = \frac{e_1'Ae_1}{e_1'e_1} = \frac{e_1'\lambda_1e_1}{e_1'e_1} = \lambda_1 \]

Since by the definition of eigenvalues and eigenvectors \(Ae_i = \lambda_ie_i\). If we let \(A = \Sigma\), \(a_1 = e_1\), and remember our orthogonal eigenvectors have the property that \(e_1'e_1 = 1\) then we get the following result

\[ \underset{a_1 \neq 0}{\mathrm{argmax}} \ a_1'\Sigma a_1 = \underset{a_1 \neq 0}{\mathrm{argmax}} \ \frac{a_1'\Sigma a_1}{a_1'a_1} = \lambda_1 \]

So the maximum variance is achieved at \(\lambda_1\) and we have the coefficients for our linear combination based on the eigenvector \(e_1\). The proof of minimum variance follows the same pattern but with \(\lambda_1\) replaced with \(\lambda_p\).

Now let \(x \perp e_1, e_2, \dots, e_k\). Then \(e_i x = 0\) for \(i=1,\dots,k\). Using our definition that \(x=Py\) we get the following result

\[ 0 = e_ix = e_i P y \]

\[ e_i [e_1 | e_2 |\dots|e_p]y = [0,\dots,1,0,\dots,0][y_1,\dots,y_p]' \]

At the \(i\)th position. Therefore

\[ = y_i \]

That is, \(y_i = 0\) for \(i=1,\dots,k\). So

\[ \frac{x'Ax}{x'x} = \frac{\sum_{i=1}^{p}\lambda_iy_i^2}{\sum_{i=1}^{p}y_i^2} = \frac{\sum_{i=k+1}^{p}\lambda_iy_i^2}{\sum_{i=k+1}^{p}y_i^2} \leq \frac{\sum_{i=k+1}^{p}\lambda_{k+1}y_i^2}{\sum_{i=k+1}^{p}y_i^2} = \lambda_{k+1} \]

And equality (or maximum) is achieved when \(x = e_{k+1}\) subject to \(\text{Cov}(Y_i,Y_k)=0 \ \forall \ k<i\) \(\square\).

So, the \(i\)th principal component is \(Y_i=e_iX\) and \(\text{Var}(Y_i)=\lambda_i\).

As a result we have

\[ \sum_i^{p}\text{Var}(Y_i) = \sum_i^{p}\text{Var}(X_i) \]

Proof:

\[ \sum_i^{p}\text{Var}(X_i) = \text{tr}\Sigma = \text{tr}(P\Lambda P') = \text{tr}(P'P\Lambda) = \text{tr}(\Lambda) = \sum_i^{p}\lambda_i = \sum_i^{p}\text{Var}(Y_i) \]

By the cyclic property of trace. Therefore the proportion of total variance explained by the \(i\)th principal component is

\[ \frac{\lambda_i}{ \sum_i^{p}\lambda_i} \]

If most (for instance, 80 to 90%) of the total population variance, for large p, can be attributed to the first one, two, or three components, then these components can “replace” the original p variables without much loss of information (Johnson and Wichern 2007).

A useful visual aid in determining the appropriate number of principal components is a scree plot and viewing at which number of principal components a bend, or elbow, appears. This indicates at a certain point, there is not much variability contributing from the remaining principal components.

Note: The previous derivations were done assuming the population variance-covariance matrix, \(\Sigma\) was known, which may not always be the case. Regardless, the derivations can still be completed using the sample variance-covariance matrix \(S\) in place of \(\Sigma\) for the same results.

Example of PCA

library(ggplot2)
library(reshape2)
library(knitr)
wine_data <- read.csv("wine.data")
colnames(wine_data) <- c("Class", "Alcohol", "Malic", "Ash", "Alcalinity", "Magnesium", "Phenols", "Flavanoids", "Nonflavanoid", "Proanthocyanins", "ColorIntensity", "Hue", "OD280_OD315", "Proline")
wine_data <- scale(wine_data[,-1])
kable(wine_data[1:10,])
Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoid Proanthocyanins ColorIntensity Hue OD280_OD315 Proline
0.2551008 -0.5002053 -0.8221529 -2.4930372 0.0290976 0.5710456 0.7375437 -0.8208101 -0.5370519 -0.2903066 0.4059482 1.1284966 0.9683055
0.2056453 0.0179690 1.1045562 -0.2748590 0.0996492 0.8104843 1.2181890 -0.4999191 2.1399040 0.2689663 0.3186634 0.8023031 1.3970348
1.7016732 -0.3483266 0.4865552 -0.8144158 0.9462687 2.4865554 1.4685250 -0.9812556 1.0376281 1.1810114 -0.4232572 1.1994082 2.3338876
0.3045563 0.2234520 1.8316163 0.4445501 1.2990268 0.8104843 0.6674496 0.2220856 0.4077561 -0.3161192 0.3623058 0.4619272 -0.0320627
1.4914875 -0.5180734 0.3047901 -1.2940219 0.8757170 1.5607256 1.3683906 -0.1790282 0.6702028 0.7292910 0.4059482 0.3484686 2.2386144
1.7264010 -0.4197989 0.3047901 -1.4738742 -0.2531089 0.3316069 0.4972211 -0.4999191 0.6876992 0.0839760 0.2750210 1.3837785 1.7304908
1.3183934 -0.1696458 0.8864382 -0.5746128 1.5106816 0.4912327 0.4872077 -0.4196964 -0.5895412 -0.0020660 0.4495905 1.3837785 1.7463697
2.2704111 -0.6252819 -0.7130939 -1.6537265 -0.1825573 0.8104843 0.9578395 -0.5801419 0.6876992 0.0624655 0.5368753 0.3484686 0.9524266
1.0711160 -0.8843690 -0.3495639 -1.0542189 -0.1120057 1.0978108 1.1280680 -1.1417010 0.4602454 0.9314896 0.2313786 1.3412315 0.9524266
1.3678488 -0.1607118 -0.2405049 -0.4547113 0.3818557 1.0499230 1.2982965 -1.1417010 1.3875569 0.2990810 1.2787959 0.8023031 2.4291607

The data set contains \(13\) variables that contribute to identifying a type of wine, called class(1-3).

correlation_matrix <- cor(wine_data)
ggplot(data = melt(correlation_matrix), aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() + 
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap")

The high correlation among some of the predictors suggests that multicollinearity exists.

eigenvalues <- eigen(cov(wine_data))$values
df <- data.frame(PC = 1:length(eigenvalues), Eigenvalue = eigenvalues)
ggplot(df, aes(x = PC, y = Eigenvalue)) +
  geom_point() +
  geom_line() +
  labs(x = "Principal Component", y = "Eigenvalue") +
  ggtitle("Scree Plot")

(round(eigenvalues, 2))
##  [1] 4.68 2.50 1.45 0.92 0.86 0.64 0.55 0.35 0.29 0.25 0.23 0.17 0.10
(round((sum(eigenvalues[1:4]) / sum(eigenvalues))*100,2))
## [1] 73.51

\(73.51 \%\) of the total variation is contributed by the first \(4\) principal components as confirmed by the screeplot where the elbow appears.

eigenvectors <- as.data.frame(eigen(cov(wine_data))$vectors)
eigenvectors <- eigenvectors[,1:4]

rownames(eigenvectors) <- c("Alcohol", "Malic", "Ash", "Alcalinity", "Magnesium", "Phenols", "Flavanoids", "Nonflavanoid", "Proanthocyanins", "ColorIntensity", "Hue", "OD280_OD315", "Proline")
colnames(eigenvectors) <- c("e1", "e2", "e3", "e4")
kable(eigenvectors)
e1 e2 e3 e4
Alcohol -0.1378881 -0.4858346 -0.2087749 0.0011407
Malic 0.2463811 -0.2215748 0.0901933 -0.5331364
Ash 0.0043183 -0.3152819 0.6237430 0.2053483
Alcalinity 0.2373796 0.0121435 0.6137987 -0.0572236
Magnesium -0.1350017 -0.3002883 0.1357316 0.3916939
Phenols -0.3958694 -0.0705490 0.1446205 -0.2025993
Flavanoids -0.4243942 -0.0017321 0.1493175 -0.1555752
Nonflavanoid 0.2991357 -0.0246692 0.1691363 0.1753302
Proanthocyanins -0.3128032 -0.0414456 0.1506018 -0.3917438
ColorIntensity 0.0932856 -0.5280188 -0.1360832 -0.0723711
Hue -0.2995654 0.2740507 0.0825512 0.4194314
OD280_OD315 -0.3772025 0.1654491 0.1668111 -0.1904646
Proline -0.2842810 -0.3695384 -0.1280343 0.2236218

Recall our principal components are of the form:

\[ Y_i = e_i'X \]

Essentially, we have reduced the original data set from the original \(13\) predictors to a new set of \(4\) uncorrelated linear combinations of our original predictors. These new principal components could be used as predictors instead in a multiple linear regression setting for predicting the class variable.

Aeberhard, Stefan, and M. Forina. 1991. “Wine.”
Alin, Aylin. 2010. “Multicollinearity: Multicollinearity.” Wiley Interdisciplinary Reviews: Computational Statistics 2 (3): 370–74. https://doi.org/10.1002/wics.84.
Johnson, Richard A., and Dean W. Wichern. 2007. Applied Multivariate Statistical Analysis. 6th ed. Upper Saddle River, N.J: Pearson Prentice Hall.