PRINCIPAL COMPONENT ANALYSIS in simple words[with code] .

Mukut Chakraborty
3 min readMay 31, 2021

In today’s world, data are found in their dirtiest form which when analyzed, we might loose many important insights as well as be led with the wrong conclusions. One of such situation is the “high dimensional data”.

For an instance,

source- https://www.google.com/url?sa=i&url=https%3A%2F%2Fpython-bloggers.com%2F2019%2F04%2Fhigh-dimensional-data-breaking-the-curse-of-dimensionality-with-python%2F&psig=AOvVaw2gcd_JSrUKoH-39v38UPBv&ust=1622558763670000&source=images&cd=vfe&ved=0CAMQjB1qFwoTCIjLksuU9PACFQAAAAAdAAAAABAP

Here, if we stop in any one dimension point of view, we will see a heavily spread data points and when we stop at another dimension point of view, we will see changes in the spread from the initial look that we had.

The dimension hence, affects the data. When we see the data point from a dimension, the data points we are missing act just to be noise, rest being the signal. Thus, one of the goal of PCA is to increase in signal data and decrease in the noise data.

Moreover, we mostly face the presence of dependent features which won’t provide us any significant amount of new information creating redundancy. Having them means means to have additional dimensions resulting to the situation named as the “Curse of Dimensionality”.

How PCA deals with these issues?

PCA solves these issues by capturing covariance information predictors(features) in order to find a dimension of maximum variance. Covariance refers to the measure of the directional relationship between two random variables.

ALGORITHM-

  1. Standardize the independent variables,i.e, normalizing the data.
#Importing the required libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

# to suppress the warnings:
from warnings import filterwarnings
filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn import tree
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# loading the iris dataset
iris = datasets.load_iris()
X = iris.data
#Standardize
X_std = StandardScaler().fit_transform(X)
X_std

2.Calculate covariance or correlation matrix.

cov_matrix = np.cov(X_std.T)
print(cov_matrix.shape)
print('Covariance Matrix: \n', cov_matrix)

3. Eigen Decomposition

eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print(len(eig_vals))
print(eig_vecs.shape)

4.Sort the Eigen Vectors w.r.t their Eigen Values

eigen_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
eigen_pairs_sorted = sorted(eigen_pairs, reverse=True)
eig_vals_sorted = [eigen_pairs_sorted[i][0] for i in range (len(eig_vals))]
eig_vecs_sorted = [eigen_pairs_sorted[i][1] for i in range (len(eig_vals))]
# Plotting the first 3 Eigen Vectors or the Prinicipal components:

iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target

## Get the min and max of the two dimensions and extend the margins by .5 on both sides to get the data points away
## from the origin in the plot
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

## plot frame size
plt.figure(2, figsize=(8, 6))
plt.clf()

# Plot the training points (scatter plot, all rows first and second column only)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')


## plotting the axes with ticks
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

# To getter a better understanding of interaction of the dimensions
# plot the first three PCA dimensions
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])

plt.show()

And then we move on to our model training. PCA is more often used as a dimensionality reduction technique in facial recognition, computer vision, image compression, etc. A similar method to PCA is “linear discriminant analysis”.

Hope I could help you grab the most important preprocessing step of today’s data world upto some extent. Thank you for your patience!

--

--

Mukut Chakraborty

Data-Science enthusiast, Persuing Masters of Computer Science (specialisation in DATA ANALYTICS) from IIITM-Kerala,Sportsperson