Principle components analysis (PCA)
Date: April 6th 2016
Last updated: April 6th 2016
The data for this example is extracted from a mysql database. See analysis/analysis/multiple regression. Then, it is arranged in a pandas dataframe. Some exploratory analysis has been done by looking at histograms. See analysis/analysis/group of histograms.
Assumes there is a pandas dataframe called "df_mysql"
Split data
# check columns
df_mysql.columns
# Create data for pca
x = df_mysql.ix[:,2:].values
y = df_mysql.ix[:,1].values
x (length, width, thickness, volume)
x
#array([[ 5.83333333, 18.375 , 2.1875 , 24.69 ],
# [ 6. , 18.625 , 2.3125 , 27.17 ],
# [ 6.16666667, 18.75 , 2.375 , 28.25 ],
# ...,
# [ 6.66666667, 22. , 2.75 , 46.5 ],
# [ 6.83333333, 22.25 , 2.875 , 50.3 ],
# [ 7. , 22.25 , 2.875 , 51.6 ]])
y (Labels)
y
#array(['HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP',
# <snipped>
# 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'SW',
# 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'AR', 'AR', 'AR', 'AR', 'AR',
# 'AR', 'FH', 'FH', 'FH', 'AR', 'AR', 'AR', 'AR', 'AR', 'AR', 'AR',
# 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW',
# <snipped>
# 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI'], dtype=object)
PCA (The shortcut version)
from matplotlib import pyplot as plt
import numpy as np
import math
from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=2)
Y_sklearn = sklearn_pca.fit_transform(x)
Basic plot
with plt.style.context('seaborn-whitegrid'):
plt.figure(figsize=(6, 4))
for lab, col in zip(('HP', 'FH', 'AR', 'SW', 'FI'),
('blue', 'red', 'green', 'grey','yellow')):
plt.scatter(Y_sklearn[y==lab, 0],
Y_sklearn[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()