## Principle components analysis (PCA)

Date: April 6th 2016
Last updated: April 6th 2016

The data for this example is extracted from a mysql database. See analysis/analysis/multiple regression. Then, it is arranged in a pandas dataframe. Some exploratory analysis has been done by looking at histograms. See analysis/analysis/group of histograms.

Assumes there is a pandas dataframe called "df_mysql"

Split data

``````# check columns
df_mysql.columns

# Create data for pca
x = df_mysql.ix[:,2:].values
y = df_mysql.ix[:,1].values
``````

x (length, width, thickness, volume)

``````x
#array([[  5.83333333,  18.375     ,   2.1875    ,  24.69      ],
#      [  6.        ,  18.625     ,   2.3125    ,  27.17      ],
#       [  6.16666667,  18.75      ,   2.375     ,  28.25      ],
#       ...,
#       [  6.66666667,  22.        ,   2.75      ,  46.5       ],
#       [  6.83333333,  22.25      ,   2.875     ,  50.3       ],
#       [  7.        ,  22.25      ,   2.875     ,  51.6       ]])
``````

y (Labels)

`````` y
#array(['HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP',
#        <snipped>
#       'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'SW',
#       'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'AR', 'AR', 'AR', 'AR', 'AR',
#       'AR', 'FH', 'FH', 'FH', 'AR', 'AR', 'AR', 'AR', 'AR', 'AR', 'AR',
#       'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW',
#        <snipped>
#       'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI'], dtype=object)
``````

PCA (The shortcut version)

``````from matplotlib import pyplot as plt
import numpy as np
import math

from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=2)
Y_sklearn = sklearn_pca.fit_transform(x)
``````

Basic plot

``````with plt.style.context('seaborn-whitegrid'):
plt.figure(figsize=(6, 4))
for lab, col in zip(('HP', 'FH', 'AR', 'SW', 'FI'),
('blue', 'red', 'green', 'grey','yellow')):
plt.scatter(Y_sklearn[y==lab, 0],
Y_sklearn[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()
``````