Principle components analysis (PCA)

Date: April 6th 2016
Last updated: April 6th 2016

The data for this example is extracted from a mysql database. See analysis/analysis/multiple regression. Then, it is arranged in a pandas dataframe. Some exploratory analysis has been done by looking at histograms. See analysis/analysis/group of histograms.

Assumes there is a pandas dataframe called "df_mysql"

Split data

# check columns
df_mysql.columns

# Create data for pca
x = df_mysql.ix[:,2:].values
y = df_mysql.ix[:,1].values

x (length, width, thickness, volume)

x
#array([[  5.83333333,  18.375     ,   2.1875    ,  24.69      ],
#      [  6.        ,  18.625     ,   2.3125    ,  27.17      ],
#       [  6.16666667,  18.75      ,   2.375     ,  28.25      ],
#       ..., 
#       [  6.66666667,  22.        ,   2.75      ,  46.5       ],
#       [  6.83333333,  22.25      ,   2.875     ,  50.3       ],
#       [  7.        ,  22.25      ,   2.875     ,  51.6       ]])

y (Labels)

 y
#array(['HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP', 'HP',
#        <snipped>
#       'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'FH', 'SW',
#       'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'AR', 'AR', 'AR', 'AR', 'AR',
#       'AR', 'FH', 'FH', 'FH', 'AR', 'AR', 'AR', 'AR', 'AR', 'AR', 'AR',
#       'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW', 'SW',
#        <snipped>
#       'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI', 'FI'], dtype=object)

PCA (The shortcut version)

from matplotlib import pyplot as plt
import numpy as np
import math

from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=2)
Y_sklearn = sklearn_pca.fit_transform(x)

Basic plot

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    for lab, col in zip(('HP', 'FH', 'AR', 'SW', 'FI'),
                        ('blue', 'red', 'green', 'grey','yellow')):
        plt.scatter(Y_sklearn[y==lab, 0],
                    Y_sklearn[y==lab, 1],
                    label=lab,
                    c=col)
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(loc='lower right')
    plt.tight_layout()
    plt.show()

results matching ""

    No results matching ""