Multiple regression

Date: April 3rd 2016
Last updated: April 3rd 2016

Continuous variables

import statsmodels.formula.api as smf

lm = smf.ols(forumla='boardvolume ~ boardlength\
+ boardwidth + boardvolume', data=df).fit()

lm.params
lm.summary()
lm.rsquared

Add a categorical variable

# encode df.boardperformance as a numeric via pd.Factor
df['boardperformance_ord'] =\
pd.Categorical(df.boardperformance).labels

# add the dummy variable
lm = smf.ols(formula="chd ~ boardvolume ~ \
boardlength + boardwidth + boardvolume +\
boardperformance_ord", data=df).fit()

#OR 
# USE STATSMODELS.FORMULA TO CREATE A 
# DUMMY VARIABLE
lm = smf.ols(formula="chd ~ boardvolume ~ \
boardlength + boardwidth + boardvolume +\
C(boardperformance)", data=df).fit()

Example
Most surfboards these days are given four dimensions (length, width, thickness and volume). Volume is a recent addition and many old boards only have the three variables written on the boards stringer.

The best way to calculate volume of a surfboard is to fill a bath tub to the brim, chuck in the board and catch the displaced water. I certainly don't have a tub big enough to do this so I'll use multiple regression instead.

Based on new boards with a known volume I am going to use the model coefficients from multiple regression to calculate an estimate of volume. As a side note, I recently damaged my board and took it in to my local surf shop for repairs. The store owner estimated the board to be around 28 litres (4 litres less than I should be riding for my height and weight).

This dataset contains n=100 surfboards from four categories (all-rounders, small wave boards, friendly high performance and high performance boards). All data were taken from www.surfshopaustralia.com.au. My board fits in with the shape and size of high performance boards so I will use the dummy variable corresponding to this category. The objective is to find the volume of my board.

import pandas as pd
import statsmodels.formula.api as smf
import MySQLdb
from fractions import Fraction

# Connect to MySQL database
# see entry: getting started with MySQL
db = MySQLdb.connect(host='localhost',
                     user='duck',
                     passwd='password',
                     db='surfdiary')

# Get data from MySQL as Pandas DataFrame                     
df_mysql = pd.read_sql("""SELECT boardperformance, 
                       boardlength, boardwidth, 
                       boardthickness, boardvolume 
                       FROM surferprofile_board;""", 
                       con = db)

# Functions to convert strings to float
# see entry: convert string fraction to decimal
def convert_feet_to_float(value):
    """ 
    see entry: convert string fraction to decimal 
    """
    mylist = value.split()
    my_feet = float(mylist[0])
    my_inches = float(int(mylist[1]) / 12)
    myvalue = my_feet + my_inches
    return myvalue

def convert_my_string_to_float(value):
    """ 
    see entry: convert string fraction to decimal 
    """
    mylist = value.split()
    if len(mylist) > 1:
        my_whole_number = float(mylist[0])
        my_fraction = float(Fraction(mylist[1]))
        myvalue = my_whole_number + my_fraction
        return myvalue
    else:
        my_whole_number = float(mylist[0])
        return my_whole_number

# modify strings from several columns
# see entry: pandas: modify values in a column
df_mysql['boardlength'] = df_mysql['boardlength'].\
apply(convert_feet_to_float)
df_mysql['boardwidth'] = df_mysql['boardwidth'].\
apply(convert_my_string_to_float)
df_mysql['boardthickness'] = df_mysql['boardthickness'].\
apply(convert_my_string_to_float)

# Calculate multiple regression
lm = smf.ols(formula="boardvolume ~ \
boardlength + boardwidth + boardthickness +\
C(boardperformance)", data=df_mysql).fit()

model parameters

lm.params
#Intercept                   -74.679006
#C(boardperformance)[T.FH]    -1.134759
#C(boardperformance)[T.HP]    -1.822232
#C(boardperformance)[T.SW]    -0.141984
#boardlength                   6.284218
#boardwidth                    2.877587
#boardthickness                4.978518
#dtype: float64

lm.rsquared
#0.97505449910508424

lm.summary()
#<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            boardvolume   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     605.9
Date:                Sun, 03 Apr 2016   Prob (F-statistic):           3.16e-72
Time:                        14:44:38   Log-Likelihood:                -112.62
No. Observations:                 100   AIC:                             239.2
Df Residuals:                      93   BIC:                             257.5
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept                   -74.6790      3.159    -23.637      0.000     -80.953     -68.405
C(boardperformance)[T.FH]    -1.1348      0.277     -4.090      0.000      -1.686      -0.584
C(boardperformance)[T.HP]    -1.8222      0.412     -4.426      0.000      -2.640      -1.005
C(boardperformance)[T.SW]    -0.1420      0.302     -0.471      0.639      -0.741       0.457
boardlength                   6.2842      0.450     13.974      0.000       5.391       7.177
boardwidth                    2.8776      0.258     11.173      0.000       2.366       3.389
boardthickness                4.9785      0.926      5.376      0.000       3.139       6.818
==============================================================================
Omnibus:                        5.377   Durbin-Watson:                   0.764
Prob(Omnibus):                  0.068   Jarque-Bera (JB):                6.636
Skew:                           0.225   Prob(JB):                       0.0362
Kurtosis:                       4.179   Cond. No.                         862.
==============================================================================
"""

Evaluation

#My board dimensions
#Length: 6'4" (converted to 6.3)
#width: 18 1/2" (converted to 18.5)
#thickness: 2 3/8 (converted to 2.375)

# My board fits High performance category
# so I am adding T.HP (-1.8222) dummy variable
# (Currently the model is using the 
#all rounders board type T.AR)

my_board_volume = -74.679 + -1.8222 
                  + 6.2842*6.3 
                  + 2.877587*18.5
                  + 4.978518*2.375
my_board_volume
#28.148599750000002

# The volume of my board is approximately 28 litres

# NOTE
# However the 95% confidence interval is , large
# (95% CI = 1.59 - 54.694 litres)

"""
TESTS
# Driver: Lost Surfboards
5'11, 18 3/4, 2 1/4 = 26.3 lt (est = 25.83, dif = 0.47)
6'3, 19, 2 1/2 = 30.52 lt (est = 29.864, diff = 0.624)
6'6, 19 1/2, 2 5/8 = 34.06 lt (est = 34.156, diff = -0.095)

# Varmit: Simon Anderson 
(wider nose than many other HiPerformance boards)
6'3, 20 1/4, 2 9/16 = 34.79 lt (est = 33.804, diff = 0.986)
6'6, 21, 2 3/4 = 40.85 lt (est = 39.095, diff = 1.755)

# Head Shifter: Superbrand
6'2, 18 3/4, 2 3/8 = 28.4 lt (est = 28.240, diff = 0.16)
"""

Useful resources

results matching ""

    No results matching ""