## Multiple regression

Date: April 3rd 2016
Last updated: April 3rd 2016

Continuous variables

``````import statsmodels.formula.api as smf

lm = smf.ols(forumla='boardvolume ~ boardlength\
+ boardwidth + boardvolume', data=df).fit()

lm.params
lm.summary()
lm.rsquared
``````

Add a categorical variable

``````# encode df.boardperformance as a numeric via pd.Factor
df['boardperformance_ord'] =\
pd.Categorical(df.boardperformance).labels

# add the dummy variable
lm = smf.ols(formula="chd ~ boardvolume ~ \
boardlength + boardwidth + boardvolume +\
boardperformance_ord", data=df).fit()

#OR
# USE STATSMODELS.FORMULA TO CREATE A
# DUMMY VARIABLE
lm = smf.ols(formula="chd ~ boardvolume ~ \
boardlength + boardwidth + boardvolume +\
C(boardperformance)", data=df).fit()
``````

Example
Most surfboards these days are given four dimensions (length, width, thickness and volume). Volume is a recent addition and many old boards only have the three variables written on the boards stringer.

The best way to calculate volume of a surfboard is to fill a bath tub to the brim, chuck in the board and catch the displaced water. I certainly don't have a tub big enough to do this so I'll use multiple regression instead.

Based on new boards with a known volume I am going to use the model coefficients from multiple regression to calculate an estimate of volume. As a side note, I recently damaged my board and took it in to my local surf shop for repairs. The store owner estimated the board to be around 28 litres (4 litres less than I should be riding for my height and weight).

This dataset contains n=100 surfboards from four categories (all-rounders, small wave boards, friendly high performance and high performance boards). All data were taken from www.surfshopaustralia.com.au. My board fits in with the shape and size of high performance boards so I will use the dummy variable corresponding to this category. The objective is to find the volume of my board.

``````import pandas as pd
import statsmodels.formula.api as smf
import MySQLdb
from fractions import Fraction

# Connect to MySQL database
# see entry: getting started with MySQL
db = MySQLdb.connect(host='localhost',
user='duck',
db='surfdiary')

# Get data from MySQL as Pandas DataFrame
df_mysql = pd.read_sql("""SELECT boardperformance,
boardlength, boardwidth,
boardthickness, boardvolume
FROM surferprofile_board;""",
con = db)

# Functions to convert strings to float
# see entry: convert string fraction to decimal
def convert_feet_to_float(value):
"""
see entry: convert string fraction to decimal
"""
mylist = value.split()
my_feet = float(mylist[0])
my_inches = float(int(mylist[1]) / 12)
myvalue = my_feet + my_inches
return myvalue

def convert_my_string_to_float(value):
"""
see entry: convert string fraction to decimal
"""
mylist = value.split()
if len(mylist) > 1:
my_whole_number = float(mylist[0])
my_fraction = float(Fraction(mylist[1]))
myvalue = my_whole_number + my_fraction
return myvalue
else:
my_whole_number = float(mylist[0])
return my_whole_number

# modify strings from several columns
# see entry: pandas: modify values in a column
df_mysql['boardlength'] = df_mysql['boardlength'].\
apply(convert_feet_to_float)
df_mysql['boardwidth'] = df_mysql['boardwidth'].\
apply(convert_my_string_to_float)
df_mysql['boardthickness'] = df_mysql['boardthickness'].\
apply(convert_my_string_to_float)

# Calculate multiple regression
lm = smf.ols(formula="boardvolume ~ \
boardlength + boardwidth + boardthickness +\
C(boardperformance)", data=df_mysql).fit()
``````

model parameters

``````lm.params
#Intercept                   -74.679006
#C(boardperformance)[T.FH]    -1.134759
#C(boardperformance)[T.HP]    -1.822232
#C(boardperformance)[T.SW]    -0.141984
#boardlength                   6.284218
#boardwidth                    2.877587
#boardthickness                4.978518
#dtype: float64

lm.rsquared
#0.97505449910508424

lm.summary()
#<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable:            boardvolume   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     605.9
Date:                Sun, 03 Apr 2016   Prob (F-statistic):           3.16e-72
Time:                        14:44:38   Log-Likelihood:                -112.62
No. Observations:                 100   AIC:                             239.2
Df Residuals:                      93   BIC:                             257.5
Df Model:                           6
Covariance Type:            nonrobust
=============================================================================================
coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept                   -74.6790      3.159    -23.637      0.000     -80.953     -68.405
C(boardperformance)[T.FH]    -1.1348      0.277     -4.090      0.000      -1.686      -0.584
C(boardperformance)[T.HP]    -1.8222      0.412     -4.426      0.000      -2.640      -1.005
C(boardperformance)[T.SW]    -0.1420      0.302     -0.471      0.639      -0.741       0.457
boardlength                   6.2842      0.450     13.974      0.000       5.391       7.177
boardwidth                    2.8776      0.258     11.173      0.000       2.366       3.389
boardthickness                4.9785      0.926      5.376      0.000       3.139       6.818
==============================================================================
Omnibus:                        5.377   Durbin-Watson:                   0.764
Prob(Omnibus):                  0.068   Jarque-Bera (JB):                6.636
Skew:                           0.225   Prob(JB):                       0.0362
Kurtosis:                       4.179   Cond. No.                         862.
==============================================================================
"""
``````

Evaluation

``````#My board dimensions
#Length: 6'4" (converted to 6.3)
#width: 18 1/2" (converted to 18.5)
#thickness: 2 3/8 (converted to 2.375)

# My board fits High performance category
# so I am adding T.HP (-1.8222) dummy variable
# (Currently the model is using the
#all rounders board type T.AR)

my_board_volume = -74.679 + -1.8222
+ 6.2842*6.3
+ 2.877587*18.5
+ 4.978518*2.375
my_board_volume
#28.148599750000002

# The volume of my board is approximately 28 litres

# NOTE
# However the 95% confidence interval is , large
# (95% CI = 1.59 - 54.694 litres)

"""
TESTS
# Driver: Lost Surfboards
5'11, 18 3/4, 2 1/4 = 26.3 lt (est = 25.83, dif = 0.47)
6'3, 19, 2 1/2 = 30.52 lt (est = 29.864, diff = 0.624)
6'6, 19 1/2, 2 5/8 = 34.06 lt (est = 34.156, diff = -0.095)

# Varmit: Simon Anderson
(wider nose than many other HiPerformance boards)
6'3, 20 1/4, 2 9/16 = 34.79 lt (est = 33.804, diff = 0.986)
6'6, 21, 2 3/4 = 40.85 lt (est = 39.095, diff = 1.755)

# Head Shifter: Superbrand
6'2, 18 3/4, 2 3/8 = 28.4 lt (est = 28.240, diff = 0.16)
"""
``````

Useful resources