Multiple regression
Date: April 3rd 2016
Last updated: April 3rd 2016
Continuous variables
import statsmodels.formula.api as smf
lm = smf.ols(forumla='boardvolume ~ boardlength\
+ boardwidth + boardvolume', data=df).fit()
lm.params
lm.summary()
lm.rsquared
Add a categorical variable
# encode df.boardperformance as a numeric via pd.Factor
df['boardperformance_ord'] =\
pd.Categorical(df.boardperformance).labels
# add the dummy variable
lm = smf.ols(formula="chd ~ boardvolume ~ \
boardlength + boardwidth + boardvolume +\
boardperformance_ord", data=df).fit()
#OR
# USE STATSMODELS.FORMULA TO CREATE A
# DUMMY VARIABLE
lm = smf.ols(formula="chd ~ boardvolume ~ \
boardlength + boardwidth + boardvolume +\
C(boardperformance)", data=df).fit()
Example
Most surfboards these days are given four dimensions (length, width, thickness and volume). Volume is a recent addition and many old boards only have the three variables written on the boards stringer.
The best way to calculate volume of a surfboard is to fill a bath tub to the brim, chuck in the board and catch the displaced water. I certainly don't have a tub big enough to do this so I'll use multiple regression instead.
Based on new boards with a known volume I am going to use the model coefficients from multiple regression to calculate an estimate of volume. As a side note, I recently damaged my board and took it in to my local surf shop for repairs. The store owner estimated the board to be around 28 litres (4 litres less than I should be riding for my height and weight).
This dataset contains n=100 surfboards from four categories (all-rounders, small wave boards, friendly high performance and high performance boards). All data were taken from www.surfshopaustralia.com.au. My board fits in with the shape and size of high performance boards so I will use the dummy variable corresponding to this category. The objective is to find the volume of my board.
import pandas as pd
import statsmodels.formula.api as smf
import MySQLdb
from fractions import Fraction
# Connect to MySQL database
# see entry: getting started with MySQL
db = MySQLdb.connect(host='localhost',
user='duck',
passwd='password',
db='surfdiary')
# Get data from MySQL as Pandas DataFrame
df_mysql = pd.read_sql("""SELECT boardperformance,
boardlength, boardwidth,
boardthickness, boardvolume
FROM surferprofile_board;""",
con = db)
# Functions to convert strings to float
# see entry: convert string fraction to decimal
def convert_feet_to_float(value):
"""
see entry: convert string fraction to decimal
"""
mylist = value.split()
my_feet = float(mylist[0])
my_inches = float(int(mylist[1]) / 12)
myvalue = my_feet + my_inches
return myvalue
def convert_my_string_to_float(value):
"""
see entry: convert string fraction to decimal
"""
mylist = value.split()
if len(mylist) > 1:
my_whole_number = float(mylist[0])
my_fraction = float(Fraction(mylist[1]))
myvalue = my_whole_number + my_fraction
return myvalue
else:
my_whole_number = float(mylist[0])
return my_whole_number
# modify strings from several columns
# see entry: pandas: modify values in a column
df_mysql['boardlength'] = df_mysql['boardlength'].\
apply(convert_feet_to_float)
df_mysql['boardwidth'] = df_mysql['boardwidth'].\
apply(convert_my_string_to_float)
df_mysql['boardthickness'] = df_mysql['boardthickness'].\
apply(convert_my_string_to_float)
# Calculate multiple regression
lm = smf.ols(formula="boardvolume ~ \
boardlength + boardwidth + boardthickness +\
C(boardperformance)", data=df_mysql).fit()
model parameters
lm.params
#Intercept -74.679006
#C(boardperformance)[T.FH] -1.134759
#C(boardperformance)[T.HP] -1.822232
#C(boardperformance)[T.SW] -0.141984
#boardlength 6.284218
#boardwidth 2.877587
#boardthickness 4.978518
#dtype: float64
lm.rsquared
#0.97505449910508424
lm.summary()
#<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: boardvolume R-squared: 0.975
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 605.9
Date: Sun, 03 Apr 2016 Prob (F-statistic): 3.16e-72
Time: 14:44:38 Log-Likelihood: -112.62
No. Observations: 100 AIC: 239.2
Df Residuals: 93 BIC: 257.5
Df Model: 6
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
Intercept -74.6790 3.159 -23.637 0.000 -80.953 -68.405
C(boardperformance)[T.FH] -1.1348 0.277 -4.090 0.000 -1.686 -0.584
C(boardperformance)[T.HP] -1.8222 0.412 -4.426 0.000 -2.640 -1.005
C(boardperformance)[T.SW] -0.1420 0.302 -0.471 0.639 -0.741 0.457
boardlength 6.2842 0.450 13.974 0.000 5.391 7.177
boardwidth 2.8776 0.258 11.173 0.000 2.366 3.389
boardthickness 4.9785 0.926 5.376 0.000 3.139 6.818
==============================================================================
Omnibus: 5.377 Durbin-Watson: 0.764
Prob(Omnibus): 0.068 Jarque-Bera (JB): 6.636
Skew: 0.225 Prob(JB): 0.0362
Kurtosis: 4.179 Cond. No. 862.
==============================================================================
"""
Evaluation
#My board dimensions
#Length: 6'4" (converted to 6.3)
#width: 18 1/2" (converted to 18.5)
#thickness: 2 3/8 (converted to 2.375)
# My board fits High performance category
# so I am adding T.HP (-1.8222) dummy variable
# (Currently the model is using the
#all rounders board type T.AR)
my_board_volume = -74.679 + -1.8222
+ 6.2842*6.3
+ 2.877587*18.5
+ 4.978518*2.375
my_board_volume
#28.148599750000002
# The volume of my board is approximately 28 litres
# NOTE
# However the 95% confidence interval is , large
# (95% CI = 1.59 - 54.694 litres)
"""
TESTS
# Driver: Lost Surfboards
5'11, 18 3/4, 2 1/4 = 26.3 lt (est = 25.83, dif = 0.47)
6'3, 19, 2 1/2 = 30.52 lt (est = 29.864, diff = 0.624)
6'6, 19 1/2, 2 5/8 = 34.06 lt (est = 34.156, diff = -0.095)
# Varmit: Simon Anderson
(wider nose than many other HiPerformance boards)
6'3, 20 1/4, 2 9/16 = 34.79 lt (est = 33.804, diff = 0.986)
6'6, 21, 2 3/4 = 40.85 lt (est = 39.095, diff = 1.755)
# Head Shifter: Superbrand
6'2, 18 3/4, 2 3/8 = 28.4 lt (est = 28.240, diff = 0.16)
"""
Useful resources