Decision key
1 response variable and (at most) 1 independent variable
Date: June 30th 2016
Last updated: July 5th 2016
This is a key to determine the method of analysis for data sets that have one or two variables (1 dependent/response and 1 independent variable that is of nominal, ordinal or interval/ratio type).
DATA TYPES
Ordinal (ranked order - likert item such as 1: agree, 2: neutral, 3: disagree)
Nominal (dichotomous - male/female)
Interval (continuous, order between points)
Ratio (scale data - order and distance between points)
KEY
[1]
Do you have two variables or less (when data is in long format you have at most two columns)
yes --> go to 2
no --> This key wont help you.
[2]
How many variables and levels are in your data set?
1 variable / 1 level --> go to 3
2 variables / 1, 2 or 3 levels --> go to 7
[3]
What data type do you have... ordinal, nominal, interval, ratio?
ordinal --> go to 4 (where an observation follows a ranked scale, e.g. 1: agree, 2: neutral, 3: disagree)
nominal --> go to 5 (where an observation is one of k categorical choices, e.g. male/female --> check if the data is already cross tabulated and contains count data)
interval/ratio --> go to 6 (where an observation is a measurement, e.g. 85, 84.2, 86.8, 90.2)
[4]
TEST using Wilcoxon signed-rank test. First check that you have more than 20 samples (scipy uses normal approximation)?
yes --> scipy.stats.wilcoxon (normal approximation)
no --> alternative??
Example
"""
scipy.stats.wilcoxon(x, y=None, zero_method='wilcox', correction=False)
y is optional, uses normal approximation (use N>20)
"""
# Data
# 1: strongly disagree
# 2: disagree
# 3: neutral
# 4: agree
# 5: strongly agree
# Question
# Course A improved my knowledge of X
# library
import scipy.stats as ss
import statistics as st
# Summarise
ss.mode(df) # 5
st.mean(df) #4.2727
st.median(df) # 4.5
min(df) # 3
max(df) # 5
# Test (using normal approximation)
df=np.array([1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5])
ss.wilcoxon(df-3)
# Output
# ss.wilcoxon(df-3)
# WilcoxonResult(statistic=68.0, pvalue=1.0)
# ss.wilcoxon(df-5)
# WilcoxonResult(statistic=0.0, pvalue=0.0004002)
[5]
Can you pass these assumptions?
- One categorical variable with K groups
- Independent (obs are not related)
- Obs are mutually exclusive (obs fall in only one category)
- Each group has at least five observations
yes --> TEST: Chi-Squared Goodness-of-fit Test
no --> go to ...?
Example
import pandas as pd
import scipy.stats as ss
# Data (nominal)
"""
partyA
partyB
partyC
partyD
"""
# Question
# Is there a preference for one party?
# Responses (use pandas Series - it has an easy count method)
df=pd.Series([
'partyD', 'partyD', 'partyD', 'partyA',
'partyA', 'partyC', 'partyC', 'partyC', 'partyA',
'partyD', 'partyB', 'partyC', 'partyA', 'partyC',
'partyA', 'partyA', 'partyB', 'partyB', 'partyA',
'partyA', 'partyA', 'partyC', 'partyA', 'partyD',
'partyB','partyD', 'partyB', 'partyA', 'partyA',
'partyA','partyA'
])
# summarise party preferences
counts = df.value_counts()
counts
#partyA 14
#partyD 6
#partyC 6
#partyB 5
#dtype: int64
# Run chi square test
ss.chisquare(counts)
#Power_divergenceResult(statistic=6.806451612903226, pvalue=0.078329476194404501)
# set expected values
# (i.e. was the election outcome different to the pre-election poll?)
ss.chisquare(counts, f_exp=[5,9,9,8])
#Power_divergenceResult(statistic=19.324999999999999, pvalue=0.00023419202567505573)
[6]
Do you have a normal distribution (or more than 30 samples)?
yes --> TEST: One sample T-test
no --> go to 4
Example (yes)
import numpy as np
from scipy.stats import ttest_1samp
import statistics as st
# data
df=np.array([2.2,2.5,3.3,4.2,4.5,4.7,4.8,4.5,5.5,6.5,6,5,5.1,5.9,6.5,6.1,6.3,5.3,5.7,5.5,5.8,5.9,6.1,6.4,6.6,6.7,6.9,7.0,7.0,7.2,8.5,9.2,10.0])
# Summarise
len(df) # 33
ss.mode(df) # 4.5
st.mean(df) # 5.8606
st.median(df) # 5.90
min(df) # 1
max(df) # 10
# test normal distribution (Anderson Darling)
import statsmodels.api as sm
sm.stats.normal_ad(df)
#(0.52594355974574114, 0.16704288032057493)
# single sample t-test
# (reject H0: the mean of df is 6.8)
t_statistic, p_value = ss.ttest_1samp(df, 6.8)
t_statistic
#-3.301025997929885
p_value
#0.0023728378433198723
# another test
#(accept H0: cannot reject df as having mean of 5.8)
ss.ttest_1samp(df, 5.8)
Ttest_1sampResult(statistic=0.21296941922127938, pvalue=0.83270177497674891)
[7]
What type of data do you have as a dependent variable?
nominal --> go to 8
ordinal/interval/ratio --> go to 9
[8]
Is the independent variable nominal/ordinal?
yes --> Chi square test
no --> reconsider your dependent variable
- convert nominal DV to dummy variable --> go to 12
- continuous DV by nominal IV --> go to 14 or 15
Example Chi Square Test
"""
H0: males and females are associated with pass and fail of an exam
"""
# Data (Exam outcome grouped by males and females)
sex=pd.Series(['male','male','male','male','male','female','female','female','female','female','female'])
outcome=pd.Series(['fail','fail','fail','fail','pass','fail','pass','pass','pass','pass','pass'])
# cross tabulate
ct = pd.crosstab(sex, outcome)
ct
#col_0 fail pass
#row_0
#female 1 5
#male 4 1
# Run
ss.chi2_contingency(ct)
#(2.2274999999999996, 0.13557305375093759, 1,
# array([[ 2.72727273, 3.27272727],
# [ 2.27272727, 2.72727273]])
"""
Do not reject the null hypothesis. There is no statistically significant association between gender and exam outcome (Chi2=2.228, df=1, p=0.1356).
"""
[9]
Does the IV have one level?
yes, my data set has 2 columns with interval/ratio/ordinal data types --> go to 10
no, I have a factor with 2 levels --> go to 14
no, I have 3 or more levels --> go to 15
[10]
Are the 2 samples correlated or paired?
yes --> go to 11
no, both samples are independent --> go to 14
[11]
Are you testing a correlation (e.g. as x increases so to does y)?
yes --> 12
no, I am testing the mean difference --> go to 13
[12]
Are both samples normally distributed and are both samples of interval/ratio data type?
yes --> Simple Linear Regression
no, the IV is ordinal --> Spearman Rank Order correlation
Example normality and linear regression
import numpy as np
import scipy.stats as ss
# Data (of equal length)
df1=np.array([1.1,1.2,3.7,4.1,4.3,5.1,5.3,5.5,5.8, 6.0,6.2,7.5,9.0])
df2=np.array([2.2,2.1,2.9,4.1,4.6,4.1,5.9,5.7,6.5, 7.0,10.0,12.0,12.2])
# TEST NORMALITY (kolmogorov test)
ss.ks_2samp(df1, df2)
#Ks_2sampResult(statistic=0.23076923076923073, pvalue=0.82810472891460341)
"""we do not reject the null hypothesis that the 2
samples come from the same continuous distribution.
Move on to t-test."""
# LINEAR REGRESSION
ss.linregress(df1,df2)
#LinregressResult(slope=1.406281327494328, intercept=-0.90977154012557371, rvalue=0.90729902346002733, pvalue=1.8533741181296456e-05, stderr=0.1965065233422113)
Example Spearman Rank Order
"""
H0: The two sets of data are not correlated.
"""
# Data (mid term exam scores vs 7 point likert item measuring enjoyment of exam: is there a correlation between exam enjoyment and grade?)
df1=np.array([45,50,55,56,58,59,60,66,67,74,75])
df2=np.array([1,1,2,3,3,3,4,5,5,7,7])
# Run
ss.spearmanr(df1, df2)
#SpearmanrResult(correlation=0.98396230526469775, pvalue=4.792371830033223e-08)
"""
Reject the null hypothesis in favour of the alternative hypothesis; the mid term exam scores are
correlated with final exam scores (rho=0.984,p<0.001).
"""
[13]
Can you pass the following assumptions?
- the variables are continuous (interval or ratio)
- samples are paired
- no sign outliers
- distribution of differences are approximately normally distributed
yes --> Paired Sample t-test
no --> Rank Sums test
Example Paired Sample t-test
import scipy.stats as ss
import numpy as np
# PAIRED SAMPLE
# also called dependent t-test
"""
H0: the mean difference between paired observations is zero.
"""
# Test
ss.ttest_rel(df1, df2)
#Ttest_relResult(statistic=-2.3753474894213884, pvalue=0.035058538954830479)
"""
we reject the null hypothesis in favour of the
alternative hypothesis; the mean difference between
paired obs is different to zero (t=-2.375, p=0.035).
"""
[14]
Are both samples normally distributed?
yes --> TEST: two sample t-test
no --> TEST: Mann-Whitney U test
Example T-Test
import scipy.stats as ss
import numpy as np
# 2 SAMPLE T-TEST (2 tailed t-test)
"""
H0: mean of df1 is greater or less than mean of df2
ss.ttest_ind(df1, df2)
"""
# Test
#Ttest_indResult(statistic=-0.98900296548064237, pvalue=0.33252850150447189)
"""
we cannot reject the null hypothesis. The sample mean
of df1 is not significantly different to the sample
mean of df2.
"""
Example Mann-Whitney U test
#-------------------
# Mann-Whitney U test
# Other names of the same test:
# 1. Mann–Whitney–Wilcoxon
# 2. Wilcoxon rank-sum test
# 3. Wilcoxon–Mann–Whitney test
# 4. Wilcoxon two-sample test
# Requires N > 20
"""
H0: The 2 samples come from the same distribution
"""
# Data (scores)
df1=np.array([56,57,55,56,58,59,60,63,63,64,65,78,79,50,55,57,68,76,65,66,67,70,71,67,68,69,55,58,59,60,61])
df2=np.array([75,76,75,78,71,72,73,76,65,54,57,60,62,69,71,72,45,49,88,87,78,69,79,80,81,84,81,79,76,78,79])
# Run
ss.mannwhitneyu(df1, df2)
#MannwhitneyuResult(statistic=208.5, pvalue=0.00013057671142387172)
"""
Reject the null hypothesis in favour of an
alternative exlanation; the ranked scores of df1 is
significantly different to the ranked scores of df2
(mw=208.5, p<0.001).
"""
[15]
Can you pass the following assumptions?
- Observations are independent between groups
- There are no outliers
- Data are normally distributed
- Groups have equal variances
yes --> TEST: One-way ANOVA
no --> TEST: Kruskal-Wallis test