Decision key

1 response variable and (at most) 1 independent variable

Date: June 30th 2016
Last updated: July 5th 2016

This is a key to determine the method of analysis for data sets that have one or two variables (1 dependent/response and 1 independent variable that is of nominal, ordinal or interval/ratio type).

DATA TYPES

Ordinal (ranked order - likert item such as 1: agree, 2: neutral, 3: disagree)
Nominal (dichotomous - male/female)
Interval (continuous, order between points)
Ratio (scale data - order and distance between points)

KEY

[1]
Do you have two variables or less (when data is in long format you have at most two columns)

yes --> go to 2
no --> This key wont help you.

[2]
How many variables and levels are in your data set?

1 variable / 1 level --> go to 3
2 variables / 1, 2 or 3 levels --> go to 7

[3]
What data type do you have... ordinal, nominal, interval, ratio?

ordinal --> go to 4 (where an observation follows a ranked scale, e.g. 1: agree, 2: neutral, 3: disagree)

nominal --> go to 5 (where an observation is one of k categorical choices, e.g. male/female --> check if the data is already cross tabulated and contains count data)

interval/ratio --> go to 6 (where an observation is a measurement, e.g. 85, 84.2, 86.8, 90.2)

[4]
TEST using Wilcoxon signed-rank test. First check that you have more than 20 samples (scipy uses normal approximation)?

yes --> scipy.stats.wilcoxon (normal approximation)
no --> alternative??

Example

"""
scipy.stats.wilcoxon(x, y=None, zero_method='wilcox', correction=False)
y is optional, uses normal approximation (use N>20)
"""
# Data
# 1: strongly disagree
# 2: disagree
# 3: neutral
# 4: agree
# 5: strongly agree

# Question 
# Course A improved my knowledge of X

# library
import scipy.stats as ss 
import statistics as st

# Summarise
ss.mode(df) # 5
st.mean(df) #4.2727
st.median(df) # 4.5
min(df) # 3
max(df) # 5

# Test (using normal approximation)
df=np.array([1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5])
ss.wilcoxon(df-3)

# Output
# ss.wilcoxon(df-3)
# WilcoxonResult(statistic=68.0, pvalue=1.0)

# ss.wilcoxon(df-5)
# WilcoxonResult(statistic=0.0, pvalue=0.0004002)

[5]
Can you pass these assumptions?

One categorical variable with K groups
Independent (obs are not related)
Obs are mutually exclusive (obs fall in only one category)
Each group has at least five observations

yes --> TEST: Chi-Squared Goodness-of-fit Test
no --> go to ...?

Example

import pandas as pd
import scipy.stats as ss

# Data (nominal)
"""
partyA 
partyB 
partyC 
partyD 
"""

# Question
# Is there a preference for one party?

# Responses (use pandas Series - it has an easy count method)
df=pd.Series([
'partyD', 'partyD', 'partyD', 'partyA',
'partyA', 'partyC', 'partyC', 'partyC', 'partyA',
'partyD', 'partyB', 'partyC', 'partyA', 'partyC',
'partyA', 'partyA', 'partyB', 'partyB', 'partyA',
'partyA', 'partyA', 'partyC', 'partyA', 'partyD',
'partyB','partyD', 'partyB', 'partyA', 'partyA',
'partyA','partyA'
])

# summarise party preferences
counts = df.value_counts()
counts
#partyA    14
#partyD     6
#partyC     6
#partyB     5
#dtype: int64

# Run chi square test
ss.chisquare(counts)
#Power_divergenceResult(statistic=6.806451612903226, pvalue=0.078329476194404501)

# set expected values
# (i.e. was the election outcome different to the pre-election poll?)
ss.chisquare(counts, f_exp=[5,9,9,8])
#Power_divergenceResult(statistic=19.324999999999999, pvalue=0.00023419202567505573)

[6]
Do you have a normal distribution (or more than 30 samples)?

yes --> TEST: One sample T-test
no --> go to 4

Example (yes)

import numpy as np
from scipy.stats import ttest_1samp
import statistics as st

# data
df=np.array([2.2,2.5,3.3,4.2,4.5,4.7,4.8,4.5,5.5,6.5,6,5,5.1,5.9,6.5,6.1,6.3,5.3,5.7,5.5,5.8,5.9,6.1,6.4,6.6,6.7,6.9,7.0,7.0,7.2,8.5,9.2,10.0])

# Summarise
len(df) # 33
ss.mode(df) # 4.5
st.mean(df) # 5.8606
st.median(df) # 5.90
min(df) # 1
max(df) # 10

# test normal distribution (Anderson Darling)
import statsmodels.api as sm
sm.stats.normal_ad(df)
#(0.52594355974574114, 0.16704288032057493)

# single sample t-test 
# (reject H0: the mean of df is 6.8)
t_statistic, p_value = ss.ttest_1samp(df, 6.8)
t_statistic
#-3.301025997929885
p_value
#0.0023728378433198723

# another test 
#(accept H0: cannot reject df as having mean of 5.8)
ss.ttest_1samp(df, 5.8)
Ttest_1sampResult(statistic=0.21296941922127938, pvalue=0.83270177497674891)

[7]
What type of data do you have as a dependent variable?

nominal --> go to 8
ordinal/interval/ratio --> go to 9

[8]
Is the independent variable nominal/ordinal?

yes --> Chi square test
no --> reconsider your dependent variable

convert nominal DV to dummy variable --> go to 12

continuous DV by nominal IV --> go to 14 or 15

Example Chi Square Test

"""
H0: males and females are associated with pass and fail of an exam 
"""
# Data (Exam outcome grouped by males and females)
sex=pd.Series(['male','male','male','male','male','female','female','female','female','female','female'])
outcome=pd.Series(['fail','fail','fail','fail','pass','fail','pass','pass','pass','pass','pass'])

# cross tabulate
ct = pd.crosstab(sex, outcome)
ct
#col_0   fail  pass
#row_0             
#female     1     5
#male       4     1

# Run
ss.chi2_contingency(ct)
#(2.2274999999999996, 0.13557305375093759, 1, 
# array([[ 2.72727273,  3.27272727],
#       [ 2.27272727,  2.72727273]])

"""
Do not reject the null hypothesis. There is no statistically significant association between gender and exam outcome (Chi2=2.228, df=1, p=0.1356).
"""

[9]
Does the IV have one level?

yes, my data set has 2 columns with interval/ratio/ordinal data types --> go to 10
no, I have a factor with 2 levels --> go to 14
no, I have 3 or more levels --> go to 15

[10]
Are the 2 samples correlated or paired?

yes --> go to 11
no, both samples are independent --> go to 14

[11]
Are you testing a correlation (e.g. as x increases so to does y)?

yes --> 12
no, I am testing the mean difference --> go to 13

[12]
Are both samples normally distributed and are both samples of interval/ratio data type?

yes --> Simple Linear Regression
no, the IV is ordinal --> Spearman Rank Order correlation

Example normality and linear regression

import numpy as np
import scipy.stats as ss

# Data (of equal length)
df1=np.array([1.1,1.2,3.7,4.1,4.3,5.1,5.3,5.5,5.8, 6.0,6.2,7.5,9.0])
df2=np.array([2.2,2.1,2.9,4.1,4.6,4.1,5.9,5.7,6.5, 7.0,10.0,12.0,12.2])

# TEST NORMALITY (kolmogorov test)
ss.ks_2samp(df1, df2)
#Ks_2sampResult(statistic=0.23076923076923073, pvalue=0.82810472891460341)

"""we do not reject the null hypothesis that the 2
samples come from the same continuous distribution.
Move on to t-test."""


# LINEAR REGRESSION
ss.linregress(df1,df2)
#LinregressResult(slope=1.406281327494328, intercept=-0.90977154012557371, rvalue=0.90729902346002733, pvalue=1.8533741181296456e-05, stderr=0.1965065233422113)

Example Spearman Rank Order

"""
H0: The two sets of data are not correlated.
"""
# Data (mid term exam scores vs 7 point likert item measuring enjoyment of exam: is there a correlation between exam enjoyment and grade?)
df1=np.array([45,50,55,56,58,59,60,66,67,74,75])
df2=np.array([1,1,2,3,3,3,4,5,5,7,7])

# Run
ss.spearmanr(df1, df2)
#SpearmanrResult(correlation=0.98396230526469775, pvalue=4.792371830033223e-08)

"""
Reject the null hypothesis in favour of the alternative hypothesis; the mid term exam scores are
correlated with final exam scores (rho=0.984,p<0.001).  
"""

[13]
Can you pass the following assumptions?

the variables are continuous (interval or ratio)
samples are paired
no sign outliers
distribution of differences are approximately normally distributed

yes --> Paired Sample t-test
no --> Rank Sums test

Example Paired Sample t-test

import scipy.stats as ss
import numpy as np

# PAIRED SAMPLE
# also called dependent t-test
""" 
H0: the mean difference between paired observations is zero.
"""

# Test
ss.ttest_rel(df1, df2)
#Ttest_relResult(statistic=-2.3753474894213884, pvalue=0.035058538954830479)

"""
we reject the null hypothesis in favour of the
alternative hypothesis; the mean difference between
paired obs is different to zero (t=-2.375, p=0.035). 
"""

[14]
Are both samples normally distributed?

yes --> TEST: two sample t-test
no --> TEST: Mann-Whitney U test

Example T-Test

import scipy.stats as ss
import numpy as np

# 2 SAMPLE T-TEST (2 tailed t-test)
"""
H0: mean of df1 is greater or less than mean of df2
ss.ttest_ind(df1, df2)
"""

# Test
#Ttest_indResult(statistic=-0.98900296548064237, pvalue=0.33252850150447189)

"""
we cannot reject the null hypothesis. The sample mean
of df1 is not significantly different to the sample
mean of df2.  
"""

Example Mann-Whitney U test

#-------------------
# Mann-Whitney U test

# Other names of the same test:
# 1. Mann–Whitney–Wilcoxon 
# 2. Wilcoxon rank-sum test
# 3. Wilcoxon–Mann–Whitney test
# 4. Wilcoxon two-sample test
# Requires N > 20

"""
H0: The 2 samples come from the same distribution
"""
# Data (scores)
df1=np.array([56,57,55,56,58,59,60,63,63,64,65,78,79,50,55,57,68,76,65,66,67,70,71,67,68,69,55,58,59,60,61])
df2=np.array([75,76,75,78,71,72,73,76,65,54,57,60,62,69,71,72,45,49,88,87,78,69,79,80,81,84,81,79,76,78,79])

# Run
ss.mannwhitneyu(df1, df2)
#MannwhitneyuResult(statistic=208.5, pvalue=0.00013057671142387172)

"""
Reject the null hypothesis in favour of an
alternative exlanation; the ranked scores of df1 is
significantly different to the ranked scores of df2
(mw=208.5, p<0.001).
"""

[15]
Can you pass the following assumptions?

Observations are independent between groups
There are no outliers
Data are normally distributed
Groups have equal variances

yes --> TEST: One-way ANOVA
no --> TEST: Kruskal-Wallis test

Stats test decision key (1 response and 1 variable)

Decision key

1 response variable and (at most) 1 independent variable

results matching ""

No results matching ""