Decision key

1 response variable and (at most) 1 independent variable

Date: June 30th 2016
Last updated: July 5th 2016

This is a key to determine the method of analysis for data sets that have one or two variables (1 dependent/response and 1 independent variable that is of nominal, ordinal or interval/ratio type).


DATA TYPES

Ordinal (ranked order - likert item such as 1: agree, 2: neutral, 3: disagree)
Nominal (dichotomous - male/female)
Interval (continuous, order between points)
Ratio (scale data - order and distance between points)


KEY

[1]
Do you have two variables or less (when data is in long format you have at most two columns)

yes --> go to 2
no --> This key wont help you.

[2]
How many variables and levels are in your data set?

1 variable / 1 level --> go to 3
2 variables / 1, 2 or 3 levels --> go to 7

[3]
What data type do you have... ordinal, nominal, interval, ratio?

ordinal --> go to 4 (where an observation follows a ranked scale, e.g. 1: agree, 2: neutral, 3: disagree)

nominal --> go to 5 (where an observation is one of k categorical choices, e.g. male/female --> check if the data is already cross tabulated and contains count data)

interval/ratio --> go to 6 (where an observation is a measurement, e.g. 85, 84.2, 86.8, 90.2)

[4]
TEST using Wilcoxon signed-rank test. First check that you have more than 20 samples (scipy uses normal approximation)?

yes --> scipy.stats.wilcoxon (normal approximation)
no --> alternative??

Example

"""
scipy.stats.wilcoxon(x, y=None, zero_method='wilcox', correction=False)
y is optional, uses normal approximation (use N>20)
"""
# Data
# 1: strongly disagree
# 2: disagree
# 3: neutral
# 4: agree
# 5: strongly agree

# Question 
# Course A improved my knowledge of X

# library
import scipy.stats as ss 
import statistics as st

# Summarise
ss.mode(df) # 5
st.mean(df) #4.2727
st.median(df) # 4.5
min(df) # 3
max(df) # 5

# Test (using normal approximation)
df=np.array([1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5])
ss.wilcoxon(df-3)

# Output
# ss.wilcoxon(df-3)
# WilcoxonResult(statistic=68.0, pvalue=1.0)

# ss.wilcoxon(df-5)
# WilcoxonResult(statistic=0.0, pvalue=0.0004002)

[5]
Can you pass these assumptions?

  1. One categorical variable with K groups
  2. Independent (obs are not related)
  3. Obs are mutually exclusive (obs fall in only one category)
  4. Each group has at least five observations

yes --> TEST: Chi-Squared Goodness-of-fit Test
no --> go to ...?

Example

import pandas as pd
import scipy.stats as ss

# Data (nominal)
"""
partyA 
partyB 
partyC 
partyD 
"""

# Question
# Is there a preference for one party?

# Responses (use pandas Series - it has an easy count method)
df=pd.Series([
'partyD', 'partyD', 'partyD', 'partyA',
'partyA', 'partyC', 'partyC', 'partyC', 'partyA',
'partyD', 'partyB', 'partyC', 'partyA', 'partyC',
'partyA', 'partyA', 'partyB', 'partyB', 'partyA',
'partyA', 'partyA', 'partyC', 'partyA', 'partyD',
'partyB','partyD', 'partyB', 'partyA', 'partyA',
'partyA','partyA'
])

# summarise party preferences
counts = df.value_counts()
counts
#partyA    14
#partyD     6
#partyC     6
#partyB     5
#dtype: int64

# Run chi square test
ss.chisquare(counts)
#Power_divergenceResult(statistic=6.806451612903226, pvalue=0.078329476194404501)

# set expected values
# (i.e. was the election outcome different to the pre-election poll?)
ss.chisquare(counts, f_exp=[5,9,9,8])
#Power_divergenceResult(statistic=19.324999999999999, pvalue=0.00023419202567505573)

[6]
Do you have a normal distribution (or more than 30 samples)?

yes --> TEST: One sample T-test
no --> go to 4

Example (yes)

import numpy as np
from scipy.stats import ttest_1samp
import statistics as st

# data
df=np.array([2.2,2.5,3.3,4.2,4.5,4.7,4.8,4.5,5.5,6.5,6,5,5.1,5.9,6.5,6.1,6.3,5.3,5.7,5.5,5.8,5.9,6.1,6.4,6.6,6.7,6.9,7.0,7.0,7.2,8.5,9.2,10.0])

# Summarise
len(df) # 33
ss.mode(df) # 4.5
st.mean(df) # 5.8606
st.median(df) # 5.90
min(df) # 1
max(df) # 10

# test normal distribution (Anderson Darling)
import statsmodels.api as sm
sm.stats.normal_ad(df)
#(0.52594355974574114, 0.16704288032057493)

# single sample t-test 
# (reject H0: the mean of df is 6.8)
t_statistic, p_value = ss.ttest_1samp(df, 6.8)
t_statistic
#-3.301025997929885
p_value
#0.0023728378433198723

# another test 
#(accept H0: cannot reject df as having mean of 5.8)
ss.ttest_1samp(df, 5.8)
Ttest_1sampResult(statistic=0.21296941922127938, pvalue=0.83270177497674891)

[7]
What type of data do you have as a dependent variable?

nominal --> go to 8
ordinal/interval/ratio --> go to 9

[8]
Is the independent variable nominal/ordinal?

yes --> Chi square test
no --> reconsider your dependent variable

  1. convert nominal DV to dummy variable --> go to 12
  2. continuous DV by nominal IV --> go to 14 or 15

Example Chi Square Test

"""
H0: males and females are associated with pass and fail of an exam 
"""
# Data (Exam outcome grouped by males and females)
sex=pd.Series(['male','male','male','male','male','female','female','female','female','female','female'])
outcome=pd.Series(['fail','fail','fail','fail','pass','fail','pass','pass','pass','pass','pass'])

# cross tabulate
ct = pd.crosstab(sex, outcome)
ct
#col_0   fail  pass
#row_0             
#female     1     5
#male       4     1

# Run
ss.chi2_contingency(ct)
#(2.2274999999999996, 0.13557305375093759, 1, 
# array([[ 2.72727273,  3.27272727],
#       [ 2.27272727,  2.72727273]])

"""
Do not reject the null hypothesis. There is no statistically significant association between gender and exam outcome (Chi2=2.228, df=1, p=0.1356).
"""

[9]
Does the IV have one level?

yes, my data set has 2 columns with interval/ratio/ordinal data types --> go to 10
no, I have a factor with 2 levels --> go to 14
no, I have 3 or more levels --> go to 15

[10]
Are the 2 samples correlated or paired?

yes --> go to 11
no, both samples are independent --> go to 14

[11]
Are you testing a correlation (e.g. as x increases so to does y)?

yes --> 12
no, I am testing the mean difference --> go to 13

[12]
Are both samples normally distributed and are both samples of interval/ratio data type?

yes --> Simple Linear Regression
no, the IV is ordinal --> Spearman Rank Order correlation

Example normality and linear regression

import numpy as np
import scipy.stats as ss

# Data (of equal length)
df1=np.array([1.1,1.2,3.7,4.1,4.3,5.1,5.3,5.5,5.8, 6.0,6.2,7.5,9.0])
df2=np.array([2.2,2.1,2.9,4.1,4.6,4.1,5.9,5.7,6.5, 7.0,10.0,12.0,12.2])

# TEST NORMALITY (kolmogorov test)
ss.ks_2samp(df1, df2)
#Ks_2sampResult(statistic=0.23076923076923073, pvalue=0.82810472891460341)

"""we do not reject the null hypothesis that the 2
samples come from the same continuous distribution.
Move on to t-test."""


# LINEAR REGRESSION
ss.linregress(df1,df2)
#LinregressResult(slope=1.406281327494328, intercept=-0.90977154012557371, rvalue=0.90729902346002733, pvalue=1.8533741181296456e-05, stderr=0.1965065233422113)

Example Spearman Rank Order

"""
H0: The two sets of data are not correlated.
"""
# Data (mid term exam scores vs 7 point likert item measuring enjoyment of exam: is there a correlation between exam enjoyment and grade?)
df1=np.array([45,50,55,56,58,59,60,66,67,74,75])
df2=np.array([1,1,2,3,3,3,4,5,5,7,7])

# Run
ss.spearmanr(df1, df2)
#SpearmanrResult(correlation=0.98396230526469775, pvalue=4.792371830033223e-08)

"""
Reject the null hypothesis in favour of the alternative hypothesis; the mid term exam scores are
correlated with final exam scores (rho=0.984,p<0.001).  
"""

[13]
Can you pass the following assumptions?

  1. the variables are continuous (interval or ratio)
  2. samples are paired
  3. no sign outliers
  4. distribution of differences are approximately normally distributed

yes --> Paired Sample t-test
no --> Rank Sums test

Example Paired Sample t-test

import scipy.stats as ss
import numpy as np

# PAIRED SAMPLE
# also called dependent t-test
""" 
H0: the mean difference between paired observations is zero.
"""

# Test
ss.ttest_rel(df1, df2)
#Ttest_relResult(statistic=-2.3753474894213884, pvalue=0.035058538954830479)

"""
we reject the null hypothesis in favour of the
alternative hypothesis; the mean difference between
paired obs is different to zero (t=-2.375, p=0.035). 
"""

[14]
Are both samples normally distributed?

yes --> TEST: two sample t-test
no --> TEST: Mann-Whitney U test

Example T-Test

import scipy.stats as ss
import numpy as np

# 2 SAMPLE T-TEST (2 tailed t-test)
"""
H0: mean of df1 is greater or less than mean of df2
ss.ttest_ind(df1, df2)
"""

# Test
#Ttest_indResult(statistic=-0.98900296548064237, pvalue=0.33252850150447189)

"""
we cannot reject the null hypothesis. The sample mean
of df1 is not significantly different to the sample
mean of df2.  
"""

Example Mann-Whitney U test

#-------------------
# Mann-Whitney U test

# Other names of the same test:
# 1. Mann–Whitney–Wilcoxon 
# 2. Wilcoxon rank-sum test
# 3. Wilcoxon–Mann–Whitney test
# 4. Wilcoxon two-sample test
# Requires N > 20

"""
H0: The 2 samples come from the same distribution
"""
# Data (scores)
df1=np.array([56,57,55,56,58,59,60,63,63,64,65,78,79,50,55,57,68,76,65,66,67,70,71,67,68,69,55,58,59,60,61])
df2=np.array([75,76,75,78,71,72,73,76,65,54,57,60,62,69,71,72,45,49,88,87,78,69,79,80,81,84,81,79,76,78,79])

# Run
ss.mannwhitneyu(df1, df2)
#MannwhitneyuResult(statistic=208.5, pvalue=0.00013057671142387172)

"""
Reject the null hypothesis in favour of an
alternative exlanation; the ranked scores of df1 is
significantly different to the ranked scores of df2
(mw=208.5, p<0.001).
"""

[15]
Can you pass the following assumptions?

  1. Observations are independent between groups
  2. There are no outliers
  3. Data are normally distributed
  4. Groups have equal variances

yes --> TEST: One-way ANOVA
no --> TEST: Kruskal-Wallis test

results matching ""

    No results matching ""