1 response variable and (at most) 1 independent variable
Date: June 30th 2016
Last updated: July 5th 2016
This is a key to determine the method of analysis for data sets that have one or two variables (1 dependent/response and 1 independent variable that is of nominal, ordinal or interval/ratio type).
Ordinal (ranked order - likert item such as 1: agree, 2: neutral, 3: disagree)
Nominal (dichotomous - male/female)
Interval (continuous, order between points)
Ratio (scale data - order and distance between points)
Do you have two variables or less (when data is in long format you have at most two columns)
yes --> go to 2
no --> This key wont help you.
How many variables and levels are in your data set?
1 variable / 1 level --> go to 3
2 variables / 1, 2 or 3 levels --> go to 7
What data type do you have... ordinal, nominal, interval, ratio?
ordinal --> go to 4 (where an observation follows a ranked scale, e.g. 1: agree, 2: neutral, 3: disagree)
nominal --> go to 5 (where an observation is one of k categorical choices, e.g. male/female --> check if the data is already cross tabulated and contains count data)
interval/ratio --> go to 6 (where an observation is a measurement, e.g. 85, 84.2, 86.8, 90.2)
TEST using Wilcoxon signed-rank test. First check that you have more than 20 samples (scipy uses normal approximation)?
yes --> scipy.stats.wilcoxon (normal approximation)
no --> alternative??
""" scipy.stats.wilcoxon(x, y=None, zero_method='wilcox', correction=False) y is optional, uses normal approximation (use N>20) """ # Data # 1: strongly disagree # 2: disagree # 3: neutral # 4: agree # 5: strongly agree # Question # Course A improved my knowledge of X # library import scipy.stats as ss import statistics as st # Summarise ss.mode(df) # 5 st.mean(df) #4.2727 st.median(df) # 4.5 min(df) # 3 max(df) # 5 # Test (using normal approximation) df=np.array([1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]) ss.wilcoxon(df-3) # Output # ss.wilcoxon(df-3) # WilcoxonResult(statistic=68.0, pvalue=1.0) # ss.wilcoxon(df-5) # WilcoxonResult(statistic=0.0, pvalue=0.0004002)
Can you pass these assumptions?
- One categorical variable with K groups
- Independent (obs are not related)
- Obs are mutually exclusive (obs fall in only one category)
- Each group has at least five observations
yes --> TEST: Chi-Squared Goodness-of-fit Test
no --> go to ...?
import pandas as pd import scipy.stats as ss # Data (nominal) """ partyA partyB partyC partyD """ # Question # Is there a preference for one party? # Responses (use pandas Series - it has an easy count method) df=pd.Series([ 'partyD', 'partyD', 'partyD', 'partyA', 'partyA', 'partyC', 'partyC', 'partyC', 'partyA', 'partyD', 'partyB', 'partyC', 'partyA', 'partyC', 'partyA', 'partyA', 'partyB', 'partyB', 'partyA', 'partyA', 'partyA', 'partyC', 'partyA', 'partyD', 'partyB','partyD', 'partyB', 'partyA', 'partyA', 'partyA','partyA' ]) # summarise party preferences counts = df.value_counts() counts #partyA 14 #partyD 6 #partyC 6 #partyB 5 #dtype: int64 # Run chi square test ss.chisquare(counts) #Power_divergenceResult(statistic=6.806451612903226, pvalue=0.078329476194404501) # set expected values # (i.e. was the election outcome different to the pre-election poll?) ss.chisquare(counts, f_exp=[5,9,9,8]) #Power_divergenceResult(statistic=19.324999999999999, pvalue=0.00023419202567505573)
Do you have a normal distribution (or more than 30 samples)?
yes --> TEST: One sample T-test
no --> go to 4
import numpy as np from scipy.stats import ttest_1samp import statistics as st # data df=np.array([2.2,2.5,3.3,4.2,4.5,4.7,4.8,4.5,5.5,6.5,6,5,5.1,5.9,6.5,6.1,6.3,5.3,5.7,5.5,5.8,5.9,6.1,6.4,6.6,6.7,6.9,7.0,7.0,7.2,8.5,9.2,10.0]) # Summarise len(df) # 33 ss.mode(df) # 4.5 st.mean(df) # 5.8606 st.median(df) # 5.90 min(df) # 1 max(df) # 10 # test normal distribution (Anderson Darling) import statsmodels.api as sm sm.stats.normal_ad(df) #(0.52594355974574114, 0.16704288032057493) # single sample t-test # (reject H0: the mean of df is 6.8) t_statistic, p_value = ss.ttest_1samp(df, 6.8) t_statistic #-3.301025997929885 p_value #0.0023728378433198723 # another test #(accept H0: cannot reject df as having mean of 5.8) ss.ttest_1samp(df, 5.8) Ttest_1sampResult(statistic=0.21296941922127938, pvalue=0.83270177497674891)
What type of data do you have as a dependent variable?
nominal --> go to 8
ordinal/interval/ratio --> go to 9
Is the independent variable nominal/ordinal?
yes --> Chi square test
no --> reconsider your dependent variable
- convert nominal DV to dummy variable --> go to 12
- continuous DV by nominal IV --> go to 14 or 15
Example Chi Square Test
""" H0: males and females are associated with pass and fail of an exam """ # Data (Exam outcome grouped by males and females) sex=pd.Series(['male','male','male','male','male','female','female','female','female','female','female']) outcome=pd.Series(['fail','fail','fail','fail','pass','fail','pass','pass','pass','pass','pass']) # cross tabulate ct = pd.crosstab(sex, outcome) ct #col_0 fail pass #row_0 #female 1 5 #male 4 1 # Run ss.chi2_contingency(ct) #(2.2274999999999996, 0.13557305375093759, 1, # array([[ 2.72727273, 3.27272727], # [ 2.27272727, 2.72727273]]) """ Do not reject the null hypothesis. There is no statistically significant association between gender and exam outcome (Chi2=2.228, df=1, p=0.1356). """
Does the IV have one level?
yes, my data set has 2 columns with interval/ratio/ordinal data types --> go to 10
no, I have a factor with 2 levels --> go to 14
no, I have 3 or more levels --> go to 15
Are the 2 samples correlated or paired?
yes --> go to 11
no, both samples are independent --> go to 14
Are you testing a correlation (e.g. as x increases so to does y)?
yes --> 12
no, I am testing the mean difference --> go to 13
Are both samples normally distributed and are both samples of interval/ratio data type?
yes --> Simple Linear Regression
no, the IV is ordinal --> Spearman Rank Order correlation
Example normality and linear regression
import numpy as np import scipy.stats as ss # Data (of equal length) df1=np.array([1.1,1.2,3.7,4.1,4.3,5.1,5.3,5.5,5.8, 6.0,6.2,7.5,9.0]) df2=np.array([2.2,2.1,2.9,4.1,4.6,4.1,5.9,5.7,6.5, 7.0,10.0,12.0,12.2]) # TEST NORMALITY (kolmogorov test) ss.ks_2samp(df1, df2) #Ks_2sampResult(statistic=0.23076923076923073, pvalue=0.82810472891460341) """we do not reject the null hypothesis that the 2 samples come from the same continuous distribution. Move on to t-test.""" # LINEAR REGRESSION ss.linregress(df1,df2) #LinregressResult(slope=1.406281327494328, intercept=-0.90977154012557371, rvalue=0.90729902346002733, pvalue=1.8533741181296456e-05, stderr=0.1965065233422113)
Example Spearman Rank Order
""" H0: The two sets of data are not correlated. """ # Data (mid term exam scores vs 7 point likert item measuring enjoyment of exam: is there a correlation between exam enjoyment and grade?) df1=np.array([45,50,55,56,58,59,60,66,67,74,75]) df2=np.array([1,1,2,3,3,3,4,5,5,7,7]) # Run ss.spearmanr(df1, df2) #SpearmanrResult(correlation=0.98396230526469775, pvalue=4.792371830033223e-08) """ Reject the null hypothesis in favour of the alternative hypothesis; the mid term exam scores are correlated with final exam scores (rho=0.984,p<0.001). """
Can you pass the following assumptions?
- the variables are continuous (interval or ratio)
- samples are paired
- no sign outliers
- distribution of differences are approximately normally distributed
yes --> Paired Sample t-test
no --> Rank Sums test
Example Paired Sample t-test
import scipy.stats as ss import numpy as np # PAIRED SAMPLE # also called dependent t-test """ H0: the mean difference between paired observations is zero. """ # Test ss.ttest_rel(df1, df2) #Ttest_relResult(statistic=-2.3753474894213884, pvalue=0.035058538954830479) """ we reject the null hypothesis in favour of the alternative hypothesis; the mean difference between paired obs is different to zero (t=-2.375, p=0.035). """
Are both samples normally distributed?
yes --> TEST: two sample t-test
no --> TEST: Mann-Whitney U test
import scipy.stats as ss import numpy as np # 2 SAMPLE T-TEST (2 tailed t-test) """ H0: mean of df1 is greater or less than mean of df2 ss.ttest_ind(df1, df2) """ # Test #Ttest_indResult(statistic=-0.98900296548064237, pvalue=0.33252850150447189) """ we cannot reject the null hypothesis. The sample mean of df1 is not significantly different to the sample mean of df2. """
Example Mann-Whitney U test
#------------------- # Mann-Whitney U test # Other names of the same test: # 1. Mann–Whitney–Wilcoxon # 2. Wilcoxon rank-sum test # 3. Wilcoxon–Mann–Whitney test # 4. Wilcoxon two-sample test # Requires N > 20 """ H0: The 2 samples come from the same distribution """ # Data (scores) df1=np.array([56,57,55,56,58,59,60,63,63,64,65,78,79,50,55,57,68,76,65,66,67,70,71,67,68,69,55,58,59,60,61]) df2=np.array([75,76,75,78,71,72,73,76,65,54,57,60,62,69,71,72,45,49,88,87,78,69,79,80,81,84,81,79,76,78,79]) # Run ss.mannwhitneyu(df1, df2) #MannwhitneyuResult(statistic=208.5, pvalue=0.00013057671142387172) """ Reject the null hypothesis in favour of an alternative exlanation; the ranked scores of df1 is significantly different to the ranked scores of df2 (mw=208.5, p<0.001). """
Can you pass the following assumptions?
- Observations are independent between groups
- There are no outliers
- Data are normally distributed
- Groups have equal variances
yes --> TEST: One-way ANOVA
no --> TEST: Kruskal-Wallis test