Chapter 5 TESTING NORMALITY

Chapter Links

Assignment Links

Required Packages

library(tidyverse)    # Loads several very helpful 'tidy' packages
library(haven)        # Read in SPSS datasets
library(psych)        # Lots of nice tid-bits

Example: Cancer Experiment

The Cancer dataset was introduced in chapter 3.

5.1 Skewness & Kurtosis

The psych::describe() function may be used to calculate skewness and kurtosis.

cancer_clean %>% 
  dplyr::select(age, totalcw4) %>% 
  psych::describe()
         vars  n  mean    sd median trimmed   mad min max range  skew
age         1 25 59.64 12.93     60   59.95 11.86  27  86    59 -0.31
totalcw4    2 25 10.36  3.47     10   10.19  2.97   6  17    11  0.49
         kurtosis   se
age         -0.01 2.59
totalcw4    -1.00 0.69

5.2 Shapiro-Wilk’s Test

The shapiro.test() function is used to test for normality in a small’ish sample. This function is meant to be applied to a single variable in vector form, thus precede it with a dplyr::pull() step.

If \(p-value \gt \alpha\), then the sample does NOT provide evidence the population is non-normally distributed.

cancer_clean %>% 
  dplyr::pull(age) %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.98317, p-value = 0.9399

If \(p-value \lt \alpha\), then the sample DOES provide evidence the population is non-normally distributed.

cancer_clean %>% 
  dplyr::pull(totalcw4) %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.9131, p-value = 0.03575

5.3 Histogram

Histograms provide a visual way to determine if a data are approximately normally distributed. Look for a ‘bell’ shape.

cancer_clean %>% 
  ggplot(aes(age)) +
  geom_histogram(binwidth = 5)

cancer_clean %>% 
  ggplot(aes(totalcw4)) +
  geom_histogram(binwidth = 1)

5.4 Q-Q Plot

Quantile-quantile plots also help visually determine if data are approximately normally distributed. Look for the points to fall on a straight \(45 \degree\) line.

cancer_clean %>% 
  ggplot(aes(sample = age)) +
  geom_qq()

cancer_clean %>% 
  ggplot(aes(sample = totalcw4)) +
  geom_qq()