Chapter 4 STANDARDIZING SCORES

Chapter Links

Assignment Links

Required Packages

library(tidyverse)    # Loads several very helpful 'tidy' packages
library(haven)        # Read in SPSS datasets
library(furniture)    # Nice tables (by our own Tyson Barrett)

Example: Cancer Experiment

The Cancer dataset was introduced in chapter 3.

4.1 Standardize Variables - Manually

You can manually create a stadradized version of the age variable.

First, you must find the mean and standard deviation of the age variable.

cancer_clean %>%
  furniture::table1(age)

-----------------------
     Mean/Count (SD/%)
     n = 25           
 age                  
     59.6 (12.9)      
-----------------------

Second, write an equation to do the calculation.

cancer_clean %>%
  dplyr::mutate(agez = (age - 59.6) / 12.9) %>% 
  dplyr::select(id, trt, age, agez)
# A tibble: 25 x 4
   id    trt       age    agez
   <fct> <fct>   <dbl>   <dbl>
 1 1     Placebo  52.0 -0.589 
 2 5     Placebo  77.0  1.35  
 3 6     Placebo  60.0  0.0310
 4 9     Placebo  61.0  0.109 
 5 11    Placebo  59.0 -0.0465
 6 15    Placebo  69.0  0.729 
 7 21    Placebo  67.0  0.574 
 8 26    Placebo  56.0 -0.279 
 9 31    Placebo  61.0  0.109 
10 35    Placebo  51.0 -0.667 
# ... with 15 more rows

4.2 Standardize Variables - with the scale() funciton

A quicker way is to use a funciton. Notice the differences due to rounding.

cancer_new <- cancer_clean %>%
  dplyr::mutate(agez = (age - 59.6) / 12.9) %>% 
  dplyr::mutate(ageZ = scale(age))%>%
  dplyr::select(id, trt, age, agez, ageZ)

cancer_new
# A tibble: 25 x 5
   id    trt       age    agez    ageZ
   <fct> <fct>   <dbl>   <dbl>   <dbl>
 1 1     Placebo  52.0 -0.589  -0.591 
 2 5     Placebo  77.0  1.35    1.34  
 3 6     Placebo  60.0  0.0310  0.0278
 4 9     Placebo  61.0  0.109   0.105 
 5 11    Placebo  59.0 -0.0465 -0.0495
 6 15    Placebo  69.0  0.729   0.724 
 7 21    Placebo  67.0  0.574   0.569 
 8 26    Placebo  56.0 -0.279  -0.281 
 9 31    Placebo  61.0  0.109   0.105 
10 35    Placebo  51.0 -0.667  -0.668 
# ... with 15 more rows

You can check that the new variable does in deed have mean of zero and spread of one.

cancer_new %>% 
  furniture::table1(age, agez, ageZ,
                    digits = 8)

--------------------------------
      Mean/Count (SD/%)        
      n = 25                   
 age                           
      59.64000000 (12.93213053)
 agez                          
      0.00310078 (1.00249074)  
 ageZ                          
      -0.00000000 (1.00000000) 
--------------------------------

Both the mean and the standard deviation are different.

cancer_new %>% 
  tidyr::gather(key = "variable",
                value = "value",
                age, ageZ) %>% 
  ggplot(aes(value)) +
  geom_histogram(bins = 8) +
  facet_grid(. ~ variable)

However, if you let the scale of the x-axis change, you see the shape of the two variables is identical.

cancer_new %>% 
  tidyr::gather(key = "variable",
                value = "value",
                age, ageZ) %>% 
  ggplot(aes(value)) +
  geom_histogram(bins = 8) +
  facet_grid(. ~ variable, scale = "free_x")