Chapter 1 DATA PREPARTION

Chapter Links

Chapter 1 Slide Show

Assignment Links

1.1 Preparing the Environment

1.1.1 The `library()` Function: Load a Package

You will need TWO packages:

tidyverse Easily Install and Load the ‘Tidy universe’ of packages (Wickham 2017)
readxl Read Excel Files (Wickham and Bryan 2017)
furniture Nice Tables and Row-wise functions (by Tyson) (Barrett, Brignone, and Laxman 2018)

Make sure the packages are installed (Package tab)

The function library() checks the package out, or makes it active.

library(tidyverse)    # Loads several very helpful 'tidy' packages
library(readxl)       # Read in Excel datasets
library(furniture)    # Nice tables (by our own Tyson Barrett)

1.1.2 The `readxl::read_excel()` Function: Read in Excel Data Files

Make sure the dataset is saved in the same folder as this file

Make sure the that folder is the working directory

Now we are ready to open the data with the read_excel() function from the readxl package
- readxl::read_excel() the double colon specifies the package::function()
- the only thing required inside the () is the quoted name of the Excel file
- Make sure it is stored in the .Rmd’s folder
- Make sure to include the file’s extension (.xls)

NOTE: a tibble is basically just a “table” of data, the way the tidy-verse represents data sets.

read_excel("Ihno_dataset.xls")

1.2 Opperators and Helpful Functions

1.2.1 The Assignment Opperator `<-`: Save things to a name

the <- combination of symbols makes assignments
tells R to store the dataset as the name it points to
this lets us use the dataset later on in another step

NOTE: no output is produced when you make an assignment.

data <- read_excel("Ihno_dataset.xls")

Print out the dataset by just typing and running the name you assigned it

data

# A tibble: 100 x 18
   Sub_num Gender Major Reason Exp_cond Coffee Num_cups Phobia Prevmath
     <dbl>  <dbl> <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
 1    1.00   1.00  1.00   3.00     1.00   1.00     0      1.00     3.00
 2    2.00   1.00  1.00   2.00     1.00   0        0      1.00     4.00
 3    3.00   1.00  1.00   1.00     1.00   0        0      4.00     1.00
 4    4.00   1.00  1.00   1.00     1.00   0        0      4.00     0   
 5    5.00   1.00  1.00   1.00     1.00   0        1.00  10.0      1.00
 6    6.00   1.00  1.00   1.00     2.00   1.00     1.00   4.00     1.00
 7    7.00   1.00  1.00   1.00     2.00   0        0      4.00     2.00
 8    8.00   1.00  1.00   3.00     2.00   1.00     2.00   4.00     1.00
 9    9.00   1.00  1.00   1.00     2.00   0        0      4.00     1.00
10   10.0    1.00  1.00   1.00     2.00   1.00     2.00   5.00     0   
# ... with 90 more rows, and 9 more variables: Mathquiz <dbl>, Statquiz
#   <dbl>, Exp_sqz <dbl>, Hr_base <dbl>, Hr_pre <dbl>, Hr_post <dbl>,
#   Anx_base <dbl>, Anx_pre <dbl>, Anx_post <dbl>

NOTE: The pound or hashtag symbol at the front of a line within an R code chunk designates what follows as a comment and does not try to run the code.

#data

1.2.2 The Pipe `%>%` Opperator: Link Steps Togehter

This special set of symbols (no spaces included) signals R to feed what precedes it into what follows it. Its a simple idea that makes code writing in R much easier (Wickham et al. 2017).

data <- read_excel("Ihno_dataset.xls") %>% 
  dplyr::rename_all(tolower)              # convert variable names to lower case

data

1.2.3 The `head()` Function: Print the First Few Rows

See the first few rows of a dataset by using the head() function. Since it is part for base R, I never include the package, but if we did, it would be utils::head(). The default is to print the first SIX rows.

head(data)

# A tibble: 6 x 18
  sub_num gender major reason exp_cond coffee num_cups phobia prevmath
    <dbl>  <dbl> <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
1    1.00   1.00  1.00   3.00     1.00   1.00     0      1.00     3.00
2    2.00   1.00  1.00   2.00     1.00   0        0      1.00     4.00
3    3.00   1.00  1.00   1.00     1.00   0        0      4.00     1.00
4    4.00   1.00  1.00   1.00     1.00   0        0      4.00     0   
5    5.00   1.00  1.00   1.00     1.00   0        1.00  10.0      1.00
6    6.00   1.00  1.00   1.00     2.00   1.00     1.00   4.00     1.00
# ... with 9 more variables: mathquiz <dbl>, statquiz <dbl>, exp_sqz
#   <dbl>, hr_base <dbl>, hr_pre <dbl>, hr_post <dbl>, anx_base <dbl>,
#   anx_pre <dbl>, anx_post <dbl>

Inside the head() function, you can change the default of n = 6 rows. You can learn about this and other options in the Help tab search for ‘head’.

utils::head(data, n = 3)

# A tibble: 3 x 18
  sub_num gender major reason exp_cond coffee num_cups phobia prevmath
    <dbl>  <dbl> <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
1    1.00   1.00  1.00   3.00     1.00   1.00        0   1.00     3.00
2    2.00   1.00  1.00   2.00     1.00   0           0   1.00     4.00
3    3.00   1.00  1.00   1.00     1.00   0           0   4.00     1.00
# ... with 9 more variables: mathquiz <dbl>, statquiz <dbl>, exp_sqz
#   <dbl>, hr_base <dbl>, hr_pre <dbl>, hr_post <dbl>, anx_base <dbl>,
#   anx_pre <dbl>, anx_post <dbl>

head(data, n = 11)

# A tibble: 11 x 18
   sub_num gender major reason exp_cond coffee num_cups phobia prevmath
     <dbl>  <dbl> <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
 1    1.00   1.00  1.00   3.00     1.00   1.00     0      1.00     3.00
 2    2.00   1.00  1.00   2.00     1.00   0        0      1.00     4.00
 3    3.00   1.00  1.00   1.00     1.00   0        0      4.00     1.00
 4    4.00   1.00  1.00   1.00     1.00   0        0      4.00     0   
 5    5.00   1.00  1.00   1.00     1.00   0        1.00  10.0      1.00
 6    6.00   1.00  1.00   1.00     2.00   1.00     1.00   4.00     1.00
 7    7.00   1.00  1.00   1.00     2.00   0        0      4.00     2.00
 8    8.00   1.00  1.00   3.00     2.00   1.00     2.00   4.00     1.00
 9    9.00   1.00  1.00   1.00     2.00   0        0      4.00     1.00
10   10.0    1.00  1.00   1.00     2.00   1.00     2.00   5.00     0   
11   11.0    1.00  1.00   1.00     2.00   0        1.00   5.00     1.00
# ... with 9 more variables: mathquiz <dbl>, statquiz <dbl>, exp_sqz
#   <dbl>, hr_base <dbl>, hr_pre <dbl>, hr_post <dbl>, anx_base <dbl>,
#   anx_pre <dbl>, anx_post <dbl>

1.2.4 The `names()` Function: List the Variable Names

Another helpful function is names() which lists out the names of the variables. This is nice to use to copy-paste later on, since…in R code chunks:

Spelling matters
Capitalization matters
Spacing does NOT matter: one space is the same as 100 spaces
Line enters are ignored

names(data)

 [1] "sub_num"  "gender"   "major"    "reason"   "exp_cond" "coffee"  
 [7] "num_cups" "phobia"   "prevmath" "mathquiz" "statquiz" "exp_sqz" 
[13] "hr_base"  "hr_pre"   "hr_post"  "anx_base" "anx_pre"  "anx_post"

1.2.5 The `dim()` Function: List the Dimentions

See how many rows (observation) and columns (variables)

dim(data)

[1] 100  18

1.2.6 The `tibble::glimpse()` Function: Gives an Overview of Variables

This is a handy function that gives (Muller and Wickham 2017):

Dimensions (observations and variables)
Names of variables
Each variables type, which could be…
dbl = numeric: double precision floating point numbers
fct = factor: categorical, either nominal or ordinal
chr = character: text
Lists the first few entries

tibble::glimpse(data)

Observations: 100
Variables: 18
$ sub_num  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ gender   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ major    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ reason   <dbl> 3, 2, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1,...
$ exp_cond <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4,...
$ coffee   <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1,...
$ num_cups <dbl> 0, 0, 0, 0, 1, 1, 0, 2, 0, 2, 1, 0, 1, 2, 3, 0, 0, 3,...
$ phobia   <dbl> 1, 1, 4, 4, 10, 4, 4, 4, 4, 5, 5, 4, 7, 4, 3, 8, 4, 5...
$ prevmath <dbl> 3, 4, 1, 0, 1, 1, 2, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,...
$ mathquiz <dbl> 43, 49, 26, 29, 31, 20, 13, 23, 38, NA, 29, 32, 18, N...
$ statquiz <dbl> 6, 9, 8, 7, 6, 7, 3, 7, 8, 7, 8, 8, 1, 5, 8, 3, 8, 7,...
$ exp_sqz  <dbl> 7, 11, 8, 8, 6, 6, 4, 7, 7, 6, 10, 7, 3, 4, 6, 1, 7, ...
$ hr_base  <dbl> 71, 73, 69, 72, 71, 70, 71, 77, 73, 78, 74, 73, 73, 7...
$ hr_pre   <dbl> 68, 75, 76, 73, 83, 71, 70, 87, 72, 76, 72, 74, 76, 8...
$ hr_post  <dbl> 65, 68, 72, 78, 74, 76, 66, 84, 67, 74, 73, 74, 78, 7...
$ anx_base <dbl> 17, 17, 19, 19, 26, 12, 12, 17, 20, 20, 21, 32, 19, 1...
$ anx_pre  <dbl> 22, 19, 14, 13, 30, 15, 16, 19, 14, 24, 25, 35, 23, 2...
$ anx_post <dbl> 20, 16, 15, 16, 25, 19, 17, 22, 17, 19, 22, 33, 20, 2...

1.2.7 The `dplyr::select()` Function: Specify VARIABLES to include/keep

This function chooses which variables to include, excluding all others not given between the ().

data %>% 
  dplyr::select(sub_num, gender, major, reason, exp_cond, coffee)

# A tibble: 100 x 6
   sub_num gender major reason exp_cond coffee
     <dbl>  <dbl> <dbl>  <dbl>    <dbl>  <dbl>
 1    1.00   1.00  1.00   3.00     1.00   1.00
 2    2.00   1.00  1.00   2.00     1.00   0   
 3    3.00   1.00  1.00   1.00     1.00   0   
 4    4.00   1.00  1.00   1.00     1.00   0   
 5    5.00   1.00  1.00   1.00     1.00   0   
 6    6.00   1.00  1.00   1.00     2.00   1.00
 7    7.00   1.00  1.00   1.00     2.00   0   
 8    8.00   1.00  1.00   3.00     2.00   1.00
 9    9.00   1.00  1.00   1.00     2.00   0   
10   10.0    1.00  1.00   1.00     2.00   1.00
# ... with 90 more rows

1.2.8 The `dplyr::filter()` Function: Specify OBSERVATIONS to include/keep

This function chooses which observations to include, excluding all others not given between the ().

data %>% 
  dplyr::filter(gender == 1)

# A tibble: 57 x 18
   sub_num gender major reason exp_cond coffee num_cups phobia prevmath
     <dbl>  <dbl> <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
 1    1.00   1.00  1.00   3.00     1.00   1.00     0      1.00     3.00
 2    2.00   1.00  1.00   2.00     1.00   0        0      1.00     4.00
 3    3.00   1.00  1.00   1.00     1.00   0        0      4.00     1.00
 4    4.00   1.00  1.00   1.00     1.00   0        0      4.00     0   
 5    5.00   1.00  1.00   1.00     1.00   0        1.00  10.0      1.00
 6    6.00   1.00  1.00   1.00     2.00   1.00     1.00   4.00     1.00
 7    7.00   1.00  1.00   1.00     2.00   0        0      4.00     2.00
 8    8.00   1.00  1.00   3.00     2.00   1.00     2.00   4.00     1.00
 9    9.00   1.00  1.00   1.00     2.00   0        0      4.00     1.00
10   10.0    1.00  1.00   1.00     2.00   1.00     2.00   5.00     0   
# ... with 47 more rows, and 9 more variables: mathquiz <dbl>, statquiz
#   <dbl>, exp_sqz <dbl>, hr_base <dbl>, hr_pre <dbl>, hr_post <dbl>,
#   anx_base <dbl>, anx_pre <dbl>, anx_post <dbl>

You can combine steps to multiple things. The steps are completed in the order we read: top to bottom, left to right.

data %>% 
  dplyr::select(sub_num, gender, major, reason, exp_cond, coffee) %>% 
  dplyr::filter(gender == 1)

# A tibble: 57 x 6
   sub_num gender major reason exp_cond coffee
     <dbl>  <dbl> <dbl>  <dbl>    <dbl>  <dbl>
 1    1.00   1.00  1.00   3.00     1.00   1.00
 2    2.00   1.00  1.00   2.00     1.00   0   
 3    3.00   1.00  1.00   1.00     1.00   0   
 4    4.00   1.00  1.00   1.00     1.00   0   
 5    5.00   1.00  1.00   1.00     1.00   0   
 6    6.00   1.00  1.00   1.00     2.00   1.00
 7    7.00   1.00  1.00   1.00     2.00   0   
 8    8.00   1.00  1.00   3.00     2.00   1.00
 9    9.00   1.00  1.00   1.00     2.00   0   
10   10.0    1.00  1.00   1.00     2.00   1.00
# ... with 47 more rows

1.3 Data Wrangling

1.3.1 The `dplyr::mutate()` Function: Create a New Variable

Just like radiation may cause a fish to grow an additional eye (a mutation), the mutate() function grows a new variable.

data %>% 
  dplyr::mutate(test = 1) %>% 
  dplyr::select(sub_num, gender, mathquiz, statquiz, test) %>% 
  head()

# A tibble: 6 x 5
  sub_num gender mathquiz statquiz  test
    <dbl>  <dbl>    <dbl>    <dbl> <dbl>
1    1.00   1.00     43.0     6.00  1.00
2    2.00   1.00     49.0     9.00  1.00
3    3.00   1.00     26.0     8.00  1.00
4    4.00   1.00     29.0     7.00  1.00
5    5.00   1.00     31.0     6.00  1.00
6    6.00   1.00     20.0     7.00  1.00

1.3.2 The `factor()` Function: Define Categorical Variables

We will be providing this function with three pieces of information:

The name of an existing variable
A concatinated set of numerical levels
A concatinated set of textual labels

You must include the SAME number of levels and labels. The ORDER in the sets designates how the labels will be applied to the levels.

Here is how it looks to create ONE new factor. Notice I added the letter F to designate that this new variable is a factor.

data %>% 
  dplyr::mutate(genderF = factor(gender,
                                 levels = c(1, 2),
                                 labels = c("Female", "Male")))

# A tibble: 100 x 19
   sub_num gender major reason exp_cond coffee num_cups phobia prevmath
     <dbl>  <dbl> <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
 1    1.00   1.00  1.00   3.00     1.00   1.00     0      1.00     3.00
 2    2.00   1.00  1.00   2.00     1.00   0        0      1.00     4.00
 3    3.00   1.00  1.00   1.00     1.00   0        0      4.00     1.00
 4    4.00   1.00  1.00   1.00     1.00   0        0      4.00     0   
 5    5.00   1.00  1.00   1.00     1.00   0        1.00  10.0      1.00
 6    6.00   1.00  1.00   1.00     2.00   1.00     1.00   4.00     1.00
 7    7.00   1.00  1.00   1.00     2.00   0        0      4.00     2.00
 8    8.00   1.00  1.00   3.00     2.00   1.00     2.00   4.00     1.00
 9    9.00   1.00  1.00   1.00     2.00   0        0      4.00     1.00
10   10.0    1.00  1.00   1.00     2.00   1.00     2.00   5.00     0   
# ... with 90 more rows, and 10 more variables: mathquiz <dbl>, statquiz
#   <dbl>, exp_sqz <dbl>, hr_base <dbl>, hr_pre <dbl>, hr_post <dbl>,
#   anx_base <dbl>, anx_pre <dbl>, anx_post <dbl>, genderF <fct>

Notice that the dataset the is printed includes our new variable at the END.

You can use the pipe to chain several mutate steps together. I have also assigned the resulting dataset with the five new factor variables at the end to a new name, dataF. Since this code chunk includes a single assignment, there is no output created.

Remember to include a PIPE between all your steps, but not at the end!

dataF <- data %>% 
  dplyr::mutate(genderF = factor(gender, 
                                 levels = c(1, 2),
                                 labels = c("Female", 
                                            "Male"))) %>% 
  dplyr::mutate(majorF = factor(major, 
                                levels = c(1, 2, 3, 4,5),
                                labels = c("Psychology",
                                           "Premed",
                                           "Biology",
                                           "Sociology",
                                           "Economics"))) %>% 
  dplyr::mutate(reasonF = factor(reason,
                                 levels = c(1, 2, 3),
                                 labels = c("Program requirement",
                                            "Personal interest",
                                            "Advisor recommendation"))) %>% 
  dplyr::mutate(exp_condF = factor(exp_cond,
                                   levels = c(1, 2, 3, 4),
                                   labels = c("Easy",
                                              "Moderate",
                                              "Difficult",
                                              "Impossible"))) %>% 
  dplyr::mutate(coffeeF = factor(coffee,
                                 levels = c(0, 1),
                                 labels = c("Not a regular coffee drinker",
                                            "Regularly drinks coffee")))

See how the new variables are added at the end.

tibble::glimpse(data)

Observations: 100
Variables: 18
$ sub_num  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ gender   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ major    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ reason   <dbl> 3, 2, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1,...
$ exp_cond <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4,...
$ coffee   <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1,...
$ num_cups <dbl> 0, 0, 0, 0, 1, 1, 0, 2, 0, 2, 1, 0, 1, 2, 3, 0, 0, 3,...
$ phobia   <dbl> 1, 1, 4, 4, 10, 4, 4, 4, 4, 5, 5, 4, 7, 4, 3, 8, 4, 5...
$ prevmath <dbl> 3, 4, 1, 0, 1, 1, 2, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,...
$ mathquiz <dbl> 43, 49, 26, 29, 31, 20, 13, 23, 38, NA, 29, 32, 18, N...
$ statquiz <dbl> 6, 9, 8, 7, 6, 7, 3, 7, 8, 7, 8, 8, 1, 5, 8, 3, 8, 7,...
$ exp_sqz  <dbl> 7, 11, 8, 8, 6, 6, 4, 7, 7, 6, 10, 7, 3, 4, 6, 1, 7, ...
$ hr_base  <dbl> 71, 73, 69, 72, 71, 70, 71, 77, 73, 78, 74, 73, 73, 7...
$ hr_pre   <dbl> 68, 75, 76, 73, 83, 71, 70, 87, 72, 76, 72, 74, 76, 8...
$ hr_post  <dbl> 65, 68, 72, 78, 74, 76, 66, 84, 67, 74, 73, 74, 78, 7...
$ anx_base <dbl> 17, 17, 19, 19, 26, 12, 12, 17, 20, 20, 21, 32, 19, 1...
$ anx_pre  <dbl> 22, 19, 14, 13, 30, 15, 16, 19, 14, 24, 25, 35, 23, 2...
$ anx_post <dbl> 20, 16, 15, 16, 25, 19, 17, 22, 17, 19, 22, 33, 20, 2...

tibble::glimpse(dataF)

Observations: 100
Variables: 23
$ sub_num   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
$ gender    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ major     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ reason    <dbl> 3, 2, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1...
$ exp_cond  <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4...
$ coffee    <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1...
$ num_cups  <dbl> 0, 0, 0, 0, 1, 1, 0, 2, 0, 2, 1, 0, 1, 2, 3, 0, 0, 3...
$ phobia    <dbl> 1, 1, 4, 4, 10, 4, 4, 4, 4, 5, 5, 4, 7, 4, 3, 8, 4, ...
$ prevmath  <dbl> 3, 4, 1, 0, 1, 1, 2, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1...
$ mathquiz  <dbl> 43, 49, 26, 29, 31, 20, 13, 23, 38, NA, 29, 32, 18, ...
$ statquiz  <dbl> 6, 9, 8, 7, 6, 7, 3, 7, 8, 7, 8, 8, 1, 5, 8, 3, 8, 7...
$ exp_sqz   <dbl> 7, 11, 8, 8, 6, 6, 4, 7, 7, 6, 10, 7, 3, 4, 6, 1, 7,...
$ hr_base   <dbl> 71, 73, 69, 72, 71, 70, 71, 77, 73, 78, 74, 73, 73, ...
$ hr_pre    <dbl> 68, 75, 76, 73, 83, 71, 70, 87, 72, 76, 72, 74, 76, ...
$ hr_post   <dbl> 65, 68, 72, 78, 74, 76, 66, 84, 67, 74, 73, 74, 78, ...
$ anx_base  <dbl> 17, 17, 19, 19, 26, 12, 12, 17, 20, 20, 21, 32, 19, ...
$ anx_pre   <dbl> 22, 19, 14, 13, 30, 15, 16, 19, 14, 24, 25, 35, 23, ...
$ anx_post  <dbl> 20, 16, 15, 16, 25, 19, 17, 22, 17, 19, 22, 33, 20, ...
$ genderF   <fct> Female, Female, Female, Female, Female, Female, Fema...
$ majorF    <fct> Psychology, Psychology, Psychology, Psychology, Psyc...
$ reasonF   <fct> Advisor recommendation, Personal interest, Program r...
$ exp_condF <fct> Easy, Easy, Easy, Easy, Easy, Moderate, Moderate, Mo...
$ coffeeF   <fct> Regularly drinks coffee, Not a regular coffee drinke...

The following portion works through the assignment for Unit 0

1.3.3 Question 2: Create a new variable = `mathquiz` + 50

data2 <- dataF %>% 
  dplyr::mutate(mathquiz_p50 = mathquiz + 50)

data2 %>% 
  dplyr::select(sub_num, mathquiz, mathquiz_p50)

# A tibble: 100 x 3
   sub_num mathquiz mathquiz_p50
     <dbl>    <dbl>        <dbl>
 1    1.00     43.0         93.0
 2    2.00     49.0         99.0
 3    3.00     26.0         76.0
 4    4.00     29.0         79.0
 5    5.00     31.0         81.0
 6    6.00     20.0         70.0
 7    7.00     13.0         63.0
 8    8.00     23.0         73.0
 9    9.00     38.0         88.0
10   10.0      NA           NA  
# ... with 90 more rows

1.3.4 Question 3: Create a new variable = `Hr_base` / 60

data3 <- data2 %>% 
  dplyr::mutate(hr_base_bps = hr_base / 60)

data3 %>% 
  dplyr::select(sub_num, hr_base, hr_base_bps)

# A tibble: 100 x 3
   sub_num hr_base hr_base_bps
     <dbl>   <dbl>       <dbl>
 1    1.00    71.0        1.18
 2    2.00    73.0        1.22
 3    3.00    69.0        1.15
 4    4.00    72.0        1.20
 5    5.00    71.0        1.18
 6    6.00    70.0        1.17
 7    7.00    71.0        1.18
 8    8.00    77.0        1.28
 9    9.00    73.0        1.22
10   10.0     78.0        1.30
# ... with 90 more rows

1.3.5 Question 4a: Create a new variable = `Statquiz` + 2, then * 10

data4a <- data3 %>% 
  dplyr::mutate(statquiz_4a = (statquiz + 2) * 10 )

1.3.6 Question 4b: Create a new variable = `Statquiz` * 10, then + 2

data4b <- data4a %>% 
  dplyr::mutate(statquiz_4b = (statquiz * 10) + 2 )

data4b %>% 
  dplyr::select(sub_num, statquiz, statquiz_4a, statquiz_4b)

# A tibble: 100 x 4
   sub_num statquiz statquiz_4a statquiz_4b
     <dbl>    <dbl>       <dbl>       <dbl>
 1    1.00     6.00        80.0        62.0
 2    2.00     9.00       110          92.0
 3    3.00     8.00       100          82.0
 4    4.00     7.00        90.0        72.0
 5    5.00     6.00        80.0        62.0
 6    6.00     7.00        90.0        72.0
 7    7.00     3.00        50.0        32.0
 8    8.00     7.00        90.0        72.0
 9    9.00     8.00       100          82.0
10   10.0      7.00        90.0        72.0
# ... with 90 more rows

1.3.7 Question 5a: Create a new variable = `sum` of the 3 anxiety measures

Here are three ways you may try to find the sum. The middle way does not perform the action we want. The first way works fine, unless there is some missing data.

data5a <- data4b %>% 
  dplyr::mutate(anx_plus    =         anx_base + anx_pre + anx_post)  %>%  # works, missing??
  dplyr::mutate(anx_sum     =     sum(anx_base,  anx_pre,  anx_post)) %>%  # does NOT work
  dplyr::mutate(anx_rowsums = rowsums(anx_base,  anx_pre,  anx_post))      # best way

data5a %>% 
  dplyr::select(sub_num, 
                anx_base, anx_pre, anx_post, 
                anx_plus, anx_sum, anx_rowsums)

# A tibble: 100 x 7
   sub_num anx_base anx_pre anx_post anx_plus anx_sum anx_rowsums
     <dbl>    <dbl>   <dbl>    <dbl>    <dbl>   <dbl>       <dbl>
 1    1.00     17.0    22.0     20.0     59.0    5741        59.0
 2    2.00     17.0    19.0     16.0     52.0    5741        52.0
 3    3.00     19.0    14.0     15.0     48.0    5741        48.0
 4    4.00     19.0    13.0     16.0     48.0    5741        48.0
 5    5.00     26.0    30.0     25.0     81.0    5741        81.0
 6    6.00     12.0    15.0     19.0     46.0    5741        46.0
 7    7.00     12.0    16.0     17.0     45.0    5741        45.0
 8    8.00     17.0    19.0     22.0     58.0    5741        58.0
 9    9.00     20.0    14.0     17.0     51.0    5741        51.0
10   10.0      20.0    24.0     19.0     63.0    5741        63.0
# ... with 90 more rows

1.3.8 Question 5b: Create a new variable = `average` of the 3 heart rates

data5b <- data5a %>% 
  dplyr::mutate(hr_avg      =         (hr_base + hr_pre + hr_post)/3) %>%  # works,no missings
  dplyr::mutate(hr_rowmeans = rowmeans(hr_base,  hr_pre,  hr_post)) # always works

data5b %>% 
  dplyr::select(sub_num, 
                hr_base, hr_pre, hr_post, 
                hr_avg,  hr_rowmeans)

# A tibble: 100 x 6
   sub_num hr_base hr_pre hr_post hr_avg hr_rowmeans
     <dbl>   <dbl>  <dbl>   <dbl>  <dbl>       <dbl>
 1    1.00    71.0   68.0    65.0   68.0        68.0
 2    2.00    73.0   75.0    68.0   72.0        72.0
 3    3.00    69.0   76.0    72.0   72.3        72.3
 4    4.00    72.0   73.0    78.0   74.3        74.3
 5    5.00    71.0   83.0    74.0   76.0        76.0
 6    6.00    70.0   71.0    76.0   72.3        72.3
 7    7.00    71.0   70.0    66.0   69.0        69.0
 8    8.00    77.0   87.0    84.0   82.7        82.7
 9    9.00    73.0   72.0    67.0   70.7        70.7
10   10.0     78.0   76.0    74.0   76.0        76.0
# ... with 90 more rows

1.3.9 Question 6: Create a new variable = `Statquiz` minus `Exp_sqz`

data6 <- data5b %>% 
  dplyr::mutate(statDiff = statquiz - exp_sqz)

data6 %>% 
  dplyr::select(sub_num, exp_cond, exp_condF,
                statquiz, exp_sqz, statDiff)

# A tibble: 100 x 6
   sub_num exp_cond exp_condF statquiz exp_sqz statDiff
     <dbl>    <dbl> <fct>        <dbl>   <dbl>    <dbl>
 1    1.00     1.00 Easy          6.00    7.00    -1.00
 2    2.00     1.00 Easy          9.00   11.0     -2.00
 3    3.00     1.00 Easy          8.00    8.00     0   
 4    4.00     1.00 Easy          7.00    8.00    -1.00
 5    5.00     1.00 Easy          6.00    6.00     0   
 6    6.00     2.00 Moderate      7.00    6.00     1.00
 7    7.00     2.00 Moderate      3.00    4.00    -1.00
 8    8.00     2.00 Moderate      7.00    7.00     0   
 9    9.00     2.00 Moderate      8.00    7.00     1.00
10   10.0      2.00 Moderate      7.00    6.00     1.00
# ... with 90 more rows

1.3.10 Putting it all together

data_clean <- read_excel("Ihno_dataset.xls") %>% 
  dplyr::rename_all(tolower) %>% 
  dplyr::mutate(genderF = factor(gender, 
                                 levels = c(1, 2),
                                 labels = c("Female", 
                                            "Male"))) %>% 
  dplyr::mutate(majorF = factor(major, 
                                levels = c(1, 2, 3, 4,5),
                                labels = c("Psychology",
                                           "Premed",
                                           "Biology",
                                           "Sociology",
                                           "Economics"))) %>% 
  dplyr::mutate(reasonF = factor(reason,
                                 levels = c(1, 2, 3),
                                 labels = c("Program requirement",
                                            "Personal interest",
                                            "Advisor recommendation"))) %>% 
  dplyr::mutate(exp_condF = factor(exp_cond,
                                   levels = c(1, 2, 3, 4),
                                   labels = c("Easy",
                                              "Moderate",
                                              "Difficult",
                                              "Impossible"))) %>% 
  dplyr::mutate(coffeeF = factor(coffee,
                                 levels = c(0, 1),
                                 labels = c("Not a regular coffee drinker",
                                            "Regularly drinks coffee")))  %>% 
  dplyr::mutate(mathquiz_p50 = mathquiz + 50) %>% 
  dplyr::mutate(hr_base_bps  = hr_base / 60) %>% 
  dplyr::mutate(statquiz_4a  = (statquiz + 2) * 10 ) %>% 
  dplyr::mutate(statquiz_4b  = (statquiz * 10) + 2 ) %>% 
  dplyr::mutate(anx_sum      = rowsums(anx_base, anx_pre, anx_post)) %>% 
  dplyr::mutate(hr_mean      = rowmeans(hr_base + hr_pre + hr_post)) %>% 
  dplyr::mutate(statDiff     = statquiz - exp_sqz)

tibble::glimpse(data_clean)

Chapter 1 DATA PREPARTION

1.1 Preparing the Environment

1.1.1 The library() Function: Load a Package

1.1.2 The readxl::read_excel() Function: Read in Excel Data Files

1.2 Opperators and Helpful Functions

1.2.1 The Assignment Opperator <-: Save things to a name

1.2.2 The Pipe %>% Opperator: Link Steps Togehter

1.2.3 The head() Function: Print the First Few Rows

1.2.4 The names() Function: List the Variable Names

1.2.5 The dim() Function: List the Dimentions

1.2.6 The tibble::glimpse() Function: Gives an Overview of Variables

1.2.7 The dplyr::select() Function: Specify VARIABLES to include/keep

1.2.8 The dplyr::filter() Function: Specify OBSERVATIONS to include/keep