15 Ex: Logistic - volunteering (Hoffman)

Compiled: April 22, 2025

15.1 PREPARATION

15.1.1 Load Packages

# library(remotes)
# remotes::install_github("sarbearschwartz/apaSupp")  # updated: 4/12/25
# remotes::install_github("ddsjoberg/gtsummary")


library(tidyverse)
library(haven)        
library(naniar)
library(apaSupp)
library(performance) 
library(interactions)

15.1.2 Load Data

This dataset comes from John Hoffman’s textbook: Regression Models for Categorical, Count, and Related Variables: An Applied Approach (2004) Amazon link, 2014 edition

Chapter 3: Logistic and Probit Regression Models

Dataset: The following example uses the SPSS data set gss.sav. The dependent variable of interest is labeled volrelig.

“The variable labeled volrelig, which indicates whether or not a respondent volunteered for a religious organization in the previous year is coded 0 = no, 1 = yes. A hypothesis we wish to explore is that females are more likely than males to volunteer for religious organizations. Hence, in this data set, we code gender as 0 = male and 1 = female. In order to preclude the possibility that age and education explain the proposed association between gender and volrelig, we include these variables in the model after transforming them into z-scores. An advantage of this transformation is that it becomes a simple exercise to compute odds or probabilities for males and females at the mean of age and education, because these variables have now been transformed to have a mean of zero.

df_spss <- haven::read_spss("https://raw.githubusercontent.com/CEHS-research/data/master/Hoffmann_datasets/gss.sav") %>% 
  haven::as_factor() %>% 
  haven::zap_label() %>%    # remove SPSS junk
  haven::zap_formats() %>%  # remove SPSS junk
  haven::zap_widths()       # remove SPSS junk

str(df_spss)

tibble [2,903 × 20] (S3: tbl_df/tbl/data.frame)
 $ id      : num [1:2903] 402 1473 1909 334 1751 ...
 $ marital : Factor w/ 5 levels "married","widowed",..: 3 2 2 2 1 3 5 1 5 2 ...
 $ divorce : Factor w/ 2 levels "yes","no": 1 2 2 1 2 1 2 1 2 2 ...
 $ childs  : Factor w/ 9 levels "0","1","2","3",..: 3 1 8 3 3 1 3 4 1 3 ...
 $ age     : num [1:2903] 54 24 75 41 37 40 36 33 18 35 ...
 $ income  : num [1:2903] 10 2 NA NA 12 NA 9 NA NA 6 ...
 $ polviews: Factor w/ 7 levels "extreme liberal",..: 4 5 7 4 7 7 NA 4 4 4 ...
 $ fund    : Factor w/ 3 levels "fundamentalist",..: NA NA NA NA NA NA NA NA NA NA ...
 $ attend  : Factor w/ 9 levels "never","less than once a year",..: NA NA NA NA NA NA 7 4 3 4 ...
 $ spanking: Factor w/ 4 levels "strongly agree",..: NA NA NA NA NA NA NA NA NA NA ...
 $ totrelig: num [1:2903] NA NA NA NA NA NA NA 1000 NA NA ...
 $ sei     : num [1:2903] 38.9 29 29.1 29 38.1 ...
 $ pasei   : num [1:2903] NA 48.6 22.5 26.7 38.1 ...
 $ volteer : num [1:2903] 0 0 0 1 1 0 0 0 1 0 ...
 $ female  : Factor w/ 2 levels "male","female": 2 1 2 2 1 2 2 2 2 1 ...
 $ nonwhite: Factor w/ 2 levels "white","non-white": 2 1 2 1 1 1 2 1 2 1 ...
 $ prayer  : Factor w/ 6 levels "never","less than once a week",..: 5 4 5 4 4 4 5 4 4 4 ...
 $ educate : num [1:2903] 12 17 8 12 12 NA 15 12 11 14 ...
 $ volrelig: Factor w/ 2 levels "no","yes": 1 1 1 2 2 1 1 1 1 1 ...
 $ polview1: Factor w/ 3 levels "liberal","moderate",..: 2 3 3 2 3 3 NA 2 2 2 ...

15.1.3 Wrangle Data

df_gss <- df_spss %>% 
  dplyr::mutate(volrelig = volrelig %>% 
                  forcats::fct_recode("Yes" = "yes",
                                      "No" = "no")) %>% 
  dplyr::mutate(female = female %>% 
                  forcats::fct_recode("Male" = "male",
                                      "Female" = "female"))

str(df_gss)

tibble [2,903 × 20] (S3: tbl_df/tbl/data.frame)
 $ id      : num [1:2903] 402 1473 1909 334 1751 ...
 $ marital : Factor w/ 5 levels "married","widowed",..: 3 2 2 2 1 3 5 1 5 2 ...
 $ divorce : Factor w/ 2 levels "yes","no": 1 2 2 1 2 1 2 1 2 2 ...
 $ childs  : Factor w/ 9 levels "0","1","2","3",..: 3 1 8 3 3 1 3 4 1 3 ...
 $ age     : num [1:2903] 54 24 75 41 37 40 36 33 18 35 ...
 $ income  : num [1:2903] 10 2 NA NA 12 NA 9 NA NA 6 ...
 $ polviews: Factor w/ 7 levels "extreme liberal",..: 4 5 7 4 7 7 NA 4 4 4 ...
 $ fund    : Factor w/ 3 levels "fundamentalist",..: NA NA NA NA NA NA NA NA NA NA ...
 $ attend  : Factor w/ 9 levels "never","less than once a year",..: NA NA NA NA NA NA 7 4 3 4 ...
 $ spanking: Factor w/ 4 levels "strongly agree",..: NA NA NA NA NA NA NA NA NA NA ...
 $ totrelig: num [1:2903] NA NA NA NA NA NA NA 1000 NA NA ...
 $ sei     : num [1:2903] 38.9 29 29.1 29 38.1 ...
 $ pasei   : num [1:2903] NA 48.6 22.5 26.7 38.1 ...
 $ volteer : num [1:2903] 0 0 0 1 1 0 0 0 1 0 ...
 $ female  : Factor w/ 2 levels "Male","Female": 2 1 2 2 1 2 2 2 2 1 ...
 $ nonwhite: Factor w/ 2 levels "white","non-white": 2 1 2 1 1 1 2 1 2 1 ...
 $ prayer  : Factor w/ 6 levels "never","less than once a week",..: 5 4 5 4 4 4 5 4 4 4 ...
 $ educate : num [1:2903] 12 17 8 12 12 NA 15 12 11 14 ...
 $ volrelig: Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 1 1 1 ...
 $ polview1: Factor w/ 3 levels "liberal","moderate",..: 2 3 3 2 3 3 NA 2 2 2 ...

df_gss %>% 
  dplyr::select("Volunteered" = volrelig, 
                "Sex" = female, 
                "Age" = age, 
                "Education" = educate) %>% 
  psych::headTail() %>% 
  flextable::flextable() %>% 
  apaSupp::theme_apa(caption = "Partial Printout of the Dataset",
                     d = 0) %>% 
  flextable::align(part = "all", j = 1:2, align = "left") %>% 
  flextable::align(part = "all", j = 3.:4, align = "right") %>% 
  flextable::colformat_num(na_str = "-")

**Table 15.1**
Partial Printout of the Dataset
Volunteered	Sex	Age	Education
No	Female	54	12
No	Male	24	17
No	Female	75	8
Yes	Female	41	12
		...	...
No	Female	32	16
No	Female	43	14
No	Male	29	16
No	Male	23	16

15.2 EXPLORATORY DATA ANALYSIS

15.2.1 Missing Data

df_gss %>% 
  dplyr::select("Volunteered" = volrelig, 
                "Sex" = female, 
                "Age" = age, 
                "Education" = educate) %>% 
  naniar::miss_var_summary() %>%
  dplyr::select(Variable = variable,
                n = n_miss) %>% 
  flextable::flextable() %>% 
  apaSupp::theme_apa(caption = "Missing Data by Variable")

**Table 15.2**
Missing Data by Variable
Variable	n
Education	9
Volunteered	0
Sex	0
Age	0

15.2.2 Summary

df_gss %>% 
  dplyr::select("Volunteered" = volrelig, 
                "Sex" = female, 
                "Age" = age, 
                "Education" = educate) %>%
  apaSupp::tab_freq(caption = "Summary of Categorical Variables")

**Table 15.3**
Summary of Categorical Variables
		Statistic (N=2,903)
Volunteered
No		2,697 (92.9%)
Yes		206 (7.1%)
Sex
Male		1,285 (44.3%)
Female		1,618 (55.7%)

df_gss %>% 
  dplyr::select("Volunteered" = volrelig, 
                "Sex" = female, 
                "Age" = age, 
                "Education" = educate) %>%
  apaSupp::tab_desc(caption = "Summary of Continuous Variables")

**Table 15.4**
Summary of Continuous Variables
	NA	M	SD	min	Q1	Mdn	Q3	max
Age	0	44.76	16.85	18.00	32.00	42.00	55.00	89.00
Education	9	13.36	2.93	0.00	12.00	13.00	16.00	20.00
Note. N = 2903. NA = not available or missing; Mdn = median; Q1 = 25th percentile; Q3 = 75th percentile.

15.2.3 Visualization

df_gss %>% 
  apaSupp::spicy_histos(var = age,
                        lab = "Age, years")

df_gss %>% 
  ggplot(aes(x = educate,
             y = volrelig)) +
  geom_count() +
  theme_bw() +
  labs(x = "Education in Years",
       y = "Observed Volunteering for a\nReligious Organization\nin the Previous Year")

15.3 LOGISTIC REGRESSION

15.3.1 Compelete Subset

df_gss_model <- df_gss %>% 
  dplyr::filter(complete.cases(volrelig, female, age, educate))

tibble::glimpse(df_gss_model)

Rows: 2,894
Columns: 20
$ id       <dbl> 402, 1473, 1909, 334, 1751, 292, 2817, 2810, 2232, 2174, 2644…
$ marital  <fct> divorced, widowed, widowed, widowed, married, never married, …
$ divorce  <fct> yes, no, no, yes, no, no, yes, no, no, no, no, no, no, yes, y…
$ childs   <fct> 2, 0, 7, 2, 2, 2, 3, 0, 2, 2, 5, 0, 2, 1, 0, 1, 0, 0, 0, 0, 1…
$ age      <dbl> 54, 24, 75, 41, 37, 36, 33, 18, 35, 35, 34, 40, 37, 41, 61, 2…
$ income   <dbl> 10, 2, NA, NA, 12, 9, NA, NA, 6, 12, 11, 12, 10, 12, 12, NA, …
$ polviews <fct> middle of the road, slight conservative, extreme conservative…
$ fund     <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ attend   <fct> NA, NA, NA, NA, NA, nearly every week, several times a year, …
$ spanking <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ totrelig <dbl> NA, NA, NA, NA, NA, NA, 1000, NA, NA, NA, NA, NA, NA, NA, NA,…
$ sei      <dbl> 38.9, 29.0, 29.1, 29.0, 38.1, 38.4, 31.3, NA, 39.0, 29.5, 50.…
$ pasei    <dbl> NA, 48.6, 22.5, 26.7, 38.1, NA, NA, NA, 50.7, 78.5, NA, 73.6,…
$ volteer  <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ female   <fct> Female, Male, Female, Female, Male, Female, Female, Female, M…
$ nonwhite <fct> non-white, white, non-white, white, white, non-white, white, …
$ prayer   <fct> daily, several times a week, daily, several times a week, sev…
$ educate  <dbl> 12, 17, 8, 12, 12, 15, 12, 11, 14, 14, 12, 20, 12, 15, 20, 11…
$ volrelig <fct> No, No, No, Yes, Yes, No, No, No, No, No, No, No, No, No, No,…
$ polview1 <fct> moderate, conservative, conservative, moderate, conservative,…

15.3.2 Fit the Model

fit_glm_1 <- glm(volrelig ~ female + age + educate,
                 data = df_gss_model,
                 family = binomial(link = "logit"))

15.3.3 Parameter Table

apaSupp::tab_glm(fit_glm_1)

**Table 15.5**
Parameter Estimates for Generalized Linear Regression
	Odds Ratio		Logit Scale
Variable	OR	95% CI	b	(SE)	Wald	LRT	VIF
female						.017*	1.01
Male	—	—	—	—
Female	1.43	[1.06, 1.92]	.4	(0.15)	.018*
age	1.01	[1.00, 1.02]	0.01	(0.00)	.055	.057	1.02
educate	1.13	[1.08, 1.19]	0.12	(0.03)	< .001***	< .001***	1.02
(Intercept)			-4.83	(0.45)	< .001***
pseudo-R²	.010
Note. N = 2894. CI = confidence interval; VIF = variance inflation factor. Significance denotes Wald t-tests for individual parameter estimates, as well as Likelihood Ratio Tests (LRT) for single-predictor deletion. Coefficient of deterination displays Tjur's pseudo-R².
* p < .05. p < .01. * p < .001.

apaSupp::tab_glm(fit_glm_1,
                 var_labels = c(female = "Sex",
                                age = "Age, yrs",
                                educate = "Education, yrs"),
                 caption = "Parameter Estimates for Multivariate Logistic Regression for Vollunteering  for Religious Organization in the Previous Year",
                 p_note = "apa13",
                 lrt = FALSE,
                 pr2 = "both") %>% 
  flextable::width(j = 1, width = 1.25) %>% 
  flextable::bold(i = c(4, 6))

**Table 15.6**
Parameter Estimates for Multivariate Logistic Regression for Vollunteering for Religious Organization in the Previous Year
	Odds Ratio		Logit Scale
Variable	OR	95% CI	b	(SE)	p	VIF
Sex						1.01
Male	—	—	—	—
Female	1.43	[1.06, 1.92]	.4	(0.15)	.018*
Age, yrs	1.01	[1.00, 1.02]	0.01	(0.00)	.055	1.02
Education, yrs	1.13	[1.08, 1.19]	0.12	(0.03)	< .001***	1.02
(Intercept)			-4.83	(0.45)	< .001***
pseudo-R²
	Tjur	.010
	McFadden	.020
Note. N = 2894. CI = confidence interval; VIF = variance inflation factor. Significance denotes Wald t-tests for parameter estimates. Coefficient of deterination included for both Tjur and McFadden's pseudo-R².
* p < .05. *** p < .001.

15.4 Interpretation

15.4.1 Probe

Logit Scale ranges from negatie infinity to positive infinity…hard to interpret

fit_glm_1 %>% 
  emmeans::emmeans(~ female)

 female emmean     SE  df asymp.LCL asymp.UCL
 Male    -2.83 0.1210 Inf     -3.07     -2.59
 Female  -2.48 0.0939 Inf     -2.66     -2.29

Results are given on the logit (not the response) scale. 
Confidence level used: 0.95

Response Scale (aka. Predicted Probabilities)

fit_glm_1 %>% 
  emmeans::emmeans(~ female,
                   type = "response")

 female   prob      SE  df asymp.LCL asymp.UCL
 Male   0.0556 0.00638 Inf    0.0444    0.0695
 Female 0.0774 0.00671 Inf    0.0653    0.0917

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale

Controlling for age and education, i.e. at the mean level of education and age…

the probability of volunteering among MALES is .0556 or a5.6% chance
the probability of volunteering among FEMALES is .0774 or a 7.7% chance

Use these probabilities to compute the odds ratio for gender (OR for sex).

(.0774/(1 - .0774))/(.0556/(1 - .0556))

[1] 1.42498

Note that these odds and probabilities are similar. This often occurs when we are dealing with probabilities that are relatively close to zero; in other words, it is a common occurrence for rare events. To see this, simply compute a cross-tabulation of volrelig and gender and compare the odds and probabilities. Then try it out for any rare event you may wish to simulate

15.4.2 Plot

fit_glm_1 %>% 
  emmeans::emmeans(~ female | educate,
                   at = list(educate = c(25, 50, 75)),
                   type = "response")

educate = 25:
 female  prob     SE  df asymp.LCL asymp.UCL
 Male   0.195 0.0455 Inf     0.121     0.300
 Female 0.257 0.0557 Inf     0.163     0.380

educate = 50:
 female  prob     SE  df asymp.LCL asymp.UCL
 Male   0.835 0.1240 Inf     0.465     0.967
 Female 0.879 0.0971 Inf     0.548     0.977

educate = 75:
 female  prob     SE  df asymp.LCL asymp.UCL
 Male   0.991 0.0141 Inf     0.842     1.000
 Female 0.993 0.0100 Inf     0.881     1.000

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale

interactions::interact_plot(model = fit_glm_1,
                            pred = educate,
                            modx = female,
                            legend.main = "Sex",
                            interval = TRUE,
                            int.width = .685) +
  theme_bw() +
  geom_vline(xintercept = 12) +
  labs(x = "Education, years",
       y = "Predicted Probability of Volunteering for a\nReligious Organization in the Previous Year") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(legend.position = "inside",
        legend.position.inside = c(0, 1),
        legend.justification = c(-.1, 1.1),
        legend.background = element_rect(color = "black"),
        legend.key.width = unit(1.5, "cm"))