17 GEE, Count Outcome: Epilepsy

17.1 PREPARATION

17.1.1 Load Packages

install.packages("devtools")
devtools::install_github("sarbearschwartz/apaSupp") # updated 10/14/2025

Load/activate these packages

library(tidyverse)    
library(flextable)
library(apaSupp)      # Not on CRAN, on GitHub (see above)
library(psych) 
library(lme4)         
library(lmerTest)
library(corrplot)      
library(geepack)       
library(MuMIn)  
library(interactions)
library(performance)      

library(HSAUR)        # package with the dataset

17.1.2 Background

This dataset was used as an example in Chapter 11 of “A Handbook of Statistical Analysis using R” by Brian S. Everitt and Torsten Hothorn. The authors include this data set in their HSAUR package on CRAN.

The Background

In this clinical trial, 59 patients suffering from epilepsy were randomized to groups receiving either the anti-epileptic drug “Progabide”” or a placebo in addition to standard chemotherapy. The numbers of seizures suffered in each of four, two-week periods were recorded for each patient along with a baseline seizure count for the 8 weeks prior to being randomized to treatment and age.

The Research Question

The main question of interest is whether taking progabide reduced the number of epileptic seizures compared with placebo.

The Data

  • Indicators

    • subject the patient ID, a factor with levels 1 to 59
    • period treatment period, an ordered factor with levels 1 to 4
  • Outcome or dependent variable

    +seizure.rate the number of seizures (2-weeks)

  • Main predictor or independent variable of interest

    • treatment the treatment group, a factor with levels placebo and Progabide
  • Time-invariant Covariates

    • age the age of the patient
    • base the number of seizures before the trial (8 weeks)

17.1.3 Read in the data

data("epilepsy", package = "HSAUR")

Problem: The outcome (seizure.rate) were counts over a TWO-week period and we would like to interpret the results in terms of effects on the WEEKLY rate.

  • If we divide the values by TWO to get weekly rates, the outcome might be a decimal number
  • The Poisson distribution may only be used for whole numbers (not deciamls)

Solution: We need to include an offset term in the model that indicates the LOG DURATION of each period.

  • Every observation period is exactly 2 weeks in this experiment
  • Create a variable in the original dataset that is equal to the LOG DURATION (per = log(2))
  • To the formula for the glm() or gee(), add: + offset(per)

17.1.4 Long Format

df_long <- epilepsy %>% 
  dplyr::select(subject, age, treatment, base, 
                period, seizure.rate) %>% 
  dplyr::mutate(per = log(2)) %>%                  # new variable to use with the offset
  dplyr::mutate(base_wk = base/8)
str(df_long)
'data.frame':   236 obs. of  8 variables:
 $ subject     : Factor w/ 59 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ age         : int  31 31 31 31 30 30 30 30 25 25 ...
 $ treatment   : Factor w/ 2 levels "placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
 $ base        : int  11 11 11 11 11 11 11 11 6 6 ...
 $ period      : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 2 3 4 1 2 3 4 1 2 ...
 $ seizure.rate: int  5 3 3 3 3 5 3 3 2 4 ...
 $ per         : num  0.693 0.693 0.693 0.693 0.693 ...
 $ base_wk     : num  1.38 1.38 1.38 1.38 1.38 ...
df_long %>% 
  psych::headTail(, top = 10, bottom = 6) %>% 
  flextable::flextable() %>% 
  apaSupp::theme_apa(caption = "Long Form of the Data")
Table 17.1
Long Form of the Data

subject

age

treatment

base

period

seizure.rate

per

base_wk

1

31

placebo

11

1

5

0.69

1.38

1

31

placebo

11

2

3

0.69

1.38

1

31

placebo

11

3

3

0.69

1.38

1

31

placebo

11

4

3

0.69

1.38

2

30

placebo

11

1

3

0.69

1.38

2

30

placebo

11

2

5

0.69

1.38

2

30

placebo

11

3

3

0.69

1.38

2

30

placebo

11

4

3

0.69

1.38

3

25

placebo

6

1

2

0.69

0.75

3

25

placebo

6

2

4

0.69

0.75

...

...

...

...

...

58

36

Progabide

13

3

0

0.69

1.62

58

36

Progabide

13

4

0

0.69

1.62

59

37

Progabide

12

1

1

0.69

1.5

59

37

Progabide

12

2

4

0.69

1.5

59

37

Progabide

12

3

3

0.69

1.5

59

37

Progabide

12

4

2

0.69

1.5

17.1.5 Wide Format

df_wide <- df_long %>% 
  tidyr::spread(key = period,
                value = seizure.rate,
                sep = "_") %>% 
  dplyr::arrange(subject)
str(df_wide)
'data.frame':   59 obs. of  10 variables:
 $ subject  : Factor w/ 59 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ age      : int  31 30 25 36 22 29 31 42 37 28 ...
 $ treatment: Factor w/ 2 levels "placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
 $ base     : int  11 11 6 8 66 27 12 52 23 10 ...
 $ per      : num  0.693 0.693 0.693 0.693 0.693 ...
 $ base_wk  : num  1.38 1.38 0.75 1 8.25 ...
 $ period_1 : int  5 3 2 4 7 5 6 40 5 14 ...
 $ period_2 : int  3 5 4 4 18 2 4 20 6 13 ...
 $ period_3 : int  3 3 0 1 9 8 0 23 6 6 ...
 $ period_4 : int  3 3 5 4 21 7 2 12 5 0 ...
df_wide %>% 
  psych::headTail(, top = 10, bottom = 6) %>% 
  flextable::flextable() %>% 
  apaSupp::theme_apa(caption = "Wide Form of the Data")
Table 17.2
Wide Form of the Data

subject

age

treatment

base

per

base_wk

period_1

period_2

period_3

period_4

1

31

placebo

11

0.69

1.38

5

3

3

3

2

30

placebo

11

0.69

1.38

3

5

3

3

3

25

placebo

6

0.69

0.75

2

4

0

5

4

36

placebo

8

0.69

1

4

4

1

4

5

22

placebo

66

0.69

8.25

7

18

9

21

6

29

placebo

27

0.69

3.38

5

2

8

7

7

31

placebo

12

0.69

1.5

6

4

0

2

8

42

placebo

52

0.69

6.5

40

20

23

12

9

37

placebo

23

0.69

2.88

5

6

6

5

10

28

placebo

10

0.69

1.25

14

13

6

0

...

...

...

...

...

...

...

...

54

41

Progabide

24

0.69

3

6

3

4

0

55

32

Progabide

16

0.69

2

3

5

4

3

56

26

Progabide

22

0.69

2.75

1

23

19

8

57

21

Progabide

25

0.69

3.12

2

3

0

1

58

36

Progabide

13

0.69

1.62

0

0

0

0

59

37

Progabide

12

0.69

1.5

1

4

3

2

17.2 EXPLORATORY DATA ANALYSIS

17.2.1 Summarize

17.2.1.1 Demographics and Baseline

df_wide %>% 
  dplyr::select(treatment,
                "Baseline Age" = age, 
                "Seizures, 8 wk total" = base, 
                "Seizures, weekly" = base_wk) %>% 
  apaSupp::tab1(split = "treatment")
Table 17.3
Summary of Variables

Total
N = 59

placebo
n = 28 (47.5%)

Progabide
n = 31 (52.5%)

p-value

Baseline Age

28.34 (6.30)

29.00 (6.00)

27.74 (6.60)

.446

Seizures, 8 wk total

31.22 (26.88)

30.79 (26.10)

31.61 (27.98)

.907

Seizures, weekly

3.90 (3.36)

3.85 (3.26)

3.95 (3.50)

.907

Note. Continuous variables are summarized with means (SD) and significant group differences assessed via independent t-tests. Categorical variables are summarized with counts (%) and significant group differences assessed via Chi-squared tests for independence.

* p < .05. ** p < .01. *** p < .001.

17.2.1.2 Outcome Across Time

Note: The Poisson distribution specifies that the MEAN = VARIANCE

In this dataset, the variance is much larger than the mean, at all time points for both groups. This is evidence of overdispersion and suggest the scale parameter should be greater than one.

df_long %>% 
  dplyr::group_by(treatment, period) %>% 
  dplyr::summarise(N = n(),
                   M = mean(seizure.rate),
                   VAR = var(seizure.rate),
                   SD = sd(seizure.rate)) %>% 
  flextable::as_grouped_data(groups = "treatment") %>% 
  flextable::flextable() %>% 
  apaSupp::theme_apa(caption = "Check for OVERDISPERSION (mean = variance)")
Table 17.4
Check for OVERDISPERSION (mean = variance)

treatment

period

N

M

VAR

SD

placebo

1

28

9.36

102.76

10.14

2

28

8.29

66.66

8.16

3

28

8.79

215.29

14.67

4

28

7.96

58.18

7.63

Progabide

1

31

8.58

332.72

18.24

2

31

8.42

140.65

11.86

3

31

8.13

193.05

13.89

4

31

6.71

126.88

11.26

17.2.1.3 Correlation Across Time

Raw Scale

df_long %>% 
  dplyr::select(subject, period, seizure.rate ) %>% 
  tidyr::spread(key = period,
                value = seizure.rate ) %>% 
  dplyr::select(-subject) %>% 
  cor() %>% 
  corrplot::corrplot.mixed()

Log Scale

df_long %>% 
  dplyr::mutate(rate_wk = log(seizure.rate + 1)) %>% 
  dplyr::select(subject, period, rate_wk) %>% 
  tidyr::spread(key = period,
                value = rate_wk) %>% 
  dplyr::select(-subject) %>% 
  cor() %>% 
  corrplot::corrplot.mixed()

17.2.2 Visualize

17.2.2.1 Outcome on the Raw Scale

There appear to be quite a few extreme values or outliers, particularly for the Progabide group during period one.

Since the outcome is truely a COUNT, we will model it with a Poisson distribution combined with a LOG link.

df_long %>% 
  ggplot(aes(x = period,
             y = seizure.rate)) +
  geom_boxplot() +
  theme_bw() +
  facet_grid(.~ treatment)

To investigate possible outliers, we should transform the outcome with the log function first.

Note: Since some participants reported no seizures during a two week period and the log(0) is unndefinded, we must add some amount to the values before transforming. Here we have chosen to add the value of \(1\).

df_long %>% 
  ggplot(aes(x = period,
             y = log(seizure.rate + 1))) +
  geom_boxplot() +
  
  theme_bw() +
  facet_grid(.~ treatment)

df_long %>% 
  ggplot(aes(x = period,
             y = log(seizure.rate + 1))) +
  geom_line(aes(group = subject)) +
  theme_bw() +
  facet_grid(.~ treatment)

df_long %>% 
  ggplot(aes(x = period,
             y = log(seizure.rate + 1))) +
  geom_smooth(aes(group = subject),
              method = "lm",
              se = FALSE) +
  geom_smooth(aes(group = 1),
              color = "red",
              size = 1.5,
              method = "lm",
              se = FALSE) +
  theme_bw() +
  facet_grid(.~ treatment)

17.3 Poisson Regression (GLM)

Note: THIS IS NEVER APPROPRIATE TO CONDUCT A GLM ON REPEATED MEASURES. THIS IS DONE FOR ILLUSTRATION PURPOSES ONLY!

17.3.1 Fit the model

fit_glm <- glm(seizure.rate ~ base + age + treatment + offset(per),
               data = df_long,
               family = poisson(link = "log"))

summary(fit_glm)

Call:
glm(formula = seizure.rate ~ base + age + treatment + offset(per), 
    family = poisson(link = "log"), data = df_long)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -0.130616   0.135619   -0.96   0.3355    
base                0.022652   0.000509   44.48  < 2e-16 ***
age                 0.022740   0.004024    5.65  1.6e-08 ***
treatmentProgabide -0.152701   0.047805   -3.19   0.0014 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2521.75  on 235  degrees of freedom
Residual deviance:  958.46  on 232  degrees of freedom
AIC: 1732

Number of Fisher Scoring iterations: 5

17.3.2 Tabulate the Model Parameters

apaSupp::tab_glm(fit_glm,
                caption = "Poisson Generalized Linear Regression (GLM) - WRONG!")
Table 17.5
Poisson Generalized Linear Regression (GLM) - WRONG!

Incident Rate Ratio

Log Scale

Variable

IRR

95% CI

b

(SE)

Wald

LRT

VIF

(Intercept)

0.88

[0.67, 1.14]

-0.13

(0.14)

.335

base

1.02

[1.02, 1.02]

0.02

(0.00)

< .001***

< .001***

1.17

age

1.02

[1.01, 1.03]

0.02

(0.00)

< .001***

< .001***

1.25

treatment

.001**

1.11

placebo

Progabide

0.86

[0.78, 0.94]

< .001

(0.05)

.001**

pseudo-R²

.999

Note. N = 236. CI = confidence interval; VIF = variance inflation factor. Significance denotes Wald t-tests for individual parameter estimates, as well as Likelihood Ratio Tests (LRT) for single-predictor deletion. Coefficient of determination displays Nagelkerke's pseudo-R².

* p < .05. ** p < .01. *** p < .001.

17.4 Generalized Estimating Equations (GEE)

17.4.1 Match Poisson Regresssion (GLM)

  • correlation structure: independence
  • scale parameter = \(1\)
fit_gee_ind_s1 <- geepack::geeglm(seizure.rate ~ base + age + treatment + offset(per),
                                  data = df_long,
                                  family = poisson(link = "log"),
                                  id = subject,
                                  waves = period,
                                  corstr = "independence",
                                  scale.fix = TRUE)

summary(fit_gee_ind_s1)

Call:
geepack::geeglm(formula = seizure.rate ~ base + age + treatment + 
    offset(per), family = poisson(link = "log"), data = df_long, 
    id = subject, waves = period, corstr = "independence", scale.fix = TRUE)

 Coefficients:
                   Estimate  Std.err   Wald Pr(>|W|)    
(Intercept)        -0.13062  0.36515   0.13     0.72    
base                0.02265  0.00124 336.05   <2e-16 ***
age                 0.02274  0.01158   3.86     0.05 *  
treatmentProgabide -0.15270  0.17111   0.80     0.37    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = independence 
Scale is fixed.

Number of clusters:   59  Maximum cluster size: 4 

17.4.2 Change Correlation Sturucture

  • correlation structure: exchangeable
  • scale parameter = \(1\)
fit_gee_exc_s1 <- geepack::geeglm(seizure.rate ~ base + age + treatment + offset(per),
                           data = df_long,
                           family = poisson(link = "log"),
                           id = subject,
                                  waves = period,
                           corstr = "exchangeable",
                           scale.fix = TRUE)

summary(fit_gee_exc_s1)

Call:
geepack::geeglm(formula = seizure.rate ~ base + age + treatment + 
    offset(per), family = poisson(link = "log"), data = df_long, 
    id = subject, waves = period, corstr = "exchangeable", scale.fix = TRUE)

 Coefficients:
                   Estimate  Std.err   Wald Pr(>|W|)    
(Intercept)        -0.13062  0.36515   0.13     0.72    
base                0.02265  0.00124 336.05   <2e-16 ***
age                 0.02274  0.01158   3.86     0.05 *  
treatmentProgabide -0.15270  0.17111   0.80     0.37    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Scale is fixed.

  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.141  0.0415
Number of clusters:   59  Maximum cluster size: 4 

17.4.2.1 Interpretation

“For this example, the estimates of standard errors under the independence are about half of the corresponding robust estimates, and the the situation improves only a little when an exchangeable structure is fitted. Using naive standard errors leads, in particular, to a highly significant treatment effect which disappears when the robust estimates are used.”

17.4.3 Estimate the Additional Scale Parameter

“The problem with the GEE approach here, using either the independence or exchangeable correlation structure lies in constraining the scale parameter to be one. For these data there is overdispersion (variance is larger than the mean value) which has to be accommodated by allowing this parameter to be freely estimated.”

  • correlation structure: exchangeable
  • scale parameter = freely estimated
fit_gee_exc_sf <- geepack::geeglm(seizure.rate ~ base + age + treatment + offset(per),
                           data = df_long,
                           family = poisson(link = "log"),
                           id = subject,
                                  waves = period,
                           corstr = "exchangeable",
                           scale.fix = FALSE)

summary(fit_gee_exc_sf)

Call:
geepack::geeglm(formula = seizure.rate ~ base + age + treatment + 
    offset(per), family = poisson(link = "log"), data = df_long, 
    id = subject, waves = period, corstr = "exchangeable", scale.fix = FALSE)

 Coefficients:
                   Estimate  Std.err   Wald Pr(>|W|)    
(Intercept)        -0.13062  0.36515   0.13     0.72    
base                0.02265  0.00124 336.05   <2e-16 ***
age                 0.02274  0.01158   3.86     0.05 *  
treatmentProgabide -0.15270  0.17111   0.80     0.37    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)        5    1.55
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.397   0.067
Number of clusters:   59  Maximum cluster size: 4 

“When this is done (scale parameter is freely estimated) it gives the last set of results shown above. The estimate of \(\phi\) is 5.”

17.4.4 Compare Models

apaSupp::tab_gees(list("Independence" = fit_gee_ind_s1,
                       "Exchg, fixed" = fit_gee_exc_s1,
                       "Exchg, freed" = fit_gee_exc_sf),
                  narrow = TRUE,
                  caption = "Estimates on Odds-Ratio Scale")
Table 17.6
Estimates on Odds-Ratio Scale

Independence

Exchg, fixed

Exchg, freed

IRR

95% CI

IRR

95% CI

IRR

95% CI

(Intercept)

0.88

[0.43, 1.80]

0.88

[0.43, 1.80]

0.88

[0.43, 1.80]

base

1.02

[1.02, 1.03]***

1.02

[1.02, 1.03]***

1.02

[1.02, 1.03]***

age

1.02

[1.00, 1.05]*

1.02

[1.00, 1.05]*

1.02

[1.00, 1.05]*

treatment

placebo

Progabide

0.86

[0.61, 1.20]

0.86

[0.61, 1.20]

0.86

[0.61, 1.20]

Note.

* p < .05. ** p < .01. *** p < .001.

17.5 Conculsion

The Research Question

The main question of interest is whether taking progabide reduced the number of epileptic seizures compared with placebo.

The Conclusion

There is no evidence that Progabide effects the weekly rate of epileptic seizures differently than placebo.