glm
objectsThe enrichwith
R package provides the enrich
method to enrich list-like R
objects with new, relevant components. The resulting objects preserve
their class, so all methods associated with them still apply.
This vignette is a demo of the available enrichment options for
glm
objects.
The following data set is provided in McCullagh and Nelder (1989, sec. 8.4) and
consists of observations on n = 18 mean clotting times of blood
in seconds (time
) for each combination of nine percentage
concentrations of normal plasma (conc
) and two lots of
clotting agent (lot
).
clotting <- data.frame(conc = c(5,10,15,20,30,40,60,80,100,
5,10,15,20,30,40,60,80,100),
time = c(118, 58, 42, 35, 27, 25, 21, 19, 18,
69, 35, 26, 21, 18, 16, 13, 12, 12),
lot = factor(c(rep(1, 9), rep(2, 9))))
McCullagh and Nelder (1989, sec. 8.4)
fitted a series of nested generalized linear models assuming that the
times are realisations of independent Gamma random
variables whose mean varies appropriately with concentration and lot. In
particular, McCullagh and Nelder (1989)
linked the inverse mean of the gamma random variables to the linear
predictors ~ 1
, ~ log(conc)
,
~ log(conc) + lot
, ~ log(conc) * lot
and
carried out an analysis of deviance to conclude that the
~ log(u) * lot
provides the best explanation of clotting
times. The really close fit of that model to the data can be seen in the
figure below.
library("ggplot2")
clottingML <- glm(time ~ log(conc) * lot, family = Gamma, data = clotting)
alpha <- 0.01
pr_out <- predict(clottingML, type = "response", se.fit = TRUE)
new_data <- clotting
new_data$time <- pr_out$fit
new_data$type <- "fitted"
clotting$type <- "observed"
all_data <- rbind(clotting, new_data)
new_data <- within(new_data, {
low <- pr_out$fit - qnorm(1 - alpha/2) * pr_out$se.fit
upp <- pr_out$fit + qnorm(1 - alpha/2) * pr_out$se.fit
})
ggplot(all_data) +
geom_point(aes(conc, time, col = type), alpha = 0.8) +
geom_segment(data = new_data, aes(x = conc, y = low, xend = conc, yend = upp, col = type)) +
facet_grid(. ~ lot) +
theme_bw() +
theme(legend.position = "top")
The score function is the gradient of the log-likelihood and is a key object for likelihood inference.
The enrichwith R package provides methods for the
enrichment of glm
objects with the corresponding score
function. This can either be done by enriching the glm
object with auxiliary_functions
and then extracting the
score function
library("enrichwith")
enriched_clottingML <- enrich(clottingML, with = "auxiliary functions")
scores_clottingML <- enriched_clottingML$auxiliary_functions$score
or directly using the get_score_function
convenience
method
By definition, the score function has to have zero components when evaluated at the maximum likelihood estimates (only numerically zero here).
## (Intercept) log(conc) lot2 log(conc):lot2 dispersion
## 1.227769e-05 2.253154e-05 6.112868e-07 1.548001e-06 1.701344e-09
## attr(,"coefficients")
## (Intercept) log(conc) lot2 log(conc):lot2
## -0.016554382 0.015343115 -0.007354088 0.008256099
## attr(,"dispersion")
## dispersion
## 0.001632971
Another key quantity in likelihood inference is the expected information.
The auxiliary_functions
enrichment option of the
enrich
method enriches a glm
object with a
function for the evaluation of the expected information.
and can also be computed directly using the
get_information_function
method
One of the uses of the expected information is the calculation of
standard errors for the model parameters. The
stats::summary.glm
function already does that for
glm
objects, estimating the dispersion parameter (if any)
using Pearson residuals.
Duly, info_clottingML
returns (numerically) the same
standard errors as the summary
method does.
summary_std_errors <- coef(summary_clottingML)[, "Std. Error"]
einfo <- info_clottingML(dispersion = summary_clottingML$dispersion)
all.equal(sqrt(diag(solve(einfo)))[1:4], summary_std_errors, tolerance = 1e-05)
## [1] TRUE
Another estimate of the standard errors results by the observed information, which is the negative Hessian matrix of the log-likelihood.
At least at the time of writting the current vignette, there appears
to be no general implementation of the observed information for
glm
objects. I guess the reason for that is the dependence
of the observed information on higher-order derivatives of the inverse
link and variance functions, which are not readily available in base
R.
enrichwith provides options for the enrichment of
link-glm
and family
objects with such
derivatives (see ?enrich.link-glm
and
?enrich.family
for details), and, based on those allows the
enrichment of glm
objects with a function to compute the
observed information.
The observed and the expected information for the regression
parameters coincide for GLMs with canonical link, like
clottingML
oinfo <- info_clottingML(dispersion = summary_clottingML$dispersion, type = "observed")
all.equal(oinfo[1:4, 1:4], einfo[1:4, 1:4])
## [1] TRUE
which is not generally true for a glm
with a
non-canonical link, as seen below.
clottingML_log <- update(clottingML, family = Gamma("log"))
summary_clottingML_log <- summary(clottingML_log)
info_clottingML_log <- get_information_function(clottingML_log)
einfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "expected")
oinfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "observed")
round(einfo_log, 3)
## (Intercept) log(conc) lot2 log(conc):lot2 dispersion
## (Intercept) 757.804 2508.115 378.902 1254.058 0.00
## log(conc) 2508.115 8971.520 1254.058 4485.760 0.00
## lot2 378.902 1254.058 378.902 1254.058 0.00
## log(conc):lot2 1254.058 4485.760 1254.058 4485.760 0.00
## dispersion 0.000 0.000 0.000 0.000 16078.16
## attr(,"coefficients")
## (Intercept) log(conc) lot2 log(conc):lot2
## 5.50322528 -0.60191618 -0.58447142 0.03448168
## attr(,"dispersion")
## [1] 0.02375284
## (Intercept) log(conc) lot2 log(conc):lot2 dispersion
## (Intercept) 757.804 2508.114 378.902 1254.057 0.000
## log(conc) 2508.114 9056.792 1254.057 4527.583 -0.041
## lot2 378.902 1254.057 378.902 1254.057 0.000
## log(conc):lot2 1254.057 4527.583 1254.057 4527.583 -0.018
## dispersion 0.000 -0.041 0.000 -0.018 7610.128
## attr(,"coefficients")
## (Intercept) log(conc) lot2 log(conc):lot2
## 5.50322528 -0.60191618 -0.58447142 0.03448168
## attr(,"dispersion")
## [1] 0.02375284
We can now use scores_clottingML
and
info_clottingML
to carry out a score test to
compare nested models.
If lψ(ψ, λ) is the gradient of the log-likelihood with respect to ψ evaluated at ψ and λ, iψψ(ψ, λ) is the (ψ, ψ) block of the inverse of the expected information matrix, and λ̂ψ is the maximum likelihood estimator of λ for fixed ψ, then, assuming that the model is adequate lψ(ψ, λ̂ψ)⊤iψψ(ψ, λ̂ψ)lψ(ψ, λ̂ψ) has an asymptotic χdim(ψ)2 distribution.
The following code chunk computes the maximum likelihood estimates
when the parameters for lot
and log(conc):lot
are fixed to zero (λ̂ψ
for ψ = 0 above), along with
the score and expected information matrix evaluated at them ($l(,_) and
i(ψ, λ̂ψ),
above).
clottingML_nested <- update(clottingML, . ~ log(conc))
enriched_clottingML_nested <- enrich(clottingML_nested, with = "mle of dispersion")
coef_full <- coef(clottingML)
coef_hypothesis <- structure(rep(0, length(coef_full)), names = names(coef_full))
coef_hypothesis_short <- coef(enriched_clottingML_nested, model = "mean")
coef_hypothesis[names(coef_hypothesis_short)] <- coef_hypothesis_short
disp_hypothesis <- coef(enriched_clottingML_nested, model = "dispersion")
scores <- scores_clottingML(coef_hypothesis, disp_hypothesis)
info <- info_clottingML(coef_hypothesis, disp_hypothesis)
The object enriched_clottingML_nested
inherits from
enriched_glm
, glm
and lm
, and, as
illustrated above, enrichwith provides a corresponding
coef
method to extract the estimates for the mean
regression parameters and/or the estimate for the dispersion
parameter.
The score statiistic is then
## [1] 17.15989
which gives a p-value of
## [1] 0.000187835
For comparison, the Wald statistic for the same hypothesis is
## [,1]
## [1,] 19.18963
and the log-likelihood ratio statistic is
## dispersion
## 17.6435
which is close to the score statistic
glm
objects at parameter valuesenrichwith also provides the
get_simulate_function
method for glm
or
lm
objects. The get_simulate_function
computes
a function to simulate response vectors at /arbitrary/ values of the
model parameters, which can be useful when setting up simulation
experiments and for various inferential procedures (e.g. indirect
inference).
For example, the following code chunk simulates three response
vectors at the maximum likelihood estimates of the parameters for
clottingML
.
## sim_1 sim_2 sim_3
## 1 119.99301 117.71976 124.77001
## 2 55.81195 53.03677 51.33030
## 3 37.29072 37.29519 39.83985
## 4 34.15268 32.39287 33.38394
## 5 30.02089 26.76743 27.99428
## 6 25.41892 24.63002 25.23102
## 7 20.50630 21.33982 20.69055
## 8 18.36304 19.61270 18.50062
## 9 19.39328 19.08656 17.67781
## 10 72.03656 72.99039 71.62121
## 11 33.36883 33.57407 33.34055
## 12 25.09187 24.91749 24.47527
## 13 20.87817 21.60344 23.25175
## 14 18.66966 16.86550 16.78601
## 15 16.35583 15.69062 16.01810
## 16 13.71021 13.65021 13.99123
## 17 12.32866 13.18895 12.59467
## 18 11.68437 11.25791 12.23057
The result is the same to what the simulate
method
returns
## sim_1 sim_2 sim_3
## 1 119.99301 117.71976 124.77001
## 2 55.81195 53.03677 51.33030
## 3 37.29072 37.29519 39.83985
## 4 34.15268 32.39287 33.38394
## 5 30.02089 26.76743 27.99428
## 6 25.41892 24.63002 25.23102
## 7 20.50630 21.33982 20.69055
## 8 18.36304 19.61270 18.50062
## 9 19.39328 19.08656 17.67781
## 10 72.03656 72.99039 71.62121
## 11 33.36883 33.57407 33.34055
## 12 25.09187 24.91749 24.47527
## 13 20.87817 21.60344 23.25175
## 14 18.66966 16.86550 16.78601
## 15 16.35583 15.69062 16.01810
## 16 13.71021 13.65021 13.99123
## 17 12.32866 13.18895 12.59467
## 18 11.68437 11.25791 12.23057
but simulate_clottingML
can also be used to simulate at
any given parameter value
coefficients <- c(0, 0.01, 0, 0.01)
dispersion <- 0.001
samples <- simulate_clottingML(coefficients = coefficients, dispersion = dispersion, nsim = 500000, seed = 123)
The empirical means and variances based on samples
agree
with the exact means and variances at coefficients
and
dispersion
means <- 1/(model.matrix(clottingML) %*% coefficients)
variances <- dispersion * means^2
max(abs(rowMeans(samples) - means))
## [1] 0.004616079
## [1] 0.00792772
enrichwith also provides the
get_dmodel_function
, get_pmodel_function
and
get_qmodel_function
methods, which can be used to evaluate
densities or probability mass functions, distribution functions, and
quantile functions, respectively at arbitrary parameter values and data
settings.
For example, the following code chunk computes the density at the
observations in the clotting
data set under then maximum
likelihood fit
## 1 2 3 4 5 6 7
## 0.05114789 0.01729857 0.11264667 0.21780853 0.23240296 0.39469095 0.36529389
## 8 9 10 11 12 13 14
## 0.33731787 0.44320377 0.11009363 0.08136892 0.23566642 0.42776887 0.51483860
## 15 16 17 18
## 0.59720662 0.29329145 0.42214857 0.75181329
## attr(,"coefficients")
## (Intercept) log(conc) lot2 log(conc):lot2
## -0.016554382 0.015343115 -0.007354088 0.008256099
## attr(,"dispersion")
## dispersion
## 0.001632971
We can also compute the densities at user-supplied parameter values
## 1 2 3 4 5 6
## 1.504831e-05 6.298754e-04 2.784455e-03 5.394485e-03 1.427986e-02 1.392561e-02
## 7 8 9 10 11 12
## 2.241253e-02 2.775377e-02 2.904231e-02 1.573806e-02 3.429823e-02 4.497386e-02
## 13 14 15 16 17 18
## 5.143328e-02 6.281910e-02 7.022318e-02 7.721225e-02 8.508033e-02 9.434798e-02
## attr(,"coefficients")
## [1] -0.01 0.02 -0.01 0.00
## attr(,"dispersion")
## [1] 0.1
or even at different data points
new_data <- data.frame(conc = 5:10, time = 50:45, lot = factor(c(1, 1, 1, 2, 2, 2)))
cML_dmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
## 1 2 3 4 5 6
## 0.02366298 0.01889204 0.01428215 0.02659079 0.02591742 0.02433105
## attr(,"coefficients")
## [1] -0.01 0.02 -0.01 0.00
## attr(,"dispersion")
## [1] 0.1
We can also compute cumulative probabilities and quantile functions. So
cML_qmodel <- get_qmodel_function(clottingML)
cML_pmodel <- get_pmodel_function(clottingML)
probs <- cML_pmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
cML_qmodel(probs, data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
## 1 2 3 4 5 6
## 50 49 48 47 46 45
## attr(,"coefficients")
## [1] -0.01 0.02 -0.01 0.00
## attr(,"dispersion")
## [1] 0.1
returns new_data$time
.
For example the fitted densities when (conc, lot)
is
c(15, 1)
, c(15, 2)
, c(40, 1)
or
c(40, 2)
are
new_data <- expand.grid(conc = c(15, 40),
lot = factor(1:2),
time = seq(0, 50, length = 100))
new_data$density <- cML_dmodel(new_data)
ggplot(data = new_data) +
geom_line(aes(time, density)) + facet_grid(conc ~ lot) +
theme_bw() +
geom_vline(data = clotting[c(3, 6, 12, 15), ], aes(xintercept = time), lty = 2)
The dashed vertical lines are the observed values of time for the
(conc, lot)
combinations in the figure.
enriched_glm
enrichwith provides a wrapper to the
glm
interface that results directly in
enriched_glm
objects that carry all the components
described above. For example,
enriched_clottingML <- enriched_glm(time ~ log(conc) * lot, family = Gamma, data = clotting)
names(enriched_clottingML$auxiliary_functions)
## [1] "score" "information" "bias" "simulate" "dmodel"
## [6] "pmodel" "qmodel"
## (Intercept) log(conc) lot2 log(conc):lot2 dispersion
## -5.642188e-15 -1.821113e-14 -2.523155e-15 -8.732598e-15 -5.815853e-13
## attr(,"coefficients")
## (Intercept) log(conc) lot2 log(conc):lot2
## 133.11331 -28.03263 -55.03734 11.89549
## attr(,"dispersion")
## dispersion
## 136.1231
## (Intercept) log(conc) lot2 log(conc):lot2 dispersion
## (Intercept) 0.13223326 0.4376542 0.06611663 0.2188271 0.0000000000
## log(conc) 0.43765424 1.5654878 0.21882712 0.7827439 0.0000000000
## lot2 0.06611663 0.2188271 0.06611663 0.2188271 0.0000000000
## log(conc):lot2 0.21882712 0.7827439 0.21882712 0.7827439 0.0000000000
## dispersion 0.00000000 0.0000000 0.00000000 0.0000000 0.0004857121
## attr(,"coefficients")
## (Intercept) log(conc) lot2 log(conc):lot2
## 133.11331 -28.03263 -55.03734 11.89549
## attr(,"dispersion")
## dispersion
## 136.1231
## (Intercept) log(conc) lot2 log(conc):lot2
## (Intercept) 1.322333e-01 4.376542e-01 6.611663e-02 2.188271e-01
## log(conc) 4.376542e-01 1.565488e+00 2.188271e-01 7.827439e-01
## lot2 6.611663e-02 2.188271e-01 6.611663e-02 2.188271e-01
## log(conc):lot2 2.188271e-01 7.827439e-01 2.188271e-01 7.827439e-01
## dispersion -4.141430e-17 -1.333502e-16 -1.840636e-17 -6.375119e-17
## dispersion
## (Intercept) -4.141430e-17
## log(conc) -1.333502e-16
## lot2 -1.840636e-17
## log(conc):lot2 -6.375119e-17
## dispersion 4.857121e-04
## attr(,"coefficients")
## (Intercept) log(conc) lot2 log(conc):lot2
## 133.11331 -28.03263 -55.03734 11.89549
## attr(,"dispersion")
## dispersion
## 136.1231
## (Intercept) log(conc) lot2 log(conc):lot2 dispersion
## 0.00000 0.00000 0.00000 0.00000 -30.24958
## attr(,"coefficients")
## (Intercept) log(conc) lot2 log(conc):lot2
## 133.11331 -28.03263 -55.03734 11.89549
## attr(,"dispersion")
## dispersion
## 136.1231
## dispersion
## 136.1231
The vignette bias
of enrichwith describes how to use the
auxiliary_functions
to reduce the bias of the maximum
likelihood estimator for generalized linear models.