--- title: "Detecting separation and infinite estimates in log binomial regression" author: "Florian Schwendinger" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: number_sections: true bibliography: detectseparation.bib link-citations: yes vignette: > %\VignetteIndexEntry{Detecting separation and infinite estimates in log binomial regression} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 6 ) ``` # Introduction In contrast to logistic regression in log-binomial regression, separation is not the only factor which decides if the maximum likelihood estimate (MLE) has infinite components. As we will see in the example below, for the log-binomial regression model (LBRM) there exist data configurations which are separated but the MLE still exists. The MLE of the LBRM can be obtained by solving the following optimization problem. $$ \begin{equation} \begin{array}{rl} \underset{\beta}{\textrm{maximize}} & \ell(\beta) = \displaystyle\sum_{i = 1}^n y_i ~ X_{i*} \beta + (1 - y_i) ~ \log(1 - \exp(X_{i*} \beta)) \ \ \ \ \textrm{s.t.} & X \beta \leq 0. \end{array} \end{equation} $$ From the optimization problem we can already guess that the different behavior with regard to the existence of the MLE is caused by the linear constraint $X \beta \leq 0$. Let $X^0$ be the submatrix of $X$ obtained by keeping only the rows $I^0 = \{i|y_i = 0\}$ and $X^1$ the submatrix obtained by keeping only the rows $I^1 = \{i|y_i = 1\}$. @schwendinger+gruen+hornik:2021 pointed out that the finiteness of the MLE can be checked by solving the following linear optimization problem. $$ \begin{equation} \begin{array}{rll} \underset{\beta}{\text{maximize}}~~ & - \sum_{i \in I^0} X_{i*} \beta \\ \text{subject to}~~ & X^0 \beta \leq 0 \\ & X^1 \beta = 0. \end{array} \end{equation} $$ The MLE has only finite components if the solution of this linear program is a zero vector. If the MLE contains infinite components, the linear programming problem is unbounded. The function `detect_infinite_estimates()` from the **detectseparation** implements the LP problem described above and can therefore be used to detect infinite components in the MLE of the LBRM. ```{r, echo = TRUE, eval = TRUE} library("detectseparation") ``` --- # Example To show the different effect of separation on the logistic regression model compared to the LBRM consider the following data. ```{r, tiny_example} data <- data.frame(a = c(1, 0, 3, 2, 3, 4), b = c(2, 1, 1, 4, 6, 8), y = c(0, 0, 0, 1, 1, 1)) ``` ## Detect separation Clearly the data is separated, which can be verified by using the `detect_separation` method (a detailed explanation of the output can be found in [Section 3](#output_details)). ```{r, tiny_example_sep} glm(y ~ a + b, data = data, family = binomial("logit"), method = "detect_separation") ``` Since separation is a property of the data, checking for separation gives the same result for the logistic regression and the LBRM. ```{r, tiny_example_sep_lbrm} glm(y ~ a + b, data = data, family = binomial("log"), method = "detect_separation") ``` ## Detect infinite estimates For logistic regression separation is necessary and sufficient that the MLE contains infinite components. ```{r, tiny_example_inf_est} glm(y ~ a + b, data = data, family = binomial("logit"), method = "detect_infinite_estimates") ``` However, due to the linear constraint of the LBRM, there exists data allocations where the MLE does exist (i.e., has only finite components), despite the fact that the data is separated. ```{r} glm(y ~ a + b, data = data, family = binomial("log"), method = "detect_infinite_estimates") ``` ## Fitting the LBRM Using `glm` to solve this problem we get the following error message. ```{r, glm_fit} fit <- try(glm(y ~ a + b, data = data, family = binomial("log"))) ``` The error message means that we should provide starting values, a simple but reliable approach is to use $(-1, 0, ..., 0)$ as starting value. Since in this example one of the constraints $X_{i*} \beta \leq 0$ is by design binding, the iteratively re-weighted least squares (IRLS) method used by the `glm` function has convergence problems. The `glm` function informs us about the convergence problems by issuing some warnings. The warnings ``` #> Warning: step size truncated: out of bounds #> Warning: glm.fit: algorithm stopped at boundary value ``` tell us that IRLS is not the best option for optimization problems with binding constraints. The warning ``` #> Warning: glm.fit: algorithm did not converge ``` tell us that the algorithm did not converge. Practically in most cases this just means the default value for the maximum number of iterations should be increased. ```{r} args(glm.control) ``` Since the default value for the maximum number of iterations is quite low (`maxit = 25`). However, `maxit = 25` is typically high enough for unbounded optimization problems (almost all models supported by `glm`), but is often to low for the LBRM. For most data sets setting `maxit = 10000` when estimating LBRMs is high enough. ```{r, tiny_example_inf_esti_log_separation} formula <- y ~ a + b start <- c(-1, double(ncol(model.matrix(formula, data = data)) - 1L)) ctrl = glm.control(epsilon = 1e-8, maxit = 10000, trace = FALSE) suppressWarnings( fit <- glm(formula, data = data, family = binomial("log"), start = start, control = ctrl) ) summary(fit) ``` We can verify that one of the constraints $X_{i*} \beta \leq 0$ is binding (i.e., $X_{i*} \beta = 0$ for at least one $i$) by multiplying the coefficients with the model matrix. ```{r} print(mm <- drop(model.matrix(formula, data) %*% coef(fit))) abs(drop(mm)) < 1e-6 ``` --- # Details on the output{#output_details} ## Detect separation ```{r, explain_output_detect_separation} glm(y ~ a + b, data = data, family = binomial("logit"), method = "detect_separation") ``` The output above provides much information: 1. The output shows, that for the verification oft the separation the linear optimization solver **lpsolve** was used. 2. The line `Separation: TRUE` indicates that the data is separated. 3. The coefficients at `Inf` and `-Inf` indicate that the underlying optimization problem is unbounded and therefore the MLE does not exist for the logistic regression model. ## Detect infinite estimates ```{r, explain_output_detect_infinite_estimates} glm(y ~ a + b, data = data, family = binomial("log"), method = "detect_infinite_estimates") ``` The output above provides much information: 1. The output shows, that for the verification oft the separation the linear optimization solver **lpsolve** was used. 2. The line `Infinite estimates: FALSE` indicates that the MLE has only finite components (the MLE exists). 3. The coefficients are all `0` which again indicates that the MLE has only finite components and therefore underlying optimization problem is bounded. --- # Choosing starting values For the LBRM the `glm` function often requires the users to specify starting values. Valid starting values have to reside in the interior of the feasible region ($X \beta < 0$). There have been different methods suggested for finding valid starting values. ## Simple approach If an intercept is present in the estimation `(-1, 0, ..., 0)` will always provide valid starting values. ```{r} find_start_simple <- function(formula, data) { c(-1, double(ncol(model.matrix(formula, data = data)) - 1L)) } find_start_simple(formula, data) max(model.matrix(formula, data = data) %*% find_start_simple(formula, data)) ``` ## Hot start via Poisson model @andrade+andrade2018 suggest a hot start method, by using the modified estimation result of a Poisson model with log link as starting values. Again this method only works if an intercept is used. ```{r} find_start_poisson <- function(formula, data, delta = 1) { b0 <- coef(glm(formula, data, family = poisson(link = "log"))) mX <- -model.matrix(formula, data = data)[, -1L, drop = FALSE] b0[1] <- min(mX %*% b0[-1]) - delta b0 } find_start_poisson(formula, data) max(model.matrix(formula, data = data) %*% find_start_poisson(formula, data)) ``` ## Recommendation One can also solve an LP to find valid start values or think of other strategies. However, for the benchmark examples reported in @schwendinger+gruen+hornik:2021 we found no conclusive evidence that one of these initialization methods outperforms the others. Therefore, my personal favorite is the simple approach `(-1, 0, ..., 0)`. --- # References