Logistic Regression, Average Marginal Effects, and the Linear Probability Model - Part IV: How AMEs are affected by the distribution of omitted variables
In the previous post, we saw that an average marginal effect (AME) of an independent variable in a logistic regression model reflects not only the influence of the variable in focus (i.e., the variable for which the AME is computed), but also the impact of ommitted variables in a model. The perhaps desirable consequence of this is that AMEs change little if an ommitted variable is added to the model (if it is uncorrelated to the variables to the variables already present in the model.) However, we also saw that the mean and standard deviation of the variable in focus affects its AME.
Since AMEs are affected by the strength of the influence of an omitted variable, one may ask whether the distribution of an omitted variable affects the AME. This is explored in the following simulation studies.
The function defined in the following provides the core replications of the
simulation study. In this function two normal distributed random variables x1 and
x2 are reated with expected values mu.x1 and mu.x2. Additionally, a binary
dependent variable is created that follows a logistic regression model with
intercept a and coefficients b1 and b2 of the predictor variables
x1 and x2. The function returns the coefficients from a logistic
regression, a linear regression, and the AMEs from the logistic regression, where x1 is
included in the regression model and x2 is omitted.
fun <- function(a=0,b1=1,b2=1,mu.x1=0,mu.x2=0,n=5000) {
x1 <- rnorm(n=n,mean=mu.x1)
x2 <- rnorm(n=n,mean=mu.x2)
p <- logistic(a+b1*x1+b2*x2)
y <- rbinom(n=n,size=1,prob=p)
glm <- glm(y~x1, family = binomial,maxit=6)
lm <- lm(y~x1)
c(
b_glm=coef(glm)[-1],
AME_glm=AME(glm),
b_lm=coef(lm)[-1]
)
}
Before the simulations are run, a few preparations are made.
library(memisc)
source("simsim.R")
logistic <- function(x) 1/(1+exp(-x))
AME <- function(mod) {
p <- predict(mod,type="response")
cf <- coef(mod)
cf <- cf[names(cf)!="(Intercept)"]
unname(mean(p*(1-p),na.rm=TRUE))*cf
}
library(ggplot2)
theme_set(theme_bw())
A first simulation study: How the mean of an omitted variable affects the AME
The following simulation run varies the true coefficient of the omitted variable as well as its expected value.
simres <- simpar(fun,
conditions=list(
b2=c(0,0.1,.3,0.5,1,1.5,2,2.5,3),
mu.x2=c(-2,-1.5,-1,-.5,0,.5,1,1.5,2)
),
nsim=100)
Maximum number of cores available is 20.
Using 18 cores for 18 threads/jobs ...
simres.df <- as.data.frame(simres)
The following code creates a diagram that shows the distribution of the estimates of the
coefficient of x1 in logistic regression.
The resulting diagram shows that, as already seen earlier, that the estimate is biased if the omitted variable as an infuence on the response. The bias increases with the strength of the influence of the omitted variable. If this influence is strong, the expected value of the omitted variable also has a slight influence on this bias, where greater values lead to smaller bias.
ggplot(simres.df) +
geom_boxplot(aes(mu.x2,b_glm.x1,group=mu.x2)) +
facet_wrap(~b2, labeller = label_both)
The following code creates a diagram that shows the distribution of the AME of x1 from logistic regression.
The diagram indicates that the AME of x1 is the smaller, the stronger the influence of x2 (expressed by the
coefficent b2) is. (This is already occured in a previous post).
Yet in addition, the AME of x1 is also affected by the mean of x1.
ggplot(simres.df) +
geom_boxplot(aes(mu.x2,AME_glm.x1,group=mu.x2)) +
facet_wrap(~b2, labeller = label_both)
A second simulation study: Comparing the effect of the mean of a predictor variable and an omitted variable on the AME
The second simulation study varies the expected values of both a predictor variable present in a
model (represented by x1) and an omitted variable (represented by x2) and examines their
effect on the AME values.
simres1 <- simpar(fun,
conditions=list(
mu.x1=c(-2,-1.5,-1,-.5,0,.5,1,1.5,2),
mu.x2=c(-2,-1.5,-1,-.5,0,.5,1,1.5,2)
),
nsim=100)
Maximum number of cores available is 20.
Using 18 cores for 18 threads/jobs ...
simres1.df <- as.data.frame(simres1)
The next two diagrams show how the values of the AMEs vary with the expected values
of both x1 and x2. The first diagram has the expected values of x2 on the
horizontal axis, while the expected values of x1 define the panels; the second
diagram has the expected values of x1 on the horizontal axis, with
the expected values of x2 defining the panels.
ggplot(simres1.df) +
geom_boxplot(aes(mu.x2,AME_glm.x1,group=mu.x2)) +
facet_wrap(~mu.x1, labeller = label_both)
ggplot(simres1.df) +
geom_boxplot(aes(mu.x1,AME_glm.x1,group=mu.x1)) +
facet_wrap(~mu.x2, labeller = label_both)
The two diagrams are very similar to each other, to the degree that the two variables, the included and the omitted variable, are exchangeable in terms of their impact of an AME.
The third diagram obtained from this simulation study shows that logistic regression coefficients are also affected by the means of included and omitted variables. However, this influence appears much weaker than in the case of average marginal effects.
ggplot(simres1.df) +
geom_boxplot(aes(mu.x2,b_glm.x1,group=mu.x2)) +
facet_wrap(~mu.x1, labeller = label_both)
Conclusion
The simulation study described above sheds even more doubt about the comparability of average marginal effects (AMEs) across samples from different (sub-)populations. That the mean of a predictor variable included in the model affects the values of the AME was already demonstrated in the previous post. If there are no omitted varaibles it could at least anticipated whether AMEs from different samples are incomparable by comparing the distribution of the variable for which an AME is computed. In practice, however, it can seldom ruled out completely that there are no relevant variables omitted from a logistic regression model. This however means that it cannot always be decided whether differences between samples in terms of the distribution lead to an incomparability of AMEs, because samples may (also) differ in the distribution of relevant variables that are not included into the logistic regression model.