🍓 课程推荐:2024 空间计量专题
主讲老师:范巧 (兰州大学)
课程时间:2024 年 10 月 2-4 日 (三天)
课程咨询:王老师 18903405450(微信)
作者: 余坚 (贵州财经大学)
邮箱: yujiangeren@163.com
Source: Xiang Zhou, 2021, Semiparametric estimation for causal mediation analysis with multiple causally ordered mediators. -Link-,-Data-
温馨提示: 文中链接在微信中无法生效。请点击底部「阅读原文」。或直接长按/扫描如下二维码,直达原文:
目录
1. 引言
2. 数据背景
3. R 语言实操
3.1 前期工作
3.2 估计过程
3.3 输出与绘制估计结果
4. 估计结果解读
5. 结语
6. 参考资料
7. 相关推文
1. 引言
在有序因果中介分析的半参数估计理论篇 [ 中介效应:有序因果中介分析的半参数估计A-理论 ] 中,我们已经对 个因果有序中介设定下,路径特定效应 (path-specific effects, PSEs) 的估计进行了理论介绍。本推文将在此基础上,复现作者原文中的实证结果,在 R 语言软件中演示具体操作,并对相应结果进行解读。
2. 数据背景
作者使用的数据来源于美国 1997 年全国青年纵向调查 (National Longitudinal Survey of Youth 1997),数据集中共有 2969 人,在 1997 年他们的年龄为 15 岁至 17 岁,并在 20 岁前完成了高中学业。相关研究表明,大学入学率对美国的政治参与 (选举中是否投票) 有实质性的积极影响,但这种因果关系背后的机制仍不清楚。大学入学率对政治参与的影响可能通过公民和政治兴趣的发展、经济状况的提高、或社会和职业网络等其他途径发挥作用。为了检验这些直接和间接的影响,我们考虑一个因果结构,令 表示大学入学率, 表示政治参与 , 和 表示两个因果有序中介,分别反映了 (a) 经济状况, (b) 公民和政治兴趣。
在该数据集中,处理 是一个二元变量,用来衡量个体在 20 岁之前是否上过 2 年或 4 年的大学。结果 是一个二元变量,用来衡量个体是否在 2010 年大选中投票。经济状况 由受访者 2006 年至 2009 年的平均年收入度量。公民和政治兴趣 由一组变量来度量,反映受访者在 2007 至 2010 年间对政府和公共事务的兴趣,以及参与志愿服务、捐赠、社区团体活动的情况。
预处理协变量 包括性别、种族、民族、1997 年的年龄、父母教育、父母收入、父母资产、父亲形象、是否与亲生父母都住在一起、军队职业能力测试分数、高中 GPA、物质使用指数、青少年犯罪指数,受访者在 18 岁之前是否有孩子,受访者的同龄人对大学的期望,以及一些学校水平的特征。
我们在上一篇推文中已经介绍过反映因果路径 ,,和 的 PSEs。为了便于说明,现在我们主要关注 的累积路径特定效应 (cumulative path-sprecific effects, cPSEs):
其中,第一个部分是大学入学率的 NDE,第二个和第三个部分反映了分别通过公民/政治兴趣中介和经济状况中介的处理效应的大小。
3. R 语言实操
3.1 前期工作
首先,我们进行一系列前期工作,包括载入所需的包,设定画图的主题,以及导入数据。
rm(list = ls())
library(haven)
library(survey)
library(gbm)
library(ranger)
library(rlang)
library(glmnet)
library(rsample)
library(caret)
library(tidyverse)
library(mice)
library(SuperLearner)
source("zmisc.R")
set.seed(02138)
# set my ggplot theme
mytheme <- theme_minimal(base_size = 16) +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(color = "grey30"))
theme_set(mytheme)
# load data
nlsy97_ed3 <- readRDS("nlsy97_ed3.RDS")
并且,我们使用多重插补法 (Multiple Imputation) 补齐缺失的协变量。
##########################################################
# Multiple Imputation for All Missing Covariates
##########################################################
# number of imputed datasets
I <- 10
nlsy97_ed3_mi <- nlsy97_ed3 %>%
# remove redundant variables for imputation
select(-id, -weight) %>%
mice(m = I, maxit = 10, defaultMethod = "rf")
nlsy97_ed3_full <- map(1:I, ~ complete(nlsy97_ed3_mi, .x)) %>%
map(., ~ mutate(.x,
id = nlsy97_ed3$id,
weight = nlsy97_ed3$weight,
logearn2007 = log(pmax(total_earnings_adj_2007, 1000)),
logearn2008 = log(pmax(total_earnings_adj_2008, 1000)),
logearn2009 = log(pmax(total_earnings_adj_2009, 1000)),
logearn2010 = log(pmax(total_earnings_adj_2010, 1000)),
))
saveRDS(nlsy97_ed3_full, file = "nlsy97_ed3_full.RDS")
3.2 估计过程
首先,我们载入所需的包并对一些变量进行预处理。
rm(list = ls())
library(survey)
library(gbm)
library(ranger)
library(glmnet)
library(rsample)
library(caret)
library(rlang)
library(tidyverse)
library(Hmisc)
library(SuperLearner)
source("zmisc.R")
set.seed(02138)
nlsy97_full <- readRDS("nlsy97_ed3_full.RDS") %>%
map(., ~ mutate(.x,
aveearn = rowMeans(select(.x, num_range("total_earnings_adj_", 2007:2010)), na.rm = TRUE),
logearn = log(pmax(aveearn, 1000)),
interest = 5 - rowMeans(select(.x, interest2008, interest2010), na.rm = TRUE)))
n <- nrow(nlsy97_full[[1]])
接着,我们开始设定变量与公式,此处 a
为处理组的平均处理效应 (Average Treatment Effects on Treated, ATT);y
为结果变量,即是否在 2010 年选举中投票;第一个中介 m1
为经济状况;第二个中介 m2
为一组多元变量,包括受访者在 2007 至 2010 年间对政府和公共事务的兴趣,以及参与志愿服务、捐赠、社区团体活动的情况;x
为一组控制变量。我们分别对无中介、仅有中介 m1
、中介 m1
和 m2
三种情况下的处理变量和结果变量设定计算公式。
##########################################################
# Formulas for treatment and outcome models
##########################################################
# continuous variables
nlsy97_full[[1]] %>%
map_lgl( ~ length(unique(.x)) > 2) %>%
which()
a <- "att"
y <- "vote2010"
m1 <- c("logearn")
m2 <- c("volunteer2007", "community2007", "donate2007", "interest")
x <- exprs(age97, female, black, hispan, parinc, paredu, parast, livebio, father,
afqt3, gpa, children18, substind, delinqind, rural97, south97,
frcoll_75, frcoll_90, schstolen, schthreat, schfight) %>%
map_chr(as_string)
a0_form <- as.formula(paste(a, " ~ ", paste(x, collapse= "+")))
a1_form <- as.formula(paste(a, " ~ ", paste(c(x, m1), collapse= "+")))
a2_form <- as.formula(paste(a, " ~ ", paste(c(x, m1, m2), collapse= "+")))
y0_form <- as.formula(paste(y, " ~ ", paste(c(x, a), collapse= "+")))
y1_form <- as.formula(paste(y, " ~ ", paste(c(x, a, m1), collapse= "+")))
y2_form <- as.formula(paste(y, " ~ ", paste(c(x, a, m1, m2), collapse= "+")))
最后,则是对原文中的理论估计模型进行编程并得到估计结果。作者使用了两种方法,一是结合结果模型与迭代回归输入,二是使用目标极大似然估计方法。注意以下命令全部运行完约需两个小时。
##########################################################
# Main analyses
##########################################################
estimands <- expand.grid(c(0, 1), c(0, 1), c(0, 1)) %>%
`colnames<-`(c("a1", "a2", "a3"))
I <- length(nlsy97_full)
S <- nrow(estimands)
K <- 5
out <- vector(mode = "list", I)
for(i in 1:I){
cat("imputed sample ", i, "\n")
df <- nlsy97_full[[i]]
df_p0 <- model.matrix(a0_form, data = df)[, -1] %>% as_tibble()
df_p1 <- model.matrix(a1_form, data = df)[, -1] %>% as_tibble()
df_p2 <- model.matrix(a2_form, data = df)[, -1] %>% as_tibble()
df_mu0 <- model.matrix(y0_form, data = df)[, -1] %>% as_tibble()
df_mu1 <- model.matrix(y1_form, data = df)[, -1] %>% as_tibble()
df_mu2 <- model.matrix(y2_form, data = df)[, -1] %>% as_tibble()
df_mu2n <- model.matrix(y2_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
df_mu1n <- model.matrix(y1_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
df_mu0n <- model.matrix(y0_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
df_mu2y <- model.matrix(y2_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
df_mu1y <- model.matrix(y1_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
df_mu0y <- model.matrix(y0_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
# create cross-fitting split
cf_fold <- createFolds(df$vote2010, K)
main_list <- vector(mode = "list", K)
for(k in 1:K){
cat(" cross-fitting fold ", k, "\n")
#################################################
# Design matrices for different models
#################################################
# auxiliary and main data
aux <- df[-cf_fold[[k]], ]
main <- df[cf_fold[[k]], ]
aux_p0 <- df_p0[-cf_fold[[k]], ]
aux_p1 <- df_p1[-cf_fold[[k]], ]
aux_p2 <- df_p2[-cf_fold[[k]], ]
main_p0 <- df_p0[cf_fold[[k]], ]
main_p1 <- df_p1[cf_fold[[k]], ]
main_p2 <- df_p2[cf_fold[[k]], ]
aux_mu0 <- df_mu0[-cf_fold[[k]], ]
aux_mu1 <- df_mu1[-cf_fold[[k]], ]
aux_mu2 <- df_mu2[-cf_fold[[k]], ]
main_mu0 <- df_mu0[cf_fold[[k]], ]
main_mu1 <- df_mu1[cf_fold[[k]], ]
main_mu2 <- df_mu2[cf_fold[[k]], ]
#################################################
# Treatment Models
#################################################
p0_sl <- SuperLearner(
Y = aux$att,
X = aux_p0,
newX = df_p0,
family = binomial(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE, trimLogit = 0.001),
cvControl = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
)
p1_sl <- SuperLearner(
Y = aux$att,
X = aux_p1,
newX = df_p1,
family = binomial(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE, trimLogit = 0.001),
cvControl = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
)
p2_sl <- SuperLearner(
Y = aux$att,
X = aux_p2,
newX = df_p2,
family = binomial(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE, trimLogit = 0.001),
cvControl = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
)
df <- df %>% mutate(
p0_fit = p0_sl$SL.predict,
p1_fit = p1_sl$SL.predict,
p2_fit = p2_sl$SL.predict,
w0_n = I(att == 0)/(1 - p0_fit),
w0_y = I(att == 1)/p0_fit,
w1_nn = I(att == 0)/(1 - p0_fit) * 1,
w1_ny = I(att == 1)/(1 - p0_fit) * (1 - p1_fit)/p1_fit,
w1_yn = I(att == 0)/p0_fit * p1_fit/(1 - p1_fit),
w1_yy = I(att == 1)/p0_fit * 1,
w2_000 = I(att == 0)/(1 - p0_fit) * 1 * 1,
w2_001 = I(att == 1)/(1 - p0_fit) * 1 * (1 - p2_fit)/p2_fit,
w2_010 = I(att == 0)/(1 - p0_fit) * (1 - p1_fit)/p1_fit * p2_fit/(1 - p2_fit),
w2_011 = I(att == 1)/(1 - p0_fit) * (1 - p1_fit)/p1_fit * 1,
w2_100 = I(att == 0)/p0_fit * p1_fit/(1 - p1_fit) * 1,
w2_101 = I(att == 1)/p0_fit * p1_fit/(1 - p1_fit) * (1 - p2_fit)/p2_fit,
w2_110 = I(att == 0)/p0_fit * 1 * p2_fit/(1 - p2_fit),
w2_111 = I(att == 1)/p0_fit * 1 * 1,
)
#################################################
# Outcome models
#################################################
mu2_sl <- SuperLearner(
Y = aux$vote2010,
X = aux_mu2,
family = binomial(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu2_fit_a3n <- predict.SuperLearner(mu2_sl, newdata = df_mu2n)$pred
df$mu2_fit_a3y <- predict.SuperLearner(mu2_sl, newdata = df_mu2y)$pred
mu1_sl_a3n <- SuperLearner(
Y = df$mu2_fit_a3n[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1_sl_a3y <- SuperLearner(
Y = df$mu2_fit_a3y[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu1_fit_a3n_a2n <- predict.SuperLearner(mu1_sl_a3n, newdata = df_mu1n)$pred
df$mu1_fit_a3n_a2y <- predict.SuperLearner(mu1_sl_a3n, newdata = df_mu1y)$pred
df$mu1_fit_a3y_a2n <- predict.SuperLearner(mu1_sl_a3y, newdata = df_mu1n)$pred
df$mu1_fit_a3y_a2y <- predict.SuperLearner(mu1_sl_a3y, newdata = df_mu1y)$pred
mu0_sl_a3n_a2n <- SuperLearner(
Y = df$mu1_fit_a3n_a2n[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0_sl_a3n_a2y <- SuperLearner(
Y = df$mu1_fit_a3n_a2y[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0_sl_a3y_a2n <- SuperLearner(
Y = df$mu1_fit_a3y_a2n[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0_sl_a3y_a2y <- SuperLearner(
Y = df$mu1_fit_a3y_a2y[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu0_fit_a3n_a2n_a1n <- predict.SuperLearner(mu0_sl_a3n_a2n, newdata = df_mu0n)$pred
df$mu0_fit_a3n_a2n_a1y <- predict.SuperLearner(mu0_sl_a3n_a2n, newdata = df_mu0y)$pred
df$mu0_fit_a3n_a2y_a1n <- predict.SuperLearner(mu0_sl_a3n_a2y, newdata = df_mu0n)$pred
df$mu0_fit_a3n_a2y_a1y <- predict.SuperLearner(mu0_sl_a3n_a2y, newdata = df_mu0y)$pred
df$mu0_fit_a3y_a2n_a1n <- predict.SuperLearner(mu0_sl_a3y_a2n, newdata = df_mu0n)$pred
df$mu0_fit_a3y_a2n_a1y <- predict.SuperLearner(mu0_sl_a3y_a2n, newdata = df_mu0y)$pred
df$mu0_fit_a3y_a2y_a1n <- predict.SuperLearner(mu0_sl_a3y_a2y, newdata = df_mu0n)$pred
df$mu0_fit_a3y_a2y_a1y <- predict.SuperLearner(mu0_sl_a3y_a2y, newdata = df_mu0y)$pred
#################################################
# Targeted MLE
#################################################
# targeted mu2
mu2_tmle_000 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_000, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_001 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_001, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_010 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_010, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_011 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_011, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_100 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_100, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_101 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_101, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_110 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_110, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_111 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_111, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
df$mu2b_fit_000 <- predict(mu2_tmle_000, newdata = df, type = "response")
df$mu2b_fit_001 <- predict(mu2_tmle_001, newdata = df, type = "response")
df$mu2b_fit_010 <- predict(mu2_tmle_010, newdata = df, type = "response")
df$mu2b_fit_011 <- predict(mu2_tmle_011, newdata = df, type = "response")
df$mu2b_fit_100 <- predict(mu2_tmle_100, newdata = df, type = "response")
df$mu2b_fit_101 <- predict(mu2_tmle_101, newdata = df, type = "response")
df$mu2b_fit_110 <- predict(mu2_tmle_110, newdata = df, type = "response")
df$mu2b_fit_111 <- predict(mu2_tmle_111, newdata = df, type = "response")
# refit mu1
mu1b_sl_000 <- SuperLearner(
Y = df$mu2b_fit_000[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_001 <- SuperLearner(
Y = df$mu2b_fit_001[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_010 <- SuperLearner(
Y = df$mu2b_fit_010[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_011 <- SuperLearner(
Y = df$mu2b_fit_011[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_100 <- SuperLearner(
Y = df$mu2b_fit_100[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_101 <- SuperLearner(
Y = df$mu2b_fit_101[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_110 <- SuperLearner(
Y = df$mu2b_fit_110[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_111 <- SuperLearner(
Y = df$mu2b_fit_111[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu1_fit_000 <- predict.SuperLearner(mu1b_sl_000, newdata = df_mu1n)$pred
df$mu1_fit_001 <- predict.SuperLearner(mu1b_sl_001, newdata = df_mu1n)$pred
df$mu1_fit_010 <- predict.SuperLearner(mu1b_sl_010, newdata = df_mu1y)$pred
df$mu1_fit_011 <- predict.SuperLearner(mu1b_sl_011, newdata = df_mu1y)$pred
df$mu1_fit_100 <- predict.SuperLearner(mu1b_sl_100, newdata = df_mu1n)$pred
df$mu1_fit_101 <- predict.SuperLearner(mu1b_sl_101, newdata = df_mu1n)$pred
df$mu1_fit_110 <- predict.SuperLearner(mu1b_sl_110, newdata = df_mu1y)$pred
df$mu1_fit_111 <- predict.SuperLearner(mu1b_sl_111, newdata = df_mu1y)$pred
# targeted mu1
mu1_tmle_000 <- lm(mu2b_fit_000 ~ 0 + offset(mu1_fit_000) + w1_nn, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_001 <- lm(mu2b_fit_001 ~ 0 + offset(mu1_fit_001) + w1_nn, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_010 <- lm(mu2b_fit_010 ~ 0 + offset(mu1_fit_010) + w1_ny, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_011 <- lm(mu2b_fit_011 ~ 0 + offset(mu1_fit_011) + w1_ny, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_100 <- lm(mu2b_fit_100 ~ 0 + offset(mu1_fit_100) + w1_yn, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_101 <- lm(mu2b_fit_101 ~ 0 + offset(mu1_fit_101) + w1_yn, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_110 <- lm(mu2b_fit_110 ~ 0 + offset(mu1_fit_110) + w1_yy, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_111 <- lm(mu2b_fit_111 ~ 0 + offset(mu1_fit_111) + w1_yy, weights = weight, data = df[cf_fold[[k]], ])
df$mu1b_fit_000 <- predict(mu1_tmle_000, newdata = df)
df$mu1b_fit_001 <- predict(mu1_tmle_001, newdata = df)
df$mu1b_fit_010 <- predict(mu1_tmle_010, newdata = df)
df$mu1b_fit_011 <- predict(mu1_tmle_011, newdata = df)
df$mu1b_fit_100 <- predict(mu1_tmle_100, newdata = df)
df$mu1b_fit_101 <- predict(mu1_tmle_101, newdata = df)
df$mu1b_fit_110 <- predict(mu1_tmle_110, newdata = df)
df$mu1b_fit_111 <- predict(mu1_tmle_111, newdata = df)
# refit mu0
mu0b_sl_000 <- SuperLearner(
Y = df$mu1b_fit_000[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_001 <- SuperLearner(
Y = df$mu1b_fit_001[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_010 <- SuperLearner(
Y = df$mu1b_fit_010[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_011 <- SuperLearner(
Y = df$mu1b_fit_011[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_100 <- SuperLearner(
Y = df$mu1b_fit_100[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_101 <- SuperLearner(
Y = df$mu1b_fit_101[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_110 <- SuperLearner(
Y = df$mu1b_fit_110[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_111 <- SuperLearner(
Y = df$mu1b_fit_111[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu0_fit_000 <- predict.SuperLearner(mu0b_sl_000, newdata = df_mu0n)$pred
df$mu0_fit_001 <- predict.SuperLearner(mu0b_sl_001, newdata = df_mu0n)$pred
df$mu0_fit_010 <- predict.SuperLearner(mu0b_sl_010, newdata = df_mu0y)$pred
df$mu0_fit_011 <- predict.SuperLearner(mu0b_sl_011, newdata = df_mu0y)$pred
df$mu0_fit_100 <- predict.SuperLearner(mu0b_sl_100, newdata = df_mu0n)$pred
df$mu0_fit_101 <- predict.SuperLearner(mu0b_sl_101, newdata = df_mu0n)$pred
df$mu0_fit_110 <- predict.SuperLearner(mu0b_sl_110, newdata = df_mu0y)$pred
df$mu0_fit_111 <- predict.SuperLearner(mu0b_sl_111, newdata = df_mu0y)$pred
# targeted mu0
mu0_tmle_000 <- lm(mu1b_fit_000 ~ 0 + offset(mu0_fit_000) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_001 <- lm(mu1b_fit_001 ~ 0 + offset(mu0_fit_001) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_010 <- lm(mu1b_fit_010 ~ 0 + offset(mu0_fit_010) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_011 <- lm(mu1b_fit_011 ~ 0 + offset(mu0_fit_011) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_100 <- lm(mu1b_fit_100 ~ 0 + offset(mu0_fit_100) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_101 <- lm(mu1b_fit_101 ~ 0 + offset(mu0_fit_101) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_110 <- lm(mu1b_fit_110 ~ 0 + offset(mu0_fit_110) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_111 <- lm(mu1b_fit_111 ~ 0 + offset(mu0_fit_111) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
df$mu0b_fit_000 <- predict(mu0_tmle_000, newdata = df)
df$mu0b_fit_001 <- predict(mu0_tmle_001, newdata = df)
df$mu0b_fit_010 <- predict(mu0_tmle_010, newdata = df)
df$mu0b_fit_011 <- predict(mu0_tmle_011, newdata = df)
df$mu0b_fit_100 <- predict(mu0_tmle_100, newdata = df)
df$mu0b_fit_101 <- predict(mu0_tmle_101, newdata = df)
df$mu0b_fit_110 <- predict(mu0_tmle_110, newdata = df)
df$mu0b_fit_111 <- predict(mu0_tmle_111, newdata = df)
main_list[[k]] <- df[cf_fold[[k]], ]
}
main_df <- reduce(main_list, bind_rows)
for (s in 1:S){
a1 <- estimands$a1[[s]]
a2 <- estimands$a2[[s]]
a3 <- estimands$a3[[s]]
main_df <- main_df %>%
mutate(
wt0_deno = a1 * p0_fit + (1 - a1) * (1 - p0_fit),
wt1_nume = a1 * p1_fit + (1 - a1) * (1 - p1_fit),
wt1_deno = a2 * p1_fit + (1 - a2) * (1 - p1_fit),
wt2_nume = a2 * p2_fit + (1 - a2) * (1 - p2_fit),
wt2_deno = a3 * p2_fit + (1 - a3) * (1 - p2_fit),
!!sym(paste0("w0_", a1, a2, a3)) := 1/trim(wt0_deno),
!!sym(paste0("w1_", a1, a2, a3)) := !!sym(paste0("w0_", a1, a2, a3)) * wt1_nume/trim(wt1_deno),
!!sym(paste0("w2_", a1, a2, a3)) := !!sym(paste0("w1_", a1, a2, a3)) * wt2_nume/trim(wt2_deno),
!!sym(paste0("mu2fit_", a1, a2, a3)) := a3 * mu2_fit_a3y + (1 - a3) * mu2_fit_a3n,
!!sym(paste0("mu1fit_", a1, a2, a3)) := a3 * a2 * mu1_fit_a3y_a2y + a3 * (1 - a2) * mu1_fit_a3y_a2n +
(1 - a3) * a2 * mu1_fit_a3n_a2y + (1 - a3) * (1 - a2) * mu1_fit_a3n_a2n,
!!sym(paste0("mu0fit_", a1, a2, a3)) := a3 * a2 * a1 * mu0_fit_a3y_a2y_a1y + a3 * a2 * (1 - a1) * mu0_fit_a3y_a2y_a1n +
a3 * (1 - a2) * a1 * mu0_fit_a3y_a2n_a1y + a3 * (1 - a2) * (1 - a1) * mu0_fit_a3y_a2n_a1n +
(1 - a3) * a2 * a1 * mu0_fit_a3n_a2y_a1y + (1 - a3) * a2 * (1 - a1) * mu0_fit_a3n_a2y_a1n +
(1 - a3) * (1 - a2) * a1 * mu0_fit_a3n_a2n_a1y + (1 - a3) * (1 - a2) * (1 - a1) * mu0_fit_a3n_a2n_a1n,
!!sym(paste0("www_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) * vote2010,
!!sym(paste0("iii_", a1, a2, a3)) := !!sym(paste0("mu0fit_", a1, a2, a3)),
!!sym(paste0("eif_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) *
(vote2010 - !!sym(paste0("mu2fit_", a1, a2, a3))) +
as.double(att==a2) * !!sym(paste0("w1_", a1, a2, a3)) * (!!sym(paste0("mu2fit_", a1, a2, a3)) -
!!sym(paste0("mu1fit_", a1, a2, a3))) +
as.double(att==a1) * !!sym(paste0("w0_", a1, a2, a3)) * (!!sym(paste0("mu1fit_", a1, a2, a3)) -
!!sym(paste0("mu0fit_", a1, a2, a3))) +
!!sym(paste0("mu0fit_", a1, a2, a3)),
!!sym(paste0("tmle_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) *
(vote2010 - !!sym(paste0("mu2b_fit_", a1, a2, a3))) +
as.double(att==a2) * !!sym(paste0("w1_", a1, a2, a3)) * (!!sym(paste0("mu2b_fit_", a1, a2, a3)) -
!!sym(paste0("mu1b_fit_", a1, a2, a3))) +
as.double(att==a1) * !!sym(paste0("w0_", a1, a2, a3)) * (!!sym(paste0("mu1b_fit_", a1, a2, a3)) -
!!sym(paste0("mu0b_fit_", a1, a2, a3))) +
!!sym(paste0("mu0b_fit_", a1, a2, a3)),
!!sym(paste0("tmle2_", a1, a2, a3)) := !!sym(paste0("mu0b_fit_", a1, a2, a3))
)
}
out[[i]] <- main_df
}
save.image(file = "05_vote_main.RData")
3.3 输出与绘制估计结果
我们同样载入所需包、导入数据、设定绘图主题后,根据之前的估计结果计算得到 ATE 与其分解所得的 PSEs 的估计值与标准误,并通过 ggplot
绘制图形。
rm(list = ls())
library(survey)
library(gbm)
library(ranger)
library(glmnet)
library(rsample)
library(caret)
library(rlang)
library(tidyverse)
library(Hmisc)
library(SuperLearner)
library(scales)
source("zmisc.R")
set.seed(02138)
load("05_vote_main.RData")
# set my ggplot theme
mytheme <- theme_minimal(base_size = 18) +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(color = "grey30"))
theme_set(mytheme)
out_df <- out %>%
imap( ~ mutate(.x, imp = .y)) %>%
reduce(bind_rows) %>%
mutate(eif_type1_ate = eif_111 - eif_000,
eif_type2_ate = eif_111 - eif_000,
eif_type3_ate = eif_111 - eif_000,
eif_type1_pse3 = eif_001 - eif_000,
eif_type1_pse2 = eif_011 - eif_001,
eif_type1_pse1 = eif_111 - eif_011,
eif_type2_pse1 = eif_100 - eif_000,
eif_type2_pse2 = eif_110 - eif_100,
eif_type2_pse3 = eif_111 - eif_110,
eif_type3_pse3 = eif_001 - eif_000,
eif_type3_pse1 = eif_101 - eif_001,
eif_type3_pse2 = eif_111 - eif_101,
tmle_type1_ate = tmle_111 - tmle_000,
tmle_type2_ate = tmle_111 - tmle_000,
tmle_type3_ate = tmle_111 - tmle_000,
tmle_type1_pse3 = tmle_001 - tmle_000,
tmle_type1_pse2 = tmle_011 - tmle_001,
tmle_type1_pse1 = tmle_111 - tmle_011,
tmle_type2_pse1 = tmle_100 - tmle_000,
tmle_type2_pse2 = tmle_110 - tmle_100,
tmle_type2_pse3 = tmle_111 - tmle_110,
tmle_type3_pse3 = tmle_001 - tmle_000,
tmle_type3_pse1 = tmle_101 - tmle_001,
tmle_type3_pse2 = tmle_111 - tmle_101) %>%
group_by(imp) %>%
summarise_at(vars(contains("type")), list(est = ~ wtd.mean(.x, weight),
se = ~ sqrt(wtd.var(.x, weight)/length(.x)))) %>%
ungroup() %>%
pivot_longer(-imp) %>%
separate(name, into = c("estimator", "type", "estimand", "measure")) %>%
pivot_wider(names_from = measure, values_from = value) %>%
group_by(estimator, type, estimand) %>%
summarise(within_var = mean(se^2),
between_var = var(est),
total_var = within_var + (1 + 1/I)*between_var,
est = mean(est),
se = sqrt(total_var)) %>%
ungroup() %>%
filter(type == "type1") %>%
mutate(estimator = factor(estimator,
levels = c("eif", "tmle"),
labels = c(expression(paste("Estimating Equation (", hat(psi)[np]^eif2, ")")),
expression(paste("TMLE (", hat(psi)[tmle]^eif2, ")")))),
estimand = factor(estimand,
levels = rev(c("ate", "pse3", "pse2", "pse1")),
labels = rev(c(expression(paste("Total Effect (", psi[`111`]-psi[`000`], ")")),
expression(paste("Direct Effect (", psi[`001`]-psi[`000`], ")")),
expression(paste("via Civic/Political Interest (", psi[`011`]-psi[`001`], ")")),
expression(paste("via Economic Status (", psi[`111`]-psi[`011`], ")"))))))
ggplot(out_df, aes(x = estimand, y = est, shape = estimator)) +
geom_pointrange(aes(ymin = est - 1.96 * se, ymax = est + 1.96 * se),
position = position_dodge(width = - 0.5), size = 1) +
geom_hline(yintercept = 0, linetype = 2) +
scale_shape("", labels = parse_format()) +
scale_color_discrete("", labels = parse_format()) +
scale_x_discrete("", labels = parse_format()) +
scale_y_continuous("Effects of College Attendance on Voting") +
coord_flip()
out_df[, c("est", "se")] %>% round(3)
4. 估计结果解读
R 的最终估计结果如下所示:
est | se |
---|---|
0.153 | 0.023 |
0.008 | 0.005 |
0.045 | 0.008 |
0.1 | 0.021 |
0.157 | 0.022 |
0.002 | 0.005 |
0.051 | 0.008 |
0.103 | 0.021 |
我们将其与原文表格 (如下所示,括号中为标准误) 进行对比,可见结果是一致的。
估计方程 () | TMLE () | |
---|---|---|
平均总效应 | 0.152 (0.022) | 0.156 (0.023) |
经济状况路径 ( ) | 0.007 (0.005) | 0.002 (0.005) |
公民/政治兴趣路径 ( ) | 0.042 (0.008) | 0.049 (0.008) |
直接效应 ( ) | 0.103 (0.021) | 0.105 (0.021) |
此外,我们还通过 R 将以上结果可视化:
在理论篇中,我们知道本文的多重稳健估计量可以通过两种方法来估计,一是调整结果模型的估计方程,使包含逆概率权重的项等于零,只留下一个位于估计对象参数位置的迭代回归输入估计量,即表中的 。二是使用目标极大似然估计方法,通过两步拟合每个结果模型,即表中的 。从上表可知,这两个估计量均得到了类似的总效应和 PSEs 的估计值。大学入学率对选举中投票的效应有很大一部分是 “直接的”,也就是说,既不是通过提高经济状况路径发挥作用,也不是通过公民和政治兴趣路径发挥作用。
5. 结语
本推文对 个因果有序中介设定下 ATE 与其各类分解——自然直接效应、自然间接效应\总间接效应、自然路径特定效应、累积路径特定效应——的建模与估计实操部分进行了展示。在后续拓展中,感兴趣的读者可以对此操作进行复现,在政策评估等实证研究中进行运用。
6. 参考资料
Pearl, J. (2009) Causality: models, reasoning, and inference. Cambridge: Cambridge University Press. Robins, J.M. (2003) Semantics of causal dag models and the identification of direct and indirect effects. Highly Structured Stochastic Systems, 70–81. Tchetgen Tchetgen, E.J. & Shpitser, I. (2012) Semiparametric theory for causal mediation analysis: efficiency bounds, multiple robustness, and sensitivity analysis. Annals of Statistics, 40, 1816. -PDF- Avin, C., Shpitser, I. & Pearl, J. (2005) Identifiability of path-specific effects. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers Inc, pp. 357–363. Zhou, X., & Yamamoto, T. (2023). Tracing causal paths from experimental and observational data. The Journal of Politics, 85(1), 250-265. -PDF- Vansteelandt, S., Bekaert, M. & Lange, T. (2012) Imputation strategies for the estimation of natural direct and indirect effects. Epidemiologic Methods, 1, 131–158. -PDF- VanderWeele, T.J., Vansteelandt, S. & Robins, J.M. (2014) Effect decomposition in the presence of an exposure-induced mediator-outcome confounder. Epidemiology, 25, 300–306. -PDF-
7. 相关推文
Note:产生如下推文列表的 Stata 命令为:
lianxh
安装最新版lianxh
命令:
ssc install lianxh, replace
专题:回归分析 中介效应分析:三段式中介效应模型真的适用于经济学研究吗? 专题:IV-GMM Stata:基于IV的因果中介分析-ivmediate 专题:内生性-因果推断 Stata:一般化的因果中介分析 专题:交乘项-调节-中介 Stata:中介效应分析新命令-sgmediation2 medsem-中介效应:基于结构方程模型SEM的中介效应分析-T331 Stata:因果中介分析大比拼-T323 med4way:中介效应和交互效应分析 Stata:调节中介效应检验.md Stata+R:一文读懂中介效应
🍓 课程推荐:2024 空间计量专题
主讲老师:范巧 (兰州大学)
课程时间:2024 年 10 月 2-4 日 (三天)
课程咨询:王老师 18903405450(微信)
🍉 扫码加入连享会微信群,提问交流更方便