写在前面
最近写报告需要用到偏最小二乘法的结构方程模型,正好顺便更新一下。不过在做之前我们需要明确一下基于plspm包的SEM的优缺点:
1. plspm的优点有哪些?
答:数据不符合正态分布时、自变量存在多重共线性时、样本量较小时。均可以使用,也就是说,是数据就可以用,什么条件都可以不在乎。
2. plspm的缺点有哪些?
答:我们做piecewiseSEM的时候我们的P值必须大于0.05才算检验通过,而基于plspm的时候,我们只有一个GOF值来判断模型结果,一般大于0.33就可以。问题也就是出在了这里,plspm本质上是讨论多维变量间的关系(就是类似于我们的复合变量结构方程模型),所以如果我们像平时一样构建单个指标间的关系的时候,GOF值是没法计算的。此外,plspm也不支持变量之间的相互影响。不过cSEM包可以克服这些困难,我们后边陆续更新。
3. 我为什么更新这一期?
答:我发现现在网上大多数代码需要手动书写01矩阵来实现指标间相互关系的确定,这简直麻烦到爆炸。对于习惯使用了简单语言的piecewiseSEM的我们,我根据piecewise的语法规则将这一步自动化了,就是我们依旧按照piecewise的语法书写指标间的关系,然后代码自动生成这个01矩阵带入计算,这样就不费脑子了,又能省好多脑细胞。
https://doi.org/10.1016/j.soilbio.2023.109010
我为什么举这个例子呢?是因为如果非要使用plspm探讨单个因子的关系又没办法提供GOF值的时候,是不是可以和这篇GCB一样,干脆不提这一茬就好
准备数据集
数据如下,小编已上传,文末下载。
开始分析
我们先读取数据,然后开始指定变量间的关系
PS:这里的指定变量关系就和我们构建复合变量的结构方程模型一样,是将读取进来的数据中的指标赋值给我们规定的指标,例如我们的土壤指标由BD SM构成。
定义潜变量列表就是将构建好的变量按顺序列出,然后我们使用公式表达这些关系,参照的概念模型如下
# 加载必要的R包
library(plspm)
# 读取数据
data <- read.csv(file.choose(), header = TRUE)
# 指定变量关系
comdata <- list(
soil = c("BD", "SM"),
plant=c("TN","TP","TOC"),
AK = "AK",
AN = "AN",
AP = "AP",
TRAD ="TRAD"
)
# 定义潜变量列表(包含模型中所有潜变量)
latent_vars <- c("soil", "plant", "AK", "AN", "AP", "TRAD")
# 将路径关系用公式列表表示
model_formulas <- list(
TRAD = ~ soil + plant + AK + AN + AP,
AK = ~ soil + plant,
AN = ~ soil + plant,
AP = ~ soil + plant
)
然后我们自动生成01矩阵
# 创建一个全0矩阵,用于存储路径关系
dat_path <- matrix(0, nrow = length(latent_vars), ncol = length(latent_vars),
dimnames = list(latent_vars, latent_vars))
# 根据公式自动填充路径矩阵
for (end_var in names(model_formulas)) {
# 提取公式右侧的自变量
rhs_vars <- all.vars(model_formulas[[end_var]])
# 将对应位置标记为1
rhs_vars] <- 1
}
# 确保路径矩阵是下三角矩阵
dat_path <- dat_path * lower.tri(dat_path)
# 输出路径矩阵
print(dat_path)
生成的矩阵如下。这一步以前都是手动的,但是如果关系一复杂,谁还分得指标间的关系,所以我对这一步代码甚是满意。
print(dat_path)
soil plant AK AN AP TRAD
soil 0 0 0 0 0 0
plant 0 0 0 0 0 0
AK 1 1 0 0 0 0
AN 1 1 0 0 0 0
AP 1 1 0 0 0 0
TRAD 1 1 1 1 1 0
然后我们运行plspm即可,这里的6是什么意思,就是变量一共有6个的意思,测量模式默认即可
dat_modes <- c("A", 6)
dat_pls <- plspm(data, dat_path, comdata, modes = dat_modes)
summary(dat_pls)
我们查看模型结果
summary(dat_pls)
PARTIAL LEAST SQUARES PATH MODELING (PLS-PM)
MODEL SPECIFICATION
1 Number of Cases 54
2 Latent Variables 6
3 Manifest Variables 9
4 Scale of Data Standardized Data
5 Non-Metric PLS FALSE
6 Weighting Scheme centroid
7 Tolerance Crit 1e-06
8 Max Num Iters 100
9 Convergence Iters 3
10 Bootstrapping FALSE
11 Bootstrap samples NULL
BLOCKS DEFINITION
Block Type Size Mode
1 soil Exogenous 2 A
2 plant Exogenous 3 A
3 AK Endogenous 1 A
4 AN Endogenous 1 A
5 AP Endogenous 1 A
6 TRAD Endogenous 1 A
BLOCKS UNIDIMENSIONALITY
Mode MVs C.alpha DG.rho eig.1st eig.2nd
soil A 2 0.000 1.04e-29 1.07 0.933
plant A 3 0.236 5.56e-01 1.41 1.013
AK A 1 1.000 1.00e+00 1.00 0.000
AN A 1 1.000 1.00e+00 1.00 0.000
AP A 1 1.000 1.00e+00 1.00 0.000
TRAD A 1 1.000 1.00e+00 1.00 0.000
OUTER MODEL
weight loading communality redundancy
soil
1 BD 0.7741 0.728 0.5301 0.000
1 SM 0.6870 0.635 0.4035 0.000
plant
2 TN 0.8293 0.817 0.6671 0.000
2 TP 0.5656 0.591 0.3495 0.000
2 TOC -0.0623 0.188 0.0352 0.000
AK
3 AK 1.0000 1.000 1.0000 0.282
AN
4 AN 1.0000 1.000 1.0000 0.527
AP
5 AP 1.0000 1.000 1.0000 0.396
TRAD
6 TRAD 1.0000 1.000 1.0000 0.661
CROSSLOADINGS
soil plant AK AN AP TRAD
soil
1 BD 0.7281 0.31412 -0.303 0.557 0.537 -0.150
1 SM 0.6352 0.00428 -0.376 0.254 0.137 -0.605
plant
2 TN 0.2802 0.81677 0.169 0.494 0.304 -0.120
2 TP 0.0301 0.59115 -0.113 0.238 0.396 -0.220
2 TOC 0.0523 0.18755 0.111 0.104 -0.115 0.182
AK
3 AK -0.4930 0.06927 1.000 -0.384 -0.169 0.702
AN
4 AN 0.6058 0.53743 -0.384 1.000 0.555 -0.637
AP
5 AP 0.5093 0.48352 -0.169 0.555 1.000 -0.407
TRAD
6 TRAD -0.5322 -0.23591 0.702 -0.637 -0.407 1.000
INNER MODEL
$AK
Estimate Std. Error t value Pr(>|t|)
Intercept 4.27e-16 0.119 3.60e-15 1.00e+00
soil -5.43e-01 0.122 -4.43e+00 4.93e-05
plant 2.03e-01 0.122 1.66e+00 1.04e-01
$AN
Estimate Std. Error t value Pr(>|t|)
Intercept -9.87e-16 0.0963 -1.03e-14 1.00e+00
soil 5.04e-01 0.0993 5.08e+00 5.50e-06
plant 4.13e-01 0.0993 4.16e+00 1.21e-04
$AP
Estimate Std. Error t value Pr(>|t|)
Intercept -9.38e-16 0.109 -8.62e-15 1.000000
soil 4.15e-01 0.112 3.70e+00 0.000529
plant 3.81e-01 0.112 3.40e+00 0.001333
$TRAD
Estimate Std. Error t value Pr(>|t|)
Intercept 3.97e-16 0.0841 4.72e-15 1.00e+00
soil 2.54e-02 0.1189 2.13e-01 8.32e-01
plant -3.67e-02 0.1107 -3.31e-01 7.42e-01
AK 5.62e-01 0.1036 5.42e+00 1.90e-06
AN -3.57e-01 0.1300 -2.74e+00 8.51e-03
AP -1.09e-01 0.1103 -9.89e-01 3.28e-01
CORRELATIONS BETWEEN LVs
soil plant AK AN AP TRAD
soil 1.000 0.2461 -0.4930 0.606 0.509 -0.532
plant 0.246 1.0000 0.0693 0.537 0.483 -0.236
AK -0.493 0.0693 1.0000 -0.384 -0.169 0.702
AN 0.606 0.5374 -0.3837 1.000 0.555 -0.637
AP 0.509 0.4835 -0.1686 0.555 1.000 -0.406
TRAD -0.532 -0.2359 0.7018 -0.637 -0.406 1.000
SUMMARY INNER MODEL
Type R2 Block_Communality Mean_Redundancy AVE
soil Exogenous 0.000 0.467 0.000 0.467
plant Exogenous 0.000 0.351 0.000 0.351
AK Endogenous 0.282 1.000 0.282 1.000
AN Endogenous 0.527 1.000 0.527 1.000
AP Endogenous 0.396 1.000 0.396 1.000
TRAD Endogenous 0.661 1.000 0.661 1.000
0.4304
TOTAL EFFECTS
relationships direct indirect total
1 soil -> plant 0.0000 0.0000 0.000
2 soil -> AK -0.5430 0.0000 -0.543
3 soil -> AN 0.5040 0.0000 0.504
4 soil -> AP 0.4154 0.0000 0.415
5 soil -> TRAD 0.0254 -0.5300 -0.505
6 plant -> AK 0.2029 0.0000 0.203
7 plant -> AN 0.4134 0.0000 0.413
8 plant -> AP 0.3813 0.0000 0.381
9 plant -> TRAD -0.0367 -0.0751 -0.112
10 AK -> AN 0.0000 0.0000 0.000
11 AK -> AP 0.0000 0.0000 0.000
12 AK -> TRAD 0.5617 0.0000 0.562
13 AN -> AP 0.0000 0.0000 0.000
14 AN -> TRAD -0.3566 0.0000 -0.357
15 AP -> TRAD -0.1091 0.0000 -0.109
这个模型结果可谓是又长又复杂,我们其实只提取关键信息即可。但是我看网上的代码也是比较繁琐,所以自己写了个函数清洗了一下,直接筛选出我们需要的关键信息即可
#查看因果关系的路径图,详情 ?innerplot
innerplot(dat_pls, colpos = 'red', colneg = 'blue', show.values = TRUE, lcol = 'gray', box.lwd = 0)
# 提取路径系数
inner_model <- dat_pls$inner_model
# 提取路径系数,去掉 Intercept 信息
path_coefficients <- bind_rows(lapply(names(inner_model), function(name) {
data.frame(
Predictor = rownames(inner_model[[name]]),
Response = name,
Estimate = inner_model[[name]][, "Estimate"],
StdError = inner_model[[name]][, "Std. Error"],
tValue = inner_model[[name]][, "t value"],
pValue = inner_model[[name]][, "Pr(>|t|)"]
)
})) %>%
filter(Predictor != "Intercept") # 去掉 Intercept 信息
# 提取 R² 值
r_squared <- as.data.frame(dat_pls$inner_summary) %>%
select(Type, R2) %>%
filter(Type == "Endogenous") %>%
mutate(Response = rownames(.)) %>%
select(Response, R2)
# 将 R² 值与路径系数表合并
path_coefficients <- path_coefficients %>%
left_join(r_squared, by = "Response") %>%
arrange(Response, Predictor) %>%
mutate(Significance = ifelse(pValue < 0.001, "***",
ifelse(pValue < 0.01, "**",
ifelse(pValue < 0.05, "*", "ns"))))
# 打印整理后的路径系数和 R² 值表
print(path_coefficients)
我们提取的结果如下,这样就和piecewise的结果一样了,然后我们根据这个结果重新绘图美化即可
清洗后的绘图所需的数据
> print(path_coefficients)
Predictor Response Estimate StdError tValue pValue R2 Significance
1 plant AK 0.20288525 0.12243738 1.6570532 1.036488e-01 0.2817613 ns
2 soil AK -0.54296973 0.12243738 -4.4346730 4.928334e-05 0.2817613 ***
3 plant AN 0.41339579 0.09930778 4.1627734 1.210835e-04 0.5274940 ***
4 soil AN 0.50403181 0.09930778 5.0754513 5.498105e-06 0.5274940 ***
5 plant AP 0.38128650 0.11228529 3.3956942 1.333346e-03 0.3959311 **
6 soil AP 0.41544149 0.11228529 3.6998746 5.291138e-04 0.3959311 ***
7 AK TRAD 0.56165887 0.10361608 5.4205765 1.898703e-06 0.6608346 ***
8 AN TRAD -0.35661098 0.12996193 -2.7439650 8.511693e-03 0.6608346 **
9 AP TRAD -0.10908939 0.11031664 -0.9888752 3.276830e-01 0.6608346 ns
10 plant TRAD -0.03665453 0.11071724 -0.3310643 7.420369e-01 0.6608346 ns
11 soil TRAD 0.02536749 0.11888597 0.2133766 8.319369e-01 0.6608346 ns
最后,我们要查看模型的拟合优度以及直接、间接、总效应
#goodness-of-fit 值可以帮助评估模型优度
dat_pls$gof
#查看变量间的直接或间接影响状态
dat_pls$effects
模型拟合优度为0.43,是大于0.33的,所以还不错
dat_pls$gof
0.4303953
#查看变量间的直接或间接影响状态
dat_pls$effects
relationships direct indirect total
1 soil -> plant 0.00000000 0.00000000 0.0000000
2 soil -> AK -0.54296973 0.00000000 -0.5429697
3 soil -> AN 0.50403181 0.00000000 0.5040318
4 soil -> AP 0.41544149 0.00000000 0.4154415
5 soil -> TRAD 0.02536749 -0.53002730 -0.5046598
6 plant -> AK 0.20288525 0.00000000 0.2028853
7 plant -> AN 0.41339579 0.00000000 0.4133958
8 plant -> AP 0.38128650 0.00000000 0.3812865
9 plant -> TRAD -0.03665453 -0.07506349 -0.1117180
10 AK -> AN 0.00000000 0.00000000 0.0000000
11 AK -> AP 0.00000000 0.00000000 0.0000000
12 AK -> TRAD 0.56165887 0.00000000 0.5616589
13 AN -> AP 0.00000000 0.00000000 0.0000000
14 AN -> TRAD -0.35661098 0.00000000 -0.3566110
15 AP -> TRAD -0.10908939 0.00000000 -0.1090894
既然直接、间接、总效应都有了,那就顺便绘个图,绘图时记得修改第11行的目标因子以适配自己的数据
# 加载必要的包
library(tidyr)
library(dplyr)
library(ggplot2)
# 示例数据
dat_pls_effects <-dat_pls$effects
# 筛选出对 TRAD 的影响
trad_effects <- dat_pls_effects %>%
filter(grepl("-> TRAD$", relationships))
# 去掉 "-> TRAD" 只保留指标名称
trad_effects <- trad_effects %>%
mutate(relationships = gsub(" -> TRAD", "", relationships))
# 转换为长格式
trad_effects_long <- trad_effects %>%
pivot_longer(
cols = c(direct, indirect, total),
names_to = "effect_type",
values_to = "effect_size"
)
# 自定义 X 轴指标显示顺序(按需调整顺序)
custom_order <- c("soil", "plant", "AK", "AN", "AP") # 自定义顺序
trad_effects_long <- trad_effects_long %>%
mutate(relationships = factor(relationships, levels = custom_order))
# 绘制柱状图
ggplot(trad_effects_long, aes(x = relationships, y = effect_size, fill = effect_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
x = "Predictor",
y = "Effect Size (Standardized)",
title = "Effects on TRAD: Direct, Indirect, and Total"
) +
scale_fill_manual(
values = c(
"direct" = "#008CB7",
"indirect" = "#A5C98C",
"total" = "#FFBD00"
),
labels = c(
"direct" = "Direct Effect",
"indirect" = "Indirect Effect",
"total" = "Total Effect"
)
) +
theme_bw() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5),
legend.title = element_blank()
) +
geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 0.5) + # 添加水平线
ylim(-1, 1) # 设置 Y 轴范围
完整版代码如下
# 加载必要的R包
library(plspm)
library(dplyr)
# 读取数据
data <- read.csv(file.choose(), header = TRUE)
# 指定变量关系
comdata <- list(
soil = c("BD", "SM"),
plant=c("TN","TP","TOC"),
AK = "AK",
AN = "AN",
AP = "AP",
TRAD ="TRAD"
)
# 定义潜变量列表(包含模型中所有潜变量)
latent_vars <- c("soil", "plant", "AK", "AN", "AP", "TRAD")
# 将路径关系用公式列表表示
model_formulas <- list(
TRAD = ~ soil + plant + AK + AN + AP,
AK = ~ soil + plant,
AN = ~ soil + plant,
AP = ~ soil + plant
)
# 创建一个全0矩阵,用于存储路径关系
dat_path <- matrix(0, nrow = length(latent_vars), ncol = length(latent_vars),
dimnames = list(latent_vars, latent_vars))
# 根据公式自动填充路径矩阵
for (end_var in names(model_formulas)) {
# 提取公式右侧的自变量
rhs_vars <- all.vars(model_formulas[[end_var]])
# 将对应位置标记为1
dat_path[end_var, rhs_vars] <- 1
}
# 确保路径矩阵是下三角矩阵
dat_path <- dat_path * lower.tri(dat_path)
# 输出路径矩阵
print(dat_path)
# 定义测量模式("A"或"P")
dat_modes <- c("A", 6)
# 运行PLS-PM模型
dat_pls <- plspm(data, dat_path, comdata, modes = dat_modes)
# 查看PLS-PM模型结果
summary(dat_pls)
#查看因果关系的路径图,详情 ?innerplot
innerplot(dat_pls, colpos = 'red', colneg = 'blue', show.values = TRUE, lcol = 'gray', box.lwd = 0)
# 提取路径系数
inner_model <- dat_pls$inner_model
# 提取路径系数,去掉 Intercept 信息
path_coefficients <- bind_rows(lapply(names(inner_model), function(name) {
data.frame(
Predictor = rownames(inner_model[[name]]),
Response = name,
Estimate = inner_model[[name]][, "Estimate"],
StdError = inner_model[[name]][, "Std. Error"],
tValue = inner_model[[name]][, "t value"],
pValue = inner_model[[name]][, "Pr(>|t|)"]
)
})) %>%
filter(Predictor != "Intercept") # 去掉 Intercept 信息
# 提取 R² 值
r_squared <- as.data.frame(dat_pls$inner_summary) %>%
select(Type, R2) %>%
filter(Type == "Endogenous") %>%
mutate(Response = rownames(.)) %>%
select(Response, R2)
# 将 R² 值与路径系数表合并
path_coefficients <- path_coefficients %>%
left_join(r_squared, by = "Response") %>%
arrange(Response, Predictor) %>%
mutate(Significance = ifelse(pValue < 0.001, "***",
ifelse(pValue < 0.01, "**",
ifelse(pValue < 0.05, "*", "ns"))))
# 打印整理后的路径系数和 R² 值表
print(path_coefficients)
#goodness-of-fit 值可以帮助评估模型优度
dat_pls$gof
#查看变量间的直接或间接影响状态
dat_pls$effects
# 加载必要的包
library(tidyr)
library(dplyr)
library(ggplot2)
# 示例数据
dat_pls_effects <-dat_pls$effects
# 筛选出对 TRAD 的影响
trad_effects <- dat_pls_effects %>%
filter(grepl("-> TRAD$", relationships))
# 去掉 "-> TRAD" 只保留指标名称
trad_effects <- trad_effects %>%
mutate(relationships = gsub(" -> TRAD", "", relationships))
# 转换为长格式
trad_effects_long <- trad_effects %>%
pivot_longer(
cols = c(direct, indirect, total),
names_to = "effect_type",
values_to = "effect_size"
)
# 自定义 X 轴指标显示顺序(按需调整顺序)
custom_order <- c("soil", "plant", "AK", "AN", "AP") # 自定义顺序
trad_effects_long <- trad_effects_long %>%
mutate(relationships = factor(relationships, levels = custom_order))
# 绘制柱状图
ggplot(trad_effects_long, aes(x = relationships, y = effect_size, fill = effect_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
x = "Predictor",
y = "Effect Size (Standardized)",
title = "Effects on TRAD: Direct, Indirect, and Total"
) +
scale_fill_manual(
values = c(
"direct" = "#008CB7",
"indirect" = "#A5C98C",
"total" = "#FFBD00"
),
labels = c(
"direct" = "Direct Effect",
"indirect" = "Indirect Effect",
"total" = "Total Effect"
)
) +
theme_bw() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5),
legend.title = element_blank()
) +
geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 0.5) + # 添加水平线
ylim(-1, 1) # 设置 Y 轴范围
脚本与示例数据下载
下载的时候顺便点个赞呗~
关注公众号“生态R学社”回复“plspm”即可下载完整代码与示例数据