基于plspm包的偏最小二乘结构方程模型

学术 2024-12-23 18:01 云南

写在前面

最近写报告需要用到偏最小二乘法的结构方程模型，正好顺便更新一下。不过在做之前我们需要明确一下基于plspm包的SEM的优缺点：

1. plspm的优点有哪些？

答：数据不符合正态分布时、自变量存在多重共线性时、样本量较小时。均可以使用，也就是说，是数据就可以用，什么条件都可以不在乎。

2. plspm的缺点有哪些？

答：我们做piecewiseSEM的时候我们的P值必须大于0.05才算检验通过，而基于plspm的时候，我们只有一个GOF值来判断模型结果，一般大于0.33就可以。问题也就是出在了这里，plspm本质上是讨论多维变量间的关系（就是类似于我们的复合变量结构方程模型），所以如果我们像平时一样构建单个指标间的关系的时候，GOF值是没法计算的。此外，plspm也不支持变量之间的相互影响。不过cSEM包可以克服这些困难，我们后边陆续更新。

3. 我为什么更新这一期？

答：我发现现在网上大多数代码需要手动书写01矩阵来实现指标间相互关系的确定，这简直麻烦到爆炸。对于习惯使用了简单语言的piecewiseSEM的我们，我根据piecewise的语法规则将这一步自动化了，就是我们依旧按照piecewise的语法书写指标间的关系，然后代码自动生成这个01矩阵带入计算，这样就不费脑子了，又能省好多脑细胞。

下边这个图是提供了GOF值的例子：

https://doi.org/10.1016/j.soilbio.2024.109455

下边这个图是没有GOF值的例子：

https://doi.org/10.1016/j.soilbio.2023.109010

我为什么举这个例子呢？是因为如果非要使用plspm探讨单个因子的关系又没办法提供GOF值的时候，是不是可以和这篇GCB一样，干脆不提这一茬就好

准备数据集

数据如下，小编已上传，文末下载。

开始分析

我们先读取数据，然后开始指定变量间的关系

PS：这里的指定变量关系就和我们构建复合变量的结构方程模型一样，是将读取进来的数据中的指标赋值给我们规定的指标，例如我们的土壤指标由BD SM构成。

定义潜变量列表就是将构建好的变量按顺序列出，然后我们使用公式表达这些关系，参照的概念模型如下

# 加载必要的R包library(plspm)
# 读取数据data <- read.csv(file.choose(), header = TRUE)
# 指定变量关系comdata <- list(  soil = c("BD", "SM"),  plant=c("TN","TP","TOC"),  AK = "AK",   AN = "AN",   AP = "AP",  TRAD ="TRAD")
# 定义潜变量列表（包含模型中所有潜变量）latent_vars <- c("soil", "plant", "AK", "AN", "AP", "TRAD")
# 将路径关系用公式列表表示model_formulas <- list(  TRAD = ~ soil + plant + AK + AN + AP,  AK   = ~ soil + plant,  AN   = ~ soil + plant,  AP   = ~ soil + plant)

然后我们自动生成01矩阵

# 创建一个全0矩阵，用于存储路径关系dat_path <- matrix(0, nrow = length(latent_vars), ncol = length(latent_vars),                   dimnames = list(latent_vars, latent_vars))
# 根据公式自动填充路径矩阵for (end_var in names(model_formulas)) {  # 提取公式右侧的自变量  rhs_vars <- all.vars(model_formulas[[end_var]])  # 将对应位置标记为1  dat_path[end_var, rhs_vars] <- 1}
# 确保路径矩阵是下三角矩阵dat_path <- dat_path * lower.tri(dat_path)
# 输出路径矩阵print(dat_path)

生成的矩阵如下。这一步以前都是手动的，但是如果关系一复杂，谁还分得指标间的关系，所以我对这一步代码甚是满意。

> print(dat_path)      soil plant AK AN AP TRADsoil     0     0  0  0  0    0plant    0     0  0  0  0    0AK       1     1  0  0  0    0AN       1     1  0  0  0    0AP       1     1  0  0  0    0TRAD     1     1  1  1  1    0

然后我们运行plspm即可，这里的6是什么意思，就是变量一共有6个的意思，测量模式默认即可

# 定义测量模式（"A"或"P"）dat_modes <- c("A", 6)
# 运行PLS-PM模型dat_pls <- plspm(data, dat_path, comdata, modes = dat_modes)
# 查看PLS-PM模型结果summary(dat_pls)

我们查看模型结果

> summary(dat_pls)PARTIAL LEAST SQUARES PATH MODELING (PLS-PM) 
---------------------------------------------------------- MODEL SPECIFICATION 1   Number of Cases      54 2   Latent Variables     6 3   Manifest Variables   9 4   Scale of Data        Standardized Data 5   Non-Metric PLS       FALSE 6   Weighting Scheme     centroid 7   Tolerance Crit       1e-06 8   Max Num Iters        100 9   Convergence Iters    3 10  Bootstrapping        FALSE 11  Bootstrap samples    NULL 
---------------------------------------------------------- BLOCKS DEFINITION     Block         Type   Size   Mode1    soil    Exogenous      2      A2   plant    Exogenous      3      A3      AK   Endogenous      1      A4      AN   Endogenous      1      A5      AP   Endogenous      1      A6    TRAD   Endogenous      1      A
---------------------------------------------------------- BLOCKS UNIDIMENSIONALITY        Mode  MVs  C.alpha    DG.rho  eig.1st  eig.2ndsoil      A    2    0.000  1.04e-29     1.07    0.933plant     A    3    0.236  5.56e-01     1.41    1.013AK        A    1    1.000  1.00e+00     1.00    0.000AN        A    1    1.000  1.00e+00     1.00    0.000AP        A    1    1.000  1.00e+00     1.00    0.000TRAD      A    1    1.000  1.00e+00     1.00    0.000
---------------------------------------------------------- OUTER MODEL            weight  loading  communality  redundancysoil                                                 1 BD     0.7741    0.728       0.5301       0.000  1 SM     0.6870    0.635       0.4035       0.000plant                                                2 TN     0.8293    0.817       0.6671       0.000  2 TP     0.5656    0.591       0.3495       0.000  2 TOC   -0.0623    0.188       0.0352       0.000AK                                                   3 AK     1.0000    1.000       1.0000       0.282AN                                                   4 AN     1.0000    1.000       1.0000       0.527AP                                                   5 AP     1.0000    1.000       1.0000       0.396TRAD                                                 6 TRAD   1.0000    1.000       1.0000       0.661
---------------------------------------------------------- CROSSLOADINGS              soil     plant      AK      AN      AP    TRADsoil                                                         1 BD     0.7281   0.31412  -0.303   0.557   0.537  -0.150  1 SM     0.6352   0.00428  -0.376   0.254   0.137  -0.605plant                                                        2 TN     0.2802   0.81677   0.169   0.494   0.304  -0.120  2 TP     0.0301   0.59115  -0.113   0.238   0.396  -0.220  2 TOC    0.0523   0.18755   0.111   0.104  -0.115   0.182AK                                                           3 AK    -0.4930   0.06927   1.000  -0.384  -0.169   0.702AN                                                           4 AN     0.6058   0.53743  -0.384   1.000   0.555  -0.637AP                                                           5 AP     0.5093   0.48352  -0.169   0.555   1.000  -0.407TRAD                                                         6 TRAD  -0.5322  -0.23591   0.702  -0.637  -0.407   1.000
---------------------------------------------------------- INNER MODEL $AK             Estimate   Std. Error     t value   Pr(>|t|)Intercept    4.27e-16        0.119    3.60e-15   1.00e+00soil        -5.43e-01        0.122   -4.43e+00   4.93e-05plant        2.03e-01        0.122    1.66e+00   1.04e-01
$AN             Estimate   Std. Error     t value   Pr(>|t|)Intercept   -9.87e-16       0.0963   -1.03e-14   1.00e+00soil         5.04e-01       0.0993    5.08e+00   5.50e-06plant        4.13e-01       0.0993    4.16e+00   1.21e-04
$AP             Estimate   Std. Error     t value   Pr(>|t|)Intercept   -9.38e-16        0.109   -8.62e-15   1.000000soil         4.15e-01        0.112    3.70e+00   0.000529plant        3.81e-01        0.112    3.40e+00   0.001333
$TRAD             Estimate   Std. Error     t value   Pr(>|t|)Intercept    3.97e-16       0.0841    4.72e-15   1.00e+00soil         2.54e-02       0.1189    2.13e-01   8.32e-01plant       -3.67e-02       0.1107   -3.31e-01   7.42e-01AK           5.62e-01       0.1036    5.42e+00   1.90e-06AN          -3.57e-01       0.1300   -2.74e+00   8.51e-03AP          -1.09e-01       0.1103   -9.89e-01   3.28e-01
---------------------------------------------------------- CORRELATIONS BETWEEN LVs          soil    plant       AK      AN      AP    TRADsoil    1.000   0.2461  -0.4930   0.606   0.509  -0.532plant   0.246   1.0000   0.0693   0.537   0.483  -0.236AK     -0.493   0.0693   1.0000  -0.384  -0.169   0.702AN      0.606   0.5374  -0.3837   1.000   0.555  -0.637AP      0.509   0.4835  -0.1686   0.555   1.000  -0.406TRAD   -0.532  -0.2359   0.7018  -0.637  -0.406   1.000
---------------------------------------------------------- SUMMARY INNER MODEL              Type     R2  Block_Communality  Mean_Redundancy    AVEsoil    Exogenous  0.000              0.467            0.000  0.467plant   Exogenous  0.000              0.351            0.000  0.351AK     Endogenous  0.282              1.000            0.282  1.000AN     Endogenous  0.527              1.000            0.527  1.000AP     Endogenous  0.396              1.000            0.396  1.000TRAD   Endogenous  0.661              1.000            0.661  1.000
---------------------------------------------------------- GOODNESS-OF-FIT [1]  0.4304
---------------------------------------------------------- TOTAL EFFECTS     relationships   direct  indirect   total1   soil -> plant   0.0000    0.0000   0.0002      soil -> AK  -0.5430    0.0000  -0.5433      soil -> AN   0.5040    0.0000   0.5044      soil -> AP   0.4154    0.0000   0.4155    soil -> TRAD   0.0254   -0.5300  -0.5056     plant -> AK   0.2029    0.0000   0.2037     plant -> AN   0.4134    0.0000   0.4138     plant -> AP   0.3813    0.0000   0.3819   plant -> TRAD  -0.0367   -0.0751  -0.11210       AK -> AN   0.0000    0.0000   0.00011       AK -> AP   0.0000    0.0000   0.00012     AK -> TRAD   0.5617    0.0000   0.56213       AN -> AP   0.0000    0.0000   0.00014     AN -> TRAD  -0.3566    0.0000  -0.35715     AP -> TRAD  -0.1091    0.0000  -0.109>

这个模型结果可谓是又长又复杂，我们其实只提取关键信息即可。但是我看网上的代码也是比较繁琐，所以自己写了个函数清洗了一下，直接筛选出我们需要的关键信息即可

#查看因果关系的路径图，详情 ?innerplotinnerplot(dat_pls, colpos = 'red', colneg = 'blue', show.values = TRUE, lcol = 'gray', box.lwd = 0)

# 提取路径系数inner_model <- dat_pls$inner_model
# 提取路径系数，去掉 Intercept 信息path_coefficients <- bind_rows(lapply(names(inner_model), function(name) {  data.frame(    Predictor = rownames(inner_model[[name]]),    Response = name,    Estimate = inner_model[[name]][, "Estimate"],    StdError = inner_model[[name]][, "Std. Error"],    tValue = inner_model[[name]][, "t value"],    pValue = inner_model[[name]][, "Pr(>|t|)"]  )})) %>%  filter(Predictor != "Intercept")  # 去掉 Intercept 信息
# 提取 R² 值r_squared <- as.data.frame(dat_pls$inner_summary) %>%  select(Type, R2) %>%  filter(Type == "Endogenous") %>%  mutate(Response = rownames(.)) %>%  select(Response, R2)
# 将 R² 值与路径系数表合并path_coefficients <- path_coefficients %>%  left_join(r_squared, by = "Response") %>%  arrange(Response, Predictor) %>%  mutate(Significance = ifelse(pValue < 0.001, "***",                               ifelse(pValue < 0.01, "**",                                      ifelse(pValue < 0.05, "*", "ns"))))
# 打印整理后的路径系数和 R² 值表print(path_coefficients)

我们提取的结果如下，这样就和piecewise的结果一样了，然后我们根据这个结果重新绘图美化即可

清洗后的绘图所需的数据

> print(path_coefficients)   Predictor Response    Estimate   StdError     tValue       pValue        R2 Significance1      plant       AK  0.20288525 0.12243738  1.6570532 1.036488e-01 0.2817613           ns2       soil       AK -0.54296973 0.12243738 -4.4346730 4.928334e-05 0.2817613          ***3      plant       AN  0.41339579 0.09930778  4.1627734 1.210835e-04 0.5274940          ***4       soil       AN  0.50403181 0.09930778  5.0754513 5.498105e-06 0.5274940          ***5      plant       AP  0.38128650 0.11228529  3.3956942 1.333346e-03 0.3959311           **6       soil       AP  0.41544149 0.11228529  3.6998746 5.291138e-04 0.3959311          ***7         AK     TRAD  0.56165887 0.10361608  5.4205765 1.898703e-06 0.6608346          ***8         AN     TRAD -0.35661098 0.12996193 -2.7439650 8.511693e-03 0.6608346           **9         AP     TRAD -0.10908939 0.11031664 -0.9888752 3.276830e-01 0.6608346           ns10     plant     TRAD -0.03665453 0.11071724 -0.3310643 7.420369e-01 0.6608346           ns11      soil     TRAD  0.02536749 0.11888597  0.2133766 8.319369e-01 0.6608346           ns

最后，我们要查看模型的拟合优度以及直接、间接、总效应

#goodness-of-fit 值可以帮助评估模型优度dat_pls$gof
#查看变量间的直接或间接影响状态dat_pls$effects

模型拟合优度为0.43，是大于0.33的，所以还不错

> dat_pls$gof[1] 0.4303953> > #查看变量间的直接或间接影响状态> dat_pls$effects   relationships      direct    indirect      total1  soil -> plant  0.00000000  0.00000000  0.00000002     soil -> AK -0.54296973  0.00000000 -0.54296973     soil -> AN  0.50403181  0.00000000  0.50403184     soil -> AP  0.41544149  0.00000000  0.41544155   soil -> TRAD  0.02536749 -0.53002730 -0.50465986    plant -> AK  0.20288525  0.00000000  0.20288537    plant -> AN  0.41339579  0.00000000  0.41339588    plant -> AP  0.38128650  0.00000000  0.38128659  plant -> TRAD -0.03665453 -0.07506349 -0.111718010      AK -> AN  0.00000000  0.00000000  0.000000011      AK -> AP  0.00000000  0.00000000  0.000000012    AK -> TRAD  0.56165887  0.00000000  0.561658913      AN -> AP  0.00000000  0.00000000  0.000000014    AN -> TRAD -0.35661098  0.00000000 -0.356611015    AP -> TRAD -0.10908939  0.00000000 -0.1090894

既然直接、间接、总效应都有了，那就顺便绘个图，绘图时记得修改第11行的目标因子以适配自己的数据

# 加载必要的包library(tidyr)library(dplyr)library(ggplot2)
# 示例数据dat_pls_effects <-dat_pls$effects
# 筛选出对 TRAD 的影响trad_effects <- dat_pls_effects %>%  filter(grepl("-> TRAD$", relationships))
# 去掉 "-> TRAD" 只保留指标名称trad_effects <- trad_effects %>%  mutate(relationships = gsub(" -> TRAD", "", relationships))
# 转换为长格式trad_effects_long <- trad_effects %>%  pivot_longer(    cols = c(direct, indirect, total),    names_to = "effect_type",    values_to = "effect_size"  )
# 自定义 X 轴指标显示顺序（按需调整顺序）custom_order <- c("soil", "plant", "AK", "AN", "AP")  # 自定义顺序
trad_effects_long <- trad_effects_long %>%  mutate(relationships = factor(relationships, levels = custom_order))
# 绘制柱状图ggplot(trad_effects_long, aes(x = relationships, y = effect_size, fill = effect_type)) +  geom_bar(stat = "identity", position = "dodge") +  labs(    x = "Predictor",    y = "Effect Size (Standardized)",    title = "Effects on TRAD: Direct, Indirect, and Total"  ) +  scale_fill_manual(    values = c(      "direct" = "#008CB7",      "indirect" = "#A5C98C",      "total" = "#FFBD00"    ),    labels = c(      "direct" = "Direct Effect",      "indirect" = "Indirect Effect",      "total" = "Total Effect"    )  ) +  theme_bw() +  theme(    axis.text.x = element_text(angle = 0, hjust = 0.5),    legend.title = element_blank()  ) +  geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 0.5) +  # 添加水平线  ylim(-1, 1)  # 设置 Y 轴范围

完整版代码如下

# 加载必要的R包library(plspm)library(dplyr)
# 读取数据data <- read.csv(file.choose(), header = TRUE)
# 指定变量关系comdata <- list(  soil = c("BD", "SM"),  plant=c("TN","TP","TOC"),  AK = "AK",   AN = "AN",   AP = "AP",  TRAD ="TRAD")
# 定义潜变量列表（包含模型中所有潜变量）latent_vars <- c("soil", "plant", "AK", "AN", "AP", "TRAD")
# 将路径关系用公式列表表示model_formulas <- list(  TRAD = ~ soil + plant + AK + AN + AP,  AK   = ~ soil + plant,  AN   = ~ soil + plant,  AP   = ~ soil + plant)
# 创建一个全0矩阵，用于存储路径关系dat_path <- matrix(0, nrow = length(latent_vars), ncol = length(latent_vars),                   dimnames = list(latent_vars, latent_vars))
# 根据公式自动填充路径矩阵for (end_var in names(model_formulas)) {  # 提取公式右侧的自变量  rhs_vars <- all.vars(model_formulas[[end_var]])  # 将对应位置标记为1  dat_path[end_var, rhs_vars] <- 1}
# 确保路径矩阵是下三角矩阵dat_path <- dat_path * lower.tri(dat_path)
# 输出路径矩阵print(dat_path)
# 定义测量模式（"A"或"P"）dat_modes <- c("A", 6)
# 运行PLS-PM模型dat_pls <- plspm(data, dat_path, comdata, modes = dat_modes)
# 查看PLS-PM模型结果summary(dat_pls)
#查看因果关系的路径图，详情 ?innerplotinnerplot(dat_pls, colpos = 'red', colneg = 'blue', show.values = TRUE, lcol = 'gray', box.lwd = 0)
# 提取路径系数inner_model <- dat_pls$inner_model
# 提取路径系数，去掉 Intercept 信息path_coefficients <- bind_rows(lapply(names(inner_model), function(name) {  data.frame(    Predictor = rownames(inner_model[[name]]),    Response = name,    Estimate = inner_model[[name]][, "Estimate"],    StdError = inner_model[[name]][, "Std. Error"],    tValue = inner_model[[name]][, "t value"],    pValue = inner_model[[name]][, "Pr(>|t|)"]  )})) %>%  filter(Predictor != "Intercept")  # 去掉 Intercept 信息
# 提取 R² 值r_squared <- as.data.frame(dat_pls$inner_summary) %>%  select(Type, R2) %>%  filter(Type == "Endogenous") %>%  mutate(Response = rownames(.)) %>%  select(Response, R2)
# 将 R² 值与路径系数表合并path_coefficients <- path_coefficients %>%  left_join(r_squared, by = "Response") %>%  arrange(Response, Predictor) %>%  mutate(Significance = ifelse(pValue < 0.001, "***",                               ifelse(pValue < 0.01, "**",                                      ifelse(pValue < 0.05, "*", "ns"))))
# 打印整理后的路径系数和 R² 值表print(path_coefficients)

#goodness-of-fit 值可以帮助评估模型优度dat_pls$gof
#查看变量间的直接或间接影响状态dat_pls$effects

# 加载必要的包library(tidyr)library(dplyr)library(ggplot2)
# 示例数据dat_pls_effects <-dat_pls$effects
# 筛选出对 TRAD 的影响trad_effects <- dat_pls_effects %>%  filter(grepl("-> TRAD$", relationships))
# 去掉 "-> TRAD" 只保留指标名称trad_effects <- trad_effects %>%  mutate(relationships = gsub(" -> TRAD", "", relationships))
# 转换为长格式trad_effects_long <- trad_effects %>%  pivot_longer(    cols = c(direct, indirect, total),    names_to = "effect_type",    values_to = "effect_size"  )
# 自定义 X 轴指标显示顺序（按需调整顺序）custom_order <- c("soil", "plant", "AK", "AN", "AP")  # 自定义顺序
trad_effects_long <- trad_effects_long %>%  mutate(relationships = factor(relationships, levels = custom_order))
# 绘制柱状图ggplot(trad_effects_long, aes(x = relationships, y = effect_size, fill = effect_type)) +  geom_bar(stat = "identity", position = "dodge") +  labs(    x = "Predictor",    y = "Effect Size (Standardized)",    title = "Effects on TRAD: Direct, Indirect, and Total"  ) +  scale_fill_manual(    values = c(      "direct" = "#008CB7",      "indirect" = "#A5C98C",      "total" = "#FFBD00"    ),    labels = c(      "direct" = "Direct Effect",      "indirect" = "Indirect Effect",      "total" = "Total Effect"    )  ) +  theme_bw() +  theme(    axis.text.x = element_text(angle = 0, hjust = 0.5),    legend.title = element_blank()  ) +  geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 0.5) +  # 添加水平线  ylim(-1, 1)  # 设置 Y 轴范围

脚本与示例数据下载

下载的时候顺便点个赞呗~

关注公众号“生态R学社”回复“plspm”即可下载完整代码与示例数据

走天涯徐小洋地理数据科学

一个爱生活的地理土博，分享GIS、遥感、空间分析、R语言、景观生态等地理数据科学实操教程、经典文献、数据资源