一、写在前面
最近有粉丝提问,收到了如下的审稿人意见:
findMarker
通过Wilcox
获得的差异基因虽然考虑到了不同组别细胞数量的不同,但是未能考虑到每组样本数量的不同。因此作者希望纳入样本水平的pseudo-bulk
分析能够有助于确认两种条件下的差异基因。pseudo-bulk
计算差异基因,岂不是无法考虑到不同组别中样本数量的差异?另外这有些吹毛求疵,在Seurat V5
之前,作者甚至没有在包里集成pseudo-bulk
的函数与算法(当然也可以自己提取矩阵计算)。难道能说作者发表的这几篇Cell和Nature给大家推荐的流程不好吗:Hao, Hao, et al., Cell 2021 [Seurat v4]
Stuart, Butler, et al., Cell 2019 [Seurat v3]
Butler, et al., Nat Biotechnol 2018 [Seurat v2] Satija, Farrell, et al., Nat Biotechnol 2015 [Seurat v1]
scRNA-seq
比Bulk RNA-Seq
更加的稀疏,将前者模拟为后者参与差异计算,其实也没那么科学。当然,审稿人的观点也不是全无道理,若能够通过不同的算法得到相同的差异基因结果,的确有较高的说服力。二、pseudo-bulk
差异分析走起
测试文件可以自行下载:
链接:https://pan.baidu.com/s/12dEGTJy4DnQ7gH2mbxCf-A?pwd=7qfm
提取码:7qfm
2.1 数据载入
# 加载R包
library(Seurat)
## 载入需要的程序包:SeuratObject
## 载入需要的程序包:sp
##
## 载入程序包:'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
# 读取数据:
scRNA <- readRDS('test_data/T1D_scRNA.rds')
# 这个数据包含24个样本:
unique(scRNA$sample)
## [1] "D_503" "H_120" "H_630" "H_3060" "D_609" "H_727" "H_4579" "D_504"
## [9] "H_3128" "H_7108" "D_502" "D_497" "D_506" "H_409" "H_6625" "D_610"
## [17] "D_501" "D_500" "H_4119" "H_1334" "D_498" "H_2928" "D_644" "D_505"
# 包含两个组别的数据:
DimPlot(scRNA,split.by = 'Group')
2.2 差异计算
(1) pseudo-bulk差异计算
### 生成拟bulk 数据 ###
bulk <- AggregateExpression(scRNA, return.seurat = T, slot = "counts", assays = "RNA",
group.by = c("cell_type", "sample", "Group")# 分别填写细胞类型、样本变量、分组变量的slot名称
)
## Names of identity class contain underscores ('_'), replacing with dashes ('-')
## Centering and scaling data matrix
##
## This message is displayed once every 8 hours.
# 生成的是一个新的Seurat对象
bulk
## An object of class Seurat
## 41056 features across 345 samples within 1 assay
## Active assay: RNA (41056 features, 0 variable features)
## 3 layers present: counts, data, scale.data
我们可以像普通scRNA-seq
的Seurat
对象一样,利用FindMarkers()
进行差异分析,我们这里用celltype10
做演示。
# 取出celltype10对应的对象:
ct10.bulk <- subset(bulk, cell_type == "celltype10")
# 改变默认分类变量:
Idents(ct10.bulk) <- "Group"
# 下面的额计算依赖DESeq2,做过Bulk RNA-Seq的同学都知道:
if(!require(DESeq2))BiocManager::install('DESeq2')
## 载入需要的程序包:DESeq2
## 载入需要的程序包:S4Vectors
## 载入需要的程序包:stats4
## 载入需要的程序包:BiocGenerics
##
## 载入程序包:'BiocGenerics'
## The following object is masked from 'package:SeuratObject':
##
## intersect
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
##
## 载入程序包:'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## 载入需要的程序包:IRanges
##
## 载入程序包:'IRanges'
## The following object is masked from 'package:sp':
##
## %over%
## The following object is masked from 'package:grDevices':
##
## windows
## 载入需要的程序包:GenomicRanges
## 载入需要的程序包:GenomeInfoDb
## 载入需要的程序包:SummarizedExperiment
## 载入需要的程序包:MatrixGenerics
## 载入需要的程序包:matrixStats
##
## 载入程序包:'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## 载入需要的程序包:Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## 载入程序包:'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
##
## 载入程序包:'SummarizedExperiment'
## The following object is masked from 'package:Seurat':
##
## Assays
## The following object is masked from 'package:SeuratObject':
##
## Assays
# 差异计算:
bulk_deg <- FindMarkers(ct10.bulk, ident.1 = "D", ident.2 = "H", # 这样算出来的Fold Change就是D/H
slot = "counts", test.use = "DESeq2",# 这里可以选择其它算法
verbose = F# 关闭进度提示
)
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
head(bulk_deg)# 看一下差异列表
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## ENSG00000047346 1.900955e-05 -1.1201537 0.917 1.000 0.780456
## ENSG00000168685 7.606182e-05 -0.5817112 1.000 1.000 1.000000
## ENSG00000131759 9.696591e-05 -2.1668759 0.667 1.000 1.000000
## ENSG00000166750 2.136829e-04 -1.5473545 0.750 0.917 1.000000
## ENSG00000163947 3.127113e-04 -0.9136378 0.917 1.000 1.000000
## ENSG00000239713 4.090397e-04 -2.2337008 0.250 0.917 1.000000
如何写循环计算所有细胞类型的差异基因,就留在这里当习题啦。
(2)细胞水平的差异计算
# 整理分组变量:
scRNA$CT_Group <- paste(scRNA$cell_type,scRNA$Group,sep = '_')
# 查看新的分组变量:
unique(scRNA$CT_Group)
## [1] "celltype12_D" "celltype3_H" "celltype13_H" "celltype6_H" "celltype0_D"
## [6] "celltype12_H" "celltype4_H" "celltype11_D" "celltype14_H" "celltype9_H"
## [11] "celltype11_H" "celltype2_H" "celltype0_H" "celltype7_H" "celltype14_D"
## [16] "celltype1_D" "celltype4_D" "celltype1_H" "celltype8_H" "celltype3_D"
## [21] "celltype13_D" "celltype8_D" "celltype7_D" "celltype5_H" "celltype6_D"
## [26] "celltype15_H" "celltype2_D" "celltype5_D" "celltype10_H" "celltype9_D"
## [31] "celltype10_D" "celltype15_D"
# 差异计算:
cell_deg <- FindMarkers(scRNA,ident.1 = 'celltype10_D',ident.2 = 'celltype10_H' ,group.by = 'CT_Group')# 同样得到的是celltype10在D组 vs H组的结果
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
(3)两种算法的对比
library(dplyr)
##
## 载入程序包:'dplyr'
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:matrixStats':
##
## count
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# 先看一下两种算法的显著差异基因数量
bulk_sig <- filter(bulk_deg,p_val < 0.05)
nrow(bulk_sig)
## [1] 308
cell_sig <- filter(cell_deg,p_val < 0.05)
nrow(cell_sig)
## [1] 1494
可以看出,pseudo-bulk得到的差异基因数量要少很多,画一个韦恩图看看二者交集
if(!require(VennDiagram))install.packages("VennDiagram")
## 载入需要的程序包:VennDiagram
## 载入需要的程序包:grid
## 载入需要的程序包:futile.logger
venn.plot <- venn.diagram(
x = list(Bulk = rownames(bulk_sig), Cell = rownames(cell_sig)
),
category.names = c("Bulk DEG", "Single-Cell DEG"),
filename = NULL,
output = TRUE,
main = "Venn Diagram of Significant Genes"
)
grid.draw(venn.plot)
可以看出包含关系还是挺明显的,那我们再用交集基因的avg_log2FC
做一个线性回归看看两次差异分析的相关性如何:
# 获得两次差异分析共同出现的基因:
inter_gene <- intersect(rownames(bulk_sig),rownames(cell_sig))
# 取出avg_log2FC整理为数据框
data4plot <- data.frame(Bulk = bulk_sig[inter_gene,'avg_log2FC'],
Cell = cell_sig[inter_gene,'avg_log2FC'] )
# 线性回归分析:
lm.model <- lm(Bulk ~ Cell,data = data4plot)
summary(lm.model)#看一下统计学参数
##
## Call:
## lm(formula = Bulk ~ Cell, data = data4plot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01613 -0.24964 -0.04723 0.17148 2.19351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.17249 0.02757 -6.257 1.58e-09 ***
## Cell 0.70081 0.01959 35.767 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4107 on 263 degrees of freedom
## Multiple R-squared: 0.8295, Adjusted R-squared: 0.8288
## F-statistic: 1279 on 1 and 263 DF, p-value: < 2.2e-16
mypara <- coefficients(lm.model)#得到截距和斜率
a <- mypara[2]#斜率
b <- mypara[1]#截距
a <- round(a,2)#取两位有效数字
b <- round(b,2)
library(ggplot2)
library(ggpubr)
##
## 载入程序包:'ggpubr'
## The following object is masked from 'package:VennDiagram':
##
## rotate
# 来个散点图吧~
lmplot <- ggplot( data4plot,aes(x=Bulk, y=Cell))+
geom_point(color="black")+
stat_smooth(method="lm",se=TRUE)+stat_cor(data=data4plot, method = "pearson")+#加上置信区间、R值、P值
ggtitle(label = paste(": y = ", a, " * x + ", b, sep = ""))+geom_rug()+#加上线性回归方程
labs(x='Bulk DEG',
y= 'single-cell DEG')
lmplot
## `geom_smooth()` using formula = 'y ~ x'
R=0.91,那么R^2就是0.83,可以看出二者的相关性还是不错的,就看能不能过审稿人这关啦。
大家有什么新的想法,欢迎在评论区留言~
环境信息
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.utf8
##
## time zone: Asia/Shanghai
## tzcode source: internal
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggpubr_0.6.0 ggplot2_3.5.1
## [3] VennDiagram_1.7.3 futile.logger_1.4.3
## [5] dplyr_1.1.4 DESeq2_1.44.0
## [7] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [9] MatrixGenerics_1.16.0 matrixStats_1.4.1
## [11] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1
## [13] IRanges_2.38.1 S4Vectors_0.42.1
## [15] BiocGenerics_0.50.0 Seurat_5.1.0
## [17] SeuratObject_5.0.2 sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.1 later_1.3.2
## [4] tibble_3.2.1 polyclip_1.10-7 fastDummies_1.7.4
## [7] lifecycle_1.0.4 rstatix_0.7.2 globals_0.16.3
## [10] lattice_0.22-6 MASS_7.3-60.2 backports_1.5.0
## [13] magrittr_2.0.3 plotly_4.10.4 sass_0.4.9
## [16] rmarkdown_2.28 jquerylib_0.1.4 yaml_2.3.10
## [19] httpuv_1.6.15 sctransform_0.4.1 spam_2.10-0
## [22] spatstat.sparse_3.1-0 reticulate_1.39.0 cowplot_1.1.3
## [25] pbapply_1.7-2 RColorBrewer_1.1-3 abind_1.4-5
## [28] zlibbioc_1.50.0 Rtsne_0.17 purrr_1.0.2
## [31] GenomeInfoDbData_1.2.12 ggrepel_0.9.6 irlba_2.3.5.1
## [34] listenv_0.9.1 spatstat.utils_3.1-0 openintro_2.5.0
## [37] airports_0.1.0 goftest_1.2-3 RSpectra_0.16-2
## [40] spatstat.random_3.3-1 fitdistrplus_1.2-1 parallelly_1.38.0
## [43] leiden_0.4.3.1 codetools_0.2-20 DelayedArray_0.30.1
## [46] tidyselect_1.2.1 UCSC.utils_1.0.0 farver_2.1.2
## [49] spatstat.explore_3.3-2 jsonlite_1.8.8 progressr_0.14.0
## [52] ggridges_0.5.6 survival_3.6-4 tools_4.4.1
## [55] ica_1.0-3 Rcpp_1.0.13 glue_1.7.0
## [58] gridExtra_2.3 SparseArray_1.4.8 mgcv_1.9-1
## [61] xfun_0.47 withr_3.0.1 formatR_1.14
## [64] fastmap_1.2.0 fansi_1.0.6 digest_0.6.37
## [67] R6_2.5.1 mime_0.12 colorspace_2.1-1
## [70] scattermore_1.2 tensor_1.5 spatstat.data_3.1-2
## [73] utf8_1.2.4 tidyr_1.3.1 generics_0.1.3
## [76] data.table_1.16.0 usdata_0.3.1 httr_1.4.7
## [79] htmlwidgets_1.6.4 S4Arrays_1.4.1 uwot_0.2.2
## [82] pkgconfig_2.0.3 gtable_0.3.5 lmtest_0.9-40
## [85] XVector_0.44.0 htmltools_0.5.8.1 carData_3.0-5
## [88] dotCall64_1.1-1 scales_1.3.0 png_0.1-8
## [91] spatstat.univar_3.0-1 knitr_1.48 lambda.r_1.2.4
## [94] rstudioapi_0.16.0 tzdb_0.4.0 reshape2_1.4.4
## [97] nlme_3.1-164 cachem_1.1.0 zoo_1.8-12
## [100] stringr_1.5.1 KernSmooth_2.23-24 parallel_4.4.1
## [103] miniUI_0.1.1.1 pillar_1.9.0 vctrs_0.6.5
## [106] RANN_2.6.2 promises_1.3.0 car_3.1-2
## [109] xtable_1.8-4 cluster_2.1.6 evaluate_0.24.0
## [112] readr_2.1.5 cli_3.6.3 locfit_1.5-9.10
## [115] compiler_4.4.1 futile.options_1.0.1 rlang_1.1.4
## [118] crayon_1.5.3 future.apply_1.11.2 ggsignif_0.6.4
## [121] labeling_0.4.3 plyr_1.8.9 stringi_1.8.4
## [124] viridisLite_0.4.2 deldir_2.0-4 BiocParallel_1.38.0
## [127] munsell_0.5.1 lazyeval_0.2.2 spatstat.geom_3.3-2
## [130] Matrix_1.7-0 RcppHNSW_0.6.0 hms_1.1.3
## [133] patchwork_1.2.0 future_1.34.0 shiny_1.9.1
## [136] highr_0.11 ROCR_1.0-11 broom_1.0.6
## [139] igraph_2.0.3 bslib_0.8.0 cherryblossom_0.1.0
欢迎致谢
如果以上内容对你有帮助,欢迎在文章的Acknowledgement中加上这一段,联系客服微信可以发放奖励:
Since Biomamba and his wechat public account team produce bioinformatics tutorials and share code with annotation, we thank Biomamba for their guidance in bioinformatics and data analysis for the current study.
欢迎在发文/毕业时向我们分享你的喜悦~
已致谢文章:
13分+文章利用scRNA-Seq揭示地铁细颗粒物引起肺部炎症的分子机制
IF14.3| scRNA-seq+脂质组多组学分析揭示宫内生长受限导致肝损伤的性别差异
鼻咽癌的Bulk RNA-Seq与scRNA-Seq联合分析
除了铁死亡,还有铜死亡?!
银屑病和脂肪肝病中共同病理和免疫特征
认真学完下面的内容,3~10分的SCI理应是囊中之物(当然也不要真的去水文章)
单细胞系列教程目录
如何联系我们