一边学习,一边总结,一边分享!
由于微信改版,一直有同学反映。存在长时间接收不到公众号的推文。那么请跟随以下步骤,将小杜的生信筆記设置为星标,不错过每一条推文教程。
欢迎关注《小杜的生信笔记》!!
如何加入社群
小杜的生信笔记
,仅有微信社群。
1. 微信群:付费社群。添加小杜好友,加友请知:加友须知!!,加入社群请查看小杜生笔记付费加友入群声明。
2. 小杜个人微信:若你有好的教程或想法,可添加小杜个人微信。值得注意的是,小杜个人微信并不支持免费咨询长时间咨询,但支持小问题2-3个免费咨询。
小杜微信:
知识星球:
2022年教程总汇
2023年教程总汇
引言
WGNCA教程我们自发布以来,一直在做更新,前面也更新几个版本。教程是出于不停的更新的,更新的原因:1、前面版本存在一定的bug;2、前期版本在运行时某些参数未修改时会报错;3、对小白不是很友好。
基于以上的原因,我们再次做相关代码修改,以及进行详细的注释。
注:
若是前面购买过WGNCA教程一、教程二、教程四、教程五的同学,可以直接在前面的教程中留言,同样可以免费获得最新的WGNCA教程六的代码(直接在前期付费教程中留言即可,注:由于人工回复,回复时间可能会有延迟)。
此外,此代码基于自己的理解,可能会存在错误的地方,若是大家发现,请大家提示,及修改。
前期教程存在问题
表达量表头(A1,A1)需要按照我们代码内容修改
代码过滤数据需要自己进行替换 注释不够详细
此教程所有结果图形【样例】
<<<>>>
WGNCA Code
##'@#WGCNA分析教程(建议在在服务器中运行)
##'
##'@WGCNA教五【更新版本】
##'@更新时间:2024.09.19
##'@注意:我们的教程也许只是WGCNA分析中一种方法,你也可以查询其他的方法进行分析
##'@此外,WGNCA教程中几个参数需要结合自己的需求进行修改
##'@
##'@若数据量较大,不建议在本地运行,建议在服务器中运行
> sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8
[2] LC_CTYPE=Chinese (Simplified)_China.utf8
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.utf8
time zone: Asia/Shanghai
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] gplots_3.1.3.1 netET_0.1.0
[3] igraph_2.0.3 linkET_0.0.7.4
[5] tidygraph_1.3.1 ggraph_2.2.1
[7] ggplot2_3.5.1 WGCNA_1.72-5
[9] fastcluster_1.2.6 dynamicTreeCut_1.63-1
loaded via a namespace (and not attached):
[1] bitops_1.0-8 DBI_1.2.3
[3] gridExtra_2.3 permute_0.9-7
[5] rlang_1.1.4 magrittr_2.0.3
[7] clue_0.3-65 GetoptLong_1.0.5
[9] matrixStats_1.3.0 compiler_4.4.0
[11] RSQLite_2.3.7 mgcv_1.9-1
[13] systemfonts_1.1.0 png_0.1-8
[15] vctrs_0.6.5 reshape2_1.4.4
[17] stringr_1.5.1 pkgconfig_2.0.3
[19] shape_1.4.6.1 crayon_1.5.3
[21] fastmap_1.2.0 backports_1.5.0
[23] XVector_0.44.0 labeling_0.4.3
[25] caTools_1.18.2 utf8_1.2.4
[27] rmarkdown_2.28 UCSC.utils_1.0.0
[29] preprocessCore_1.66.0 ragg_1.3.2
[31] purrr_1.0.2 bit_4.0.5
[33] xfun_0.47 zlibbioc_1.50.0
[35] cachem_1.1.0 GenomeInfoDb_1.40.1
[37] jsonlite_1.8.8 blob_1.2.4
[39] tweenr_2.0.3 parallel_4.4.0
[41] cluster_2.1.6 R6_2.5.1
[43] stringi_1.8.4 RColorBrewer_1.1-3
[45] rpart_4.1.23 Rcpp_1.0.13
[47] iterators_1.0.14 knitr_1.48
[49] base64enc_0.1-3 IRanges_2.38.1
[51] Matrix_1.7-0 splines_4.4.0
[53] nnet_7.3-19 tidyselect_1.2.1
[55] rstudioapi_0.16.0 viridis_0.6.5
[57] vegan_2.6-6.1 doParallel_1.0.17
[59] codetools_0.2-20 lattice_0.22-6
[61] tibble_3.2.1 plyr_1.8.9
[63] Biobase_2.64.0 withr_3.0.1
[65] KEGGREST_1.44.1 evaluate_0.24.0
[67] foreign_0.8-87 survival_3.7-0
[69] polyclip_1.10-7 circlize_0.4.16
[71] Biostrings_2.72.1 pillar_1.9.0
[73] KernSmooth_2.23-24 checkmate_2.3.2
[75] foreach_1.5.2 stats4_4.4.0
[77] generics_0.1.3 S4Vectors_0.42.1
[79] munsell_0.5.1 scales_1.3.0
[81] gtools_3.9.5 glue_1.6.2
[83] Hmisc_5.1-3 tools_4.4.0
[85] data.table_1.15.4 ggvenn_0.1.10
[87] graphlayouts_1.1.1 cowplot_1.1.3
[89] grid_4.4.0 impute_1.78.0
[91] tidyr_1.3.1 AnnotationDbi_1.66.0
[93] colorspace_2.1-1 nlme_3.1-166
[95] GenomeInfoDbData_1.2.12 ggforce_0.4.2
[97] htmlTable_2.4.3 Formula_1.2-5
[99] cli_3.6.3 textshaping_0.4.0
[101] fansi_1.0.6 viridisLite_0.4.2
[103] ComplexHeatmap_2.20.0 dplyr_1.1.4
[105] gtable_0.3.5 digest_0.6.37
[107] BiocGenerics_0.50.0 ggrepel_0.9.5
[109] farver_2.1.2 rjson_0.2.22
[111] htmlwidgets_1.6.4 memoise_2.0.1
[113] htmltools_0.5.8.1 lifecycle_1.0.4
[115] httr_1.4.7 GlobalOptions_0.1.2
[117] GO.db_3.19.1 MASS_7.3-61
[119] bit64_4.0.5
1. 安装及加载所需的R包
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# if (!require('devtools')) install.packages('devtools');
# if (!require("WGCNA")) BiocManager::install("WGCNA");
# if (!require("igraph")) install.packages("igraph");
# if (!require("tidygraph")) install.packages("tidygraph");
# if (!require("ggraph")) install.packages("ggraph");
# if (!require("linkET")) devtools::install_github("Hy4m/linkET", force = TRUE);
# if (!require("netET")) devtools::install_github("Hy4m/netET");
# Bioconductor_packages <- c("WGCNA","igraph","tidygraph","ggraph")
#
# for (pkg in Bioconductor_packages){
# if(!require(pkg, character.only = T)){
# BiocManager::install(pkg, ask = F, update = F)
# require(pkg, character.only = T)
# }
# }
# ##'@
# devtools::install_github("Hy4m/linkET", force = TRUE)
# devtools::install_github("Hy4m/netET")
# install.packages("igraph")
##'@加载R包
library(WGCNA)
library(ggraph)
library(tidygraph)
library(linkET)
library(igraph)
library(netET)
2. 导入数据
WGCNA.fpkm = read.csv("ExpData_WGCNA.csv", header = T,
check.names = F)
WGCNA.fpkm[1:5,1:5]
# 查看数据的维度和列名
dim(WGCNA.fpkm)
names(WGCNA.fpkm)
> WGCNA.fpkm[1:5,1:5]
sample sample01 sample02 sample03 sample04
1 gene001 8.242857 8.256276 8.360306 8.582075
2 gene002 8.089509 8.099918 8.204312 7.790141
3 gene003 7.909983 7.944611 8.204516 8.177993
4 gene004 7.928443 7.925406 8.118065 8.170789
5 gene005 8.164323 8.341932 8.284205 8.097484
3. 数据处理
#'@转换数据
row_names = WGCNA.fpkm[,1]
# 转置数据集,移除第一列(基因ID)
datExpr1 = as.data.frame(t(WGCNA.fpkm[,-1]))
# 将第一列的内容作为转置后数据框的列名
names(datExpr1) = row_names
# 为转置后的数据框设置行名(原数据框的列名,去掉第一列)
rownames(datExpr1) = colnames(WGCNA.fpkm)[-1]
# 查看最终的 datExpr0 数据框的维度和前几行
dim(datExpr1)
datExpr1[1:5,1:5]
#########check missing value and filter ##########
#datExpr0
##check missing value
gsg = goodSamplesGenes(datExpr1, verbose = 3)
gsg$allOK
if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(datExpr1)[!gsg$goodGenes], collapse = ", ")))
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(datExpr1)[!gsg$goodSamples], collapse = ", ")))
# Remove the offending genes and samples from the data:
datExpr = datExpr1[gsg$goodSamples, gsg$goodGenes]
}
dim(datExpr1)
4. 过滤低表达量基因【optional】
#############################'@optional[可选步骤]
######'@过滤低表达量数据
######'
meanFPKM= 2 ####'@-过滤标准,可以修改
n=nrow(datExpr1)
datExpr1[n+1,]=apply(datExpr1[c(1:nrow(datExpr1)),],2,mean)
datExpr2 <- datExpr1[1:n,datExpr1[n+1,] > meanFPKM]
# for meanFpkm in row n+1 and it must be above what you set--select meanFpkm>opt$meanFpkm(by rp)
filtered_fpkm=t(datExpr2)
filtered_fpkm=data.frame(rownames(filtered_fpkm),filtered_fpkm)
names(filtered_fpkm)[1]="gene"
head(filtered_fpkm)
dim(filtered_fpkm)
write.table(filtered_fpkm, file="WGCNA.过滤后数据.txt",,
row.names=F, col.names=T,quote=FALSE,sep="\t")
5. 提取过滤后的数据集
##'@若是你在进行此步处理,请将过滤后的数据集进行赋值
##'
rownames(filtered_fpkm) <- filtered_fpkm$gene
filtered_fpkm <- filtered_fpkm[, -1] # 移除gene列
filtered_fpkm[1:5,1:12]
# 转置数据集
filtered_fpkm2 <- as.data.frame(t(filtered_fpkm))
# 检查结果
filtered_fpkm2[1:5,1:5]
##'@赋值
datExpr0 <- t(filtered_fpkm)
6. 若是不做过滤处理步骤4
和步骤5
的操作请忽略,直接进行赋值
#'@不过滤,直接使用原数据进行##
##'@若是未进行过滤,请忽略以上操作
# datExpr0 <- datExpr1
7. 样本聚类分析
sampleTree = hclust(dist(datExpr0), method = "average")
pdf(file = "1.sampleClustering.pdf", width = 6, height = 4)
par(cex = 0.6)
par(mar = c(0,6,6,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 2,
cex.axis = 1.5, cex.main = 2)
### Plot a line to show the cut
#abline(h = 180, col = "red")##剪切高度不确定,故无红线
dev.off()
8. 去除离散样本
注意:若是样本量很小,可以不去除,影响不大。
###'@去除离群值
pdf("2_sample clutering_delete_outliers.pdf", width = 6, height = )
par(cex = 0.6)
par(mar = c(0,6,6,0))
cutHeight = 20 ###'@cut高度
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 2,
cex.axis = 1.5, cex.main = 2) +
abline(h = cutHeight, col = "red")
dev.off()
##'@"cutHeight"为过滤参数,与上述图保持一致
clust = cutreeStatic(sampleTree, cutHeight = cutHeight, minSize = 10)
keepSamples = (clust == 1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
dim(datExpr)
head(datExpr)
datExpr[1:11,1:5]
9. 注意事项
注意:若不做样本处理,可忽略步骤8
的操作,直接进行步骤10
的操作赋值。
10. 数据集赋值
###----------------------------------------------------------------------------
###'@若是不过滤数据,直接赋值数据,做后续的分析
# datExpr <- datExpr0
11. 加载表型数据集
##'@导入csv格式
traitData = read.csv("TraitData.csv",row.names=1)
# ##'@导入txt格式
# traitData = read.table("TraitData.txt",row.names=1,header=T,comment.char = "",check.names=F)
head(traitData)
allTraits = traitData
dim(allTraits)
names(allTraits)
# 形成一个类似于表达数据的数据框架
fpkmSamples = rownames(datExpr)
traitSamples =rownames(allTraits)
traitRows = match(fpkmSamples, traitSamples)
datTraits = allTraits[traitRows,]
rownames(datTraits)
# Re-cluster samples
sampleTree2 = hclust(dist(datExpr), method = "average")
#
traitColors = numbers2colors(datTraits, signed = FALSE)
12. 表达量数据集与表型数据集进行匹配
pdf(file="3_Sample_dendrogram_and_trait_heatmap.pdf",width=8 ,height= 6)
plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits),
main = "Sample dendrogram and trait heatmap",cex.colorLabels = 1.5, cex.dendroLabels = 1, cex.rowText = 2)
dev.off()