近年来,国内的科研水平提升地非常快,特别是硕士和博士的科研训练,基本上在国内培养出的比较优秀的硕士生做实验的小时数和经验值基本顶得上国外好多博士的工作量,在生物学领域尤其明显。00后学生也变得越来越自信,这是一件好事。
今天让我们专注一些套路化的数据分析,随着高通量组学技术的快速发展,组学数据在生命科学领域扮演着越来越重要的角色。无论是基因组、转录组、蛋白质组,还是代谢组,它们为研究生物系统的复杂性提供了宝贵的资源。高质量的组学数据分析,不仅能挖掘潜在的生物学意义,更是学术论文中令人信服的重要实验依据。
我们今天将以蛋白质组学为例,介绍从原始数据到最终结果的分析基本套路,并配以代码演示。
数据分析的基本流程
蛋白质组学的数据分析通常包括以下几个步骤:
1. 数据预处理
目标:清洗和标准化原始数据,为后续分析打好基础。
内容:质谱数据的峰提取、基线校正、归一化,以及低丰度蛋白质过滤。
示例代码:
# 加载必要的库
library(ggplot2)
library(dplyr)
# 模拟数据
set.seed(42)
data <- data.frame(
Sample1 = rnorm(100, mean = 10, sd = 2),
Sample2 = rnorm(100, mean = 12, sd = 2.5),
Sample3 = rnorm(100, mean = 8, sd = 1.5)
)
# 数据标准化
data_normalized <- scale(data)
# 高质量可视化
ggplot() +
geom_histogram(data = data, aes(x = Sample1), fill = "#1f77b4", alpha = 0.7, bins = 20) +
geom_density(data = data_normalized, aes(x = Sample1), fill = "#ff7f0e", alpha = 0.7) +
labs(title = "Data Normalization", x = "Expression Level", y = "Density") +
theme_minimal() +
theme(legend.position = "none")
2. 数据降维与探索
目标:通过降维方法初步探索样本之间的关系。
内容:主成分分析(PCA)、t-SNE或UMAP降维。
示例代码:
# 加载必要的库
library(ggplot2)
library(Rtsne)
# PCA 分析
pca_result <- prcomp(data_normalized)
# t-SNE 分析
tsne_result <- Rtsne(data_normalized, dims = 2, pca = TRUE)
# 高质量可视化
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
# PCA 图
plot(pca_result$x[, 1:2], col = "#1f77b4", pch = 16, xlab = "PC1", ylab = "PC2", main = "PCA Plot")
# t-SNE 图
plot(tsne_result$Y[, 1:2], col = "#ff7f0e", pch = 16, xlab = "Dimension 1", ylab = "Dimension 2", main = "t-SNE Plot")
3. 差异表达分析
目标:识别不同条件下显著差异表达的蛋白质。
内容:统计检验(如t检验)、火山图、热图。
示例代码:
# 模拟数据集
set.seed(42)
data <- data.frame(
Gene = paste0("Gene_", 1:100),
log2FC = rnorm(100, mean = 0, sd = 2), # 模拟 Log2 Fold Change
p_value = runif(100, min = 0.0001, max = 1) # 模拟 P 值
)
# 添加显著性标签
data$significant <- (data$p_value < 0.05) & (abs(data$log2FC) > 1)
# 绘制火山图
ggplot(data, aes(x = log2FC, y = -log10(p_value))) +
geom_point(aes(color = significant), alpha = 0.6) +
geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "green") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "blue") +
labs(title = "Volcano Plot", x = "Log2 Fold Change", y = "-Log10 P-value") +
theme_minimal()
4. 功能注释与网络分析
目标:挖掘蛋白质的功能和相互作用。
内容:GO和KEGG功能注释,PPI网络分析。
4.1. GO和KEGG功能注释
# 如果没有安装相应的包,请先安装
# 加载相关包
library(clusterProfiler)
library(org.Hs.eg.db)
library(igraph)
library(enrichplot)
# 示例基因列表(Entrez Gene ID)
gene_list <- c("1", "2", "3", "4", "5", "6", "7")
# GO富集分析
go_results <- enrichGO(gene = gene_list,
OrgDb = org.Hs.eg.db,
ont = "BP", # 可以选择 "BP"(生物过程)、"MF"(分子功能)、"CC"(细胞组分)
pAdjustMethod = "BH",
qvalueCutoff = 0.05)
# 查看GO富集结果
head(go_results)
## ID Description GeneRatio
## GO:0001867 GO:0001867 complement activation, lectin pathway 1/2
## GO:0001553 GO:0001553 luteinization 1/2
## GO:0045916 GO:0045916 negative regulation of complement activation 1/2
## GO:0002921 GO:0002921 negative regulation of humoral immune response 1/2
## GO:0030449 GO:0030449 regulation of complement activation 1/2
## GO:0034695 GO:0034695 response to prostaglandin E 1/2
## BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust
## GO:0001867 11/18986 0.09090909 863.0000 29.35210 0.001158443 0.02075132
## GO:0001553 12/18986 0.08333333 791.0833 28.10027 0.001263723 0.02075132
## GO:0045916 14/18986 0.07142857 678.0714 26.01168 0.001474266 0.02075132
## GO:0002921 18/18986 0.05555556 527.3889 22.93288 0.001895285 0.02075132
## GO:0030449 23/18986 0.04347826 412.7391 20.27958 0.002421434 0.02075132
## GO:0034695 29/18986 0.03448276 327.3448 18.05167 0.003052630 0.02075132
## qvalue geneID Count
## GO:0001867 0.0004748586 2 1
## GO:0001553 0.0004748586 2 1
## GO:0045916 0.0004748586 2 1
## GO:0002921 0.0004748586 2 1
## GO:0030449 0.0004748586 2 1
## GO:0034695 0.0004748586 2 1
# 可视化GO富集结果
dotplot(go_results)
# KEGG富集分析
kegg_results <- enrichKEGG(gene = gene_list,
organism = "hsa", # 人类,其他物种可以修改
pAdjustMethod = "BH",
qvalueCutoff = 0.05)
# 查看KEGG富集结果
head(kegg_results)
## category subcategory ID
## hsa04610 Organismal Systems Immune system hsa04610
## Description GeneRatio BgRatio RichFactor
## hsa04610 Complement and coagulation cascades 1/1 88/8538 0.01136364
## FoldEnrichment zScore pvalue p.adjust qvalue geneID Count
## hsa04610 97.02273 9.799119 0.01030686 0.01030686 NA 2 1
# 可视化KEGG富集结果
dotplot(kegg_results)
4.2. 网络分析
# 安装并加载所需的R包
# install.packages("igraph")
# install.packages("ggplot2")
library(igraph)
library(ggplot2)
# 步骤 1:创建模拟PPI数据集
ppi_data <- data.frame(
Protein1 = c("P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8"),
Protein2 = c("P2", "P3", "P4", "P5", "P6", "P7", "P8", "P1")
)
# 查看数据
head(ppi_data)
## Protein1 Protein2
## 1 P1 P2
## 2 P2 P3
## 3 P3 P4
## 4 P4 P5
## 5 P5 P6
## 6 P6 P7
# 步骤 2:构建PPI网络
ppi_network <- graph_from_data_frame(ppi_data, directed = FALSE)
# 查看网络基本信息
summary(ppi_network)
## IGRAPH 64f9f8d UN-- 8 8 --
## + attr: name (v/c)
# 查看网络的节点和边
V(ppi_network) # 节点
## + 8/8 vertices, named, from 64f9f8d:
## [1] P1 P2 P3 P4 P5 P6 P7 P8
E(ppi_network) # 边
## + 8/8 edges from 64f9f8d (vertex names):
## [1] P1--P2 P2--P3 P3--P4 P4--P5 P5--P6 P6--P7 P7--P8 P1--P8
# 步骤 3:可视化PPI网络
plot(ppi_network,
vertex.size = 30, # 节点大小
vertex.color = "skyblue", # 节点颜色
vertex.label.cex = 1.2, # 节点标签大小
vertex.label.color = "black", # 节点标签颜色
edge.width = 2, # 边宽度
edge.color = "gray", # 边颜色
main = "Protein-Protein Interaction Network") # 标题
# 步骤 4:计算PPI网络的中心性指标
# 计算度中心性(每个节点连接的边数)
degree_centrality <- degree(ppi_network)
print(degree_centrality)
## P1 P2 P3 P4 P5 P6 P7 P8
## 2 2 2 2 2 2 2 2
# 计算接近中心性
closeness_centrality <- closeness(ppi_network)
print(closeness_centrality)
## P1 P2 P3 P4 P5 P6 P7 P8
## 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625
# 计算介数中心性
betweenness_centrality <- betweenness(ppi_network)
print(betweenness_centrality)
## P1 P2 P3 P4 P5 P6 P7 P8
## 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5
# 步骤 5:识别网络中的社群(模块)
community_detection <- cluster_louvain(ppi_network)
# 查看社区分配结果
print(community_detection)
## IGRAPH clustering multi level, groups: 3, mod: 0.28
## + groups:
## $`1`
## [1] "P1" "P7" "P8"
##
## $`2`
## [1] "P2" "P3" "P4"
##
## $`3`
## [1] "P5" "P6"
##
# 可视化网络,按社区着色
plot(community_detection, ppi_network,
vertex.size = 30,
vertex.label.cex = 1.2,
vertex.color = membership(community_detection),
edge.width = 2,
edge.color = "gray",
main = "PPI Network with Communities")
# 步骤 6:基于节点属性进行网络着色
# 添加节点属性(模拟)
V(ppi_network)$importance <- degree_centrality # 将度中心性作为节点的重要性
# 可视化网络,按节点重要性着色
plot(ppi_network,
vertex.size = 30,
vertex.label.cex = 1.2,
vertex.color = heat.colors(100)[as.numeric(cut(V(ppi_network)$importance, breaks = 100))],
edge.width = 2,
edge.color = "gray",
main = "PPI Network with Node Importance")
# 步骤 7:进行PPI网络的拓扑分析
# 计算网络的平均度
avg_degree <- mean(degree(ppi_network))
print(paste("Average Degree:", avg_degree))
## [1] "Average Degree: 2"
# 计算网络的密度
network_density <- edge_density(ppi_network)
print(paste("Network Density:", network_density))
## [1] "Network Density: 0.285714285714286"
# 计算网络的直径
network_diameter <- diameter(ppi_network)
print(paste("Network Diameter:", network_diameter))
## [1] "Network Diameter: 4"
小结
数据清洗、降维探索、差异分析、功能注释和预测建模这一套流程完全掌握,你就基本入门了组学数据的分析套路。希望本文的示例代码能够为初学者提供信心,快快大胆尝试吧!
感谢关注,你的支持是我不懈的动力!