今天来复现一篇 2024 年 6 月份发表在 顶刊 science 杂志上的文献《Defining the KRAS- and ERK-dependent transcriptome in KRAS-mutant cancers》中两个差异分组结果比较的散点图,图片以及含义如下:
Fig. 2. Mutant KRAS is largely dependent on ERK for PDAC proliferation in vitro:图片中的蓝色为在两组差异分析中都下调的基因,红色为都上调的基因,橙灰色为上下调方向不一致的基因。黑色为不显著基因。
1、数据背景介绍
主要为两组差异分析,介绍如下。
1.1 PDAC细胞系KRAS siRNA RNA测序
为了确定 KRAS 依赖性转录组,作者对一组八个人类 KRAS 突变型 胰腺导管腺癌PDAC的细胞系在经过 24 小时 KRAS 小干扰 RNA(siRNA)处理后的基因转录变化(图 S1A 和 B)进行了 RNA 测序(RNA-seq)。数据可在这里下载:PRJNA980201 https://www.ebi.ac.uk/ena/browser/view/PRJNA980201?show=related-records。
Note:
NS缩写:control non-specific ("NS") siRNA。
KRAS siRNA是一种小干扰RNA(siRNA),通过特异性靶向KRAS基因的mRNA,从而抑制其表达,达到治疗KRAS突变肿瘤的目的。
差异分析结果如下:The 677 KRAS-dependent (UP) and 1051 KRAS-inhibited (DN) genes (log2FC > 0.5, adj. p < 0.05) are indicated by the light blue– and light red–shaded rectangles underlaid on plot, respectively。
差异结果在表格S1:Data S1. Differential expression statistics for genes in KRAS versus NS siRNA treated PDAC cells, using RNA-sequencing。
1.2 PDAC细胞系的ERKi RNA测序
为了评估 KRAS 突变型 胰腺导管腺癌PDAC中的 ERK 依赖性转录组,作者对RNA-seq 测序数据集(PRJEB25806)进行分析,该数据集包含一组经 ERK1/2 选择性抑制剂 SCH772984 处理 1、4、12 或 24 小时(ERKi)的 KRAS 突变型 PDAC 细胞系。
差异分析结果如下:
24 小时 vs 对照组:鉴定出包括 2054 个 ERK 依赖性(上调,UP)和 2519 个 ERK 抑制性(下调,DN)基因(log2 倍变化 > 0.5,校正后 p 值 < 0.01)(图 2E 和数据 S6)。
差异结果在表格S6:Data S6. Differential expression statistics for genes in ERKi treated PDAC cell lines at various time points, using RNA-sequencing.
2、两组差异比较散点图
为了评估 ERK 在 KRAS 调控的基因转录中的作用,作者将 24 小时 ERKi vs 对照组 和 KRAS siRNA 处理与对照组 相比的前 200 个差异表达的上调/下调(UP/DN)基因进行了比较(图 2F 中的彩色点)。在大约 84% 的情况下,两种处理下的基因表达水平变化方向相同(图 2F),就是我们今天复现的图F。
数据背景都搞清楚了,现在开始复现吧!
2.1 第一个差异结果读取:KRAS vs NS siRNA(log2FC)
差异筛选阈值:677 个 KRAS 依赖性(上调,UP)和 1051 个 KRAS 抑制性(下调,DN)基因(log2FC > 0.5,校正后 p 值 < 0.05)分别用浅蓝色和浅红色阴影矩形在图中表示。
rm(list=ls())
# 加载R包
library(ggplot2)
library(tibble)
library(ggrepel)
library(tidyverse)
library(dplyr)
library(patchwork)
##### 01、加载数据
# 加载:KRAS vs NS siRNA(log2FC)
group1 <- read.csv("./data/science.adk0775_data_s1.csv" )
head(group1)
# 提取FDR值和FC值
group1 <- group1[,c("external_gene_name","logFC","FDR" )]
group1 <- group1[group1$external_gene_name!="", ]
group1 <- distinct(group1,external_gene_name,.keep_all = T)
rownames(group1) <- group1$external_gene_name
head(group1)
# external_gene_name logFC FDR
# KDM1A KDM1A 0.7018303 0.000582458
# DVL2 DVL2 0.4406432 0.001620279
# DHX33 DHX33 -0.3528938 0.001198934
# MSL3 MSL3 -1.0778392 0.000171252
# CREBBP CREBBP 0.1041763 0.113563770
# KMT2E KMT2E 0.3245412 0.000775427
# 增加一列上下调
group1$g1 <- "normal"
group1$g1[group1$logFC >0.5 & group1$FDR < 0.05 ] <- "up"
group1$g1[group1$logFC < -0.5 & group1$FDR < 0.05 ] <- "down"
table(group1$g1)
# down normal up
# 675 12876 1043
2.2 第二个差异结果读取:ERKi 24h vs Control(log2FC)
差异筛选阈值:大多数这些变化是在 24 小时时观察到的,包括 2054 个 ERK 依赖性(上调,UP)和 2519 个 ERK 抑制性(下调,DN)基因(log2FC > 0.5,校正后 p 值 < 0.01)(图 2E 和数据 S6)。
## 加载:ERKi 24h vs Control(log2FC)
group2 <- read.csv("./data/science.adk0775_data_s6.csv" )
head(group2)
group2 <- group2[,c("external_gene_name","ERKi_24_hr.logFC","ERKi_24_hr.FDR" )]
group2 <- group2[group2$external_gene_name!="", ]
group2 <- distinct(group2,external_gene_name,.keep_all = T)
rownames(group2) <- group2$external_gene_name
head(group2)
# external_gene_name ERKi_24_hr.logFC ERKi_24_hr.FDR
# TSPAN6 TSPAN6 0.6386822 1.165894e-05
# DPM1 DPM1 -0.2505255 2.509240e-03
# SCYL3 SCYL3 0.2558298 2.613104e-03
# C1orf112 C1orf112 -0.9348071 3.492023e-05
# CFH CFH 1.0302213 6.134556e-03
# FUCA2 FUCA2 -0.2011928 2.993087e-02
# 增加一列上下调
group2$g2 <- "normal"
group2$g2[group2$ERKi_24_hr.logFC >0.5 & group2$ERKi_24_hr.FDR < 0.01 ] <- "up"
group2$g2[group2$ERKi_24_hr.logFC < -0.5 & group2$ERKi_24_hr.FDR < 0.01 ] <- "down"
table(group2$g2)
# down normal up
# 2031 8626 2491
2.3 数据合并并筛选top的基因
图中分别挑选了上下调一致的按照logFC选取的top 200 基因进行放大和上色:
红色点,两组中均上调 深蓝色点,两组中均下调 咖色点,两组中上下调情况不一致
以及top20的基因进行基因symbol显示。
# 合并
data.all <- merge(group1,group2,by.x ="external_gene_name",by.y = "external_gene_name" )
head(data.all)
##### 筛选top200的基因
## up
group1_gene_up200 <- data.all %>%
filter(FDR < 0.01) %>%
arrange(desc(logFC)) %>%
head(100) %>% pull(external_gene_name)
group2_gene_up200 <- data.all %>%
filter(ERKi_24_hr.FDR < 0.01) %>%
arrange(desc(ERKi_24_hr.logFC)) %>%
head(100) %>%
pull(external_gene_name)
## down
group1_gene_down200 <- data.all %>%
filter(FDR < 0.01) %>%
arrange(logFC) %>%
head(100) %>%
pull(external_gene_name)
group2_gene_down200 <- data.all %>%
filter(ERKi_24_hr.FDR < 0.01) %>%
arrange(ERKi_24_hr.logFC) %>%
head(100) %>%
pull(external_gene_name)
## 两组top200基因合并
down200_genes_all <- unique(c(group1_gene_down200, group2_gene_down200))
up200_genes_all <- unique(c(group1_gene_up200,group2_gene_up200))
##### 点的颜色确定
# 红色点,两组中均上调:
# 深蓝色点,两组中均下调:
# 咖色点,两组中上下调情况不一致
data.all$zz <- paste0(data.all$g1, data.all$g2)
table(data.all$zz)
data.all$zz[grepl("normal",data.all$zz)] <- "normal"
data.all$zz[data.all$zz=="upup"] <- "up"
data.all$zz[data.all$zz=="downdown"] <- "down"
data.all$zz[data.all$zz=="downup" | data.all$zz=="updown"] <- "change"
table(data.all$zz)
# change down normal up
# 74 335 12118 453
## 筛选40个top基因展示
# Genes with the strongest log2FC values (n = 40) are labeled.
down_label <- data.all[data.all$external_gene_name %in% down200_genes_all,]
down_label_data1 <- down_label[order(down_label$logFC), "external_gene_name"][1:10]
down_label_data2 <- down_label[order(down_label$ERKi_24_hr.logFC), "external_gene_name"][1:10]
down_label_gene <- unique(c(down_label_data1, down_label_data2))
down_label_gene
up_label <- data.all[data.all$external_gene_name %in% up200_genes_all,]
up_label_data1 <- up_label[order(up_label$logFC, decreasing = T), "external_gene_name"][1:10]
up_label_data2 <- up_label[order(up_label$ERKi_24_hr.logFC, decreasing = T), "external_gene_name"][1:10]
up_label_gene <- unique(c(up_label_data1, up_label_data2))
up_label_gene
2.4 绘图
万事具备:
###### 画图展示
head(data.all)
data_normal <- data.all[data.all$zz %in% "normal", ]
data_sig <- data.all[!(data.all$zz %in% "normal"), ]
data_label <- data.all[data.all$external_gene_name %in% c(down_label_gene, up_label_gene), ]
p1 <- ggplot(data.all, aes(x=logFC, y=ERKi_24_hr.logFC))+
geom_vline(xintercept = 0, color = "grey60", linewidth = 0.6) +
geom_hline(yintercept = 0, color = "grey60", linewidth = 0.6) +
# 不显著的基因为黑色
geom_point(data = data_normal, shape = 21, color = "black",
alpha = 0.1, size = 0.2, stroke = 1)
# 显著的基因:添加的颜色
my_col <- c("up" = "#cb181d", "down" ="#2927c4", "change" = "#c6a58b")
p2 <- p1 +
geom_point(data = data_sig,
aes(x=logFC, y=ERKi_24_hr.logFC, fill = zz,
colour = zz, size = -log10(FDR),
alpha = -log10(ERKi_24_hr.FDR)),
shape = 21, stroke = 0.7) +
scale_size_continuous(range = c(1.5, 4.5)) +
scale_alpha_continuous(range = c(0.2, 0.85)) +
scale_fill_manual(values = my_col ) +
scale_color_manual(values = my_col )
p2
# 添加label
p3 <- p2 +
geom_text_repel(data= data_label,
aes(x = logFC, y =ERKi_24_hr.logFC,label = external_gene_name),
size = 5, color = "black", fontface = 'italic', max.overlaps = 40) +
xlim(c(-3, 3)) +
xlab("KRAS vs NS siRNA (log2FC)") +
ylab("ERKi 24h vs Control (log2FC)") +
theme_bw(base_size = 15) +
theme(legend.position = "none")
p3
结果如下:
学徒作业
以上复现可以从 GSE20970, GSE37182, GSE44861 and GSE64392 其实可以挑选两个做这个类似的图看看效果。