在模仿中精进数据可视化_使用circlize绘制单一染色体的circos圈图
❝
在模仿中精进数据可视化
该系列推文中,我们将从各大顶级学术期刊的Figure
入手,
解读文章的绘图思路,
模仿文章的作图风格,
构建适宜的绘图数据,
并且将代码应用到自己的实际论文中。
绘图缘由:小伙伴们总会展示出一些非常好看且精美的图片。我大概率会去学习和复现一下。其实每个人的时间和精力都非常有限和异常宝贵的。之所以我会去做,主要有以下原因:
图片非常好看,我自己看着也手痒痒 图片我自己在Paper也用的上,储备着留着用 保持了持续学习的状态
今天继续使用circlize
进行基因组数据的可视化,这一期推文属于是一个小小的编程实操。主要的目的如下:
统计人类第一号染色体的碱基 AGCTN
的分布情况使用circlize进行绘图
编写Python程序,解析人类第一号染色体,并且对ATCTN的个数进行统计
❝
stat_AGCT.py
import argparse
def read_fa(file):
fa_seq = {}
with open(file, "r") as fr:
for line in fr:
line_tmp = line.strip()
if line_tmp.startswith(">"):
chr_id = line_tmp.split()[0]
fa_seq[chr_id] = ""
else:
fa_seq[chr_id] += line_tmp
return fa_seq
def stat_fa(file, window_size, outfile):
# read fa
fa_dict = read_fa(file)
# output
output_file = open(outfile, "w")
output_file.write("Chr\tStart\tEnd\tA\tG\tC\tT\tN\n")
# for loop chr
for k in fa_dict.keys():
k2 = k.replace(">","")
for i in range(0, len(fa_dict[k]), int(window_size)):
seq_sub = fa_dict[k][i:i+int(window_size)]
A_number = seq_sub.count("A")
G_number = seq_sub.count("G")
C_number = seq_sub.count("C")
T_number = seq_sub.count("T")
N_number = seq_sub.count("N")
output_file.write(f"Chr{k2}\t{i}\t{i+int(window_size)}\t{A_number}\t{G_number}\t{C_number}\t{T_number}\t{N_number}\n")
if __name__ == "__main__":
# create ArgumentParser
parser = argparse.ArgumentParser(description="Stat the number of AGCTN base in Chrosome sequence")
# add argument
parser.add_argument("--arg1", help="Input Chromsome sequence")
parser.add_argument("--arg2", help="Window Size")
parser.add_argument("--arg3", help="Output file")
# parse argument
args = parser.parse_args()
# run
stat_fa(args.arg1, args.arg2, args.arg3)
❝调用
Python
程序
python stat_AGCT.py --arg1 example.fa --arg2 1000 --arg3 example_stat.out
获取到的结果如下
然后我们回到R
上使用circlize
进行可视化
直接上代码:
加载R
包
rm(list = ls())
####----load R Package----####
library(tidyverse)
library(circlize)
library(ComplexHeatmap)
library(readxl)
加载数据
set.seed(20241201)
####----load Data----####
Chr_info <- read_xlsx(path = "Input/Chr1_Info.xlsx")
Chr1_Gene <- read_xlsx(path = "Input/Chr1_Gene.xlsx")
Chr1_stat <- read_delim(file = "Input/example_stat.out", col_names = T, delim = "\t")
绘图
####----Plot----####
pdf(file = "./Output/plot.pdf",
height = 10,
width = 10)
# chrosome
circos.par("start.degree" = 90,
"track.margin" = c(0, 0.2),
"cell.padding" = c(0, 0.1, 0, 0),
gap.degree = 0)
circos.genomicInitialize(Chr_info,
plotType = c("axis", "labels"),
axis.labels.cex = 0.8*par("cex"),
labels.cex = 1*par("cex"),
track.height = 0.05, #设置的轨道的宽度
major.by = 50000
)
# chrosome fill
circos.genomicTrackPlotRegion(
Chr_info,
track.height = 0.02,
stack = TRUE,
bg.border = NA,
track.margin = c(0, 0),
panel.fun = function(region, value, ...) {
circos.genomicRect(region, value, col = "#fa9fb5", border = "black", ...)
} )
# A
circos.genomicTrack(
Chr1_stat %>% dplyr::select(1,2,3,4),
track.height = 0.08,
bg.col = NA,
bg.border = NA,
track.margin = c(0, 0.04),
panel.fun = function(region, value, ...){
circos.genomicLines(region, value, col="#238b45", lwd=1,...)
}
)
# T
circos.genomicTrack(
Chr1_stat %>% dplyr::select(1,2,3,7),
track.height = 0.08,
bg.col = NA,
bg.border = NA,
track.margin = c(0, 0),
panel.fun = function(region, value, ...){
circos.genomicLines(region, value, col="#d7301f", lwd=1,...)
}
)
# C
circos.genomicTrack(
Chr1_stat %>% dplyr::select(1,2,3,5),
track.height = 0.08,
bg.col = NA,
bg.border = NA,
track.margin = c(0, 0),
panel.fun = function(region, value, ...){
circos.genomicLines(region, value, col="#2171b5", lwd=1,...)
}
)
# G
circos.genomicTrack(
Chr1_stat %>% dplyr::select(1,2,3,6),
track.height = 0.08,
bg.col = NA,
bg.border = NA,
track.margin = c(0, 0),
panel.fun = function(region, value, ...){
circos.genomicLines(region, value, col="#e7298a", lwd=1,...)
}
)
# N
circos.genomicTrack(
Chr1_stat %>% dplyr::select(1,2,3,8),
track.height = 0.05,
bg.col = NA,
bg.border = NA,
track.margin = c(0, 0.02),
panel.fun = function(region, value, ...){
circos.genomicLines(region, value, col="#525252", lwd=1,...)
}
)
# add arrow
circos.genomicTrack(
Chr1_Gene,
track.height = 0.1,
bg.col = NA,
bg.border = NA,
track.margin = c(0, 0.02),
ylim = c(0, 1),
panel.fun = function(region, value, ...){
all_tx = unique(value$Gene)
for (i in seq_along(all_tx)) {
l = value$Gene == all_tx[i]
# for each Gene
current_tx_start = min(region[l, 1])
current_tx_end = max(region[l, 2])
current_tx_mean = mean(c(min(region[l, 1]), max(region[l, 2])))
circos.lines(c(1,599940),
c(0.5,0.5),
col = "#525252",
lwd = 0.15,
lty = 2)
circos.arrow(current_tx_start,
current_tx_end,
col = "#807dba",
arrow.position = "end",
width = 0.35,
arrow.head.length = mm_x(3),
arrow.head.width = 0.35*1,
border = "black")
circos.text(current_tx_mean, 0.85, labels = all_tx[i], cex = 0.85)
}
}
)
# add legend
legend <- Legend(
at = c(1, 2, 3, 4, 5),
labels = c("A", "T", "C", "G", "N"),
labels_gp = gpar(fontsize = 8),
grid_height = unit(0.5, "cm"),
grid_width = unit(0.5, "cm"),
type = c("lines", "lines"),
pch = NA,
legend_gp = gpar(col = c("#238b45", "#d7301f", "#2171b5", "#e7298a", "#525252"), lwd = 1),
# title = "Kind",
title_gp = gpar(fontsize = 9))
pushViewport(viewport(x = 0.5, y = 0.5))
grid.draw(legend)
y_coord <- 0.5
upViewport()
circos.clear()
dev.off()
版本信息
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS 15.1.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Shanghai
tzcode source: internal
attached base packages:
[1] grid stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] ComplexHeatmap_2.18.0 circlize_0.4.15 lubridate_1.9.3 forcats_1.0.0
[5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[9] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
[13] readxl_1.4.3 Biostrings_2.70.1 GenomeInfoDb_1.38.1 XVector_0.42.0
[17] IRanges_2.36.0 S4Vectors_0.40.2 BiocGenerics_0.48.1
loaded via a namespace (and not attached):
[1] gtable_0.3.5 shape_1.4.6 rjson_0.2.21 GlobalOptions_0.1.2
[5] tzdb_0.4.0 vctrs_0.6.5 tools_4.3.0 bitops_1.0-7
[9] generics_0.1.3 parallel_4.3.0 fansi_1.0.6 cluster_2.1.6
[13] pkgconfig_2.0.3 RColorBrewer_1.1-3 lifecycle_1.0.4 GenomeInfoDbData_1.2.11
[17] compiler_4.3.0 munsell_0.5.1 codetools_0.2-19 clue_0.3-65
[21] RCurl_1.98-1.13 pillar_1.9.0 crayon_1.5.2 iterators_1.0.14
[25] foreach_1.5.2 tidyselect_1.2.1 digest_0.6.37 stringi_1.8.3
[29] colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3 utf8_1.2.4
[33] withr_3.0.1 scales_1.3.0 bit64_4.0.5 writexl_1.4.2
[37] timechange_0.2.0 matrixStats_1.1.0 bit_4.0.5 cellranger_1.1.0
[41] png_0.1-8 GetoptLong_1.0.5 hms_1.1.3 doParallel_1.0.17
[45] rlang_1.1.4 glue_1.8.0 vroom_1.6.4 rstudioapi_0.15.0
[49] R6_2.5.1 zlibbioc_1.48.0
历史绘图合集
公众号推文一览
进化树合集
环状图
散点图
基因家族合集
换一个排布方式:
首先查看基础版热图:
然后再看进阶版热图:
基因组共线性
WGCNA ggplot2版本
其他科研绘图
合作、联系和交流
有很多小伙伴在后台私信作者,非常抱歉,我经常看不到导致错过,请添加下面的微信联系作者,一起交流数据分析和可视化。