浅学R-ape包 | 序列及进化树操作

文摘科学 2022-07-23 21:42 英国

Analyses of Phylogenetics and Evolution(ape)包含了很多序列和进化树的操作，比如序列和树的读入、输出、可视化、遗传距离的计算、比较等等。本文粗略学习了一下ape包中常用的功能，主要涉及到到序列和进化树两个方面。

对序列的操作

序列的读入

ape提供了三种直接读入本地序列文件的函数read.dna(), read.FASTA()和read.fastq()，根据名字便可以判断它们读入的文件格式。或者直接使用read.dna()，然后在参数中说明文件格式。比如，我们一个含有三个比对序列的fasta文件：

library(ape)
cat(">No303",
"NTTCGAACTTCCCACCCACTACTAAAAAGCTTCAATCACTGT",
">No304",
"NTTCCGCGTTCCCACCCACTTATTAAAAGCTTCAATCGCTGT",
">No305",
"ATTCCGCGTTTCCACCCACTTATTCCCCGCTTCAATCGCTGT",
">No306",
"TTTACGTATTTCCACCCACGGTTAGACCGCTTCAAGCGCTTT",
">No307",
"TTTACGTATTTCCACCCACGGTTAGTCCGCTTCAAGCGCTTT",
file = "exdna.fas", sep = "\n")
ex.dna4 <- read.dna("exdna.fas", format = "fasta")
ex.dna4

## 5 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 42 
## 
## Labels:
## No303
## No304
## No305
## No306
## No307
## 
## Base composition:
##     a     c     g     t 
## 0.207 0.341 0.130 0.322 
## (Total: 210 bases)

checkAlignment(ex.dna4)

## 
## Number of sequences: 5 
## Number of sites: 42 
## 
## No gap in alignment.
## 
## Number of segregating sites (including gaps): 18
## Number of sites with at least one substitution: 18
## Number of sites with 1, 2, 3 or 4 observed bases:
##  1  2  3  4 
## 24 12  6  0

上面使用checkAlignment()显示了一些序列的基本信息了：各位点的碱基情况，香农指数等

此外，还可以通过read.GenBank()函数从GenBank中直接读取序列。

分离位点位置

seg.sites(ex.dna4)

##  [1]  1  4  5  6  7  8 11 20 21 22 24 25 26 27 28 36 38 41

可以直接看到该比对序列的多态性位点所在的位置。

序列翻译

trans()函数可以直接将序列翻译成对应的氨基酸，使用alview可以查看翻译的序列

ex.dna4.trans <- trans(ex.dna4, codonstart = 1)
alview(ex.dna4.trans)

##       00000000011111
##       12345678901234
## No303 XRTSHPLLKSFNHC
## No304 .PR....I....R.
## No305 IPRF...IPR..R.
## No306 FTYF..R.DR.KRF
## No307 FTYF..R.VR.KRF

计算遗传距离

根据fasta中序列的比对结果，可以计算出各个序列的遗传距离。计算遗传距离的方法有很多，此处包括："raw", "N", "TS", "TV", "JC69", "K80" (the default), "F81", "K81", "F84", "BH87", "T92", "TN93", "GG95", "logdet", "paralin", "indel", "indelblock"。其中K80是默认的方法，也即最常用的Kimura's 2-parameters distance，它在计算遗传距离的时候考虑到了碱基替换和颠换之间的概率差异。

ex.dna4.distance <- dist.dna(ex.dna4, model = "K80")
ex.dna4.distance

##            No303      No304      No305      No306
## No304 0.22636149                                 
## No305 0.41461053 0.13337813                      
## No306 0.50343936 0.41227858 0.33520324           
## No307 0.55182895 0.45630133 0.33254166 0.02484891

根据遗传距离建树

下面分别使用两种不同的方法建树：

par(mfrow=c(2,1), mar=c(1,1,1,1))
tree1 <- fastme.bal(ex.dna4.distance, nni = TRUE)
plot(tree1, main=' minimum evolution estimation')
tree2 <- nj(ex.dna4.distance)
plot(tree2, main='neighbor-joining tree estimation')

计算dN/dS值

将序列两两比对计算dN/dS的值

dnds(ex.dna4, details = FALSE)

##            No303      No304      No305      No306
## No304  0.3325324                                 
## No305  0.6078244  0.9749547                      
## No306  2.3641390 29.6330288  3.6822329           
## No307  1.4326689  2.0789517  3.8109801        Inf

序列可视化

par(mfrow=c(2,2))
image(ex.dna4)
image(ex.dna4, "n", "blue") #只显示缺失值为蓝色
image(ex.dna4, c("g", "c"), "green") # 显示G+C为绿色
image(ex.dna4, "g", "yellow", "black") # 显示G为换色，背景为黑色

当然也可以通过for循环分别显示四种碱基

#分别显示四种碱基
par(mfrow = c(2, 2), mar=c(3,3,3,3))
### barcoding style:
for (x in c("a", "g", "c", "t"))
    image(ex.dna4, x, "black", cex.lab = 0.5, cex.axis = 0.7)

对树的操作

树的读入和可视化

ape包提供了多种树的读入方式，包括read.caic(), read.tree(), read.nexus()。

下面我们以文本的方式读入果蝇的进化树

dro_tree <- "(D_IMMIGRANS:0.3994590763,(D_SERRATA:0.2038336229,(D_SUZUKII:0.0880712765,(D_SIMULANS:0.0192802251,D_MELANOGASTER:0.0212721592)n9:0.1162298771)n8:0.1015050941)n7:0.2940361197)n6;"
tree3 <- read.tree(text = dro_tree)
par(mfrow=c(1,1), mar=c(2,2,3,2))
plot(tree3)
edgelabels(round(tree3$edge.length,3), adj=c(0.5,-0.3), font=1) # 添加枝长
nodelabels(tree3$node.label) # 添加节点名称
axisPhylo() # 添加坐标尺

如果要输出保存一个树，可以write.tree()函数直接将树保存到文件中。

树信息的显示

coalescent.intervals(tree3)

## $lineages
## [1] 5 4 3 2
## 
## $interval.length
## [1] 0.02127216 0.11622988 0.10150509 0.29403612
## 
## $interval.count
## [1] 4
## 
## $total.depth
## [1] 0.5330433
## 
## attr(,"class")
## [1] "coalescentIntervals"

我们还可以改变树的形状

par(mfrow=c(1,1), mar=c(1,1,1,1), oma=c(1,1,0,1))
plot(tree3, "c", FALSE, direction = "u")
edgelabels(round(tree3$edge.length,3), adj=c(0.5,-0.3), font=1)
nodelabels(tree3$node.label)
axisPhylo(2, las = 1)

节点标注

tree4_text <- "((Pan_paniscus,Pan_troglodytes),((Homo_sapiens,Homo_erectus),Homo_abilis));"
tree4 <- read.tree(text = tree4_text)
tree4.1 <- makeNodeLabel(tree4, prefix = 'Node') # 默认是前缀加数字的方式对节点命名。
plot(tree4.1, show.node.label = TRUE)

我们还可以自定义节点名称，但是这个自定义的方式有些晦涩。它是通过树末端名称匹配搜索，然后将最近共同祖先的节点重新命名。如下，所有含有'Pan‘的最近共同祖先被名为’Pan',所有含有‘Homo'的最近共同祖先节点被命名为’Homo'。


tree4.2 <- makeNodeLabel(tree4, method="user", nodeList = list(Pan = "Pan", Homo = "Homo"))
plot(tree4.2, show.node.label = TRUE)

tree4.3 <- makeNodeLabel(tree4, method="user", nodeList = list(Pan = "Pan", Homo = "Homo",Hominid = c("Pan","Homo")))
plot(tree4.3, show.node.label = TRUE)

添加支系关系

通过edges()函数可以在不同节点之间添加线段或者箭头。比如在群体遗传中表示各个支系之间的迁徙，或者基因流。

set.seed(2)
tree5 <- rcoal(6) # 随机产生6个末端的树
plot(tree5, "c")
edges(10, 9, col = "red", lty = 2)
edges(1, 2, lwd = 2, type = "h", arrows = 1, col = "green")

树的修剪

比如，有一个上百序列的进化树，但是你不想要这么多序列，想删除一些序列，然后保留剩下的树。这是可以使用keep.tip()或者drop.tip()来选择保留后者删除哪些树。

par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(tree3)
tree3.1 <- keep.tip(tree3, tip=c('D_MELANOGASTER','D_SUZUKII','D_IMMIGRANS'))
plot(tree3.1)

tree3.2 <- drop.tip(tree3, tip=c('D_SERRATA'))
plot(tree3.2)

左上为原始树，余下两个为修剪之后的树。

修剪之后的树，仍然保留了树枝的长度！

上述两个函数还有个interactive=参数，你可以通过鼠标点击来选择要保留或者删除的树枝。

当然，对树的修剪还可以对整个树枝进行操作，比如要提取树上的某一个大树枝,可以根据该树枝的节点进行如下操作

par(mfrow=c(2,1), mar=c(1,1,1,1))

plot(tree3)
nodelabels(tree3$node.label)
tree3.3 <- extract.clade(tree3, node=8) # 提取节点为8的树枝
plot(tree3.3)
nodelabels(tree3.3$node.label)

树的合并

树不仅可以进行修剪，还能将不同的树合并（或者叫嫁接？）。比如，我们可以通过bind.ree()直接将两个树在根部合并。

layout(matrix(c(1, 2, 3, 3), 2, 2))
par(mar = c(2, 2, 2, 2))
x <- rtree(4, tip.label = LETTERS[1:4]) # 随机产生一个树
y <- rtree(4, tip.label = LETTERS[5:8]) # 随机产生一个树
x <- makeNodeLabel(x, prefix = "x_")
y <- makeNodeLabel(y, prefix = "y_")
x$root.edge <- y$root.edge <- .2

z <- bind.tree(x, y, po=.2)

plot(y, show.node.label = TRUE, font = 1, root.edge = TRUE)
title("y")
plot(x, show.node.label = TRUE, font = 1, root.edge = TRUE)
title("x")
plot(z, show.node.label = TRUE, font = 1, root.edge = TRUE)
title("z <- bind.tree(x, y, po=.2)")

或者我们也可以指定合并的位置。如下，我们把y树合并到x树的第二个节点，并且指定枝长为0.1

layout(matrix(c(1, 2, 3, 3), 2, 2))
par(mar = c(2, 2, 2, 2))
x$edge.length[x$edge[, 2] == 2] <- 0.2

z <- bind.tree(x, y, 2, .1)
plot(y, show.node.label = TRUE, font = 1, root.edge = TRUE)
title("y")
plot(x, show.node.label = TRUE, font = 1, root.edge = TRUE)
title("x")
plot(z, show.node.label = TRUE, font = 1, root.edge = TRUE)
title("z <- bind.tree(x, y, where=2, position=0.1)")

节点路径和最近共同祖先

我们可以通过nodepath()函数来寻找从一个节点到另个节点所经历的路径：

nodepath(tree3, from=1, to=2)

## [1] 1 6 7 2

从第一个节点到第二个节点，要经过第六、七个节点。

我们还可以显示最有末端节点的最近共同最先的节点。

mrca(tree3) # most recent common acestor

##                D_IMMIGRANS D_SERRATA D_SUZUKII D_SIMULANS D_MELANOGASTER
## D_IMMIGRANS              1         6         6          6              6
## D_SERRATA                6         2         7          7              7
## D_SUZUKII                6         7         3          8              8
## D_SIMULANS               6         7         8          4              9
## D_MELANOGASTER           6         7         8          9              5

树枝的密度

opar <- par(mfrow = c(2, 1), mar=c(3,2,2,2))
plot(tree3)
axisPhylo()
ltt.plot(tree3)
title("Lineages Through Time")

也可以直接使用ltt.coplot(tree3)来显示树枝的密度。

关联树

将两个不同树的相同末端节点连接起来：

par(mfrow=c(1,1))
tr1 <- rtree(40)
tr2 <- rtree(20)

#creation of the association matrix:
association <- cbind(tr2$tip.label, tr2$tip.label)

cophyloplot(tr1, tr2, assoc = association,
            length.line=4, space=28, gap=3, col='orange', lwd=2)

树的放大显示

如果我们有一个很茂密的树，但是只想突出显示其中的一个树枝，该怎么办？这时可以使用zoom()函数，选择要显示的末端节点。如下，将左侧茂密树枝上标红的部分放大显示。


data(chiroptera)
# zoom(chiroptera, 1:20, subtree = TRUE)
par(mfrow=c(1,1), mar=c(1,1,1,1), oma=c(0,0,0,0))
zoom(chiroptera, grep("Plecotus", chiroptera$tip.label))
# zoom(chiroptera, list(grep("Plecotus", chiroptera$tip.label),
#                       grep("Pteropus", chiroptera$tip.label)))

[左图对应树枝有标红，此处图太小，显示不明显]

树枝截断显示

如果一个树的某个树枝特别长，那么整个树显示效果会很差，如下左图。这时，我们可以使用plotBreakLongEdges()，选择要截断显示的内部节点的数量，如下右图

tr <- rtree(10)
tr$edge.length[c(1, 18)] <- 100
op <- par(mfcol = 1:2, mar=c(2,2,2,2))
plot(tr); axisPhylo()
plotBreakLongEdges(tr, 2); axisPhylo()

给树枝添加点缀

下面的图使用phydataplot()，ring()函数，在树的外侧添加一些其他和各个树枝相关的信息。

对于phydataplot()，这些信息包括“bars”, “segments”, “image”, “arrows”, “boxplot”, “dotchart”, “mosaic” ；

对于ring()，包括“ring”, “segments”, “arrows” 。

自行鉴赏吧。。。

## demonstrates matching with names:
tr <- rcoal(n <- 10)
x <- 1:n
names(x) <- tr$tip.label
plot(tr, x.lim = 11)
phydataplot(x, tr)
## shuffle x but matching names with tip labels reorders them:
phydataplot(sample(x), tr, "s", lwd = 3, lty = 3)

## adapts to the tree:
plot(tr, "f", x.l = c(-11, 11), y.l = c(-11, 11))
phydataplot(x, tr, "s")

## leave more space with x.lim to show a barplot and a dotchart:
plot(tr, x.lim = 22)
phydataplot(x, tr, col = "yellow")
phydataplot(x, tr, "d", offset = 13)

ts <- rcoal(N <- 100)
X <- rTraitCont(ts) # names are set
dd <- dist(X)
op <- par(mar = rep(0, 4))
plot(ts, x.lim = 10, cex = 0.4, font = 1)
phydataplot(as.matrix(dd), ts, "i", offset = 0.2)

par(xpd = TRUE, mar = op$mar)
co <- c("blue", "red"); l <- c(-2, 2)
X <- X + abs(min(X)) # move scale so X >= 0
plot(ts, "f", show.tip.label = FALSE, x.lim = l, y.lim = l, open.angle = 30)
phydataplot(X, ts, "s", col = co, offset = 0.05)
ring(X, ts, "ring", col = co, offset = max(X) + 0.1) # the same info as a ring

## if x is matrix:
tx <- rcoal(m <- 20)
X <- runif(m, 0, 0.5); Y <- runif(m, 0, 0.5)
X <- cbind(X, Y, 1 - X - Y)
rownames(X) <- tx$tip.label
plot(tx, x.lim = 4)
co <- rgb(diag(3))
phydataplot(X, tx, col = co, offset=0.2)

Z <- matrix(rnorm(m * 5), m)
rownames(Z) <- rownames(X)
plot(tx, x.lim = 5)
phydataplot(Z, tx, "bo", scaling = .5, offset = 0.5,
            boxfill = c("gold", "skyblue"))


## use type = "mosaic" on a 30x5 matrix:
tr <- rtree(n <- 30)
p <- 5
x <- matrix(sample(3, size = n*p, replace = TRUE), n, p)
dimnames(x) <- list(paste0("t", 1:n), LETTERS[1:p])
plot(tr, x.lim = 35, align.tip = TRUE, adj = 1)
phydataplot(x, tr, "m", 2)

## change the aspect:
plot(tr, x.lim = 35, align.tip = TRUE, adj = 1)
phydataplot(x, tr, "m", 2, width = 2, border = "white", lwd = 3, legend = "side")


x[] <- 1:(n*p)
plot(tr, x.lim = 35, align.tip = TRUE, adj = 1)
phydataplot(x, tr, "m", 2, width = 1.5, continuous = TRUE, legend = "side",
            funcol = colorRampPalette(c("white", "darkgreen")))

本文缘起于一个果蝇的进化树，有200多个物种，但是只想要其中二十余种果蝇的进化关系，想对该进化树进行修剪，遂找到了ape包，发现其中很多有用的功能。。。

【感谢阅读】

资料来源：R-ape package: https://cran.r-project.org/web/packages/ape/index.html

http://mp.weixin.qq.com/s?__biz=MzUxMDYwNTQ4Nw==&mid=2247485985&idx=1&sn=05ccb201b2d5ce20b2f3c393afbe4e3e

AI写代码的DNA

我的群体进化遗传学学习笔记~~~ 学习|交流|进步