bwa
之后,在2018年推出了又一力作minimap2
,其主要功能就是将测序得到的DNA或者RNA序列快速比对到参考基因组上。bwa主要适用于illumina等短读长序列的比对,而minimap2
则是专门针对PacBio或Oxford Nanopore等三代测序长读长数据开发的软件。实际上,minimap2
可应用于多种比对场景,包括对二代数据进行比对,其输出结果格式包括paf
和sam
格式,默认为sam
格式。
在本文中,不对minimap2
的算法原理进行解析,仅以“基因组组装之间的比对”为例,展示其易用性和paf文件的可视化。
软件安装与测试
conda install -c bioconda minimap2 # conda安装是方法之一
minimap2 -h # 调出该软件的帮助信息
Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
Indexing:
-H use homopolymer-compressed k-mer (preferrable for PacBio)
-k INT k-mer size (no larger than 28) [15]
-w INT minimizer window size [10]
-I NUM split index for every ~NUM input bases [8G]
-d FILE dump index to FILE []
Mapping:
-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]
-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]
-G NUM max intron length (effective with -xsplice; changing -r) [200k]
-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]
-r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
-n INT minimal number of minimizers on a chain [3]
-m INT minimal chaining score (matching bases minus log gap penalty) [40]
-X skip self and dual mappings (for the all-vs-all mode)
-p FLOAT min secondary-to-primary score ratio [0.8]
-N INT retain at most INT secondary alignments [5]
Alignment:
-A INT matching score [2]
-B INT mismatch penalty (larger value for lower divergence) [4]
-O INT[,INT] gap open penalty [4,24]
-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]
-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]
-s INT minimal peak DP alignment score [80]
-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]
-J INT splice mode. 0: original minimap2 model; 1: miniprot model [1]
Input/Output:
-a output in the SAM format (PAF by default)
-o FILE output alignments to FILE [stdout]
-L write CIGAR with >65535 ops at the CG tag
-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar' []
-c output CIGAR in PAF
--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]
--ds output the ds tag, which is an extension to cs
--MD output the MD tag
--eqx write =/X CIGAR operators
-Y use soft clipping for supplementary alignments
-t INT number of threads [3]
-K NUM minibatch size for mapping [500M]
--version show version number
Preset:
-x STR preset (always applied before other options; see minimap2.1 for details) []
- lr:hq - accurate long reads (error rate <1%) against a reference genome
- splice/splice:hq - spliced alignment for long reads/accurate long reads
- asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence
- sr - short reads against a reference
- map-pb/map-hifi/map-ont/map-iclr - CLR/HiFi/Nanopore/ICLR vs reference mapping
- ava-pb/ava-ont - PacBio CLR/Nanopore read overlap
See 'man ./minimap2.1' for detailed description of these and other advanced command-line options.
运行程序
minimap2 -t 20 -c ref.fa query.fa > alignment.c.paf # 基因组不大的话,不用建立索引,可直接比对
# -t 线程数,可根据电脑配置进行调整
# -c output CIGAR in PAF
运行过程
[M::mm_idx_gen::7.366*1.69] collected minimizers
[M::mm_idx_gen::7.967*2.76] sorted minimizers
[M::main::7.967*2.76] loaded/built the index for 140 target sequence(s)
[M::mm_mapopt_update::8.825*2.59] mid_occ = 242
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 140
[M::mm_idx_stat::9.341*2.50] distinct minimizers: 29349971 (75.58% are singletons); average occurrences: 1.800; average spacing: 5.340; total length: 282161110
[M::worker_pipeline::267.293*7.42] mapped 25 sequences
[M::main] Version: 2.28-r1209
[M::main] CMD: minimap2 -t 20 -c ref.fa query.fa > alignment.c.paf
[M::main] Real time: 267.437 sec; CPU: 1983.772 sec; Peak RSS: 19.358 GB
接下来,让我们用TBtools中的PAF Viz
插件,对上面产生的“alignment.c.paf”文件进行快速的可视化。
PAF Viz插件的界面如下: