TBtools|对minimap2生成的paf文件进行可视化

文摘   科学   2024-09-22 08:45   广东  
作为著名的生信大佬,李恒教授在开发了比对工具bwa之后,在2018年推出了又一力作minimap2,其主要功能就是将测序得到的DNA或者RNA序列快速比对到参考基因组上。

bwa主要适用于illumina等短读长序列的比对,而minimap2则是专门针对PacBio或Oxford Nanopore等三代测序长读长数据开发的软件。实际上,minimap2 可应用于多种比对场景,包括对二代数据进行比对,其输出结果格式包括pafsam格式,默认为sam格式。

在本文中,不对minimap2的算法原理进行解析,仅以“基因组组装之间的比对”为例,展示其易用性和paf文件的可视化。

软件安装与测试

conda install -c bioconda minimap2 # conda安装是方法之一minimap2 -h # 调出该软件的帮助信息
Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]Options:Indexing:      -H           use homopolymer-compressed k-mer (preferrable for PacBio)      -k INT       k-mer size (no larger than 28) [15]      -w INT       minimizer window size [10]      -I NUM       split index for every ~NUM input bases [8G]      -d FILE      dump index to FILE []Mapping:      -f FLOAT     filter out top FLOAT fraction of repetitive minimizers [0.0002]      -g NUM       stop chain enlongation if there are no minimizers in INT-bp [5000]      -G NUM       max intron length (effective with -xsplice; changing -r) [200k]      -F NUM       max fragment length (effective with -xsr or in the fragment mode) [800]      -r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]      -n INT       minimal number of minimizers on a chain [3]      -m INT       minimal chaining score (matching bases minus log gap penalty) [40]      -X           skip self and dual mappings (for the all-vs-all mode)      -p FLOAT     min secondary-to-primary score ratio [0.8]      -N INT       retain at most INT secondary alignments [5]Alignment:      -A INT       matching score [2]      -B INT       mismatch penalty (larger value for lower divergence) [4]      -O INT[,INT] gap open penalty [4,24]      -E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]      -z INT[,INT] Z-drop score and inversion Z-drop score [400,200]      -s INT       minimal peak DP alignment score [80]      -u CHAR      how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]      -J INT       splice mode. 0: original minimap2 model; 1: miniprot model [1]Input/Output:      -a           output in the SAM format (PAF by default)      -o FILE      output alignments to FILE [stdout]      -L           write CIGAR with >65535 ops at the CG tag      -R STR       SAM read group line in a format like '@RG\tID:foo\tSM:bar' []      -c           output CIGAR in PAF      --cs[=STR]   output the cs tag; STR is 'short' (if absent) or 'long' [none]      --ds         output the ds tag, which is an extension to cs      --MD         output the MD tag      --eqx        write =/X CIGAR operators      -Y           use soft clipping for supplementary alignments      -t INT       number of threads [3]      -K NUM       minibatch size for mapping [500M]      --version    show version numberPreset:      -x STR       preset (always applied before other options; see minimap2.1 for details) []                   - lr:hq - accurate long reads (error rate <1%) against a reference genome                   - splice/splice:hq - spliced alignment for long reads/accurate long reads                   - asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence                   - sr - short reads against a reference                   - map-pb/map-hifi/map-ont/map-iclr - CLR/HiFi/Nanopore/ICLR vs reference mapping                   - ava-pb/ava-ont - PacBio CLR/Nanopore read overlap
See 'man ./minimap2.1' for detailed description of these and other advanced command-line options.

运行程序

minimap2 -t 20 -c ref.fa query.fa > alignment.c.paf # 基因组不大的话,不用建立索引,可直接比对
# -t 线程数,可根据电脑配置进行调整# -c output CIGAR in PAF

运行过程

[M::mm_idx_gen::7.366*1.69] collected minimizers    [M::mm_idx_gen::7.967*2.76] sorted minimizers    [M::main::7.967*2.76] loaded/built the index for 140 target sequence(s)    [M::mm_mapopt_update::8.825*2.59] mid_occ = 242    [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 140    [M::mm_idx_stat::9.341*2.50] distinct minimizers: 29349971 (75.58% are singletons); average occurrences: 1.800; average spacing: 5.340; total length: 282161110    [M::worker_pipeline::267.293*7.42] mapped 25 sequences    [M::main] Version: 2.28-r1209    [M::main] CMD: minimap2 -t 20 -c ref.fa query.fa > alignment.c.paf    [M::main] Real time: 267.437 sec; CPU: 1983.772 sec; Peak RSS: 19.358 GB

接下来,让我们用TBtools中的PAF Viz插件,对上面产生的“alignment.c.paf”文件进行快速的可视化。

打开TBtools之后,点击Graphics选项卡,选择Comparative Genomics里的PAF Viz。

PAF Viz插件的界面如下:

然后,选择或拖入自己的paf文件,并根据自己的需要,调整显示参数。

最后,点击“Save Graph”,设置输出格式和分辨率,可保存该比对图像。

植信矿工
专注于分享植物方向的最新学术成果、前沿知识和技术进步,以及实践优化过的生信软件、脚本和流程。
 最新文章