SeqKit2|一款超快且全能的序列处理工具包(以取反向互补序列为例)

文摘   科学   2024-10-31 15:04   广东  
2024年4月5日,iMeta上线了一篇关于生信工具的短讯文章,“SeqKit2: A Swiss army knife for sequence and alignment processing”。该论文发布了已被广泛使用的序列分析工具SeqKit的升级版本SeqKit2,其具有更多的功能、更高的性能,并增加了对更多压缩格式的支持。
SeqKit2扩展了SeqKit的功能,子命令的数量从19增加到38个,并且增加支持3种压缩文件格式。此外,SeqKit2通过增加自动补全、显示进度条和增强的错误处理等新特性,改善了用户体验

软件安装与测试

conda install -c bioconda seqkit # 官方还提供了其他5种安装方法seqkit -h # 调出该软件的帮助信息
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 2.8.2
Author: Wei Shen <shenwei356@gmail.com>
Documents : http://bioinf.shenwei.me/seqkitSource code: https://github.com/shenwei356/seqkitPlease cite: 1. https://doi.org/10.1002/imt2.191 2. https://doi.org/10.1371/journal.pone.0163962

Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip) package toread and write gzip file, and the outputted gzip file would be slightylarger than files generated by GNU gzip.
Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,therefore there's no need to pipe the result to gzip/pigz.
Seqkit also supports reading and writing xz (.xz) and zstd (.zst) formats since v2.2.0.Bzip2 format is supported since v2.4.0.
Compression level: format range default comment gzip 1-9 5 https://github.com/klauspost/pgzip sets 5 as the default value. xz NA NA https://github.com/ulikunitz/xz does not support. zstd 1-4 2 roughly equals to zstd 1, 3, 7, 11, respectively. bzip 1-9 6 https://github.com/dsnet/compress
Usage: seqkit [command]
Commands for Basic Operation: faidx create the FASTA index file and extract subsequences scat real time recursive concatenation and streaming of fastx files seq transform sequences (extract ID, filter by length, remove gaps, reverse complement...) sliding extract subsequences in sliding windows stats simple statistics of FASTA/Q files subseq get subsequences by region/gtf/bed, including flanking sequences translate translate DNA/RNA to protein sequence (supporting ambiguous bases) watch monitoring and online histograms of sequence features
Commands for Format Conversion: convert convert FASTQ quality encoding between Sanger, Solexa and Illumina fa2fq retrieve corresponding FASTQ records by a FASTA file fq2fa convert FASTQ to FASTA fx2tab convert FASTA/Q to tabular format (and length, GC content, average quality...) tab2fx convert tabular format to FASTA/Q format
Commands for Searching: amplicon extract amplicon (or specific region around it) via primer(s) fish look for short sequences in larger sequences using local alignment grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed locate locate subsequences/motifs, mismatch allowed
Commands for Set Operation: common find common/shared sequences of multiple files by id/name/sequence duplicate duplicate sequences N times head print first N FASTA/Q records head-genome print sequences of the first genome with common prefixes in name pair match up paired-end reads from two fastq files range print FASTA/Q records in a range (start:end) rmdup remove duplicated sequences by ID/name/sequence sample sample sequences by number or proportion split split sequences into files by id/seq region/size/parts (mainly for FASTA) split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
Commands for Edit: concat concatenate sequences with the same ID from multiple files mutate edit sequence (point mutation, insertion, deletion) rename rename duplicated IDs replace replace name/sequence by regular expression restart reset start position for circular genome sana sanitize broken single line FASTQ files
Commands for Ordering: shuffle shuffle sequences sort sort sequences by id/name/sequence/length
Commands for BAM Processing: bam monitoring and online histograms of BAM record features
Commands for Miscellaneous: merge-slides merge sliding windows generated from seqkit sliding sum compute message digest for all sequences in FASTA/Q files
Additional Commands: genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell) version print version information and check for update
Flags: --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000) --compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h" for the range and default value for each format (default -1) -h, --help help for seqkit --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud... --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?") -X, --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60) -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-") --quiet be quiet and do not show extra information -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto") -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Use "seqkit [command] --help" for more information about a command.

运行程序

seqkit seq -h # 调出seq命令的帮助信息
transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
Usage: seqkit seq [flags]
Flags: -k, --color colorize sequences - to be piped into "less -R" -p, --complement complement sequence, flag '-v' is recommended to switch on --dna2rna DNA to RNA -G, --gap-letters string gap letters to be removed with -g/--remove-gaps (default "- \t.") -h, --help help for seq -l, --lower-case print sequences in lower case -M, --max-len int only print sequences shorter than or equal to the maximum length (-1 for no limit) (default -1) -R, --max-qual float only print sequences with average quality less than this limit (-1 for no limit) (default -1) -m, --min-len int only print sequences longer than or equal to the minimum length (-1 for no limit) (default -1) -Q, --min-qual float only print sequences with average quality greater or equal than this limit (-1 for no limit) (default -1) -n, --name only print names/sequence headers -i, --only-id print IDs instead of full headers -q, --qual only print qualities -b, --qual-ascii-base int ASCII BASE, 33 for Phred+33 (default 33) -g, --remove-gaps remove gaps letters set by -G/--gap-letters, e.g., spaces, tabs, and dashes (gaps "-" in aligned sequences) -r, --reverse reverse sequence --rna2dna RNA to DNA -s, --seq only print sequences -u, --upper-case print sequences in upper case -v, --validate-seq validate bases according to the alphabet
Global Flags: --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000) --compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h" for the range and default value for each format (default -1) --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud... --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?") -X, --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60) -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-") --quiet be quiet and do not show extra information -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto") -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

取反向互补的染色体序列:

seqkit seq a.fa -p -r -w 0 -v > b.fa # -w 和 -v 可不加,程序会自动检测序列类型

该工具包中,其他命令的使用方法同上。先用“seqkit [command] -h”调出某个命令的用法,选取你想用的某些“flag”,组合起来使用即可。

植信矿工
专注于分享植物方向的最新学术成果、前沿知识和技术进步,以及实践优化过的生信软件、脚本和流程。
 最新文章