软件安装与测试
conda install -c bioconda seqkit # 官方还提供了其他5种安装方法
seqkit -h # 调出该软件的帮助信息
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 2.8.2
Author: Wei Shen <shenwei356@gmail.com>
Documents : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite:
1. https://doi.org/10.1002/imt2.191
2. https://doi.org/10.1371/journal.pone.0163962
Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip) package to
read and write gzip file, and the outputted gzip file would be slighty
larger than files generated by GNU gzip.
Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,
therefore there's no need to pipe the result to gzip/pigz.
Seqkit also supports reading and writing xz (.xz) and zstd (.zst) formats since v2.2.0.
Bzip2 format is supported since v2.4.0.
Compression level:
format range default comment
gzip 1-9 5 https://github.com/klauspost/pgzip sets 5 as the default value.
xz NA NA https://github.com/ulikunitz/xz does not support.
zstd 1-4 2 roughly equals to zstd 1, 3, 7, 11, respectively.
bzip 1-9 6 https://github.com/dsnet/compress
Usage:
seqkit [command]
Commands for Basic Operation:
faidx create the FASTA index file and extract subsequences
scat real time recursive concatenation and streaming of fastx files
seq transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
sliding extract subsequences in sliding windows
stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
translate translate DNA/RNA to protein sequence (supporting ambiguous bases)
watch monitoring and online histograms of sequence features
Commands for Format Conversion:
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
fa2fq retrieve corresponding FASTQ records by a FASTA file
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (and length, GC content, average quality...)
tab2fx convert tabular format to FASTA/Q format
Commands for Searching:
amplicon extract amplicon (or specific region around it) via primer(s)
fish look for short sequences in larger sequences using local alignment
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
locate locate subsequences/motifs, mismatch allowed
Commands for Set Operation:
common find common/shared sequences of multiple files by id/name/sequence
duplicate duplicate sequences N times
head print first N FASTA/Q records
head-genome print sequences of the first genome with common prefixes in name
pair match up paired-end reads from two fastq files
range print FASTA/Q records in a range (start:end)
rmdup remove duplicated sequences by ID/name/sequence
sample sample sequences by number or proportion
split split sequences into files by id/seq region/size/parts (mainly for FASTA)
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
Commands for Edit:
concat concatenate sequences with the same ID from multiple files
mutate edit sequence (point mutation, insertion, deletion)
rename rename duplicated IDs
replace replace name/sequence by regular expression
restart reset start position for circular genome
sana sanitize broken single line FASTQ files
Commands for Ordering:
shuffle shuffle sequences
sort sort sequences by id/name/sequence/length
Commands for BAM Processing:
bam monitoring and online histograms of BAM record features
Commands for Miscellaneous:
merge-slides merge sliding windows generated from seqkit sliding
sum compute message digest for all sequences in FASTA/Q files
Additional Commands:
genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
version print version information and check for update
Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on
which seqkit guesses the sequence type (0 for whole seq)
(default 10000)
--compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
for the range and default value for each format (default -1)
-h, --help help for seqkit
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
-X, --infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it
automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable
SEQKIT_THREADS) (default 4)
Use "seqkit [command] --help" for more information about a command.
运行程序
seqkit seq -h # 调出seq命令的帮助信息
transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
Usage:
seqkit seq [flags]
Flags:
-k, --color colorize sequences - to be piped into "less -R"
-p, --complement complement sequence, flag '-v' is recommended to switch on
--dna2rna DNA to RNA
-G, --gap-letters string gap letters to be removed with -g/--remove-gaps (default "- \t.")
-h, --help help for seq
-l, --lower-case print sequences in lower case
-M, --max-len int only print sequences shorter than or equal to the maximum length (-1 for
no limit) (default -1)
-R, --max-qual float only print sequences with average quality less than this limit (-1 for no
limit) (default -1)
-m, --min-len int only print sequences longer than or equal to the minimum length (-1 for no
limit) (default -1)
-Q, --min-qual float only print sequences with average quality greater or equal than this limit
(-1 for no limit) (default -1)
-n, --name only print names/sequence headers
-i, --only-id print IDs instead of full headers
-q, --qual only print qualities
-b, --qual-ascii-base int ASCII BASE, 33 for Phred+33 (default 33)
-g, --remove-gaps remove gaps letters set by -G/--gap-letters, e.g., spaces, tabs, and
dashes (gaps "-" in aligned sequences)
-r, --reverse reverse sequence
--rna2dna RNA to DNA
-s, --seq only print sequences
-u, --upper-case print sequences in upper case
-v, --validate-seq validate bases according to the alphabet
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on
which seqkit guesses the sequence type (0 for whole seq)
(default 10000)
--compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
for the range and default value for each format (default -1)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
-X, --infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it
automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable
SEQKIT_THREADS) (default 4)
取反向互补的染色体序列:
seqkit seq a.fa -p -r -w 0 -v > b.fa # -w 和 -v 可不加,程序会自动检测序列类型
该工具包中,其他命令的使用方法同上。先用“seqkit [command] -h”调出某个命令的用法,选取你想用的某些“flag”,组合起来使用即可。