SeqKit2｜一款超快且全能的序列处理工具包（以取反向互补序列为例）

文摘科学 2024-10-31 15:04 广东

2024年4月5日，iMeta上线了一篇关于生信工具的短讯文章，“SeqKit2: A Swiss army knife for sequence and alignment processing”。该论文发布了已被广泛使用的序列分析工具SeqKit的升级版本SeqKit2，其具有更多的功能、更高的性能，并增加了对更多压缩格式的支持。

SeqKit2扩展了SeqKit的功能，子命令的数量从19增加到38个，并且增加支持3种压缩文件格式。此外，SeqKit2通过增加自动补全、显示进度条和增强的错误处理等新特性，改善了用户体验。

软件安装与测试

conda install -c bioconda seqkit # 官方还提供了其他5种安装方法seqkit -h # 调出该软件的帮助信息

SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 2.8.2
Author: Wei Shen <shenwei356@gmail.com>
Documents  : http://bioinf.shenwei.me/seqkitSource code: https://github.com/shenwei356/seqkitPlease cite:  1. https://doi.org/10.1002/imt2.191  2. https://doi.org/10.1371/journal.pone.0163962

Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip) package toread and write gzip file, and the outputted gzip file would be slightylarger than files generated by GNU gzip.
Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,therefore there's no need to pipe the result to gzip/pigz.
Seqkit also supports reading and writing xz (.xz) and zstd (.zst) formats since v2.2.0.Bzip2 format is supported since v2.4.0.
Compression level:  format   range   default  comment  gzip     1-9     5        https://github.com/klauspost/pgzip sets 5 as the default value.  xz       NA      NA       https://github.com/ulikunitz/xz does not support.  zstd     1-4     2        roughly equals to zstd 1, 3, 7, 11, respectively.  bzip     1-9     6        https://github.com/dsnet/compress
Usage:  seqkit [command]
Commands for Basic Operation:  faidx           create the FASTA index file and extract subsequences  scat            real time recursive concatenation and streaming of fastx files  seq             transform sequences (extract ID, filter by length, remove gaps, reverse complement...)  sliding         extract subsequences in sliding windows  stats           simple statistics of FASTA/Q files  subseq          get subsequences by region/gtf/bed, including flanking sequences  translate       translate DNA/RNA to protein sequence (supporting ambiguous bases)  watch           monitoring and online histograms of sequence features
Commands for Format Conversion:  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina  fa2fq           retrieve corresponding FASTQ records by a FASTA file  fq2fa           convert FASTQ to FASTA  fx2tab          convert FASTA/Q to tabular format (and length, GC content, average quality...)  tab2fx          convert tabular format to FASTA/Q format
Commands for Searching:  amplicon        extract amplicon (or specific region around it) via primer(s)  fish            look for short sequences in larger sequences using local alignment  grep            search sequences by ID/name/sequence/sequence motifs, mismatch allowed  locate          locate subsequences/motifs, mismatch allowed
Commands for Set Operation:  common          find common/shared sequences of multiple files by id/name/sequence  duplicate       duplicate sequences N times  head            print first N FASTA/Q records  head-genome     print sequences of the first genome with common prefixes in name  pair            match up paired-end reads from two fastq files  range           print FASTA/Q records in a range (start:end)  rmdup           remove duplicated sequences by ID/name/sequence  sample          sample sequences by number or proportion  split           split sequences into files by id/seq region/size/parts (mainly for FASTA)  split2          split sequences into files by size/parts (FASTA, PE/SE FASTQ)
Commands for Edit:  concat          concatenate sequences with the same ID from multiple files  mutate          edit sequence (point mutation, insertion, deletion)  rename          rename duplicated IDs  replace         replace name/sequence by regular expression  restart         reset start position for circular genome  sana            sanitize broken single line FASTQ files
Commands for Ordering:  shuffle         shuffle sequences  sort            sort sequences by id/name/sequence/length
Commands for BAM Processing:  bam             monitoring and online histograms of BAM record features
Commands for Miscellaneous:  merge-slides    merge sliding windows generated from seqkit sliding  sum             compute message digest for all sequences in FASTA/Q files
Additional Commands:  genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)  version         print version information and check for update
Flags:      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on                                        which seqkit guesses the sequence type (0 for whole seq)                                        (default 10000)      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"                                        for the range and default value for each format (default -1)  -h, --help                            help for seqkit      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|                                        Pseud...      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")  -X, --infile-list string              file of input files list (one file per line), if given, they are                                        appended to files from cli arguments  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")      --quiet                           be quiet and do not show extra information  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it                                        automatically detect by the first sequence) (default "auto")  -j, --threads int                     number of CPUs. can also set with environment variable                                        SEQKIT_THREADS) (default 4)
Use "seqkit [command] --help" for more information about a command.

运行程序

seqkit seq -h # 调出seq命令的帮助信息

transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
Usage:  seqkit seq [flags]
Flags:  -k, --color                 colorize sequences - to be piped into "less -R"  -p, --complement            complement sequence, flag '-v' is recommended to switch on      --dna2rna               DNA to RNA  -G, --gap-letters string    gap letters to be removed with -g/--remove-gaps (default "- \t.")  -h, --help                  help for seq  -l, --lower-case            print sequences in lower case  -M, --max-len int           only print sequences shorter than or equal to the maximum length (-1 for                              no limit) (default -1)  -R, --max-qual float        only print sequences with average quality less than this limit (-1 for no                              limit) (default -1)  -m, --min-len int           only print sequences longer than or equal to the minimum length (-1 for no                              limit) (default -1)  -Q, --min-qual float        only print sequences with average quality greater or equal than this limit                              (-1 for no limit) (default -1)  -n, --name                  only print names/sequence headers  -i, --only-id               print IDs instead of full headers  -q, --qual                  only print qualities  -b, --qual-ascii-base int   ASCII BASE, 33 for Phred+33 (default 33)  -g, --remove-gaps           remove gaps letters set by -G/--gap-letters, e.g., spaces, tabs, and                              dashes (gaps "-" in aligned sequences)  -r, --reverse               reverse sequence      --rna2dna               RNA to DNA  -s, --seq                   only print sequences  -u, --upper-case            print sequences in upper case  -v, --validate-seq          validate bases according to the alphabet
Global Flags:      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on                                        which seqkit guesses the sequence type (0 for whole seq)                                        (default 10000)      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"                                        for the range and default value for each format (default -1)      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|                                        Pseud...      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")  -X, --infile-list string              file of input files list (one file per line), if given, they are                                        appended to files from cli arguments  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")      --quiet                           be quiet and do not show extra information  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it                                        automatically detect by the first sequence) (default "auto")  -j, --threads int                     number of CPUs. can also set with environment variable                                        SEQKIT_THREADS) (default 4)

取反向互补的染色体序列：

seqkit seq a.fa -p -r -w 0 -v > b.fa # -w 和 -v 可不加，程序会自动检测序列类型

该工具包中，其他命令的使用方法同上。先用“seqkit [command] -h”调出某个命令的用法，选取你想用的某些“flag”，组合起来使用即可。

植信矿工

专注于分享植物方向的最新学术成果、前沿知识和技术进步，以及实践优化过的生信软件、脚本和流程。

最新文章

基因组注释｜1. 从原理介绍开始

SeqKit2｜一款超快且全能的序列处理工具包（以取反向互补序列为例）

Bioinformatics | 张国捷团队开发T2T基因组组装质量评估新工具

Plant Journal｜胡萝卜的T2T 基因组和转录组揭示了其与病原菌在感染过程中的相互作用机制

PCE｜V-ATPase可以与VPT蛋白合作，在亚细胞和系统两个层面上调节Pi稳态

JIPB｜OsBSK1-2通过OsHLH46/OsbHLH6复合物来调节水稻的稻瘟病抗性

TBtools｜对minimap2生成的paf文件进行可视化

Nature Communications｜VIG1基因上的一个点突变促进了水稻的发育和耐冷性

Nature Communications｜HASTY介导的miRNA动态变化调控了拟南芥中缺氮诱导的叶片衰老

高校新闻｜港浸大前协理副校长呼吁取消教资会，将八大高校合并为香港联合大学

会议通知｜这个11月，我们相约重庆

JIPB | 南京农业大学王源超课题组提出大豆锈病菌防治新策略

著作解读｜GWAS第三章：基因分型平台介绍

Plant Communications｜OsHAG1调控了水稻籽粒中的砷元素分配与积累

Plant Communications｜叶绿体五肽重复蛋白通过TB1-RCN22-RbcL模块影响糖水平来调控水稻分蘖

著作解读｜GWAS第二章：表型数据的准备和管理

著作解读｜GWAS第一章：分析的主要步骤和关键要点

New Phytologist｜植物必需微量元素的关键生理功能与缺乏症状

JIPB丨玉米螟幼虫取食玉米时的“马太效应”

Nat Genet | 豌豆参考基因组和314份群体分析提供了对孟德尔性状遗传基础的见解

JIPB｜MRP5和ITPK4双突变在不损害拟南芥耐盐性的同时，降低了种子中的植酸含量

Plant Journal｜法国科研团队推出了871个完全测序的纯合EMS突变体

Nature Communications｜E1及其同源基因精细调控大豆开花时间和适应性的分子机制

PBB｜1-丁醇预处理通过刺激气孔关闭和延缓叶片水分损失，有效增强了拟南芥对干旱胁迫的耐受性

JIPB｜绿光通过调控光敏色素的活性介导了拟南芥中的非典型光形态建成

Nature Communications｜胁迫诱导的转录因子ONAC023同时改善了水稻对干旱和热胁迫耐受性

Mac 上的终端神器 - iTerm2

Nature Communications｜效应因子NopL与GmREM1a和GmNFR5互作以促进大豆与根瘤的共生

JIPB｜液泡的磷酸盐外排机制支持了大豆根毛在缺磷条件下的生长

quarTeT｜鉴定基因组中的端粒（1）

生信技能 | quarTeT：专门用于T2T组装和着丝粒重复识别研究的新工具

Plant Journal｜OsMYC2-JA反馈回路通过细胞壁松弛调控水稻的日间开花时间

Nature Communications｜AUREO1c-LI818途径促进了硅藻在动态光照条件下的适应性

Nature Communications｜温度依赖的Jumonji去甲基化酶通过靶向H3K36me2/3调控小白菜的开花时间

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉