参考数据库的选择直接影响物种注释的准确性、灵敏度和特异性,全面的参考数据库包含大量的微生物基因组信息,能提供更加广泛的物种覆盖范围,但是需要更多的计算资源,在宏基因组分析时需要在参考数据库选择和计算资源间做出权衡。使用Kraken2进行宏基因组分析时,置信度是一个关键参数,置信度直接决定分类标签被分配给一个序列所需的最低k-mer匹配比例。然而,目前在使用Kraken2进行物种注释时,并没有明确给出参考数据库以及置信度参数的使用建议,大多采用默认值。研究人员也通常会忽略参考数据库的选择、以及不同置信度的设定对物种分类及丰度准确性的影响。图 1 不同数据库和置信度对分类准确率、召回率和F1分数的影响本研究发现,当使用较小的参考数据库时,随着置信度的增加,能够被分类的序列显著降低;而使用较大的数据库时,分类率受置信度的影响较小。选择较大的参考数据库,随着置信度的增加会降低假阳性物种出现的概率,同时显著提高物种分类的准确度和F1评分,召回率基本不会受到置信度的影响。但置信度的增加会显著影响物种相对丰度的波动,置信度越高,注释到物种丰度与真实丰度差异越大。所以,在使用Kraken2进行物种注释时,建议使用更全面的参考数据库(standard、nt和GTDB数据库),并结合适度的置信度(0.2-0.4)来提高分类的准确性和灵敏度。
中国农业科学院饲料研究所博士研究生刘云龙为文章第一作者,屠焰研究员和马涛副研究员为文章共同通讯作者,Morteza Ghaffari研究员指导了相关工作。该研究得到中国农业科学院科技创新工程和中央级公益性科研院所基本科研业务费专项资助。
Liu, Y., Ghaffari, M.H., Ma, T. et al. Impact of database choice and confidence score on the performance of taxonomic classification using Kraken2. aBIOTECH (2024). https://doi.org/10.1007/s42994-024-00178-0相关阅读:
aBIOTECH 竭诚为学者们提供以下免费服务:
▶ 论文写作指导
▶ 高质量论文图片编辑
▶ 参考文献规范化校对
▶ 对论文成果开展全球化推广
▶ 定期组织青年科学家开展学术交流
▶ 实验室招聘、团队最新成果发布等
Prof. Sanwen Huang
2023 IF 4.6
Indexed in EI, ESCI, PubMed Central, SCOPUS, CSCD, Google Scholar, CNKI, Dimensions...
The aims of aBIOTECH are two-fold: First to publish seminal articles that focus the relevant research communities to achieve development of superior agroecosystems, globally. Next, to foster national and international engagement, including business, politics, and society, to build an understanding of modern agrobiotechnology/genomics-empowered strategies, which can ensure the availability of adequate nutritious foods to feed the growing global population.
Relevant topics include, but are not limited to, the followings:TRANSGENE, GENOME EDITING TECHNOLOGIES & APPLICATIONS: Advanced transgene or genome editing technologies or methodologies; applications of transgene or genome editing in genetic improvement of agriculturally important traits, which otherwise are impossible by conventional breeding; commercialization of modified or gene-edited crops/livestock for agricultural production; safety and regulatory affairs/policies.METABOLIC ENGINEERING: Synthesis of bioactive natural products, including study of their metabolic networks and functions, using both genetic and synthetic biology approaches.TECHNOLOGIES FOR DISEASE CONTROL: Developmental, physiological, biochemical, and technological studies, and innovative strategies relevant to disease control in crop or livestock production systems.GENOMICS & BREEDING: Genome, pan-genome, and metagenome studies, multi-omics data mining approaches, intelligent design breeding theory, approaches, and practice, and innovative analytical/bioinformatics tools/methods, with potential to advance crop and livestock breeding programs.ROOT-SOIL-MICROBIOME AGROECOSYSTEMS: Targeted breeding and engineering of essential root biology and associated microbiome traits directed to enhance crop performance under sub-optimal soil abiotic and/or biotic conditions.