近期biorxiv生信好文速览(2024年第一季)

学术   2024-10-29 06:57   北京  

最近几个月,小编因为太忙又没有及时更新预印本好文速览,看着积攒下来的几十篇收藏好的文章和新闻点评,小编只能忍痛割爱,选择了10篇文章和2则新闻跟大家分享。老规矩,首先让我们先看看预印本(preprint)领域近期又有哪些可喜的变化。

首先,bioRxiv的兄弟(或者说是老大哥),arxiv从23年12月18日开始,为使用TeX/LaTeX提交的新论文(2023年12月1日之后提交且HTML转换成功)生成HTML格式版本。注意,HTML版本将作为PDF的补充,而非取代。按照官方说法:提供HTML格式版本是为了帮助残障科学家更好地获取论文内容;HTML格式可以被屏幕阅读器等辅助技术更好地阅读,进而帮助失明、视力低下、阅读障碍等残障研究人员;该做法是arXiv朝向"使科学无障碍化"的重要一步,他们的最终目标是为所有现存论文补充HTML版本。这个看似简单的变化,其实是在包括LaTeX项目和NIST的LaTeXML团队在内的多方通力合作下才得以实现的。

其二是biorxiv这边推出了人工智能辅助阅读功能。只需要点击界面中的齿轮图标即可实现。

以我们下面为大家带来的第一篇preprint为例,读者可以切换general(大众)和expert(专家)两种模式,会得到不同的AI辅助的总结。

这一进步得益于bioRxiv和Science Cast公司的合作,这一举动旨在弥合科学信息与大众之间的鸿沟(其实也帮助小编写推送),也许会在某种程度上改变科学论文在不同读者群体中的理解方式。需要指出的是,AI辅助bioRxiv理解生成的一小段文字不止针对文章的摘要而是基于“通读全文”后生成。为进一步篇配合不断变化的用户需求,Science Cast公司可能还将制作音频内容,使科学知识比以往更加易于获取。    

话不多说,让我们看看近期(大概是5个月以内)有哪些有趣的预印本文章。       

 

1.【争议】德国图宾根大学(University of Tübingen)学者:t-SNE和UMAP在单细胞测序数据分析中的使用是否合理?

The art of seeing the elephant in the room: 2D embeddings of single-cell data do make sense

A recent paper in PLOS Computational Biology (Chari and Pachter, 2023) claimed that t-SNE and UMAP embeddings of single-cell datasets fail to capture true biological structure. The authors argued that such embeddings are as arbitrary and as misleading as forcing the data into an elephant shape. Here we show that this conclusion was based on inadequate and limited metrics of embedding quality. More appropriate metrics quantifying neighborhood and class preservation reveal the elephant in the room: while t-SNE and UMAP embeddings of single-cell data do not represent high-dimensional distances, they can nevertheless provide biologically relevant information.

【备注】针对此文,大佬们在社交媒体上已经“吵”得不亦乐乎。究竟孰是孰非?

   

2.【结构】捷克Masaryk University学者推出AlphaFind:基于结构的快速检索引擎,全面检索AlphaFold DB中的蛋白质结构

AlphaFind: Discover structure similarity across the entire known proteome

AlphaFind is a web-based search engine that provides fast structure-based retrieval in the entire set of AlphaFold DB structures. Unlike other protein processing tools, AlphaFind is focused entirely on tertiary structure, automatically extracting the main 3D features of each protein chain and using a machine learning model to find the most similar structures. This indexing approach and the 3D feature extraction method used by AlphaFind have both demonstrated remarkable scalability to large datasets as well as to large protein structures. The web application itself has been designed with a focus on clarity and ease of use. The searcher accepts any valid Uniprot ID, PDB ID or gene symbol as input, and returns a set of similar protein chains from AlphaFold DB, including various similarity metrics between the query and each of the retrieved results. In addition to the main search functionality, the application provides 3D visualizations of protein structure superpositions in order to allow researchers to instantly analyze the structural similarity of the retrieved results. The AlphaFind web application is available online for free and without any registration at https://alphafind.fi.muni.cz.    

         

 

3.【自动】 Data Citation Explorer:促进基因组数据与文献关联的自动化服务

Identifying genomic data use with the Data Citation Explorer

Increases in sequencing capacity, combined with rapid accumulation of publications and associated data resources, have increased the complexity of maintaining associations between literature and genomic data. As the volume of literature and data have exceeded the capacity of manual curation, automated approaches to maintaining and confirming associations among these resources have become necessary. Here we present the Data Citation Explorer (DCE), which discovers literature incorporating genomic data whether or not provenance was clearly indicated. This service provides advantages over manual curation methods including consistent resource coverage, metadata enrichment, documentation of new use cases, and identification of conflicting metadata. The service reduces labor costs associated with manual review, improves the quality of genome metadata maintained by the U.S. Department of Energy Joint Genome Institute (JGI), and increases the number of known publications that incorporate its data products. The DCE facilitates an understanding of JGI impact, improves credit attribution for data generators, and can encourage data sharing by allowing scientists to see how reuse amplifies the impact of their original studies.             

 

4.【折叠】诺奖得主Doudna:基于病毒组学的蛋白质全新折叠模式发现

Birth of new protein folds and functions in the virome

Rapid virus evolution generates proteins essential to infectivity and replication but with unknown function due to extreme sequence divergence1. Using a database of 67,715 newly predicted protein structures from 4,463 eukaryotic viral species, we found that 62% of viral proteins are evolutionarily young and lack homologs in the Alphafold database2,3. Among the 38% of more ancient viral proteins, many have non-viral structural homologs that revealed surprising similarities between human pathogens and their eukaryotic hosts. Structural comparisons suggested putative functions for >25% of unannotated viral proteins, including those with roles in the evasion of innate immunity. In particular, RNA ligase T- (ligT) like phosphodiesterases were found to resemble phage-encoded proteins that hydrolyze the host immune-activating cyclic dinucleotides 3’3’ and 2’3’ cyclic G-A monophosphate (cGAMP). Experimental analysis showed that ligT homologs encoded by avian poxviruses likewise hydrolyze 2’3’ cGAMP, showing that ligT-mediated targeting of cGAMP is an evolutionarily conserved mechanism of immune evasion present in both bacteriophage and eukaryotic viruses. Together, the viral protein structural database and analytics presented here afford new opportunities to identify mechanisms of virus-host interactions that are common across the virome.    

         

 

5.【李恒】泛基因组的基因图分析新工具

Exploring gene content with pangenome gene graphs

Motivation: The gene content regulates the biology of an organism. It varies between species and between individuals of the same species. Although tools have been developed to identify gene content changes in bacterial genomes, none is applicable to collections of large eukaryotic genomes such as the human pangenome. Results: We developed pangene, a computational tool to identify gene orientation, gene order and gene copy-number changes in a collection of genomes. Pangene aligns a set of input protein sequences to the genomes, resolves redundancies between protein sequences and constructs a gene graph with each genome represented as a walk in the graph. It additionally finds subgraphs that encodes gene content changes. Applied to the human pangenome, pangene identifies known gene-level variations and reveals complex haplotypes that are not well studied before. Pangene also works with high-quality bacterial pangenome and reports similar numbers of core and accessory genes in comparison to existing tools. Availability and implementation: Source code at this https URL pre-built pangene graphs can be downloaded from this https URL and visualized at this https URL. (这里的URL小编就不展开了,感兴趣的读者请自行查阅原文)    

         

 

6.【十万】加拿大圭尔夫大学University of Guelph:纳米孔实现一次测序中跑10万个条形码

Barcode 100K Specimens: In a Single Nanopore Run

It is a global priority to better manage the biosphere, but action needs to be informed by monitoring shifts in the abundance and distribution of species across the domains of life. The acquisition of such information is currently constrained by the limited knowledge of biodiversity. Among the 20 million or more species of eukaryotes, just a tenth have scientific names. DNA barcoding can speed the registration of unknown animal species, the most diverse kingdom of eukaryotes, as the BIN system automates their recognition. However, inexpensive analytical protocols are critical as the census of all animal species will require processing a billion or more specimens. Barcoding involves DNA extraction followed by PCR and sequencing with the last step dominating costs until 2017. By recovering barcodes from highly multiplexed samples, the Sequel platforms from Pacific BioSciences slashed costs by 90%, but these instruments are only deployed in core facilities because of their expense. Sequencers from Oxford Nanopore Technologies provide an escape from high capital and service costs, but their low sequence fidelity has, until now, kept analytical cost above Sequel. However, the improved performance of its latest flow cells (R10.4.1) might erase this differential. This study demonstrates that a regular MinION flow cell can characterize an amplicon pool derived from 100,000 specimens while a Flongle flow cell can process one derived from several thousand. At $0.01 per specimen, DNA sequencing is now the least expensive step in the barcode workflow. By coupling simplified protocols for DNA extraction with ultra-low volume PCRs, it will be possible to move from specimen to DNA barcode for $0.10, a price point that will enable the census of all species within two decades.             

 

7.【猩猩】近缘物种间的表型差异与基因组3D结构变异关系研究: 以黑猩猩和倭黑猩猩为例

Sequence-based machine learning reveals 3D genome differences between bonobos and chimpanzees

Phenotypic divergence between closely related species, including bonobos and chimpanzees (genus Pan), is largely driven by variation in gene regulation. The 3D structure of the genome mediates gene expression; however, genome folding differences in Pan are not well understood. Here, we apply machine learning to predict genome-wide 3D genome contact maps from DNA sequence for 56 bonobos and chimpanzees, encompassing all five extant lineages. We use a pairwise approach to estimate 3D divergence between individuals from the resulting contact maps in 4,420 1 Mb genomic windows. While most pairs were similar, 17% were predicted to be substantially divergent in genome folding. The most dissimilar maps were largely driven by single individuals with rare variants that pro-duce unique 3D genome folding in a region. We also identified 89 genomic windows where bonobo and chimpanzee contact maps substantially diverged, including several windows harboring genes associated with traits implicated in Pan phenotypic divergence. We used in silico mutagenesis to identify 51 3D-modifying variants in these bonobo-chimpanzee diver-gent windows, finding that 34 or 66.67% induce genome folding changes via CTCF binding motif disruption. Our results reveal 3D genome variation at the population-level and identify genomic regions where changes in 3D folding may contribute to phenotypic differences in our closest living relatives.    

         

 

8.【增强子】斯坦福大学:ENCODE数据库——构建人类增强子-基因调控网络的资源

An encyclopedia of enhancer-gene regulatory interactions in the human genome

Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1–6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and large-scale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 element-gene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancer-promoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.             

 

9.【比较】NCBI的真核生物比较基因组可视化工具CGV的介绍与应用

Interactive visualization of whole eukaryote genome alignments using NCBI’s Comparative Genome Viewer (CGV)

We report a new visualization tool for analysis of whole genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by NCBI using assembly alignment algorithms developed by us and others. Researchers can examine the alignments between the two assemblies using two alternate views: a chromosome ideogram-based view or a 2D genome dotplot. Whole genome alignment views expose large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed in the ideogram view where available. CGV currently provides approximately 700 alignments from over 300 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.    

         

 

10.【非洲】跨区域合作与创新:非洲生物基因组研究的新模式探讨

Establish grassroots genomics and bioinformatics programs to train 400 Africans yearly

In 2022, around 54% of African students were denied student visas to study in the United States (US), compared to 36% of Asian students and 9% of European students, despite African immigrants in the US often being more highly educated than the US native-born population. This issue cannot be attributed solely to the dichotomy between the Global North and South in visa regimes, but it is also evident among African nations across regional economic blocs. The African BioGenome Project (AfricaBP) Open Institute for Genomics and Bioinformatics, which aims to overcome barriers to capacity building through its distributed African regional workshops, prioritizes grassroots knowledge exchange and innovation in biodiversity genomics and bioinformatics. In 2023, we orchestrated the implementation of 27 capacity building workshops on biodiversity genomics and bioinformatics, covering 10 African countries across 5 African geographical regions. The AfricaBP Open Institute regional workshops raised awareness of biodiversity genomics and bioinformatics among 3788 registered participants and trained 408 African scientists in hands-on molecular biology, genomics, and bioinformatics techniques. Here, we discuss the implementation of transformative strategies by deploying the AfricaBP Open Institute multi-country, multi-institution, and multi-partner hybrid regional workshop model, including the proposed creation of an African digital database containing sequence information relating to biodiversity and agriculture.    

END

不想错过每天的热点和技术

欢迎大家添加“生信人”为星标推荐 


         

 


         

 

   

生信人
共同学习生物信息学知识,共同探究生物奥秘。
 最新文章