python版的singler单细胞注释工具

科技 2024-11-27 13:26 广东

今天是生信星球陪你的第1023天

公众号里的文章大多数需要编程基础，如果因为代码看不懂，而跟不上正文的节奏，可以来找我学习，相当于给自己一个新手保护期。我的课程都是循环开课，点进去咨询微信↓
生信分析直播课程(每月初开一期，春节休一个月)
生信新手保护学习小组（每月两期）
单细胞陪伴学习小组（每月两期）

网上可以搜到大量的R语言singleR的代码和教程，但python版的就比较少啦，恭喜你找到了我。

1.文件读取

输入的数据是10X标准的三个文件

import singlecellexperiment as sce
import scanpy as sc
import os
print(os.listdir("01_data"))

['barcodes.tsv', 'genes.tsv', 'matrix.mtx']

用read_10x_mtx读取

adata = sc.read_10x_mtx("01_data/")
print(adata.shape)

(2700, 32738)

2. 质控

sc.pp.filter_cells(adata,min_genes=200)
sc.pp.filter_genes(adata,min_cells=3)
adata.var['mt']=adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata,qc_vars=['mt'],log1p=False,percent_top=None,inplace=True)
sc.pl.violin(adata,["n_genes_by_counts", "total_counts", "pct_counts_mt"],jitter=0.4, multi_panel=True)

adata=adata[adata.obs.n_genes_by_counts>200]
adata=adata[adata.obs.n_genes_by_counts<2500]
adata=adata[adata.obs.pct_counts_mt<20]

print(adata.shape)

(2693, 13714)

3.降维聚类分群

sc.pp.normalize_total(adata,target_sum=1e4)
sc.pp.log1p(adata)
adata.raw=adata

sc.pp.highly_variable_genes(adata,n_top_genes=2000)
sc.pp.scale(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata,n_pcs=15)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2,resolution=0.5)
sc.tl.umap(adata)
sc.pl.umap(adata,color='leiden')

4.singler自动注释

singler的资料实在太少，文档也很简洁，我学习到这个地方时，请教了包的作者两个问题：

1.如何按照cluster完成注释？

作者回答可以用scranpy的aggregate_across_cells函数按簇整合；

Q: In the R package singleR, I am able to utilize the cluster parameter; however, it appears that this parameter does not exist in the Python version of singler.Did I miss anything？
A: scranpy has an aggregate_across_cells() function that you can use to get the aggregated matrix that can be used in classify_single_reference(). That should be the same as what SingleR::SingleR() does under the hood.
I suppose we could add this argument, but to be honest, the only reason that cluster= still exists in SingleR() is for back-compatibility purposes. It's easy enough to do the aggregation outside of the function and I don't want to add more responsibilities to the singler package.

2.应该选择raw count还是lognormalized data 还是scaled data?

作者回答都可以

Q: Thank you. I've been learning singler recently. According to the quick start guide on the pip website,the test_data parameter seems to require the original count data:
data = sce.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True)
mat = data.assay("counts")
However, the R version of SingleR typically uses log-normalized data. The documentation also mentions,”or if you are coming from scverse ecosystem, i.e. AnnData, simply read the object as SingleCellExperiment and extract the matrix and the features.“，but data processed with Scanpy could be extracted as scaled data. Could you provide advice on which matrix I should use, or if either would be suitable?
A: For the test dataset, it doesn't matter. Only the ranks of the values are used by SingleR itself, so it will give the same results for any monotonic transformation within each cell.
IIRC the only place where the log/normalization-status makes a difference is in SingleR::plotMarkerHeatmap() (R package only, not in the Python package yet) which computes log-fold changes in the test dataset to prioritize the markers to be visualized in the heatmap. This is for diagnostic purposes only.
Of course, the reference dataset should always be some kind of log-normalized value, as log-fold changes are computed via the difference of means, e.g., with getClassicMarkers().

其实使用哪个数据还是会产生一些差别的，我们就沿用log-normalized数据吧（当然其他的也可以）

mat = adata.raw.X.T # 矩阵
features = list(adata.raw.var.index) #矩阵的行名-基因

import scranpy
m2 = scranpy.aggregate_across_cells(mat,adata.obs['leiden']) #按照聚类结果整合单细胞矩阵
m2

SummarizedExperiment(number_of_rows=13714, number_of_columns=8, assays=['sums', 'detected'], row_data=BiocFrame(data={}, number_of_rows=13714, column_names=[]), column_data=BiocFrame(data={'factor_1': StringList(data=['0', '2', '3', '4', '1', '5', '6', '7']), 'counts': array([452, 350, 226, 252, 713, 226, 450,  24], dtype=int32)}, number_of_rows=8, column_names=['factor_1', 'counts']), column_names=['0', '2', '3', '4', '1', '5', '6', '7'])

查看都有哪些可选的注释

import celldex
refs = celldex.list_references() #这句也有可能因为网络问题而报错，不过可以不运行，只是为了知道下面可以写什么注释和什么版本。
print(refs[["name", "version"]])

                        name     version
0                       dice  2024-02-26
1           blueprint_encode  2024-02-26
2                     immgen  2024-02-26
3               mouse_rnaseq  2024-02-26
4                       hpca  2024-02-26
5  novershtern_hematopoietic  2024-02-26
6              monaco_immune  2024-02-26

celldex的参考数据是会下载的，经常有网络问题下载困难,导致运行失败，可以存本地文件，只有第一次运行时会下载,但要注意换了参考数据则fr和fetch_reference里两处要修改

import os
import pickle

fr = "ref_blueprint_encode_data.pkl" 
if not os.path.exists(fr):
    ref_data = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
    with open(fr, 'wb') as file:
        pickle.dump(ref_data, file)
else:
    with open(fr, 'rb') as file:
        ref_data = pickle.load(file)

完成注释

import singler
results = singler.annotate_single(
    test_data = m2,
    test_features = features,
    ref_data = ref_data,
    ref_labels = "label.main"
)

将注释结果添加到anndata对象里，并画图

dd = dict(zip(list(m2.column_data.row_names), results['best']))
dd

{'0': 'CD8+ T-cells',
 '2': 'B-cells',
 '3': 'Monocytes',
 '4': 'NK cells',
 '1': 'CD4+ T-cells',
 '5': 'CD8+ T-cells',
 '6': 'Monocytes',
 '7': 'Monocytes'}

adata.obs['singler']=adata.obs['leiden'].map(dd)

sc.pl.umap(adata,color = 'singler')

自动注释不一定是完全准确的，你换一个参考数据也会发现结果会变。发现有问题就要结合背景知识（比如marker基因）去检查一下。

都已经看到这里了，那就再看看我们近期的培训日程，有合适的就来参加呀(错过了时间也没关系，因为都是循环开课的，随时等你)~

生信新手保护学习小组，适用于任何方向打基础。本周五（24.11.29）开始，学费50，7天，要求每天有2小时用于学习，具体时间自由安排，详细图文教程+打卡+课程答疑。

👉生信新手保护学习小组

单细胞陪伴学习小组，适用于单细胞方向。本周五（24.11.29）开始，学费100，12天，不要求有基础，3天R语言+9天单细胞，代码丝滑，填平新手常见的坑，方便换数据跑出结果和图表，详细图文教程+打卡＋课程答疑。

👉 单细胞陪伴学习小组

12月生信入门和数据挖掘线上直播课，12月2号开始，生信入门班内容是R语言+GEO+linux+转录组上下游分析，4周*5天，学费3699。数据挖掘班内容是R语言+GEO+TCGA+转录组下游分析+机器学习+单细胞数据挖掘和文章复现，3周*5天，学费2899。12月的还没发，内容和11月的基本相同，细节一直在改进→生信入门&数据挖掘线上直播课11月班

以上课程都是0基础友好型，选择困难症找我帮你选，欢迎联系我咨询和报名👇

python+单细胞的学习小组和直播课也即将上线了，我们马上开启内测，暂时只收老学员，（因为我怕翻车啊）。请老学员们留意群通知！正式课程(对所有人开放)将于春节前后上线！

python做单细胞有什么优势吗

http://mp.weixin.qq.com/s?__biz=MzU4NjU4ODQ2MQ==&mid=2247496614&idx=2&sn=11542806ec42c0beac2f1878d8caf709

生信星球

一个零基础学生信的平台-- 原创结构化图文/教程，精选阶段性资料，带你少走弯路早入门，收获成就感，早成生信小能手~

最新文章

生信入门&数据挖掘线上直播课12月班

python版的singler单细胞注释工具

年底了，又到了学习的好时候，快看看我近期的生信培训日程

gseapy-python版的富集分析

多样本数据的自动注释-harmony和celltypist

审美不够，配色来凑，数量不够，拿啥来凑?

从体育生到医学生，一路逆袭，一路自我救赎

单细胞陪伴学习小组召唤你

招聘|中山大学-广州医科大学联合招聘神经生物学与生物信息学方向博士后

漂亮的单细胞多组火山图

拟时序分析的State表达矩阵和差异基因

如果你的mac装包很困难，那就试试...

生信入门&数据挖掘线上直播课11月班

igraph更新，让monocle出bug啦

不看KM-plot，不做cox回归，怎么量化哪个组的预后好

近期的生信培训日程

数据存储格式小知识：tar、gz、tar.gz、mtx、tsv、csv大揭秘！

python字符串处理技巧

jupyter 的魔法函数

python单细胞数据的基因集打分

python单细胞自动注释工具celltypist(排版不抽风版)

python单细胞自动注释工具celltypist

富集的物种不是人咋整啊

你这KEGG富集到了吗？

python 单细胞scanpy流程

抓出电脑上的大文件

评估多个模型[系列完结]

建模数据的预处理

模型的超参数优化

分类模型的评测指标

正则化回归

理解什么叫交叉验证

半职妈妈的朝朝暮暮

线性回归的基础知识

线性回归模型简介

生信入门&数据挖掘线上直播课10月班

机器学习分类模型的性能衡量

机器学习分类模型的构建和预测

使用scikit-learn进行机器学习

多重假设检验P值的校正及Python实现

R用户要整点python[系列完结]

R用户要整点python--matplotlab画图

R用户要整点python--seaborn画图

R用户要整点python--pandas画图

R用户要整点python--pandas进阶

R用户要整点python--数据框里的数据类型

生信入门&数据挖掘线上直播课9月班

R用户要整点python--pandas数据框取子集

周末南京见

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉