今天是生信星球陪你的第1023天
公众号里的文章大多数需要编程基础,如果因为代码看不懂,而跟不上正文的节奏,可以来找我学习,相当于给自己一个新手保护期。我的课程都是循环开课,点进去咨询微信↓
生信分析直播课程(每月初开一期,春节休一个月)
生信新手保护学习小组(每月两期)
单细胞陪伴学习小组(每月两期)
1.文件读取
import singlecellexperiment as sce
import scanpy as sc
import os
print(os.listdir("01_data"))
['barcodes.tsv', 'genes.tsv', 'matrix.mtx']
adata = sc.read_10x_mtx("01_data/")
print(adata.shape)
(2700, 32738)
2. 质控
sc.pp.filter_cells(adata,min_genes=200)
sc.pp.filter_genes(adata,min_cells=3)
adata.var['mt']=adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata,qc_vars=['mt'],log1p=False,percent_top=None,inplace=True)
sc.pl.violin(adata,["n_genes_by_counts", "total_counts", "pct_counts_mt"],jitter=0.4, multi_panel=True)
adata=adata[adata.obs.n_genes_by_counts>200]
adata=adata[adata.obs.n_genes_by_counts<2500]
adata=adata[adata.obs.pct_counts_mt<20]
print(adata.shape)
(2693, 13714)
3.降维聚类分群
sc.pp.normalize_total(adata,target_sum=1e4)
sc.pp.log1p(adata)
adata.raw=adata
sc.pp.highly_variable_genes(adata,n_top_genes=2000)
sc.pp.scale(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata,n_pcs=15)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2,resolution=0.5)
sc.tl.umap(adata)
sc.pl.umap(adata,color='leiden')
4.singler自动注释
Q: In the R package singleR, I am able to utilize the cluster parameter; however, it appears that this parameter does not exist in the Python version of singler.Did I miss anything? A: scranpy has an aggregate_across_cells() function that you can use to get the aggregated matrix that can be used in classify_single_reference(). That should be the same as what SingleR::SingleR() does under the hood. I suppose we could add this argument, but to be honest, the only reason that cluster= still exists in SingleR() is for back-compatibility purposes. It's easy enough to do the aggregation outside of the function and I don't want to add more responsibilities to the singler package.
Q: Thank you. I've been learning singler recently. According to the quick start guide on the pip website,the test_data
parameter seems to require the original count data:data = sce.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True)
mat = data.assay("counts")However, the R version of SingleR typically uses log-normalized data. The documentation also mentions,”or if you are coming from scverse ecosystem, i.e. AnnData, simply read the object as SingleCellExperiment and extract the matrix and the features.“,but data processed with Scanpy could be extracted as scaled data. Could you provide advice on which matrix I should use, or if either would be suitable? A: For the test dataset, it doesn't matter. Only the ranks of the values are used by SingleR itself, so it will give the same results for any monotonic transformation within each cell. IIRC the only place where the log/normalization-status makes a difference is in SingleR::plotMarkerHeatmap()
(R package only, not in the Python package yet) which computes log-fold changes in the test dataset to prioritize the markers to be visualized in the heatmap. This is for diagnostic purposes only.Of course, the reference dataset should always be some kind of log-normalized value, as log-fold changes are computed via the difference of means, e.g., with getClassicMarkers()
.
mat = adata.raw.X.T # 矩阵
features = list(adata.raw.var.index) #矩阵的行名-基因
import scranpy
m2 = scranpy.aggregate_across_cells(mat,adata.obs['leiden']) #按照聚类结果整合单细胞矩阵
m2
SummarizedExperiment(number_of_rows=13714, number_of_columns=8, assays=['sums', 'detected'], row_data=BiocFrame(data={}, number_of_rows=13714, column_names=[]), column_data=BiocFrame(data={'factor_1': StringList(data=['0', '2', '3', '4', '1', '5', '6', '7']), 'counts': array([452, 350, 226, 252, 713, 226, 450, 24], dtype=int32)}, number_of_rows=8, column_names=['factor_1', 'counts']), column_names=['0', '2', '3', '4', '1', '5', '6', '7'])
import celldex
refs = celldex.list_references() #这句也有可能因为网络问题而报错,不过可以不运行,只是为了知道下面可以写什么注释和什么版本。
print(refs[["name", "version"]])
name version
0 dice 2024-02-26
1 blueprint_encode 2024-02-26
2 immgen 2024-02-26
3 mouse_rnaseq 2024-02-26
4 hpca 2024-02-26
5 novershtern_hematopoietic 2024-02-26
6 monaco_immune 2024-02-26
import os
import pickle
fr = "ref_blueprint_encode_data.pkl"
if not os.path.exists(fr):
ref_data = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
with open(fr, 'wb') as file:
pickle.dump(ref_data, file)
else:
with open(fr, 'rb') as file:
ref_data = pickle.load(file)
import singler
results = singler.annotate_single(
test_data = m2,
test_features = features,
ref_data = ref_data,
ref_labels = "label.main"
)
dd = dict(zip(list(m2.column_data.row_names), results['best']))
dd
{'0': 'CD8+ T-cells',
'2': 'B-cells',
'3': 'Monocytes',
'4': 'NK cells',
'1': 'CD4+ T-cells',
'5': 'CD8+ T-cells',
'6': 'Monocytes',
'7': 'Monocytes'}
adata.obs['singler']=adata.obs['leiden'].map(dd)
sc.pl.umap(adata,color = 'singler')
生信新手保护学习小组,适用于任何方向打基础。本周五(24.11.29)开始,学费50,7天,要求每天有2小时用于学习,具体时间自由安排,详细图文教程+打卡+课程答疑。
12月生信入门和数据挖掘线上直播课,12月2号开始,生信入门班内容是R语言+GEO+linux+转录组上下游分析,4周*5天,学费3699。数据挖掘班内容是R语言+GEO+TCGA+转录组下游分析+机器学习+单细胞数据挖掘和文章复现,3周*5天,学费2899。12月的还没发,内容和11月的基本相同,细节一直在改进→生信入门&数据挖掘线上直播课11月班
以上课程都是0基础友好型,选择困难症找我帮你选,欢迎联系我咨询和报名👇
python+单细胞的学习小组和直播课也即将上线了,我们马上开启内测,暂时只收老学员,(因为我怕翻车啊)。请老学员们留意群通知! 正式课程(对所有人开放)将于春节前后上线!