复现/实战 | 基于python和R的单细胞数据整合

学术   2024-10-09 00:06   四川  
基于对图谱类⽂章的整理,我们发现⼤量单细胞数据的整合主要是基于 Python 的 BBKNN ⽅法以及基于 R的 harmony ⽅法和 RPCA ⽅法,下⾯本⽂依次对三种⽅法进⾏代码实践
1 基于 python 的 BBKNN:
其实,python 的 scanpy 流程(https://scanpy.readthedocs.io/en/stable/index.html)被设计出来就是⽤来整合百万级单细胞数据的
其中,BBKNN 去批次的 Documentation 链接如下:https://scanpy.readthedocs.io/en/stable/tutorials/basics/integrating-data-using-ingest.html。本⽂重点在于整合,因此数据预处理过程不在此赘述,分析的起点就是⼀个 merge 了多个数据集的经过normalize 的百万细胞级 adata 对象,adata.obs 中包含批次变量(样本标识 orig.ident 或者数据集标识batch),去批次前再次进⾏必要的数据预处理(这⼀步还有个主要⽬的是通过减少纳⼊分析的基因数量减轻运算负担):scale,top2000高变基因,PCA,neighbor,UMA
sc.pp.scale(adata, max_value=10)sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_top_genes = 2000)adata= adata[:, adata.var.highly_variable]
sc.pp.pca(adata)sc.pp.neighbors(adata)sc.tl.umap(adata)
这一步完成后得到adata对象如下:
随后是BBKNN步骤:
sc.external.pp.bbknn(adata_concat, batch_key="batch") #这里是去除数据集间的批次,也可以去除样本间的批次,可以都做比较一下结果sc.tl.umap(adata_concat)sc.pl.umap(    adata_concat, color=["batch"], palette=sc.pl.palettes.vega_20_scanpy)
去批次结果如下,结果还是相当不错的!
2 基于R的整合与去批次
其实在python中得到百万级adata对象前预处理过程的计算压力还是很大的,如果相同的过程再在R中过一遍那必定会是一场硬仗,所以这里就参考了单细胞天地中关于adata对象和seurat对象间转换的文章:https://mp.weixin.qq.com/s/vJrfJE5Eh4tUwMlHlLvIKA,大致策略是将adata对象中的稀疏矩阵、基因名信息以及包含分组的细胞信息提取出来,分别保存成.mtx和csv,再在R中按照标准10X文件读入,转化代码如下:在python中将adata对象保存成10X标准文件
import scanpy as scimport scipy.sparse as sparseimport scipy.io as siocellinfo = adata_concat.obsgeneinfo = adata_concat.varmtx = adata_concat.layers['counts'].Tcellinfo.to_csv("cellinfo.csv")geneinfo.to_csv("geneinfo.csv")sparse_matrix = sparse.csr_matrix(mtx) # 将稀疏矩阵转换为 CSR 格式sio.mmwrite("sparse_matrix.mtx", sparse_matrix)
在R中生成seurat对象:基于10X文件生成seurat对象
cellinfo <- read.csv("cellinfo.csv",header = T)geneinfo <- read.csv("geneinfo.csv",header = T)mtx <- readMM("sparse_matrix.mtx")
rownames(mtx) <- geneinfo[,1]colnames(mtx)= cellinfo[,1]
rownames(geneinfo) <- geneinfo[,1]geneinfo <- geneinfo[,-1]
#准备metadataphe=as.data.frame(cellinfo)rownames(phe)=phe[,1]phe <- phe[,-1]identical(rownames(phe),colnames(mtx)) # TRUEsce.all=CreateSeuratObject(counts = mtx, meta.data = phe) # harmonysce.list <- SplitObject(sce.all, split.by = "batch") #RPCA
2.1 Harmony
cellinfo <- read.csv("cellinfo.csv",header = T)geneinfo <- read.csv("geneinfo.csv",header = T)mtx <- readMM("sparse_matrix.mtx")
rownames(mtx) <- geneinfo[,1]colnames(mtx)= cellinfo[,1]
rownames(geneinfo) <- geneinfo[,1]geneinfo <- geneinfo[,-1]
#准备metadataphe=as.data.frame(cellinfo)rownames(phe)=phe[,1]phe <- phe[,-1]identical(rownames(phe),colnames(mtx)) # TRUEsce.all=CreateSeuratObject(counts = mtx, meta.data = phe) # harmonysce.list <- SplitObject(sce.all, split.by = "batch") #RPCA
2.2 RPCA
sce.list <- lapply(X = sce.list, FUN = function(x) {  x <- NormalizeData(x)  x <- ScaleData(x, features = rownames(x), verbose = FALSE)  x <- RunPCA(x, features = rownames(x), verbose = FALSE)})
glioma.anchors <- FindIntegrationAnchors(object.list = sce.list, anchor.features = rownames(geneinfo), reduction = "rpca")
# this command creates an 'integrated' data assayglioma.combined <- IntegrateData(anchorset = glioma.anchors)
# specify that we will perform downstream analysis on the corrected data note that the# original unmodified data still resides in the 'RNA' assayDefaultAssay(glioma.combined) <- "integrated"
# Run the standard workflow for visualization and clusteringglioma.combined <- ScaleData(glioma.combined, verbose = FALSE)glioma.combined <- RunPCA(glioma.combined, npcs = 30, verbose = FALSE)glioma.combined <- RunUMAP(glioma.combined, reduction = "pca", dims = 1:30)glioma.combined <- FindNeighbors(glioma.combined, reduction = "pca", dims = 1:30)glioma.combined <- FindClusters(glioma.combined, resolution = 0.5)
DimPlot(glioma.combined, reduction = "umap",group.by = "batch")

除了本文提到的三种最常用的方法,整合策略还包括但不限于基于python的ingest、Scanorama,基于R的CCA、FastMNN、scVI,张泽民院士团队构建T细胞图谱时用到的metacell等等。另外,为了对不同整合方法进行benchmark,有学者开发了基于R的iLISI算法(https://github.com/carmonalab/scIntegrationMetrics),暂时不明确是否可以将python和R的结果放在一起作比较,有尝试过的同道欢迎交流!

芒果师兄聊生信
1.生信技能和基因编辑。2.论文发表和基金写作。3. 健康管理和医学科研资讯。4.幸福之路,读书,音乐和娱乐。
 最新文章