基于对图谱类⽂章的整理,我们发现⼤量单细胞数据的整合主要是基于 Python 的 BBKNN ⽅法以及基于 R的 harmony ⽅法和 RPCA ⽅法,下⾯本⽂依次对三种⽅法进⾏代码实践其实,python 的 scanpy 流程(https://scanpy.readthedocs.io/en/stable/index.html)被设计出来就是⽤来整合百万级单细胞数据的其中,BBKNN 去批次的 Documentation 链接如下:https://scanpy.readthedocs.io/en/stable/tutorials/basics/integrating-data-using-ingest.html。本⽂重点在于整合,因此数据预处理过程不在此赘述,分析的起点就是⼀个 merge 了多个数据集的经过normalize 的百万细胞级 adata 对象,adata.obs 中包含批次变量(样本标识 orig.ident 或者数据集标识batch),去批次前再次进⾏必要的数据预处理(这⼀步还有个主要⽬的是通过减少纳⼊分析的基因数量减轻运算负担):scale,top2000高变基因,PCA,neighbor,UMAsc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_top_genes = 2000)
adata= adata[:, adata.var.highly_variable]
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.external.pp.bbknn(adata_concat, batch_key="batch")
其实在python中得到百万级adata对象前预处理过程的计算压力还是很大的,如果相同的过程再在R中过一遍那必定会是一场硬仗,所以这里就参考了单细胞天地中关于adata对象和seurat对象间转换的文章:https://mp.weixin.qq.com/s/vJrfJE5Eh4tUwMlHlLvIKA,大致策略是将adata对象中的稀疏矩阵、基因名信息以及包含分组的细胞信息提取出来,分别保存成.mtx和csv,再在R中按照标准10X文件读入,转化代码如下:在python中将adata对象保存成10X标准文件import scanpy as scimport scipy.sparse as sparseimport scipy.io as siocellinfo = adata_concat.obsgeneinfo = adata_concat.varmtx = adata_concat.layers['counts'].Tcellinfo.to_csv("cellinfo.csv")geneinfo.to_csv("geneinfo.csv")sparse_matrix = sparse.csr_matrix(mtx)
在R中生成seurat对象:基于10X文件生成seurat对象cellinfo <- read.csv("cellinfo.csv",header = T)
geneinfo <- read.csv("geneinfo.csv",header = T)
mtx <- readMM("sparse_matrix.mtx")
rownames(mtx) <- geneinfo[,1]
colnames(mtx)= cellinfo[,1]
rownames(geneinfo) <- geneinfo[,1]
geneinfo <- geneinfo[,-1]
#准备metadata
phe=as.data.frame(cellinfo)
rownames(phe)=phe[,1]
phe <- phe[,-1]
identical(rownames(phe),colnames(mtx))
# TRUE
sce.all=CreateSeuratObject(counts = mtx,
meta.data = phe) # harmony
sce.list <- SplitObject(sce.all, split.by = "batch") #RPCA
cellinfo <- read.csv("cellinfo.csv",header = T)
geneinfo <- read.csv("geneinfo.csv",header = T)
mtx <- readMM("sparse_matrix.mtx")
rownames(mtx) <- geneinfo[,1]
colnames(mtx)= cellinfo[,1]
rownames(geneinfo) <- geneinfo[,1]
geneinfo <- geneinfo[,-1]
#准备metadata
phe=as.data.frame(cellinfo)
rownames(phe)=phe[,1]
phe <- phe[,-1]
identical(rownames(phe),colnames(mtx))
# TRUE
sce.all=CreateSeuratObject(counts = mtx,
meta.data = phe) # harmony
sce.list <- SplitObject(sce.all, split.by = "batch") #RPCA
sce.list <- lapply(X = sce.list, FUN = function(x) {
x <- NormalizeData(x)
x <- ScaleData(x, features = rownames(x), verbose = FALSE)
x <- RunPCA(x, features = rownames(x), verbose = FALSE)
})
glioma.anchors <- FindIntegrationAnchors(object.list = sce.list,
anchor.features = rownames(geneinfo),
reduction = "rpca")
glioma.combined <- IntegrateData(anchorset = glioma.anchors)
DefaultAssay(glioma.combined) <- "integrated"
glioma.combined <- ScaleData(glioma.combined, verbose = FALSE)
glioma.combined <- RunPCA(glioma.combined, npcs = 30, verbose = FALSE)
glioma.combined <- RunUMAP(glioma.combined, reduction = "pca", dims = 1:30)
glioma.combined <- FindNeighbors(glioma.combined, reduction = "pca", dims = 1:30)
glioma.combined <- FindClusters(glioma.combined, resolution = 0.5)
DimPlot(glioma.combined, reduction = "umap",group.by = "batch")
除了本文提到的三种最常用的方法,整合策略还包括但不限于基于python的ingest、Scanorama,基于R的CCA、FastMNN、scVI,张泽民院士团队构建T细胞图谱时用到的metacell等等。另外,为了对不同整合方法进行benchmark,有学者开发了基于R的iLISI算法(https://github.com/carmonalab/scIntegrationMetrics),暂时不明确是否可以将python和R的结果放在一起作比较,有尝试过的同道欢迎交流!