python 单细胞scanpy流程

科技 2024-10-13 16:11 广东

今天是生信星球陪你的第1005天

公众号里的文章大多数需要编程基础，如果因为代码看不懂，而跟不上正文的节奏，可以来找我学习，相当于给自己一个新手保护期。我的课程都是循环开课，点进去咨询微信↓
生信分析直播课程(每月初开一期，春节休一个月)
生信新手保护学习小组（每月两期）
单细胞陪伴学习小组（每月两期)

这篇推文耗时甚久，主要是学习和跑通官网代码，其次是加了一些自己的细微调整，比如整理marker基因表格，还有可视化的调整等。

与Seurat的流程很像，但不完全一样。scanpy耗费的计算资源更少，可视化的图也很漂亮，且python的机器学习和深度学习做的比R语言更好，我一个R语言讲师被吸引过来了。

1.数据和包准备
2.读取数据
3.质控
4.找高变基因(HVG)
5.标准化和降维
6.marker基因及其可视化
7.细胞类型注释

1.数据和包准备

著名的pbmc3k数据：
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

代码参考自scanpy的标准流程：
https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html
https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html#manual-cell-type-annotation

import pandas as pd
import scanpy as sc
import numpy as np
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor="white")
results_file = "write/pbmc3k.h5ad"  # the file that will store the analysis results

scanpy==1.10.3 anndata==0.10.9 umap==0.5.6 numpy==1.26.4 scipy==1.11.4 pandas==2.0.3 scikit-learn==1.5.2 statsmodels==0.14.4 igraph==0.11.6 louvain==0.8.2 pynndescent==0.5.13

2.读取数据

10X的输入数据是固定的三个文件，在工作目录下新建01_data/，把三个文件放进去。

adata = sc.read_10x_mtx(
    "01_data",  # the directory with the `.mtx` file
    var_names="gene_symbols",  # use gene symbols for the variable names (variables-axis index)
    cache=True,  # write a cache file for faster subsequent reading
)
adata.var_names_make_unique()  
# this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
adata

... reading from cache file cache/01_data-matrix.h5ad

AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

#表达矩阵里的数值范围
np.min(adata.X),np.max(adata.X)

(0.0, 419.0)

#基本过滤
#过滤前
adata.X.shape

(2700, 32738)

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

filtered out 19024 genes that are detected in less than 3 cells

#过滤后
adata.X.shape #可以看到过滤前后的细胞数量和基因数量

(2700, 13714)

查看感兴趣的基因的表达矩阵
稀疏矩阵不支持直接查看，只能是转换成矩阵或者数据框才能查看。转换成矩阵就丢失了行名列名，转换成数据框更好。

adata[0:6, ['CD3D','TCL1A','MS4A1']].X.toarray()

array([[ 4.,  0.,  0.],
       [ 0.,  0.,  6.],
       [10.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 1.,  0.,  0.]], dtype=float32)

adata[0:6, ['CD3D','TCL1A','MS4A1']].to_df()

3.质控

这里是对细胞进行的质控，指标是：

线粒体基因含量不能过高；

n_genes_by_counts 不能过高或过低

为什么？ n_genes_by_counts是每个细胞中检测到的基因数量。total_counts是细胞内检测到的分子总数。n_genes_by_counts过低，表示该细胞可能已死/将死或是空液滴。太高的total_counts和/或n_genes_by_counts表明“细胞”实际上可以是两个或多个细胞。结合线粒体基因count数除去异常值，即可除去大多数双峰/死细胞/空液滴，因此它们过滤是常见的预处理步骤。参考自：https://www.biostars.org/p/407036/

3.1 查看三个指标

在anndata对象中，细胞的注释是在adata.var里，基因的注释是在adata.obs里。

adata.var.head()

adata.obs.head()

此时的表格里仅有基本过滤时计算的指标n_genes和n_cells，还没有我们需要的过滤指标（"n_genes_by_counts", "total_counts", "pct_counts_mt"）。需要用calculate_qc_metrics来计算。

#质控
# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)
adata.obs.head()

可以看到，表格里有n_genes和n_genes_by_counts，好奇二者之间的关联，统计了一下这两列数值相等的个数，发现更多的是不相等。

(adata.obs.n_genes == adata.obs.n_genes_by_counts).value_counts()

False    1924
True      776
Name: count, dtype: int64

搜索得知，这里的n_genes是刚开始基本过滤时计算的，n_genes_by_counts是calculate_qc_metrics计算的，二者有一点区别，后续质控用的是n_genes_by_counts。
有点麻烦，如果是R语言的seurat就只会计算一个n_Feature，对应这里的n_genes_by_counts。所以我们就忽略n_genes吧。

https://discourse.scverse.org/t/difference-between-n-genes-and-n-genes-by-counts/632

sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
   # **{"color": "#f8766d"},  # 使用额外的参数来设置颜色
    multi_panel=True
)

根据这个三个图，确定了这个数据的过滤标准：

nFeature_RNA在200~2500之间；线粒体基因占比在5%以下。

3.2 三个指标之间的相关性

sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")

sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

3.3 过滤

adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()

链式赋值：如果你在一段代码中连续对数据进行多次筛选或其他操作，可以在最后一步使用 .copy() 来确保最终结果是一个独立的副本。这样可以避免在中间步骤中不小心修改原始数据。

adata.shape

(2638, 13714)

过滤之后少了一些基因。

4.找高变基因(HVG)

首先将数据矩阵归一化（校正文库大小）为每个单元格10000条reads，以使细胞之间的counts具有可比性。

sc.pp.normalize_total(adata, target_sum=1e4)

normalizing counts per cell
    finished (0:00:00)

sc.pp.log1p(adata) #对数据进行log

查看感兴趣的基因的数值大小，可见与先前的count不同

adata[0:6, ['ACTB','PPBP','MS4A1']].to_df()

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

adata.var.head()

离散度是基因表达方差与基因表达平均水平的比值，用于评估基因表达的变异程度。

sc.pl.highly_variable_genes(adata)
#sc.pl.highly_variable_genes(adata, log=True)

查看前10个高变化基因(因为scanpy和seurat的高变化基因默认参数不同，所以找出的基因也不相同)

#adata.var.loc[adata.var.highly_variable == True]
he =  adata.var.sort_values(by='dispersions_norm', ascending=False)
print(he[he.highly_variable == True].index.tolist()[0:10])

['DOK3', 'ARVCF', 'YPEL2', 'UBE2D4', 'FAM210B', 'CTB-113I20.2', 'GBGT1', 'LRRIQ3', 'MTIF2', 'TTC8']

adata.raw = adata

存储一下adata的原始数据

adata = adata[:, adata.var.highly_variable] #非必须

5.标准化和降维

5.1 线性降维PCA

sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"]) #回归
sc.pp.scale(adata, max_value=10) # scale

regressing out ['total_counts', 'pct_counts_mt']
    sparse input is densified and may lead to high memory use
    finished (0:00:22)

sc.tl.pca(adata, svd_solver="arpack")

computing PCA
    with n_comps=50
    finished (0:00:01)

sc.pl.pca(adata)

sc.pl.pca_variance_ratio(adata, log=True)

adata

AnnData object with n_obs × n_vars = 2638 × 13714
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

computing neighbors
    using 'X_pca' with n_pcs = 40
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)

5.2 非线性降维UMAP

sc.tl.umap(adata)

computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:02)

leiden是聚类，没有设置分辨率参数，默认值是1

# Using the igraph implementation and a fixed number of iterations can be significantly faster, especially for larger datasets
#sc.tl.leiden(adata)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2)
sc.pl.umap(adata, color=["leiden"],
           legend_loc="on data",
           size=5)

running Leiden clustering
    finished: found 8 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

6.marker基因及其可视化

啥叫marker基因呢。和差异基因里面的上调基因有点类似，某个基因在某一簇细胞里表达量都很高，在其他簇表达量很低，那么这个基因就是这簇细胞的象征。

6.1 计算全部cluster的maker基因

# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon",pts=True)

sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden", standard_scale="var", n_genes=5
)

ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:02)
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_leiden']`

marker基因表格

pd.DataFrame(adata.uns["rank_genes_groups"]["names"]).head(10)

如果我们想要查看计算结果，像FindAllMarkers那样的数据框，就需要自己整理一番，这段是我自己加的，好费劲哦。

result = adata.uns["rank_genes_groups"]
result.keys() #看看结果都包含哪些部分

dict_keys(['params', 'pts', 'pts_rest', 'names', 'scores', 'pvals', 'pvals_adj', 'logfoldchanges'])

取出需要的信息，来做后面的整理。注意：观察发现pts和pts_rest的行名顺序与其他列不同，所以要分开整理，避免出错。

keys_to_get = ("pvals","logfoldchanges","pvals_adj","names")
subset = {k: pd.DataFrame(result[k]) for k in keys_to_get if k in result}

{key: value.shape for key, value in subset.items()}

{'pvals': (13714, 8),
 'logfoldchanges': (13714, 8),
 'pvals_adj': (13714, 8),
 'names': (13714, 8)}

{key: value.iloc[1:4,1:4,] for key, value in subset.items()}

{'pvals':                1              2             3
 1  1.679730e-170  5.878996e-100  1.513503e-88
 2  6.935111e-167   2.002411e-98  1.405777e-84
 3  2.569135e-154   3.853307e-97  7.510027e-84,
 'logfoldchanges':           1         2         3
 1  7.749746  4.805129  7.957393
 2  4.887034  4.863548  7.636600
 3  5.518004  3.643959  4.836388,
 'pvals_adj':                1             2             3
 1  1.151791e-166  4.031227e-96  1.037809e-84
 2  3.170270e-163  9.153686e-95  6.426275e-81
 3  8.808280e-151  1.321106e-93  2.574813e-80,
 'names':          1       2     3
 1    CD79A  FCER1G  GNLY
 2  HLA-DRA    AIF1  GZMB
 3    CD79B   COTL1  CTSW}

官网给的代码↓，整理出的格式是每簇占4列，且每簇里的基因排列顺序都不同，不方便进行筛选，所以要继续整理成规范的数据框

groups = result["names"].dtype.names
pbmc_markers = pd.DataFrame(
    {
        group + "_" + key: result[key][group]
        for group in groups
        for key in [ "pvals","logfoldchanges","pvals_adj","names"]
    }
)

pbmc_markers.shape
pbmc_markers.to_csv('pbmc_markers.csv', index=False)

pbmc_markers.head(5)

我自己加的调整格式的代码如下：

import pandas as pd
import numpy as np
df = pbmc_markers
n = 4
split_df = []
for start_col in range(0, df.shape[1], n):
    end_col = min(start_col + n, df.shape[1])
    a = df.iloc[:, start_col:end_col]
    a['cluster'] = a.columns.str.split('_').str[0].unique().tolist()* a.shape[0]
    a.columns =   [col.split("_", 1)[1] for col in a.columns[:n]] + a.columns[n:].tolist()
    split_df.append(a)
new_df = pd.concat(split_df, ignore_index=True)
new_df.head()

然后再将每个基因的细胞表达比例加上去

keys_to_get = ('pts', 'pts_rest')
subset = {k: pd.DataFrame(result[k]) for k in keys_to_get if k in result}

subset.keys()

dict_keys(['pts', 'pts_rest'])

pt1 = pd.melt(subset['pts'].reset_index().rename(columns={'index': 'names'}), 
                  id_vars='names', var_name='cluster', value_name='pts')
pt1.head()

pt1.names.equals(pt2.names)

True

pt1["pts_rest"] = pt2.pts_rest
pt1.head()

new_df = pd.merge(new_df, pt1, on=['names', 'cluster'])new_df.head()


new_df = new_df.loc[((new_df.pts>0.25)|(new_df.pts_rest>0.25))&(new_df.logfoldchanges>0.25) ,]
new_df.shape

(4285, 7)

6.2 marker基因可视化

df = pd.read_table("markers.txt",header=None)
df

df.columns = ['Cell_Type', 'Marker']

# 创建字典，其中每个细胞类型对应一个标记基因列表
markers = df.groupby('Cell_Type')['Marker'].apply(list).to_dict()

markers
sc.pl.dotplot(adata, markers, groupby='leiden',figsize=(15,4),var_group_rotation=0)

sc.pl.violin(adata,['PPBP','MS4A1','CD8A'],'leiden')

sc.pl.stacked_violin(adata, markers, groupby = "leiden",figsize= (16,7),var_group_rotation=0)

sc.pl.heatmap(adata, markers, groupby='leiden', var_group_rotation=True,figsize=(12,6))

sc.pl.matrixplot(adata, markers, groupby='leiden',figsize=(12,2),var_group_rotation= 0)

sc.pl.tracksplot(adata,markers,'leiden',figsize=(15,4))

7.细胞类型注释

new_cluster_names = {
    '0': "CD4 T",
    '1': "B",
    '2': "FCGR3A+ Mono",
    '3': "NK",
    '4': "CD8 T",
    '5': "CD14+ Mono",
    '6': "DC",
    '7': "Platelet"
}
#adata.rename_categories("louvain", new_cluster_names)
adata.obs['leiden'] = adata.obs['leiden'].map(new_cluster_names)
adata.obs['leiden']

AAACATACAACCAC-1           CD4 T
AAACATTGAGCTAC-1               B
AAACATTGATCAGC-1           CD4 T
AAACCGTGCTTCCG-1    FCGR3A+ Mono
AAACCGTGTATGCG-1              NK
                        ...     
TTTCGAACTCTCAT-1      CD14+ Mono
TTTCTACTGAGGCA-1               B
TTTCTACTTCCTCG-1               B
TTTGCATGAGAGGC-1               B
TTTGCATGCCTCAC-1           CD4 T
Name: leiden, Length: 2638, dtype: category
Categories (8, object): ['CD4 T', 'B', 'FCGR3A+ Mono', 'NK', 'CD8 T', 'CD14+ Mono', 'DC', 'Platelet']

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 7))

# 使用 scanpy.pl.umap() 绘制 UMAP 图，并将图形绘制到指定的 Axes 对象上
sc.pl.umap(adata, color='leiden', legend_loc='on data',
           frameon=False, legend_fontsize=10, legend_fontoutline=2,
           add_outline=True, palette="Set1", ax=ax)
# 显示图表
plt.show()

import pandas as pd
umap_coords = adata.obsm['X_umap']
labels = adata.obs['leiden']
plt.figure(figsize=(5,5))
unique_labels = labels.unique()
colors = plt.cm.get_cmap('Set1', len(unique_labels))  
for i, label in enumerate(unique_labels):
    plt.scatter(umap_coords[labels == label, 0], umap_coords[labels == label, 1], 
                color=colors(i), alpha=0.5)
for label in unique_labels:
    label_coords = umap_coords[labels == label]
    center_x = label_coords[:, 0].mean()
    center_y = label_coords[:, 1].mean()
    plt.text(center_x, center_y, label, fontsize=12, ha='center', va='center')
plt.grid(False)
plt.title('UMAP Plot with Cluster Labels', fontsize=14)
plt.xlabel('UMAP 1', fontsize=12)
plt.ylabel('UMAP 2', fontsize=12)
plt.show()

!mkdir write
adata.write(results_file)

http://mp.weixin.qq.com/s?__biz=MzU4NjU4ODQ2MQ==&mid=2247496267&idx=1&sn=fe8126619796acfc99d5a50c36695f59

生信星球

一个零基础学生信的平台-- 原创结构化图文/教程，精选阶段性资料，带你少走弯路早入门，收获成就感，早成生信小能手~