今天是生信星球陪你的第1005天
公众号里的文章大多数需要编程基础,如果因为代码看不懂,而跟不上正文的节奏,可以来找我学习,相当于给自己一个新手保护期。我的课程都是循环开课,点进去咨询微信↓
生信分析直播课程(每月初开一期,春节休一个月)
生信新手保护学习小组(每月两期)
单细胞陪伴学习小组(每月两期)
这篇推文耗时甚久,主要是学习和跑通官网代码,其次是加了一些自己的细微调整,比如整理marker基因表格,还有可视化的调整等。
与Seurat的流程很像,但不完全一样。scanpy耗费的计算资源更少,可视化的图也很漂亮,且python的机器学习和深度学习做的比R语言更好,我一个R语言讲师被吸引过来了。
1.数据和包准备
2.读取数据
3.质控
4.找高变基因(HVG)
5.标准化和降维
6.marker基因及其可视化
7.细胞类型注释
1.数据和包准备
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html
https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html#manual-cell-type-annotation
import pandas as pd
import scanpy as sc
import numpy as np
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor="white")
results_file = "write/pbmc3k.h5ad" # the file that will store the analysis results
scanpy==1.10.3 anndata==0.10.9 umap==0.5.6 numpy==1.26.4 scipy==1.11.4 pandas==2.0.3 scikit-learn==1.5.2 statsmodels==0.14.4 igraph==0.11.6 louvain==0.8.2 pynndescent==0.5.13
2.读取数据
adata = sc.read_10x_mtx(
"01_data", # the directory with the `.mtx` file
var_names="gene_symbols", # use gene symbols for the variable names (variables-axis index)
cache=True, # write a cache file for faster subsequent reading
)
adata.var_names_make_unique()
# this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
adata
... reading from cache file cache/01_data-matrix.h5ad
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
#表达矩阵里的数值范围
np.min(adata.X),np.max(adata.X)
(0.0, 419.0)
#基本过滤
#过滤前
adata.X.shape
(2700, 32738)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 19024 genes that are detected in less than 3 cells
#过滤后
adata.X.shape #可以看到过滤前后的细胞数量和基因数量
(2700, 13714)
稀疏矩阵不支持直接查看,只能是转换成矩阵或者数据框才能查看。转换成矩阵就丢失了行名列名,转换成数据框更好。
adata[0:6, ['CD3D','TCL1A','MS4A1']].X.toarray()
array([[ 4., 0., 0.],
[ 0., 0., 6.],
[10., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 1., 0., 0.]], dtype=float32)
adata[0:6, ['CD3D','TCL1A','MS4A1']].to_df()
3.质控
为什么? n_genes_by_counts是每个细胞中检测到的基因数量。total_counts是细胞内检测到的分子总数。n_genes_by_counts过低,表示该细胞可能已死/将死或是空液滴。太高的total_counts和/或n_genes_by_counts表明“细胞”实际上可以是两个或多个细胞。结合线粒体基因count数除去异常值,即可除去大多数双峰/死细胞/空液滴,因此它们过滤是常见的预处理步骤。 参考自:https://www.biostars.org/p/407036/
3.1 查看三个指标
adata.var.head()
adata.obs.head()
#质控
# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)
adata.obs.head()
可以看到,表格里有n_genes和n_genes_by_counts,好奇二者之间的关联,统计了一下这两列数值相等的个数,发现更多的是不相等。
(adata.obs.n_genes == adata.obs.n_genes_by_counts).value_counts()
False 1924
True 776
Name: count, dtype: int64
有点麻烦,如果是R语言的seurat就只会计算一个n_Feature,对应这里的n_genes_by_counts。所以我们就忽略n_genes吧。
https://discourse.scverse.org/t/difference-between-n-genes-and-n-genes-by-counts/632
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4,
# **{"color": "#f8766d"}, # 使用额外的参数来设置颜色
multi_panel=True
)
3.2 三个指标之间的相关性
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
3.3 过滤
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()
链式赋值:如果你在一段代码中连续对数据进行多次筛选或其他操作,可以在最后一步使用 .copy() 来确保最终结果是一个独立的副本。这样可以避免在中间步骤中不小心修改原始数据。
adata.shape
(2638, 13714)
4.找高变基因(HVG)
sc.pp.normalize_total(adata, target_sum=1e4)
normalizing counts per cell
finished (0:00:00)
sc.pp.log1p(adata) #对数据进行log
adata[0:6, ['ACTB','PPBP','MS4A1']].to_df()
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
adata.var.head()
sc.pl.highly_variable_genes(adata)
#sc.pl.highly_variable_genes(adata, log=True)
#adata.var.loc[adata.var.highly_variable == True]
he = adata.var.sort_values(by='dispersions_norm', ascending=False)
print(he[he.highly_variable == True].index.tolist()[0:10])
['DOK3', 'ARVCF', 'YPEL2', 'UBE2D4', 'FAM210B', 'CTB-113I20.2', 'GBGT1', 'LRRIQ3', 'MTIF2', 'TTC8']
adata.raw = adata
adata = adata[:, adata.var.highly_variable] #非必须
5.标准化和降维
5.1 线性降维PCA
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"]) #回归
sc.pp.scale(adata, max_value=10) # scale
regressing out ['total_counts', 'pct_counts_mt']
sparse input is densified and may lead to high memory use
finished (0:00:22)
sc.tl.pca(adata, svd_solver="arpack")
computing PCA
with n_comps=50
finished (0:00:01)
sc.pl.pca(adata)
sc.pl.pca_variance_ratio(adata, log=True)
adata
AnnData object with n_obs × n_vars = 2638 × 13714
obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'log1p', 'hvg', 'pca'
obsm: 'X_pca'
varm: 'PCs'
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
computing neighbors
using 'X_pca' with n_pcs = 40
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
5.2 非线性降维UMAP
sc.tl.umap(adata)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm)
'umap', UMAP parameters (adata.uns) (0:00:02)
# Using the igraph implementation and a fixed number of iterations can be significantly faster, especially for larger datasets
#sc.tl.leiden(adata)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2)
sc.pl.umap(adata, color=["leiden"],
legend_loc="on data",
size=5)
running Leiden clustering
finished: found 8 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
6.marker基因及其可视化
6.1 计算全部cluster的maker基因
# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon",pts=True)
sc.pl.rank_genes_groups_dotplot(
adata, groupby="leiden", standard_scale="var", n_genes=5
)
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:02)
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_leiden']`
pd.DataFrame(adata.uns["rank_genes_groups"]["names"]).head(10)
result = adata.uns["rank_genes_groups"]
result.keys() #看看结果都包含哪些部分
dict_keys(['params', 'pts', 'pts_rest', 'names', 'scores', 'pvals', 'pvals_adj', 'logfoldchanges'])
keys_to_get = ("pvals","logfoldchanges","pvals_adj","names")
subset = {k: pd.DataFrame(result[k]) for k in keys_to_get if k in result}
{key: value.shape for key, value in subset.items()}
{'pvals': (13714, 8),
'logfoldchanges': (13714, 8),
'pvals_adj': (13714, 8),
'names': (13714, 8)}
{key: value.iloc[1:4,1:4,] for key, value in subset.items()}
{'pvals': 1 2 3
1 1.679730e-170 5.878996e-100 1.513503e-88
2 6.935111e-167 2.002411e-98 1.405777e-84
3 2.569135e-154 3.853307e-97 7.510027e-84,
'logfoldchanges': 1 2 3
1 7.749746 4.805129 7.957393
2 4.887034 4.863548 7.636600
3 5.518004 3.643959 4.836388,
'pvals_adj': 1 2 3
1 1.151791e-166 4.031227e-96 1.037809e-84
2 3.170270e-163 9.153686e-95 6.426275e-81
3 8.808280e-151 1.321106e-93 2.574813e-80,
'names': 1 2 3
1 CD79A FCER1G GNLY
2 HLA-DRA AIF1 GZMB
3 CD79B COTL1 CTSW}
groups = result["names"].dtype.names
pbmc_markers = pd.DataFrame(
{
group + "_" + key: result[key][group]
for group in groups
for key in [ "pvals","logfoldchanges","pvals_adj","names"]
}
)
pbmc_markers.shape
pbmc_markers.to_csv('pbmc_markers.csv', index=False)
pbmc_markers.head(5)
import pandas as pd
import numpy as np
df = pbmc_markers
n = 4
split_df = []
for start_col in range(0, df.shape[1], n):
end_col = min(start_col + n, df.shape[1])
a = df.iloc[:, start_col:end_col]
a['cluster'] = a.columns.str.split('_').str[0].unique().tolist()* a.shape[0]
a.columns = [col.split("_", 1)[1] for col in a.columns[:n]] + a.columns[n:].tolist()
split_df.append(a)
new_df = pd.concat(split_df, ignore_index=True)
new_df.head()
keys_to_get = ('pts', 'pts_rest')
subset = {k: pd.DataFrame(result[k]) for k in keys_to_get if k in result}
subset.keys()
dict_keys(['pts', 'pts_rest'])
pt1 = pd.melt(subset['pts'].reset_index().rename(columns={'index': 'names'}),
id_vars='names', var_name='cluster', value_name='pts')
pt1.head()
pt1.names.equals(pt2.names)
True
pt1["pts_rest"] = pt2.pts_rest
pt1.head()
new_df = pd.merge(new_df, pt1, on=['names', 'cluster'])new_df.head()
new_df = new_df.loc[((new_df.pts>0.25)|(new_df.pts_rest>0.25))&(new_df.logfoldchanges>0.25) ,]
new_df.shape
(4285, 7)
6.2 marker基因可视化
df = pd.read_table("markers.txt",header=None)
df
df.columns = ['Cell_Type', 'Marker']
# 创建字典,其中每个细胞类型对应一个标记基因列表
markers = df.groupby('Cell_Type')['Marker'].apply(list).to_dict()
markers
sc.pl.dotplot(adata, markers, groupby='leiden',figsize=(15,4),var_group_rotation=0)
sc.pl.violin(adata,['PPBP','MS4A1','CD8A'],'leiden')
sc.pl.stacked_violin(adata, markers, groupby = "leiden",figsize= (16,7),var_group_rotation=0)
sc.pl.heatmap(adata, markers, groupby='leiden', var_group_rotation=True,figsize=(12,6))
sc.pl.matrixplot(adata, markers, groupby='leiden',figsize=(12,2),var_group_rotation= 0)
sc.pl.tracksplot(adata,markers,'leiden',figsize=(15,4))
7.细胞类型注释
new_cluster_names = {
'0': "CD4 T",
'1': "B",
'2': "FCGR3A+ Mono",
'3': "NK",
'4': "CD8 T",
'5': "CD14+ Mono",
'6': "DC",
'7': "Platelet"
}
#adata.rename_categories("louvain", new_cluster_names)
adata.obs['leiden'] = adata.obs['leiden'].map(new_cluster_names)
adata.obs['leiden']
AAACATACAACCAC-1 CD4 T
AAACATTGAGCTAC-1 B
AAACATTGATCAGC-1 CD4 T
AAACCGTGCTTCCG-1 FCGR3A+ Mono
AAACCGTGTATGCG-1 NK
...
TTTCGAACTCTCAT-1 CD14+ Mono
TTTCTACTGAGGCA-1 B
TTTCTACTTCCTCG-1 B
TTTGCATGAGAGGC-1 B
TTTGCATGCCTCAC-1 CD4 T
Name: leiden, Length: 2638, dtype: category
Categories (8, object): ['CD4 T', 'B', 'FCGR3A+ Mono', 'NK', 'CD8 T', 'CD14+ Mono', 'DC', 'Platelet']
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 7))
# 使用 scanpy.pl.umap() 绘制 UMAP 图,并将图形绘制到指定的 Axes 对象上
sc.pl.umap(adata, color='leiden', legend_loc='on data',
frameon=False, legend_fontsize=10, legend_fontoutline=2,
add_outline=True, palette="Set1", ax=ax)
# 显示图表
plt.show()
import pandas as pd
umap_coords = adata.obsm['X_umap']
labels = adata.obs['leiden']
plt.figure(figsize=(5,5))
unique_labels = labels.unique()
colors = plt.cm.get_cmap('Set1', len(unique_labels))
for i, label in enumerate(unique_labels):
plt.scatter(umap_coords[labels == label, 0], umap_coords[labels == label, 1],
color=colors(i), alpha=0.5)
for label in unique_labels:
label_coords = umap_coords[labels == label]
center_x = label_coords[:, 0].mean()
center_y = label_coords[:, 1].mean()
plt.text(center_x, center_y, label, fontsize=12, ha='center', va='center')
plt.grid(False)
plt.title('UMAP Plot with Cluster Labels', fontsize=14)
plt.xlabel('UMAP 1', fontsize=12)
plt.ylabel('UMAP 2', fontsize=12)
plt.show()
!mkdir write
adata.write(results_file)