肺腺癌单细胞数据集GSE189357复现(一)：数据下载整理、降维聚类与分群

学术 2024-10-07 19:00 广东

分享是一种态度

前言

Hello小伙伴们大家好，我是生信技能树的小学徒”我才不吃蛋黄“。继胃癌单细胞复现系列完结之后，经过了短暂的休整，我又回来了。之前每一期的推文，我都有认真准备。推文发出之后，也仔细阅读了每一条留言。非常感谢大家的鼓励。

接下来的一段时间里，我将再次开启一个新的学徒分享系列，给大家系统整理肺腺癌单细胞测序的代码。

文章的具体内容，二由老师已在生信菜鸟团公众号发布，请大家移步阅读：单细胞测序+空间转录组描绘从癌前病变到浸润性肺腺癌的动态演变。

本系列包括但不限于以下内容：数据下载与读取；质控和去批次；降维聚类；分群注释；差异分析；富集分析；亚群细分；inferCNV;拟时序分析；细胞通讯；SCENIC转录因子分析等。

为了加深对代码的理解，除了阅读推文，可以加入单细胞月更群，也可以配合观看生信技能树B站系列教学视频。

如果你是的单细胞新手，推荐报名参加生信马拉松课程（生信入门&数据挖掘线上直播课10月班），让你快速掌握生信数据挖掘技能。

本系列推文预计12篇左右，欢迎大家持续追更，关注公众号，点赞+评论+收藏+在看，您的鼓励将是我更新的动力，同时欢迎大家批评指正。

数据处理流程

首先清除系统环境变量，加载R包：

rm(list=ls())
options(stringsAsFactors = F) 
getwd()
setwd("../GSE189357-LUAD-scRNA-ST")
source('scRNA_scripts/lib.R')
library(Seurat)
library(ggplot2)
library(clustree)
library(cowplot)
library(data.table)
library(dplyr)

当我们在建立数据框的时候，R语言将会默认把字符型（character）当成因子（factor）。stringsAsFactors = F意味着，“在读入数据时，遇到字符串之后，不将其转换为factors，仍然保留为字符串格式”。

在R语言中，stringsAsFactors是一个全局选项，它控制着当读取数据时，字符串类型的列是否自动转换为因子(factors)类型。因子类型在R中是一种特殊的数据类型，用于表示分类数据。

在R的早期版本中，默认情况下，读取数据时字符串会被转换为因子。但是，因子类型有其特定的属性和行为，这在某些情况下可能不是用户所期望的。例如，因子类型会将数据视为分类变量，并且会存储数据的唯一值作为内部整数编码，这可能会增加内存使用量，并且在进行某些类型的数据分析时可能会引起混淆。

在单细胞数据分析中，数据通常包含大量的基因表达信息，这些信息是以字符串形式存在的。在进行数据分析之前，设置stringsAsFactors = F的目的通常是为了：

避免不必要的内存消耗：将字符串转换为因子类型会增加内存的使用，因为因子需要存储一个内部的整数编码和对应的唯一值列表。在处理大规模的单细胞数据集时，这可能会成为一个问题。
保持数据的一致性：在数据分析过程中，保持数据类型的一致性是很重要的。如果某些操作期望数据是字符串类型，而数据被自动转换为因子，这可能会导致错误或者不一致的结果。
提高代码的可读性和可维护性：明确地设置stringsAsFactors = F可以让代码的读者知道数据不会被自动转换为因子，这有助于理解代码的意图和行为。
避免因子类型带来的潜在问题：因子类型在某些统计分析中可能会引起问题，比如在进行机器学习或者数据挖掘时，因子类型的数据可能需要额外的处理才能被正确使用。

因此，在读取单细胞数据之前设置stringsAsFactors = F是一种良好的编程实践，它有助于确保数据以正确的形式被处理，并且可以减少后续分析中可能出现的问题。

step1:导入数据

数据存储在“GSE189357_RAW”文件夹中：

dir='GSE189357_RAW/' 
fs=list.files('GSE189357_RAW/','^GSM')
fs
library(tidyverse)
samples=str_split(fs,'_',simplify = T)[,1]

处理数据，将原始文件分别整理为barcodes.tsv.gz，features.tsv.gz和matrix.mtx.gz到各自的文件夹。批量将文件名改为 Read10X() 函数能够识别的名字：

if(F){
lapply(unique(samples),function(x){
  # x = unique(samples)[1]
  y=fs[grepl(x,fs)]
  folder=paste0("GSE189357_RAW/", paste(str_split(y[1],'_',simplify = T)[,1:2], collapse = "_"))
  dir.create(folder,recursive = T)
  #为每个样本创建子文件夹
  file.rename(paste0("GSE189357_RAW/",y[1]),file.path(folder,"barcodes.tsv.gz"))
  #重命名文件，并移动到相应的子文件夹里
  file.rename(paste0("GSE189357_RAW/",y[2]),file.path(folder,"features.tsv.gz"))
  file.rename(paste0("GSE189357_RAW/",y[3]),file.path(folder,"matrix.mtx.gz"))
})
}

自行创建函数function，内置Read10X()函数和CreateSeuratObject()函数，批量读取文件原始文件创建Seurat对象：

dir='GSE189357_RAW/'
samples=list.files( dir )
samples 
sceList = lapply(samples,function(pro){ 
  # pro=samples[1] 
  print(pro)  
  tmp = Read10X(file.path(dir,pro )) 
  if(length(tmp)==2){
    ct = tmp[[1]] 
  }else{ct = tmp}
  sce =CreateSeuratObject(counts =  ct ,
                          project =  pro  ,
                          min.cells = 5,
                          min.features = 300 )
  return(sce)
})

这里我们得到的sceList实际上包含了9个样本的Seurat对象，接着，我们需要使用Seurat包的merge函数，将九个Seurat合并成一个Seurat对象：

do.call(rbind,lapply(sceList, dim))
sce.all=merge(x=sceList[[1]],
              y=sceList[ -1 ],
              add.cell.ids = samples  ) 
names(sce.all@assays$RNA@layers)
#> [1] "counts.GSM5699777_TD1" "counts.GSM5699778_TD2" "counts.GSM5699779_TD3"
#> [4] "counts.GSM5699780_TD4" "counts.GSM5699781_TD5" "counts.GSM5699782_TD6"
#> [7] "counts.GSM5699783_TD7" "counts.GSM5699784_TD8" "counts.GSM5699785_TD9"

此时，虽然我们已经完成了Seurat对象的创建，但是可以看到，九个样本仍然有9个layers。如果不进一步处理，后续在提取counts时数据不完整，分析会一直出错。因此我们需要使用JoinLayers函数对layers进行合并。

sce.all[["RNA"]]$counts 
# Alternate accessor function with the same result
LayerData(sce.all, assay = "RNA", layer = "counts")

看看合并前后的sce变化：

sce.all
sce.all <- JoinLayers(sce.all)
sce.all
dim(sce.all[["RNA"]]$counts )

在合并Seurat和layers后，终于得到了一个完整的Seurat对象”sce.all“。我们可以查看”sce.all“内部的一些信息，以此来检查数据是否完整。

as.data.frame(sce.all@assays$RNA$counts[1:10, 1:2])
head(sce.all@meta.data, 10)
table(sce.all$orig.ident) 
length(sce.all$orig.ident)
# fivenum(sce.all$nFeature_RNA)
# table(sce.all$nFeature_RNA>800) 
# sce.all=sce.all[,sce.all$nFeature_RNA>800]
# sce.all

在成功构建Seurat对象”sce.all“后，我们还需要给样本添加meta.data分组信息，以便后续做不同分组之间的对比以及提取亚组后进行进一步分析。首先，我们查看现有的meta.data信息有哪些：

library(stringr)
phe = sce.all@meta.data
table(phe$orig.ident)
phe$patient = str_split(phe$orig.ident,'[_]',simplify = T)[,2]
table(phe$patient)

按照患者来源，肺腺癌（LUAD）从原位腺癌（AIS）进展到微浸润性腺癌（MIA）和随后的浸润性腺癌（IAC）

phe$sample = phe$patient
table(phe$sample)
phe$sample = gsub("TD5|TD7|TD8", "AIS", phe$sample) 
phe$sample = gsub("TD3|TD4|TD6", "MIA", phe$sample) 
phe$sample = gsub("TD1|TD2|TD9", "IAC", phe$sample) 
table(phe$sample)
sce.all@meta.data = phe

step2: QC质控

质控（quality control, QC）的目的是为了去除质量较差细胞，低质量的细胞会形成独特的亚群，使分群结果变得复杂；在主成分分析过程中，前几个主要成分将捕获质量差异而不是生物学差异，从而降低降维效果。因此，低质量的细胞可能会导致下游分析出现误导性结果。为了避免上述情况的发生，我们需要在下游分析开始前排除掉这些低质量细胞。QC主要的指标有nCount_RNA(每个细胞的UMI数目)和nFeature_RNA（每个细胞中检测到的基因数量）以及"percent_mito"（表示细胞中线粒体基因的比例）这三个指标。此外，还可以纳入”percent_ribo“（核糖体基因比例）和”percent_hb“（红血细胞基因比例）两个指标。细胞的UMI分子数和基因数反映细胞的质量，数量太低可能是细胞碎片，太高则可能是两个或多个细胞结团的情况；线粒体基因高表达的细胞，可能是处于凋亡状态或者裂解状态的细胞；核糖体基因高表达的细胞，细胞内出现RNA降解时；血红蛋白基因高表达的细胞通常为红细胞，而红细胞本身是不含有细胞核的，且携带的基因少，其携带的基因与疾病以及生物发育等过程没有太大的联系，所以可以直接剔除掉。在这里，我们要利用PercentageFeatureSet函数分别计算每个细胞的"percent_mito"，”percent_ribo“和”percent_hb“，具体可见scripts代码

sp='human'
dir.create("./1-QC")
setwd("./1-QC")
# 如果过滤的太狠，就需要去修改这个过滤代码
source('../scRNA_scripts/qc.R')
sce.all.filt = basic_qc(sce.all)
print(dim(sce.all))
print(dim(sce.all.filt))
##细胞减少了一点
setwd('../')
getwd()
fivenum(sce.all.filt$percent_ribo)
table(sce.all.filt$nFeature_RNA> 5)

qc函数如下：

basic_qc <- function(input_sce){
  #计算线粒体基因比例
  mito_genes=rownames(input_sce)[grep("^MT-", rownames(input_sce),ignore.case = T)] 
  print(mito_genes) #可能是13个线粒体基因
  #input_sce=PercentageFeatureSet(input_sce, "^MT-", col.name = "percent_mito")
  input_sce=PercentageFeatureSet(input_sce, features = mito_genes, col.name = "percent_mito")
  fivenum(input_sce@meta.data$percent_mito)
  
  #计算核糖体基因比例
  ribo_genes=rownames(input_sce)[grep("^Rp[sl]", rownames(input_sce),ignore.case = T)]
  print(ribo_genes)
  input_sce=PercentageFeatureSet(input_sce,  features = ribo_genes, col.name = "percent_ribo")
  fivenum(input_sce@meta.data$percent_ribo)
  
  #计算红血细胞基因比例
  Hb_genes=rownames(input_sce)[grep("^Hb[^(p)]", rownames(input_sce),ignore.case = T)]
  print(Hb_genes)
  input_sce=PercentageFeatureSet(input_sce,  features = Hb_genes,col.name = "percent_hb")
  fivenum(input_sce@meta.data$percent_hb)
  
  #可视化细胞的上述比例情况
  feats <- c("nFeature_RNA", "nCount_RNA", "percent_mito",
             "percent_ribo", "percent_hb")
  feats <- c("nFeature_RNA", "nCount_RNA")
  p1=VlnPlot(input_sce, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 2) + 
    NoLegend()
  p1 
  w=length(unique(input_sce$orig.ident))/3+5;w
  ggsave(filename="Vlnplot1.pdf",plot=p1,width = w,height = 5)
  
  feats <- c("percent_mito", "percent_ribo", "percent_hb")
  p2=VlnPlot(input_sce, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 3, same.y.lims=T) + 
    scale_y_continuous(breaks=seq(0, 100, 5)) +
    NoLegend()
  p2 
  w=length(unique(input_sce$orig.ident))/2+5;w
  ggsave(filename="Vlnplot2.pdf",plot=p2,width = w,height = 5)
  
  p3=FeatureScatter(input_sce, "nCount_RNA", "nFeature_RNA", group.by = "orig.ident", pt.size = 0.5)
  ggsave(filename="Scatterplot.pdf",plot=p3)
  
  #根据上述指标，过滤低质量细胞/基因
  #过滤指标1:最少表达基因数的细胞&最少表达细胞数的基因
  # 一般来说，在CreateSeuratObject的时候已经是进行了这个过滤操作
  # 如果后期看到了自己的单细胞降维聚类分群结果很诡异，就可以回过头来看质量控制环节
  # 先走默认流程即可
  if(F){
    selected_c <- WhichCells(input_sce, expression = nFeature_RNA > 500)
    selected_f <- rownames(input_sce)[Matrix::rowSums(input_sce@assays$RNA$counts > 0 ) > 3]
    input_sce.filt <- subset(input_sce, features = selected_f, cells = selected_c)
    dim(input_sce) 
    dim(input_sce.filt) 
  }
  
  input_sce.filt =  input_sce
  
  # par(mar = c(4, 8, 2, 1))
  # 这里的C 这个矩阵，有一点大，可以考虑随抽样 
  C=subset(input_sce.filt,downsample=100)@assays$RNA$counts
  dim(C)
  C=Matrix::t(Matrix::t(C)/Matrix::colSums(C)) * 100

  most_expressed <- order(apply(C, 1, median), decreasing = T)[50:1]
  
  pdf("TOP50_most_expressed_gene.pdf",width=14)
  boxplot(as.matrix(Matrix::t(C[most_expressed, ])),
          cex = 0.1, las = 1, 
          xlab = "% total count per cell", 
          col = (scales::hue_pal())(50)[50:1], 
          horizontal = TRUE)
  dev.off()
  rm(C)
  
  #过滤指标2:线粒体/核糖体基因比例(根据上面的violin图)
  selected_mito <- WhichCells(input_sce.filt, expression = percent_mito < 25)
  selected_ribo <- WhichCells(input_sce.filt, expression = percent_ribo > 3)
  selected_hb <- WhichCells(input_sce.filt, expression = percent_hb < 1 )
  length(selected_hb)
  length(selected_ribo)
  length(selected_mito)
  
  input_sce.filt <- subset(input_sce.filt, cells = selected_mito)
  #input_sce.filt <- subset(input_sce.filt, cells = selected_ribo)
  input_sce.filt <- subset(input_sce.filt, cells = selected_hb)
  dim(input_sce.filt)
  
  table(input_sce.filt$orig.ident) 
  
  #可视化过滤后的情况
  feats <- c("nFeature_RNA", "nCount_RNA")
  p1_filtered=VlnPlot(input_sce.filt, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 2) + 
    NoLegend()
  w=length(unique(input_sce.filt$orig.ident))/3+5;w 
  ggsave(filename="Vlnplot1_filtered.pdf",plot=p1_filtered,width = w,height = 5)
  
  feats <- c("percent_mito", "percent_ribo", "percent_hb")
  p2_filtered=VlnPlot(input_sce.filt, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 3) + 
    NoLegend()
  w=length(unique(input_sce.filt$orig.ident))/2+5;w 
  ggsave(filename="Vlnplot2_filtered.pdf",plot=p2_filtered,width = w,height = 5) 
  return(input_sce.filt) 
  
}

可视化

step3: harmony整合多个单细胞样品

细胞筛选之后，我们需要进行harmony整合。第一期我们在创建总的Seurat对象时，使用了merge函数对多个Seurat进行了简单的合并。merge只是按照行和列进行了合并，并未对数据进行其他处理。

在拿到下游单细胞矩阵前，样本经历了多个实验环节的处理，时间、处理人员、试剂以及技术平台等变量都会成为混杂因素。以上因素混合到一起，就会导致数据产生批次效应（batch effect）。为了尽可能避免混杂因素，我们可以严格把控测序的技术流程，同时也需要在下游分析中进行事后补救（算法去批次）。目前单细胞测序常用的去批次算法有Harmony，CCA，RPCA,FastMNN,scVI等。在这里，我们使用Harmony进行演示。

set.seed(10086)
table(sce.all.filt$orig.ident)
if(T){
  dir.create("2-harmony")
  getwd()
  setwd("2-harmony")
  source('../scRNA_scripts/harmony.R')
  # 默认 ScaleData 没有添加"nCount_RNA", "nFeature_RNA"
  # 默认的
  sce.all.int = run_harmony(sce.all.filt)
  setwd('../')
}

harmony函数如下：

run_harmony <- function(input_sce){
  
  print(dim(input_sce))
  input_sce <- NormalizeData(input_sce, 
                             normalization.method = "LogNormalize",
                             scale.factor = 1e4) 
  input_sce <- FindVariableFeatures(input_sce)
  input_sce <- ScaleData(input_sce)
  input_sce <- RunPCA(input_sce, features = VariableFeatures(object = input_sce))
  seuratObj <- RunHarmony(input_sce, "orig.ident")
  names(seuratObj@reductions)
  seuratObj <- RunUMAP(seuratObj,  dims = 1:15, 
                       reduction = "harmony")
  
  # p = DimPlot(seuratObj,reduction = "umap",label=T ) 
  # ggsave(filename='umap-by-orig.ident-after-harmony',plot = p)
  
  input_sce=seuratObj
  input_sce <- FindNeighbors(input_sce, reduction = "harmony",
                             dims = 1:15) 
  input_sce.all=input_sce
  
  #设置不同的分辨率，观察分群效果(选择哪一个？)
  for (res in c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5,0.8,1)) {
    input_sce.all=FindClusters(input_sce.all, #graph.name = "CCA_snn", 
                               resolution = res, algorithm = 1)
  }
  colnames(input_sce.all@meta.data)
  apply(input_sce.all@meta.data[,grep("RNA_snn",colnames(input_sce.all@meta.data))],2,table)
  
  p1_dim=plot_grid(ncol = 3, DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.01") + 
                     ggtitle("louvain_0.01"), DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.1") + 
                     ggtitle("louvain_0.1"), DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.2") + 
                     ggtitle("louvain_0.2"))
  ggsave(plot=p1_dim, filename="Dimplot_diff_resolution_low.pdf",width = 14)
  
  p1_dim=plot_grid(ncol = 3, DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.8") + 
                     ggtitle("louvain_0.8"), DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.1") + 
                     ggtitle("louvain_1"), DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.3") + 
                     ggtitle("louvain_0.3"))
  ggsave(plot=p1_dim, filename="Dimplot_diff_resolution_high.pdf",width = 18)
  
  p2_tree=clustree(input_sce.all@meta.data, prefix = "RNA_snn_res.")
  ggsave(plot=p2_tree, filename="Tree_diff_resolution.pdf")
  table(input_sce.all@active.ident) 
  saveRDS(input_sce.all, "sce.all_int.rds")
  return(input_sce.all)
  
}

可视化

结语

本期我们整理了肺腺癌单细胞数据集GSE189357个的10X格式的单细胞测序数据，在批量读取了10X文件后，进行了合并并成功构建Seurat对象，在此基础上将患者的临床信息添加到meta.data分组信息中。在此基础上走Seurat V5标准流程，对Seurat对象进行QC质控、并利用harmony整合去批次、并按标准流程进行降维聚类分群。下一期，我们将进行单细胞测序数据分析最重要（个人认为）的一步：细胞注释。下一期咱们不见不散！

往期回顾

不能要求五年前的单细胞文章按照现在的知识体系做分析

为什么执着于复现出来张泽民大佬的单细胞分群呢

廿一｜生物信息“农转非”

VlnPlot结果及常用参数浅析

胃癌单细胞数据集GSE163558复现(十二)：细胞通讯

如果你对单细胞转录组研究感兴趣，但又不知道如何入门，也许你可以关注一下下面的课程

看完记得顺手点个“在看”哦！

生物 | 单细胞 | 转录组丨资料

每天都精彩

长按扫码可关注

http://mp.weixin.qq.com/s?__biz=MzI1Njk4ODE0MQ==&mid=2247525652&idx=1&sn=dfb8088be3ff41de0359c1ba6617da0a

单细胞天地

对应生信技能树论坛›研究热点›单细胞测序版块，力求全方位收集整理分享单细胞测序数据的应用，涵盖多种组学，多种疾病，发育机理，药物开发等等