ATAC-seq 数据分析实操教程

2024-12-25 21:21 湖北

引言

❝
ATAC-seq( assay for transposase-accessible chromatin using sequencing ) 具体是什么我就不再叙述了,大家可以看看网上的介绍,我们这次来下载文献里的数据来实操熟悉一下具体流程。

不同 SEQ 的区别:

ATAC-seq:

流程分析步骤:

数据下载

GSE242671 6 个样本,CTL 对照组和 siNCoR 处理组:

看看数据处理的方式,作为参考:

我们复制 PRJNA1014001 去 SRA Explorer 下载就行了:

#!/usr/bin/env bash
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/035/SRR26124135/SRR26124135_1.fastq.gz . && mv SRR26124135_1.fastq.gz SRR26124135_GSM7766625_CTL_1_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/035/SRR26124135/SRR26124135_2.fastq.gz . && mv SRR26124135_2.fastq.gz SRR26124135_GSM7766625_CTL_1_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/032/SRR26124132/SRR26124132_1.fastq.gz . && mv SRR26124132_1.fastq.gz SRR26124132_GSM7766628_siNCoR_1_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/032/SRR26124132/SRR26124132_2.fastq.gz . && mv SRR26124132_2.fastq.gz SRR26124132_GSM7766628_siNCoR_1_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/033/SRR26124133/SRR26124133_1.fastq.gz . && mv SRR26124133_1.fastq.gz SRR26124133_GSM7766627_CTL_3_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/033/SRR26124133/SRR26124133_2.fastq.gz . && mv SRR26124133_2.fastq.gz SRR26124133_GSM7766627_CTL_3_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/030/SRR26124130/SRR26124130_1.fastq.gz . && mv SRR26124130_1.fastq.gz SRR26124130_GSM7766630_siNCoR_3_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/030/SRR26124130/SRR26124130_2.fastq.gz . && mv SRR26124130_2.fastq.gz SRR26124130_GSM7766630_siNCoR_3_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/031/SRR26124131/SRR26124131_1.fastq.gz . && mv SRR26124131_1.fastq.gz SRR26124131_GSM7766629_siNCoR_2_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/031/SRR26124131/SRR26124131_2.fastq.gz . && mv SRR26124131_2.fastq.gz SRR26124131_GSM7766629_siNCoR_2_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/034/SRR26124134/SRR26124134_1.fastq.gz . && mv SRR26124134_1.fastq.gz SRR26124134_GSM7766626_CTL_2_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/034/SRR26124134/SRR26124134_2.fastq.gz . && mv SRR26124134_2.fastq.gz SRR26124134_GSM7766626_CTL_2_Homo_sapiens_ATAC-seq_2.fastq.gz

为了方便,我自己重新命名了一下:

质控

用 fastqc 和 multiqc 软件,看一下序列有没有接头:

没有就直接进行比对。

比对及数据处理

❝
ATAC-seq 需要对比对后的文件进行过滤,包括 去除低质量序列, 去除线粒体序列, 去除 PCR 重复, 去除 blacklist 区域的序列, 对序列位置进行矫正,正链+4,负链-5。最终得到 shifted.bam 文件。

blacklist 文件下载:

https://github.com/Boyle-Lab/Blacklist/tree/master/lists

$ wget https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg19-blacklist.v2.bed.gz
$ gunzip hg19-blacklist.v2.bed.gz

写一个脚本,一起完成这些事情:

#!/bin/bash

for i in CTL_1 CTL_2 CTL_3 siNCoR_1 siNCoR_2 siNCoR_3
do
 bowtie2 -p 15 -x ~/index-data/hg19_bw2/hg19 \
  --very-sensitive --no-mixed --no-discordant -X 2000 \
  -1 ../0.raw-data/${i}_1.fastq.gz \
  -2 ../0.raw-data/${i}_2.fastq.gz \
  | samtools view -bS | samtools sort -@ 6 -o ${i}.sorted.bam


 # remove mitochondrial reads and low-quality alignments
 samtools view -h -q 30 ${i}.sorted.bam \
  | grep -v chrM \
  | samtools sort -o ${i}.rmchrM.bam


 # remove duplicates
 sambamba markdup --remove-duplicates --nthreads 10 ${i}.rmchrM.bam ${i}.rmDup.bam

 # remove ENCODE blacklist regions
 bedtools intersect -nonamecheck -v -abam ${i}.rmDup.bam \
      -b hg19-blacklist.v2.bed > ${i}.tmp.bam

 # sort and index the bam file
 samtools sort -@ 10  -o ${i}.bf.bam ${i}.tmp.bam
 samtools index -@ 10 ${i}.bf.bam
 rm ${i}.tmp.bam

 # shift read coordinates
 alignmentSieve --numberOfProcessors 15 \
         --ATACshift \
         --blackListFileName hg19-blacklist.v2.bed \
         --bam ${i}.bf.bam -o ${i}.shifted.bam

  # sort and index
  samtools sort -@ 10 -o ${i}.shift.sorted.bam ${i}.shifted.bam
 samtools index -@ 10 ${i}.shift.sorted.bam

done

bam 转 bigwig/bedgraph 可视化

❝
对上面最终得到的 bam 文件转换为 bigwig/bedgraph 可以在 igv 里面查看。此外还可以在全基因水平上查看在 TSS/TES 出的富集情况。

#!/bin/bash

for i in CTL_1 CTL_2 CTL_3 siNCoR_1 siNCoR_2 siNCoR_3
do
 # bam2bigwig
 bamCoverage -p 15 --outFileFormat bigwig \
      --binSize 50 --normalizeUsing BPM \
      -b ../2.map-data/${i}.shift.sorted.bam \
      -o ${i}.bw

 #bam2bedgraph
 bamCoverage -p 15 --outFileFormat bedgraph \
                    --binSize 50 --normalizeUsing BPM \
                    -b ../2.map-data/${i}.shift.sorted.bam \
                    -o ${i}.bdg
done

❝
对于多个生物学重复样本,我们可以合并然后出图,这样可以不用展示很多图,尤其是样本数量多有生物学重复的情况,如果不多也可以都画出来。这里我们使用 macs3 cmbreps 来用均值合并。

# merge replicate of bedgraph files
$ macs3 cmbreps -i CTL_1.bdg CTL_2.bdg CTL_3.bdg -o CTL_cb.bedGraph --method mean
$ macs3 cmbreps -i siNCoR_1.bdg siNCoR_2.bdg siNCoR_3.bdg -o siNCoR_cb.bedGraph --method mean

# bedgraph to bigwig
$ conda install ucsc-bedgraphtobigwig
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes

$ bedGraphToBigWig CTL_cb.bedGraph hg19.chrom.sizes  CTL_cb.bigwig
$ bedGraphToBigWig siNCoR_cb.bedGraph hg19.chrom.sizes  siNCoR_cb.bigwig

下载注释文件:

$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz
$ gunzip hg19.ncbiRefSeq.gtf.gz

绘图:

# =============================================================================================
# heatmap plot
# =============================================================================================
computeMatrix scale-regions -R ~/index-data/hg19.ncbiRefSeq.gtf \
                            -b 3000 -a 3000 \
                            --regionBodyLength 5000 \
                            --missingDataAsZero \
                            -S CTL_cb.bigwig siNCoR_cb.bigwig \
                            --skipZeros -p 15 \
                            --metagene \
                            -o matrix_sr.gz

computeMatrix reference-point --referencePoint TSS \
                              -R ~/index-data/hg19.ncbiRefSeq.gtf \
                              -b 3000 -a 3000 \
                              --missingDataAsZero \
                              -S CTL_cb.bigwig siNCoR_cb.bigwig \
                              --skipZeros -p 15 \
                              --metagene \
                              -o matrix_rp.gz

plotHeatmap -m matrix_sr.gz --colorList 'white,#0066CC' --heatmapHeight 12 --heatmapWidth 6 -o sr.pdf
plotHeatmap -m matrix_rp.gz --colorList 'white,#0066CC' --heatmapHeight 12 --heatmapWidth 6 -o rp.pdf

TSS-TES 分布:

TSS 分布:

ATAC 质控

❝
片段长度分布是有特点的,一般 100bp 以内位 NFR 无核小体区, 一个核小体(200bp),两个核小体(400bp), 三个核小体(600bp)区域等等。我们用 ATACseqQC 来质控。

setwd("~/JunJun/8.atacSeq_proj/4.atacQC-data/")

# BiocManager::install("ATACseqQC")
library(ATACseqQC)
library(ggplot2)
library(dplyr)

# bamfile files
bamfile <- list.files("../2.map-data",pattern = "shift.sorted.bam$",full.names = T)
bamlabel <- sapply(strsplit(bamfile,split = "../2.map-data/|.shift.sorted.bam"),"[",2)

# Estimate the library complexity
estimateLibComplexity(readsDupFreq(bamfile[1]))

################################################################################
# generate fragement size distribution
################################################################################
x = 1
lapply(seq_along(bamfile),function(x){
  fragSize <- fragSizeDist(bamFiles = bamfile[x], bamFiles.labels = bamlabel[x])

  # replot
  df <- data.frame(fragSize)
  colnames(df) <- c("x","y")
  df$sample <- bamlabel[x]
  df$y <- df$y/sum(df$y)

  return(df)
}) %>% do.call("rbind",.) -> fz_data

# plot 6x10
ggplot(data = fz_data,aes(x = as.numeric(x),y = y*10^3)) +
  geom_line() +
  theme_bw() +
  theme(panel.grid = element_blank(),
        strip.text = element_text(face = "bold.italic")) +
  facet_wrap(~sample,ncol = 3,scales = "free") +
  scale_x_continuous(limits = c(0,1000),breaks = c(0,200,400,600,800,1000)) +
  xlab("Fragment length (bp)") +
  ylab("Normalized read density(10^-3)")

PeakCalling

❝
一般用 macs 的比较多,但对于具体参数不同文章也不一样, 如果你使用 nomodel --shift -100 --extsize 200 参数就是寻找 Tn5 酶的切割位点, 对于双端数据,我们不用就行。在 macs3.0 版本中,作者将另外一款专门针对 ATAC-seq 的 callpeak 软件 HMMRATAC 也纳入进来了。

我们可以都试一下(macs3 hmmratac 慢很多):

#!/bin/bash

for i in CTL_1 CTL_2 CTL_3 siNCoR_1 siNCoR_2 siNCoR_3
do
  # using macs3 to call peaks
  macs3 callpeak -f BAMPE -t ../2.map-data/${i}.shift.sorted.bam \
                 -g hs -n ${i} -B -q 0.01 --keep-dup all

  # using hmmratac to call peaks
  macs3 hmmratac -i ../2.map-data/${i}.shift.sorted.bam \
                    -n ${i} --format BAMPE \
                    --outdir hmmratac_peaks
done

差异 Peaks 分析

❝
差异 peak 的分析软件有多种, 基于 peak set-based 和 sliding window, 一种最原始的方法就是对 peak 区域进行计数获得 count,然后使用差异软件进行差异分析。我们使用: 原始计数+DESeq2, DiffBind 和 DiffChIPL 来看看结果。

一些差异 peak 分析软件:

DiffChIPL 是一款 2022 年发表在 Bioinformatics 的差异分析软件:

原始计数+DESeq2

❝
首先我们需要对生物学重复间的 peak 取交集,然后合并不同分组的 peaks,然后进行计数获得 count 矩阵,最后用 DESeq2 差异分析。

# control peaks replicate overlap
bedtools intersect -a CTL_1_peaks.narrowPeak -b CTL_2_peaks.narrowPeak | bedtools intersect -a stdin -b CTL_3_peaks.narrowPeak | awk '{print $1"\t"$2"\t"$3}' > CTL_overlap.bed

# treat peaks replicate overlap
bedtools intersect -a siNCoR_1_peaks.narrowPeak -b siNCoR_2_peaks.narrowPeak | bedtools intersect -a stdin -b siNCoR_3_peaks.narrowPeak | awk '{print $1"\t"$2"\t"$3}' > siNCoR_overlap.bed

# merge control and treat peaks
cat CTL_overlap.bed siNCoR_overlap.bed | sort -k1,1 -k2,2n > CTL_siNCoR.bed
bedtools merge -i CTL_siNCoR.bed > CTL_siNCoR.merge.bed

计数:

# make count matrix
bedtools multicov -bams ../2.map-data/*.shift.sorted.bam -bed CTL_siNCoR.merge.bed > peak_count_matrix.txt

差异分析:

setwd("~/JunJun/8.atacSeq_proj/5.peak-data/")

library(DESeq2)
library(dplyr)

# load data
raw_count <- read.table('peak_count_matrix.txt')
raw_count$id <- paste(raw_count$V1,raw_count$V2,raw_count$V3,sep = "|")
raw_count <- raw_count %>% dplyr::select(-V1,-V2,-V3) %>%
  tibble::column_to_rownames(var = "id")

sample <- list.files("../2.map-data/",pattern = "*.shift.sorted.bam$")
sample_name <- sapply(strsplit(sample,split = ".shift.sorted.bam"),"[",1)
colnames(raw_count) <- sample_name

# save folder
dir.create('1.Diff-data/')

# meta data
coldata <- data.frame(condition = factor(rep(c('control', 'treat'), each = 3),
                                         levels = c('control', 'treat')))
# DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = raw_count,
                              colData = coldata,
                              design= ~condition)

# diff
dds <- DESeq(dds)

norm <- counts(dds,normalized = T)

# get results
res <- results(dds, contrast = c('condition', 'treat', 'control'))
res <- cbind(norm,res)

# filter NA
res_all <- as.data.frame(res) %>% filter(log2FoldChange != 'NA')

# add annotation
res_all$id <- rownames(res_all)
res_all <- res_all %>% separate_wider_delim(id,delim = "|",names = c("chr","start","end"))

# add sig type
res_all_anno <- res_all |>
  mutate(type = case_when(log2FoldChange >= 1 & pvalue < 0.05 ~ "sigUp",
                          log2FoldChange <= -1 & pvalue < 0.05 ~ "sigDown",
                          .default = "nonSig"))

# check
table(res_all_anno$type)
# nonSig sigDown   sigUp
# 58446    5962    3449

# export all diff
write.csv(res_all_anno,file = paste('1.Diff-data/peak_diffAll.csv',sep = '') ,row.names = F)

colSums(raw_count)
#   CTL_1    CTL_2    CTL_3 siNCoR_1 siNCoR_2 siNCoR_3
# 4115852  3178135  3737711 16961289 23533595 35302961

❝
思考: 这里有个问题, 我们对于一个组有明显富集的 peak 和一个比较弱的富集的 peak 进行计数,那么它们总的 reads 数量肯定相差很大,而 DESeq2 需要对每个样本的文库大小进行标准化,那么富集多的那组很多 peak 经过标准化则会变成下调的相反结果,事实上我看了很多确实是这样的,所以并不推荐这种做法去找差异的 peak。

DiffBind 差异分析

# BiocManager::install("DiffBind")
library(DiffBind)

# SampleID Tissue Factor Condition Treatment Replicate bamReads ControlID bamControl Peaks PeakCaller
meta <- data.frame(SampleID = paste0(rep(c("CTL","siNCoR"),each = 3),c(1:3)),
                   Tissue = NA,
                   Factor = "atac",
                   Condition = rep(c("control","treat"),each = 3),
                   Treatment = rep(c("control","treat"),each = 3),
                   Replicate = rep(c(1,2,3),2),
                   bamReads = list.files("../2.map-data",pattern = ".shift.sorted.bam$",full.names = T),
                   ControlID = NA,
                   bamControl = NA,
                   Peaks = list.files(".",pattern = ".narrowPeak",full.names = T),
                   PeakCaller = "narrow")

tamoxifen <-dba(sampleSheet = meta)

plot(tamoxifen)

tamoxifen <- dba.count(tamoxifen,summits = 100)

dba.plotPCA(tamoxifen,attributes = DBA_CONDITION,label = DBA_ID)
# =======================================================
# show libsizes
info <- dba.show(tamoxifen)
libsizes <- cbind(LibReads=info$Reads, FRiP=info$FRiP,
                  PeakReads=round(info$Reads * info$FRiP))
rownames(libsizes) <- info$ID
libsizes
#         LibReads FRiP PeakReads
# CTL1    17653370 0.09   1588803
# CTL2    28665764 0.05   1433288
# CTL3    28460519 0.05   1423026
# siNCoR1 35846252 0.18   6452325
# siNCoR2 32999878 0.27   8909967
# siNCoR3 42458342 0.32  13586670

tamoxifen <- dba.normalize(tamoxifen)
tamoxifen <- dba.contrast(tamoxifen,contrast = c("Condition","treat","control"))
dba.show(DBA = tamoxifen,bContrasts = T)

# test <-dba.contrast(tamoxifen,reorderMeta = list(Condition = "control"))
# dba.show(DBA = test,bContrasts = T)

tamoxifen <- dba.analyze(tamoxifen)
diffPeaks <- dba.report(tamoxifen,th = 1) %>%
  data.frame() %>%
  mutate(type = case_when(Fold >= 1 & FDR < 0.05 ~ "sigUp",
                          Fold <= -1 & FDR < 0.05 ~ "sigDown",
                          .default = "nonSig"))

# check
table(diffPeaks$type)
# nonSig sigDown   sigUp
# 12928    3555   75505

# export all diff
write.csv(diffPeaks,file = paste('1.Diff-data/diffbind_peak_diffAll.csv',sep = '') ,row.names = F)

可以看到显著上调的 peak 比下调的多了一个数量级,也和我们的 TSS heatmap profile 比较像。

样本相关性:

PCA:

注释一下差异的 peak:

# ==============================================================================
# annotate for peaks
# ==============================================================================
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)

sigdiff <- diffPeaks %>% dplyr::filter(type == "sigUp") %>%
  GRanges()

peakAnno <- annotatePeak(sigdiff, tssRegion = c(-3000, 3000),
                         TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene)

peakAnno_df <- data.frame(peakAnno)

# id transformation
ids <- mapIds(org.Hs.eg.db, keys = peakAnno_df$geneId, keytype = "ENTREZID", column = "SYMBOL")
peakAnno_df$GeneName <- ids

# output
write.csv(peakAnno_df, '1.Diff-data/diffbind_sigPeak_anno.csv', row.names = FALSE)

# plot
plotAnnoBar(peakAnno)

输出把上下调和不变的 peak:

# ==============================================================================
# output peaks with types
# ==============================================================================
diffPeaks <- read.csv("1.Diff-data/diffbind_peak_diffAll.csv")

types <- c("sigUp","sigDown","nonSig")

# x = 2
lapply(seq_along(type),function(x){
  tmp <- diffPeaks %>% dplyr::filter(type == types[x]) %>%
    dplyr::select(seqnames,start,end)

  write.table(tmp,file = paste0(types[x],".bed"),
              quote = F,col.names = F,row.names = F,sep = "\t")
})

把上下调和不变的 peak 画个 profile 图看看:

# plot diff peaks region signal
computeMatrix reference-point --referencePoint TSS \
                              -R ../5.peak-data/sigUp.bed ../5.peak-data/sigDown.bed ../5.peak-data/nonSig.bed \
                              -b 3000 -a 3000 \
                              --missingDataAsZero \
                              -S CTL_cb.bigwig siNCoR_cb.bigwig \
                              --skipZeros -p 15 \
                              -o matrix_diff.gz

plotHeatmap -m matrix_diff.gz --colorList 'white,#0066CC' --heatmapHeight 12 --heatmapWidth 6 -o diffPeaks.pdf

DiffChIPL 差异分析

setwd("~/JunJun/8.atacSeq_proj/5.peak-data/")

# install DiffChIPL
# pacman::p_install_gh("yancychy/DiffChIPL")
library(DiffChIPL)


# SampleID Tissue Factor Condition Treatment Replicate bamReads ControlID bamControl Peaks PeakCaller
meta <- data.frame(SampleID = paste0(rep(c("CTL","siNCoR"),each = 3),c(1:3)),
                   Tissue = NA,
                   Factor = "atac",
                   Condition = rep(c("control","treat"),each = 3),
                   Treatment = rep(c("control","treat"),each = 3),
                   Replicate = rep(c(1,2,3),2),
                   bamReads = list.files("../2.map-data",pattern = ".shift.sorted.bam$",full.names = T),
                   ControlID = NA,
                   bamControl = NA,
                   Peaks = list.files(".",pattern = ".narrowPeak",full.names = T),
                   PeakCaller = "narrow")

# output
write.csv(meta,file = "DiffChIPL_meta.csv",row.names = F)

# ==============================================================================
# 1_Get read count
# ==============================================================================

countL = getReadCount(inputF = "DiffChIPL_meta.csv",
                      overlap = FALSE,
                      fragment = 0,
                      removeBackground = TRUE)

# ==============================================================================
# 2_Make the design matrix and normalization
# ==============================================================================
group= c(0,0,0,1,1,1)
ctrName = "control"
treatName = "treat"
groupName = rep(c(ctrName,treatName),each = 3)
design0 <- cbind(rep(1, 6), group)
colnames(design0) <- c(ctrName, treatName)
design0

#      control treat
# [1,]       1     0
# [2,]       1     0
# [3,]       1     0
# [4,]       1     1
# [5,]       1     1
# [6,]       1     1

countAll <- countL$countAll

for(i in 1:ncol(countAll)){
  id = which(countAll[,i] < 1)
  countAll[id,i] = 0
}

cpmD = cpmNorm(countAll, libsize = countL$fd$lsIP)

# write.csv(cpmD,file = "cpmD.csv",row.names = F)

# ==============================================================================
# 3_Do differential analysis with DiffChIPL
# ==============================================================================
resA = DiffChIPL(cpmD, design0, group0 = group)
# fitRlimm3 = resA$fitDiffL

rtRlimm3 <- resA$resDE %>%
  mutate(type = case_when(logFC >= 1 & adj.P.Val < 0.05 ~ "sigUp",
                          logFC <= -1 & adj.P.Val < 0.05 ~ "sigDown",
                          .default = "nonSig"))

# check
head(rtRlimm3,3)
#                         logFC   AveExpr         t      P.Value   adj.P.Val            B   type
# chr1_752456_752814  0.9113821 0.6859282  1.796055 0.0724863331 0.146245882 -4.907984089 nonSig
# chr1_756889_757238 -0.7992602 0.9601723 -1.547512 0.1217404297 0.219235256 -5.313989648 nonSig
# chr1_762153_763224  1.1434761 2.5288585  3.644235 0.0002682122 0.001128709  0.004337593  sigUp

table(rtRlimm3$type)
# nonSig sigDown   sigUp
#  93102     753   50758

rtRlimm3$id <- rownames(rtRlimm3)

# export all diff
write.csv(rtRlimm3,file = paste('1.Diff-data/DiffChIPL_peak_diffAll.csv',sep = '') ,row.names = F)

这里我们看看 DiffChIPL 和 DiffBind 找到的差异 peaks 的重叠性怎么样:

# ==============================================================================
# overlap with diffbind results
# ==============================================================================
sig_peak <- rtRlimm3 %>% dplyr::filter(type != "nonSig") %>%
  separate_wider_delim(id,delim = "_",names = c("chr","start","end"),too_many = "merge") %>%
  dplyr::select(chr,start,end) %>% head(n = 51468) %>% GRanges()

diffPeaks <- read.csv("1.Diff-data/diffbind_peak_diffAll.csv") %>%
  dplyr::filter(type != "nonSig") %>% GRanges()

# overlap
ChIPpeakAnno::makeVennDiagram(Peaks = list(sig_peak,diffPeaks),
                              NameOfPeaks = c("DiffChIPL","DiffBind"),
                              fill = c("#D55E00", "#0072B2"))

可以看到大部分是一致的,但是 DiffBind 数量更多一些,可能是默认调用的 DESeq2 来寻找差异的, dba.analyze 的 method 可以换成 edgeR 都试试看。

footprinting 分析

❝
转录因子结合在 DNA 上会阻碍 Tn5 酶的切割,就会形成下面这样的小区域,成为足迹(footprint),因此我们可以查看转录因子在全基因组上的结合状态。我们使用 rbt hint 来进行转录因子分析。

文献:

安装:

# installing RGT
$ pip install cython numpy scipy
$ pip install RGT

配置基因组文件:

# install all the necessary human genome (hg19) data sets
$ cd ~/rgtdata
$ python setupGenomicData.py --hg19

我们获得一个总的 peaks 文件和 bam 文件,来处理生物学重复样本:

# using macs3 to call peaks
for i in CTL siNCoR
do
  macs3 callpeak -f BAMPE \
                 -t ../2.map-data/${i}_1.shift.sorted.bam \
                 ../2.map-data/${i}_2.shift.sorted.bam \
                 ../2.map-data/${i}_3.shift.sorted.bam \
                 -g hs -n ${i} -q 0.01 --keep-dup all
done

# merge replicates bam files
samtools merge -@ 12 -o CTL_cb.bam \
                ../2.map-data/CTL_1.shift.sorted.bam \
                ../2.map-data/CTL_2.shift.sorted.bam \
                ../2.map-data/CTL_3.shift.sorted.bam

samtools merge -@ 12 -o siNCoR_cb.bam \
                ../2.map-data/siNCoR_1.shift.sorted.bam \
                ../2.map-data/siNCoR_2.shift.sorted.bam \
                ../2.map-data/siNCoR_3.shift.sorted.bam

footprinting analysis:

# call footprints
rgt-hint footprinting -h

for i in CTL siNCoR
do
  rgt-hint footprinting --atac-seq --paired-end \
                        --organism=hg19 \
                        --output-location=./ --output-prefix=${i} \
                        ${i}_cb.bam \
                        ${i}_peaks.narrowPeak
done

# finding motifs overlapping with predicted footprints
rgt-motifanalysis matching --help

rgt-motifanalysis matching --organism=hg19 --input-files CTL.bed siNCoR.bed

然后进行差异转录因子分析:

# generate average ATAC-seq profiles around binding sites of particular TF
rgt-hint differential -h

rgt-hint differential --organism=hg19 \
                      --bc --nc 12 \
                      --mpbs-files=./match/CTL_mpbs.bed,./match/siNCoR_mpbs.bed \
                      --reads-files=CTL_cb.bam,siNCoR_cb.bam \
                      --conditions=CTL,siNCoR \
                      --output-location=./CTL_siNCoR

差异转录因子散点图:

我们选择上面两个查看在全基因组的结合情况, 可以看到一个结合增强,一个减弱:

结尾

❝
路漫漫其修远兮,吾将上下而求索。

欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群 (微信交流群需收取 20 元入群费用,一旦交费,拒不退还!(防止骗子和便于管理)) 。QQ 群可免费加入, 记得进群按格式修改备注哦。

老俊俊微信：

知识星球：

往期回顾目录

❝
ATAC-seq 分析方法一览locuszoomr 可视化 GWAS 结果2024 了,听说你还在用 dplyr 处理数据?计算 motif 到 peak center 的距离写个包 metaplot 绘制 m6A peak 分布当 ggSCvis 遇到 ggmagnify在模仿中精进数据可视化_使用ggcirclize绘制circos圈图单细胞二维密度图一文拿捏!coord_polar2 补充极坐标图形短板ggplot 绘制热图标注特定名称

AIBioPioneer

分享人工智能，生物信息与植物生理的点点滴滴