引言
❝ATAC-seq( assay for transposase-accessible chromatin using sequencing ) 具体是什么我就不再叙述了,大家可以看看网上的介绍,我们这次来下载文献里的数据来实操熟悉一下具体流程。
不同 SEQ 的区别:
ATAC-seq:
流程分析步骤:
数据下载
GSE242671 6 个样本,CTL 对照组和 siNCoR 处理组:
看看数据处理的方式,作为参考:
我们复制 PRJNA1014001 去 SRA Explorer 下载就行了:
#!/usr/bin/env bash
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/035/SRR26124135/SRR26124135_1.fastq.gz . && mv SRR26124135_1.fastq.gz SRR26124135_GSM7766625_CTL_1_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/035/SRR26124135/SRR26124135_2.fastq.gz . && mv SRR26124135_2.fastq.gz SRR26124135_GSM7766625_CTL_1_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/032/SRR26124132/SRR26124132_1.fastq.gz . && mv SRR26124132_1.fastq.gz SRR26124132_GSM7766628_siNCoR_1_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/032/SRR26124132/SRR26124132_2.fastq.gz . && mv SRR26124132_2.fastq.gz SRR26124132_GSM7766628_siNCoR_1_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/033/SRR26124133/SRR26124133_1.fastq.gz . && mv SRR26124133_1.fastq.gz SRR26124133_GSM7766627_CTL_3_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/033/SRR26124133/SRR26124133_2.fastq.gz . && mv SRR26124133_2.fastq.gz SRR26124133_GSM7766627_CTL_3_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/030/SRR26124130/SRR26124130_1.fastq.gz . && mv SRR26124130_1.fastq.gz SRR26124130_GSM7766630_siNCoR_3_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/030/SRR26124130/SRR26124130_2.fastq.gz . && mv SRR26124130_2.fastq.gz SRR26124130_GSM7766630_siNCoR_3_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/031/SRR26124131/SRR26124131_1.fastq.gz . && mv SRR26124131_1.fastq.gz SRR26124131_GSM7766629_siNCoR_2_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/031/SRR26124131/SRR26124131_2.fastq.gz . && mv SRR26124131_2.fastq.gz SRR26124131_GSM7766629_siNCoR_2_Homo_sapiens_ATAC-seq_2.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/034/SRR26124134/SRR26124134_1.fastq.gz . && mv SRR26124134_1.fastq.gz SRR26124134_GSM7766626_CTL_2_Homo_sapiens_ATAC-seq_1.fastq.gz
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR261/034/SRR26124134/SRR26124134_2.fastq.gz . && mv SRR26124134_2.fastq.gz SRR26124134_GSM7766626_CTL_2_Homo_sapiens_ATAC-seq_2.fastq.gz
为了方便,我自己重新命名了一下:
质控
用 fastqc 和 multiqc 软件,看一下序列有没有接头:
没有就直接进行比对。
比对及数据处理
❝ATAC-seq 需要对比对后的文件进行过滤,包括 去除低质量序列, 去除线粒体序列, 去除 PCR 重复, 去除 blacklist 区域的序列, 对序列位置进行矫正,正链+4,负链-5。最终得到 shifted.bam 文件。
blacklist 文件下载:
https://github.com/Boyle-Lab/Blacklist/tree/master/lists
$ wget https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg19-blacklist.v2.bed.gz
$ gunzip hg19-blacklist.v2.bed.gz
写一个脚本,一起完成这些事情:
#!/bin/bash
for i in CTL_1 CTL_2 CTL_3 siNCoR_1 siNCoR_2 siNCoR_3
do
bowtie2 -p 15 -x ~/index-data/hg19_bw2/hg19 \
--very-sensitive --no-mixed --no-discordant -X 2000 \
-1 ../0.raw-data/${i}_1.fastq.gz \
-2 ../0.raw-data/${i}_2.fastq.gz \
| samtools view -bS | samtools sort -@ 6 -o ${i}.sorted.bam
# remove mitochondrial reads and low-quality alignments
samtools view -h -q 30 ${i}.sorted.bam \
| grep -v chrM \
| samtools sort -o ${i}.rmchrM.bam
# remove duplicates
sambamba markdup --remove-duplicates --nthreads 10 ${i}.rmchrM.bam ${i}.rmDup.bam
# remove ENCODE blacklist regions
bedtools intersect -nonamecheck -v -abam ${i}.rmDup.bam \
-b hg19-blacklist.v2.bed > ${i}.tmp.bam
# sort and index the bam file
samtools sort -@ 10 -o ${i}.bf.bam ${i}.tmp.bam
samtools index -@ 10 ${i}.bf.bam
rm ${i}.tmp.bam
# shift read coordinates
alignmentSieve --numberOfProcessors 15 \
--ATACshift \
--blackListFileName hg19-blacklist.v2.bed \
--bam ${i}.bf.bam -o ${i}.shifted.bam
# sort and index
samtools sort -@ 10 -o ${i}.shift.sorted.bam ${i}.shifted.bam
samtools index -@ 10 ${i}.shift.sorted.bam
done
bam 转 bigwig/bedgraph 可视化
❝对上面最终得到的 bam 文件转换为 bigwig/bedgraph 可以在 igv 里面查看。此外还可以在全基因水平上查看在 TSS/TES 出的富集情况。
#!/bin/bash
for i in CTL_1 CTL_2 CTL_3 siNCoR_1 siNCoR_2 siNCoR_3
do
# bam2bigwig
bamCoverage -p 15 --outFileFormat bigwig \
--binSize 50 --normalizeUsing BPM \
-b ../2.map-data/${i}.shift.sorted.bam \
-o ${i}.bw
#bam2bedgraph
bamCoverage -p 15 --outFileFormat bedgraph \
--binSize 50 --normalizeUsing BPM \
-b ../2.map-data/${i}.shift.sorted.bam \
-o ${i}.bdg
done
❝对于多个生物学重复样本,我们可以合并然后出图,这样可以不用展示很多图,尤其是样本数量多有生物学重复的情况,如果不多也可以都画出来。这里我们使用 macs3 cmbreps 来用均值合并。
# merge replicate of bedgraph files
$ macs3 cmbreps -i CTL_1.bdg CTL_2.bdg CTL_3.bdg -o CTL_cb.bedGraph --method mean
$ macs3 cmbreps -i siNCoR_1.bdg siNCoR_2.bdg siNCoR_3.bdg -o siNCoR_cb.bedGraph --method mean
# bedgraph to bigwig
$ conda install ucsc-bedgraphtobigwig
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes
$ bedGraphToBigWig CTL_cb.bedGraph hg19.chrom.sizes CTL_cb.bigwig
$ bedGraphToBigWig siNCoR_cb.bedGraph hg19.chrom.sizes siNCoR_cb.bigwig
下载注释文件:
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz
$ gunzip hg19.ncbiRefSeq.gtf.gz
绘图:
# =============================================================================================
# heatmap plot
# =============================================================================================
computeMatrix scale-regions -R ~/index-data/hg19.ncbiRefSeq.gtf \
-b 3000 -a 3000 \
--regionBodyLength 5000 \
--missingDataAsZero \
-S CTL_cb.bigwig siNCoR_cb.bigwig \
--skipZeros -p 15 \
--metagene \
-o matrix_sr.gz
computeMatrix reference-point --referencePoint TSS \
-R ~/index-data/hg19.ncbiRefSeq.gtf \
-b 3000 -a 3000 \
--missingDataAsZero \
-S CTL_cb.bigwig siNCoR_cb.bigwig \
--skipZeros -p 15 \
--metagene \
-o matrix_rp.gz
plotHeatmap -m matrix_sr.gz --colorList 'white,#0066CC' --heatmapHeight 12 --heatmapWidth 6 -o sr.pdf
plotHeatmap -m matrix_rp.gz --colorList 'white,#0066CC' --heatmapHeight 12 --heatmapWidth 6 -o rp.pdf
TSS-TES 分布:
TSS 分布:
ATAC 质控
❝片段长度分布是有特点的,一般 100bp 以内位 NFR 无核小体区, 一个核小体(200bp),两个核小体(400bp), 三个核小体(600bp)区域等等。我们用 ATACseqQC 来质控。
setwd("~/JunJun/8.atacSeq_proj/4.atacQC-data/")
# BiocManager::install("ATACseqQC")
library(ATACseqQC)
library(ggplot2)
library(dplyr)
# bamfile files
bamfile <- list.files("../2.map-data",pattern = "shift.sorted.bam$",full.names = T)
bamlabel <- sapply(strsplit(bamfile,split = "../2.map-data/|.shift.sorted.bam"),"[",2)
# Estimate the library complexity
estimateLibComplexity(readsDupFreq(bamfile[1]))
################################################################################
# generate fragement size distribution
################################################################################
x = 1
lapply(seq_along(bamfile),function(x){
fragSize <- fragSizeDist(bamFiles = bamfile[x], bamFiles.labels = bamlabel[x])
# replot
df <- data.frame(fragSize)
colnames(df) <- c("x","y")
df$sample <- bamlabel[x]
df$y <- df$y/sum(df$y)
return(df)
}) %>% do.call("rbind",.) -> fz_data
# plot 6x10
ggplot(data = fz_data,aes(x = as.numeric(x),y = y*10^3)) +
geom_line() +
theme_bw() +
theme(panel.grid = element_blank(),
strip.text = element_text(face = "bold.italic")) +
facet_wrap(~sample,ncol = 3,scales = "free") +
scale_x_continuous(limits = c(0,1000),breaks = c(0,200,400,600,800,1000)) +
xlab("Fragment length (bp)") +
ylab("Normalized read density(10^-3)")
PeakCalling
❝一般用 macs 的比较多,但对于具体参数不同文章也不一样, 如果你使用 nomodel --shift -100 --extsize 200 参数就是寻找 Tn5 酶的切割位点, 对于双端数据,我们不用就行。在 macs3.0 版本中,作者将另外一款专门针对 ATAC-seq 的 callpeak 软件 HMMRATAC 也纳入进来了。
我们可以都试一下(macs3 hmmratac 慢很多):
#!/bin/bash
for i in CTL_1 CTL_2 CTL_3 siNCoR_1 siNCoR_2 siNCoR_3
do
# using macs3 to call peaks
macs3 callpeak -f BAMPE -t ../2.map-data/${i}.shift.sorted.bam \
-g hs -n ${i} -B -q 0.01 --keep-dup all
# using hmmratac to call peaks
macs3 hmmratac -i ../2.map-data/${i}.shift.sorted.bam \
-n ${i} --format BAMPE \
--outdir hmmratac_peaks
done
差异 Peaks 分析
❝差异 peak 的分析软件有多种, 基于 peak set-based 和 sliding window, 一种最原始的方法就是对 peak 区域进行计数获得 count,然后使用差异软件进行差异分析。我们使用: 原始计数+DESeq2, DiffBind 和 DiffChIPL 来看看结果。
一些差异 peak 分析软件:
DiffChIPL 是一款 2022 年发表在 Bioinformatics 的差异分析软件:
原始计数+DESeq2
❝首先我们需要对生物学重复间的 peak 取交集,然后合并不同分组的 peaks,然后进行计数获得 count 矩阵,最后用 DESeq2 差异分析。
# control peaks replicate overlap
bedtools intersect -a CTL_1_peaks.narrowPeak -b CTL_2_peaks.narrowPeak | bedtools intersect -a stdin -b CTL_3_peaks.narrowPeak | awk '{print $1"\t"$2"\t"$3}' > CTL_overlap.bed
# treat peaks replicate overlap
bedtools intersect -a siNCoR_1_peaks.narrowPeak -b siNCoR_2_peaks.narrowPeak | bedtools intersect -a stdin -b siNCoR_3_peaks.narrowPeak | awk '{print $1"\t"$2"\t"$3}' > siNCoR_overlap.bed
# merge control and treat peaks
cat CTL_overlap.bed siNCoR_overlap.bed | sort -k1,1 -k2,2n > CTL_siNCoR.bed
bedtools merge -i CTL_siNCoR.bed > CTL_siNCoR.merge.bed
计数:
# make count matrix
bedtools multicov -bams ../2.map-data/*.shift.sorted.bam -bed CTL_siNCoR.merge.bed > peak_count_matrix.txt
差异分析:
setwd("~/JunJun/8.atacSeq_proj/5.peak-data/")
library(DESeq2)
library(dplyr)
# load data
raw_count <- read.table('peak_count_matrix.txt')
raw_count$id <- paste(raw_count$V1,raw_count$V2,raw_count$V3,sep = "|")
raw_count <- raw_count %>% dplyr::select(-V1,-V2,-V3) %>%
tibble::column_to_rownames(var = "id")
sample <- list.files("../2.map-data/",pattern = "*.shift.sorted.bam$")
sample_name <- sapply(strsplit(sample,split = ".shift.sorted.bam"),"[",1)
colnames(raw_count) <- sample_name
# save folder
dir.create('1.Diff-data/')
# meta data
coldata <- data.frame(condition = factor(rep(c('control', 'treat'), each = 3),
levels = c('control', 'treat')))
# DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = raw_count,
colData = coldata,
design= ~condition)
# diff
dds <- DESeq(dds)
norm <- counts(dds,normalized = T)
# get results
res <- results(dds, contrast = c('condition', 'treat', 'control'))
res <- cbind(norm,res)
# filter NA
res_all <- as.data.frame(res) %>% filter(log2FoldChange != 'NA')
# add annotation
res_all$id <- rownames(res_all)
res_all <- res_all %>% separate_wider_delim(id,delim = "|",names = c("chr","start","end"))
# add sig type
res_all_anno <- res_all |>
mutate(type = case_when(log2FoldChange >= 1 & pvalue < 0.05 ~ "sigUp",
log2FoldChange <= -1 & pvalue < 0.05 ~ "sigDown",
.default = "nonSig"))
# check
table(res_all_anno$type)
# nonSig sigDown sigUp
# 58446 5962 3449
# export all diff
write.csv(res_all_anno,file = paste('1.Diff-data/peak_diffAll.csv',sep = '') ,row.names = F)
colSums(raw_count)
# CTL_1 CTL_2 CTL_3 siNCoR_1 siNCoR_2 siNCoR_3
# 4115852 3178135 3737711 16961289 23533595 35302961
❝思考: 这里有个问题, 我们对于一个组有明显富集的 peak 和一个比较弱的富集的 peak 进行计数,那么它们总的 reads 数量肯定相差很大,而 DESeq2 需要对每个样本的文库大小进行标准化,那么富集多的那组很多 peak 经过标准化则会变成下调的相反结果,事实上我看了很多确实是这样的,所以并不推荐这种做法去找差异的 peak。
DiffBind 差异分析
# BiocManager::install("DiffBind")
library(DiffBind)
# SampleID Tissue Factor Condition Treatment Replicate bamReads ControlID bamControl Peaks PeakCaller
meta <- data.frame(SampleID = paste0(rep(c("CTL","siNCoR"),each = 3),c(1:3)),
Tissue = NA,
Factor = "atac",
Condition = rep(c("control","treat"),each = 3),
Treatment = rep(c("control","treat"),each = 3),
Replicate = rep(c(1,2,3),2),
bamReads = list.files("../2.map-data",pattern = ".shift.sorted.bam$",full.names = T),
ControlID = NA,
bamControl = NA,
Peaks = list.files(".",pattern = ".narrowPeak",full.names = T),
PeakCaller = "narrow")
tamoxifen <-dba(sampleSheet = meta)
plot(tamoxifen)
tamoxifen <- dba.count(tamoxifen,summits = 100)
dba.plotPCA(tamoxifen,attributes = DBA_CONDITION,label = DBA_ID)
# =======================================================
# show libsizes
info <- dba.show(tamoxifen)
libsizes <- cbind(LibReads=info$Reads, FRiP=info$FRiP,
PeakReads=round(info$Reads * info$FRiP))
rownames(libsizes) <- info$ID
libsizes
# LibReads FRiP PeakReads
# CTL1 17653370 0.09 1588803
# CTL2 28665764 0.05 1433288
# CTL3 28460519 0.05 1423026
# siNCoR1 35846252 0.18 6452325
# siNCoR2 32999878 0.27 8909967
# siNCoR3 42458342 0.32 13586670
tamoxifen <- dba.normalize(tamoxifen)
tamoxifen <- dba.contrast(tamoxifen,contrast = c("Condition","treat","control"))
dba.show(DBA = tamoxifen,bContrasts = T)
# test <-dba.contrast(tamoxifen,reorderMeta = list(Condition = "control"))
# dba.show(DBA = test,bContrasts = T)
tamoxifen <- dba.analyze(tamoxifen)
diffPeaks <- dba.report(tamoxifen,th = 1) %>%
data.frame() %>%
mutate(type = case_when(Fold >= 1 & FDR < 0.05 ~ "sigUp",
Fold <= -1 & FDR < 0.05 ~ "sigDown",
.default = "nonSig"))
# check
table(diffPeaks$type)
# nonSig sigDown sigUp
# 12928 3555 75505
# export all diff
write.csv(diffPeaks,file = paste('1.Diff-data/diffbind_peak_diffAll.csv',sep = '') ,row.names = F)
可以看到显著上调的 peak 比下调的多了一个数量级,也和我们的 TSS heatmap profile 比较像。
样本相关性:
PCA:
注释一下差异的 peak:
# ==============================================================================
# annotate for peaks
# ==============================================================================
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
sigdiff <- diffPeaks %>% dplyr::filter(type == "sigUp") %>%
GRanges()
peakAnno <- annotatePeak(sigdiff, tssRegion = c(-3000, 3000),
TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene)
peakAnno_df <- data.frame(peakAnno)
# id transformation
ids <- mapIds(org.Hs.eg.db, keys = peakAnno_df$geneId, keytype = "ENTREZID", column = "SYMBOL")
peakAnno_df$GeneName <- ids
# output
write.csv(peakAnno_df, '1.Diff-data/diffbind_sigPeak_anno.csv', row.names = FALSE)
# plot
plotAnnoBar(peakAnno)
输出把上下调和不变的 peak:
# ==============================================================================
# output peaks with types
# ==============================================================================
diffPeaks <- read.csv("1.Diff-data/diffbind_peak_diffAll.csv")
types <- c("sigUp","sigDown","nonSig")
# x = 2
lapply(seq_along(type),function(x){
tmp <- diffPeaks %>% dplyr::filter(type == types[x]) %>%
dplyr::select(seqnames,start,end)
write.table(tmp,file = paste0(types[x],".bed"),
quote = F,col.names = F,row.names = F,sep = "\t")
})
把上下调和不变的 peak 画个 profile 图看看:
# plot diff peaks region signal
computeMatrix reference-point --referencePoint TSS \
-R ../5.peak-data/sigUp.bed ../5.peak-data/sigDown.bed ../5.peak-data/nonSig.bed \
-b 3000 -a 3000 \
--missingDataAsZero \
-S CTL_cb.bigwig siNCoR_cb.bigwig \
--skipZeros -p 15 \
-o matrix_diff.gz
plotHeatmap -m matrix_diff.gz --colorList 'white,#0066CC' --heatmapHeight 12 --heatmapWidth 6 -o diffPeaks.pdf
DiffChIPL 差异分析
setwd("~/JunJun/8.atacSeq_proj/5.peak-data/")
# install DiffChIPL
# pacman::p_install_gh("yancychy/DiffChIPL")
library(DiffChIPL)
# SampleID Tissue Factor Condition Treatment Replicate bamReads ControlID bamControl Peaks PeakCaller
meta <- data.frame(SampleID = paste0(rep(c("CTL","siNCoR"),each = 3),c(1:3)),
Tissue = NA,
Factor = "atac",
Condition = rep(c("control","treat"),each = 3),
Treatment = rep(c("control","treat"),each = 3),
Replicate = rep(c(1,2,3),2),
bamReads = list.files("../2.map-data",pattern = ".shift.sorted.bam$",full.names = T),
ControlID = NA,
bamControl = NA,
Peaks = list.files(".",pattern = ".narrowPeak",full.names = T),
PeakCaller = "narrow")
# output
write.csv(meta,file = "DiffChIPL_meta.csv",row.names = F)
# ==============================================================================
# 1_Get read count
# ==============================================================================
countL = getReadCount(inputF = "DiffChIPL_meta.csv",
overlap = FALSE,
fragment = 0,
removeBackground = TRUE)
# ==============================================================================
# 2_Make the design matrix and normalization
# ==============================================================================
group= c(0,0,0,1,1,1)
ctrName = "control"
treatName = "treat"
groupName = rep(c(ctrName,treatName),each = 3)
design0 <- cbind(rep(1, 6), group)
colnames(design0) <- c(ctrName, treatName)
design0
# control treat
# [1,] 1 0
# [2,] 1 0
# [3,] 1 0
# [4,] 1 1
# [5,] 1 1
# [6,] 1 1
countAll <- countL$countAll
for(i in 1:ncol(countAll)){
id = which(countAll[,i] < 1)
countAll[id,i] = 0
}
cpmD = cpmNorm(countAll, libsize = countL$fd$lsIP)
# write.csv(cpmD,file = "cpmD.csv",row.names = F)
# ==============================================================================
# 3_Do differential analysis with DiffChIPL
# ==============================================================================
resA = DiffChIPL(cpmD, design0, group0 = group)
# fitRlimm3 = resA$fitDiffL
rtRlimm3 <- resA$resDE %>%
mutate(type = case_when(logFC >= 1 & adj.P.Val < 0.05 ~ "sigUp",
logFC <= -1 & adj.P.Val < 0.05 ~ "sigDown",
.default = "nonSig"))
# check
head(rtRlimm3,3)
# logFC AveExpr t P.Value adj.P.Val B type
# chr1_752456_752814 0.9113821 0.6859282 1.796055 0.0724863331 0.146245882 -4.907984089 nonSig
# chr1_756889_757238 -0.7992602 0.9601723 -1.547512 0.1217404297 0.219235256 -5.313989648 nonSig
# chr1_762153_763224 1.1434761 2.5288585 3.644235 0.0002682122 0.001128709 0.004337593 sigUp
table(rtRlimm3$type)
# nonSig sigDown sigUp
# 93102 753 50758
rtRlimm3$id <- rownames(rtRlimm3)
# export all diff
write.csv(rtRlimm3,file = paste('1.Diff-data/DiffChIPL_peak_diffAll.csv',sep = '') ,row.names = F)
这里我们看看 DiffChIPL 和 DiffBind 找到的差异 peaks 的重叠性怎么样:
# ==============================================================================
# overlap with diffbind results
# ==============================================================================
sig_peak <- rtRlimm3 %>% dplyr::filter(type != "nonSig") %>%
separate_wider_delim(id,delim = "_",names = c("chr","start","end"),too_many = "merge") %>%
dplyr::select(chr,start,end) %>% head(n = 51468) %>% GRanges()
diffPeaks <- read.csv("1.Diff-data/diffbind_peak_diffAll.csv") %>%
dplyr::filter(type != "nonSig") %>% GRanges()
# overlap
ChIPpeakAnno::makeVennDiagram(Peaks = list(sig_peak,diffPeaks),
NameOfPeaks = c("DiffChIPL","DiffBind"),
fill = c("#D55E00", "#0072B2"))
可以看到大部分是一致的,但是 DiffBind 数量更多一些,可能是默认调用的 DESeq2 来寻找差异的, dba.analyze 的 method 可以换成 edgeR 都试试看。
footprinting 分析
❝转录因子结合在 DNA 上会阻碍 Tn5 酶的切割,就会形成下面这样的小区域,成为足迹(footprint),因此我们可以查看转录因子在全基因组上的结合状态。我们使用 rbt hint 来进行转录因子分析。
文献:
安装:
# installing RGT
$ pip install cython numpy scipy
$ pip install RGT
配置基因组文件:
# install all the necessary human genome (hg19) data sets
$ cd ~/rgtdata
$ python setupGenomicData.py --hg19
我们获得一个总的 peaks 文件和 bam 文件,来处理生物学重复样本:
# using macs3 to call peaks
for i in CTL siNCoR
do
macs3 callpeak -f BAMPE \
-t ../2.map-data/${i}_1.shift.sorted.bam \
../2.map-data/${i}_2.shift.sorted.bam \
../2.map-data/${i}_3.shift.sorted.bam \
-g hs -n ${i} -q 0.01 --keep-dup all
done
# merge replicates bam files
samtools merge -@ 12 -o CTL_cb.bam \
../2.map-data/CTL_1.shift.sorted.bam \
../2.map-data/CTL_2.shift.sorted.bam \
../2.map-data/CTL_3.shift.sorted.bam
samtools merge -@ 12 -o siNCoR_cb.bam \
../2.map-data/siNCoR_1.shift.sorted.bam \
../2.map-data/siNCoR_2.shift.sorted.bam \
../2.map-data/siNCoR_3.shift.sorted.bam
footprinting analysis:
# call footprints
rgt-hint footprinting -h
for i in CTL siNCoR
do
rgt-hint footprinting --atac-seq --paired-end \
--organism=hg19 \
--output-location=./ --output-prefix=${i} \
${i}_cb.bam \
${i}_peaks.narrowPeak
done
# finding motifs overlapping with predicted footprints
rgt-motifanalysis matching --help
rgt-motifanalysis matching --organism=hg19 --input-files CTL.bed siNCoR.bed
然后进行差异转录因子分析:
# generate average ATAC-seq profiles around binding sites of particular TF
rgt-hint differential -h
rgt-hint differential --organism=hg19 \
--bc --nc 12 \
--mpbs-files=./match/CTL_mpbs.bed,./match/siNCoR_mpbs.bed \
--reads-files=CTL_cb.bam,siNCoR_cb.bam \
--conditions=CTL,siNCoR \
--output-location=./CTL_siNCoR
差异转录因子散点图:
我们选择上面两个查看在全基因组的结合情况, 可以看到一个结合增强,一个减弱:
结尾
❝路漫漫其修远兮,吾将上下而求索。
欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群 (微信交流群需收取 20 元入群费用,一旦交费,拒不退还!(防止骗子和便于管理)) 。QQ 群可免费加入, 记得进群按格式修改备注哦。
老俊俊微信:
知识星球:
❝ATAC-seq 分析方法一览 locuszoomr 可视化 GWAS 结果 2024 了,听说你还在用 dplyr 处理数据? 计算 motif 到 peak center 的距离 写个包 metaplot 绘制 m6A peak 分布 当 ggSCvis 遇到 ggmagnify 在模仿中精进数据可视化_使用ggcirclize绘制circos圈图 单细胞二维密度图一文拿捏! coord_polar2 补充极坐标图形短板 ggplot 绘制热图标注特定名称