写在前面的话
接触了表观更加发现生物学是如此有趣!!记录一下学习表观遗传学的一些重点知识,包括分析流程和数据质控细节。
转录调控与相关测定技术
引自:Gasperini, M., Tome, J.M. & Shendure, J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat Rev Genet 21, 292–310 (2020).
染⾊质可及性技术:MNase-seq,DNase-seq and ATAC-seq
TF结合位点的鉴定与histone mark的鉴定:TF Chip-seq,Histone modification Chip-seq到CUT&RUN,CUT&Tag and in situ ChIP
三维基因组测定技术:3C,4C,Hi-C mRNA定量技术:PRO-seq,RNA-seq
TF相关Chip-seq分析流程
试验过程
引自:Park, P. ChIP–seq: advantages and challenges of a maturing technology. Nat Rev Genet 10, 669–680 (2009).
转录因子的分类统计
引自:Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M, Chen X, Taipale J, Hughes TR, Weirauch MT. The Human Transcription Factors. Cell. 2018 Feb 8;172(4):650-665.
转录因子的定义:(1)具有DNA结合能力(2)能够调控基因转录
分析流程
单样本双端为例
# QC -cutadapt
cutadapt -j 20 --times 1 -e 0.1 -O 3 --quality-cutoff 25 -m 36 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o {output.R1}.fq.gz {output.R2}.fq.gz {input.R1}.fq.gz {input.R2}.fq.gz > {log} 2>&1
# QC -trimgalore
trim_galore -q 20 --length 36 --max_n 3 --stringency 3 --fastqc --paired -o {output.R1}.fq.gz {output.R2}.fq.gz {input.R1}.fq.gz {input.R2}.fq.gz> {log} 2>&1
# mapping
构建hg38参考基因组索引文件后比对
bowtie2-build -t 6 ref_hg38.fa ref_hg38.fa > bt2_build.log 2>&1 &
bowtie2 -x ref_hg38.fa -p 20 -1 {input}.fq.gz -2 {input}.fq.gz -S {output}.sam > {log} 2>&1
# 转化bam文件
samtools view -t 20 -h -f bam -S -o {output}.bam {input}.sam
# bam文件排序
samtools sort -t 20 -m 50G -o {output}.bam --tmpdir={input}.bam
# Bam文件去重
java -Xms40g -Xmx40g -XX:ParallelGCThreads=20 -jar picard.jar MarkDuplicates I={input}.bam O={output}.bam M={output}.matrix ASO=coordinate REMOVE_DUPLICATES=true 2>{log}
# Bam文件质控
samtools view -t 20 -h -f bam -F "mapping_quality >= 20" -o {output}.bam {input}.bam
# 去除比对上线粒体的reads
samtools view -b -h {input}.bam ch1 chr2 chr3 chr4 chr5 chr6 ch7 chr8 chr9 chr10 chr11 chr12 chr13 ch14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 >{output}.bam
# peak calling
macs2 callpeak -c {input.Input}.bam -t {input.Treat}.bam -f BAM -g hs --outdir {dir} -n {dir} --call-summits -m 5 50 -q 0.005 > {log} 2>&1
# peak annotation
annotatePeaks.pl {input}.bed hg38 -annStats ann_stats.log > {output}.tsv
# motif finding
bedtools getfasta {input}.bed {output}.fa
meme {input}.fa -oc {output} -p {} -dna -maxw 20
streme -oc {output} -p {input}.bed --dna --maxw 30 --nmotifs 15
# TOMTOM搜库
https://meme-suite.org/meme/tools/tomtom
多样本分析思路
# merge peak
cat *.narrowPeak | sort -k1,1 -k2,2n > all.merge.peak
# 去除Y染色体和线粒体
bedtools merge -i all.merge.peak | grep -v chrY -v chrM > all.RmchrY.peak
# 根据基因组排序bam
samtools faidx ref_hg38.fa ref_hg38.fa.fai
bedtools sort -i all.RmchrY.peak -g ref_hg38.fa.fai > all.RmchrY.sort.peak
# 计算每个peak的reads count
bedtools coverage -a all.RmchrY.sort.peak -b {input}.bam -sorted -g ref_hg38.fa.fai -counts > {input}.count_table
# 统计测序depth
samtools idxstats {input}.bam > {input}.tsv
# R语言计算cpm、rpkm以FC鉴定差异峰
# 利用DESeq2鉴定差异峰
Histone相关Chip-seq分析流程
Histone与转录调控
引自:Zhao Z, Shilatifard A. Epigenetic modifications of histones in cancer. Genome Biol. 2019 Nov 20;20(1):245.
Histone的writer,eraser和reader
引自:Zhao S, Allis CD, Wang GG. The language of chromatin modification in human cancers. Nat Rev Cancer. 2021 Jul;21(7):413-430.
分析流程
引自:https://www.encodeproject.org/chip-seq/histone-encode4/
#宽富集峰的鉴定
macs2 callpeak -c {input.Input}.bam -t {input.Treat}.bam -f BAM -g hs --min-length 500 --broad --outdir {dir} -n {dir} --call-summits --cutoff-analysis --keep-dup all -m 5 50 -q 0.005 > {log} 2>&1
Histone和TF的chip-seq结果图
引自:Park, P. ChIP–seq: advantages and challenges of a maturing technology. Nat Rev Genet 10, 669–680 (2009).
chromHMM的使用
Histone markers定义各element
引自:Zhou VW, Goren A, Bernstein BE. Charting histone modifications and the functional organization of mammalian genomes. Nat Rev Genet. 2011 Jan;12(1):7-18.Hsitone markers区分染色体状态
引自:Ernst, J., Kheradpour, P., Mikkelsen, T. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011).
# 制作mark_table.tsv说明文件
第一列:参考基因组
第二列:组蛋白marker
第三列:组蛋白的bam文件
第四列:input的bam文件
# 制作hg38染色体长度文件
samtools view -H {input}.bam | grep LN --color=no | cut -f 2,3 | sed 's/SN://g' | sed 's/LN://g' > hg38_chr_length.txt
# chromHMM bin count
java -Xms80g -Xmx80g -XX:ParallelGCThreads=20 \
-jar ChromHMM.jar BinarizeBam \
-b 200 -f 2 -t {output} \
./hg38_chr_length.txt \
./histone_bam \
mark_table.tsv \
./out_count_info
# chromHMM LearnModel
java -Xms80g -Xmx80g -XX:ParallelGCThreads=20 \
-jar ChromHMM.jar LearnModel \
-b 200 -l hg38_chr_length.txt -p 20 \
out_count_info \
{output} \
15 \
hg38
CUT&RUN、CUT&Tag
技术原理
引自:Kaya-Okur, H.S., Janssens, D.H., Henikoff, J.G. et al. Efficient low-cost chromatin profiling with CUT&Tag. Nat Protoc 15, 3264–3283 (2020).
CUT&RUN、CUT&Tag是chip-seq的升级版本,优势在于低细胞起始量
对于Histone markers及强结合的TF,CUT&RUN、CUT&Tag的结果相对较好,但对于弱结合的蛋白,效果存疑。
分析流程
质控标准
1.能否通过fragment length计算出核小体的周期性重复
2.大肠杆菌参考基因组的比对率
3.final reads占原始reads的比例
4.与已有的chip-seq数据比较,peak是否一致
# QC -cutadapt
cutadapt -j 20 --times 1 -e 0.1 -O 3 --quality-cutoff 25 -m 22 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o {output.R1}.fq.gz {output.R2}.fq.gz {input.R1}.fq.gz {input.R2}.fq.gz > {log} 2>&1
# QC -trimgalore
trim_galore -q 20 --length 22 --max_n 3 --stringency 3 --fastqc --paired -o {output.R1}.fq.gz {output.R2}.fq.gz {input.R1}.fq.gz {input.R2}.fq.gz> {log} 2>&1
# mapping
## 大肠杆菌基因组 https://www.ncbi.nlm.nih.gov/nuccore/U00096.3
bowtie2 -p 12 -x Ecoli_K12.fa -1 {input.R1} -2 {input.R2} -S {output}.sam --very-sensitive --no-discordant --minins 10 --maxins 1000 --no-unal --un-conc-gz {output} > {log} 2>&1
## 人基因组
bowtie2 -p 12 -x ref_hg38.fa -1 {input.R1} -2 {input.R2} -S {output}.sam --very-sensitive --no-discordant --minins 10 --maxins 1000 --no-unal > {log} 2>&1
# align stats
samtools stats -@ 5 {input}.sam > {output}.mapping_stats
# Bam转化
samtools view -h -b -@ 10 -f 3 -F 12 -F 256 {input}.sam > {output}.bam
# Bam排序
samtools sort -t 12 -m 50G -o {output}.bam --tmpdir={input}.bam
# 去重
java -Xms40g -Xmx40g -XX:ParallelGCThreads=12 -jar picard.jar MarkDuplicates I={input}.bam O={output}.bam M={output}.matrix ASO=coordinate REMOVE_DUPLICATES=true 2>{log}
# Bam过滤
samtools view -t 12 -h -f bam -F "mapping_quality >= 20" -o {output}.bam {input}.bam
# align stats
samtools stats -@ 5 {input}.sam > {output}.sam
# 转化bigwig
bamCoverage --bam {input}.bam -o {output}.bigwig -of bigwig --scaleFactor 1 --binSize 1 -p 12 --normalizeUsing RPKM
# fragment length
java -Xms20g -Xmx20g -XX:ParallelGCThreads=12 -jar picard.jar CollectInsertSizeMetrics I={input}.bam O={output}.txt H={output}.pdf METRIC_ACCUMULATION_LEVEL=ALL_READS > {log} 2>&1
# peak calling
macs2 callpeak -c {input.Input}.bam -t {input.Treat}.bam -f BAM -g hs --keep-dup all -p 0.00005 --outdir {output} -n {output}
ATAC-seq
核糖体和染色质可及性
引自:Klemm, S.L., Shipony, Z. & Greenleaf, W.J. Chromatin accessibility and the regulatory epigenome. Nat Rev Genet 20, 207–220 (2019).
染色质可及性检测技术
引自:Tsompana, M., Buck, M.J. Chromatin accessibility: a window into the genome. Epigenetics & Chromatin 7, 33 (2014).
ATAC-seq技术原理
引自:Grandi, F.C., Modi, H., Kampman, L. et al. Chromatin accessibility profiling by ATAC-seq. Nat Protoc 17, 1518–1552 (2022).
ATAC-seq分析流程
数据质控指标
#1.线粒体基因组的占比
samtools idxstats {input}.bam > {input}.tsv
#2.fragment length的分布情况
java -Xms20g -Xmx20g -XX:ParallelGCThreads=10 \
-jar picard.jar CollectInsertSizeMetrics \
I={input}.bam \
O={output}.txt \
H=K562-ATACSeq-rep1.ENCFF534DCE.InsertSize.pdf \
METRIC_ACCUMULATION_LEVEL=ALL_READS
#3.peak的数目,peak中reads的占比
wc -l {input}.bam
## 计算peak中的数据
bedtools coverage \
-a {input}.narrowPeak \
-b {bam}.bam -sorted \
-g hg38_chr_length.txt \
-counts > {output}.PeakRegion.count_table
## total reads
samtools idxstats {input}.bam > {input}.tsv
#4.TSS附近的富集情况
TSS±2kb(平均覆盖depth)\全基因组覆盖depth
## make TSS region
UCSC table browser下载
R
library(tidyverse)
ucsc_gene_df <- read_tsv(file = "./UCSC_RefSeq_hg38.tsv")
## make max region gene
ucsc_gene_df.merge <- filter(
ucsc_gene_df,
grepl(pattern = "NM", x = name)
) %>% group_by(
chrom,name2
) %>% summarise(
chrom,start = min(txStart),end = max(txEnd),name2,score,strand
) %>% distinct () %>% select(
chrom,start,end,name2,score,strand
) %>% arrange(
chrom,start,end
)
write_tsv(ucsc_gene_df.merge, file = "./UCSC_RefSeq_hg38.NM.merge_TSS_TES.bed", col_names = F)
# make TSS±2kb region
gene_df.TSS <- mutate(
gene_df,
tss = case_when(
strand == "+" ~ start,
strand == "-" ~ end
),
tss_up = ifelse(tss - 2000 > 1, tss - 2000, 1),
tss_down = tss + 2000
) %>% select(
chrom,tss_up,tss_down,gene_id,tss,strand
)
# select main chrom
chr_list = c(sprintf("chr%s", 1:22), "chrX")
gene_df.TSS.filter <- filter(
gene_df.TSS,chrom %in% chr_list)
write_tsv(gene_df.TSS.filter, file = "UCSC_RefSeq_hg38.NM.merge_TSS_extend2kb.bed", col_names = F)
# 计算TSS extend 2kbp 中的数据
bedtools sort -i UCSC_RefSeq_hg38.NM.merge_TSS_extend2kb.bed -g hg38_chr_length.txt > region/UCSC_RefSeq_hg38.NM.merge_TSS_extend2kb.sort.bed
bedtools coverage \
-a UCSC_RefSeq_hg38.NM.merge_TSS_extend2kb.sort.bed \
-b {input}.bam -sorted \
-g hg38_chr_length.txt \
-counts > {input}.TSS_2Kbp.count_table
tss_coverage = sum(tss_2kb_count_df$region_count) * 150 / (4000 * nrow(tss_2kb_count_df))
genome_coverage = sum(total_count_df.filter$mapped_count) * 150 / 2.7e9
tss_enrichment_score = tss_coverage / genome_coverage
单样本双端为例
# QC -cutadapt
cutadapt -j 20 --times 1 -e 0.1 -O 3 --quality-cutoff 25 -m 55 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o {output.R1}.fq.gz {output.R2}.fq.gz {input.R1}.fq.gz {input.R2}.fq.gz > {log} 2>&1
# QC -trimgalore
trim_galore -q 20 --length 55 --max_n 3 --stringency 3 --fastqc --paired -o {output.R1}.fq.gz {output.R2}.fq.gz {input.R1}.fq.gz {input.R2}.fq.gz> {log} 2>&1
# mapping
构建hg38参考基因组索引文件后比对
bowtie2-build -t 6 ref_hg38.fa ref_hg38.fa > bt2_build.log 2>&1 &
# mapping
bowtie2 -p 20 -x ref_hg38.fa -1 {input.R1} -2 {input.R2} -S {output}.sam --no-unal --maxins 2000 > {log} 2>&1
# bam转换和过滤
samtools view -h -b -@ 10 -f 3 -F 12 -F 256 {input}.sam > {output}.bam
# bam排序
samtools sort -t 20 -m 50G -o {output}.bam --tmpdir= {input}.bam
# 去重
java -Xms40g -Xmx40g -XX:ParallelGCThreads=20 -jar picard.jar MarkDuplicates I={input}.bam O={output}.bam M={output}.matrix ASO=coordinate REMOVE_DUPLICATES=true 2>{log}
# bam文件质控
samtools view -t 20 -h -f bam -F "mapping_quality >= 20" -o {output}.bam {input}.bam
# NFR区域选取
samtools view -@ 5 -h {input}.bam | awk 'substr($0,1,1)=="@" || ($9 < 120 && $9 > 0) || ($9 > -120 && $9 < 0)' | samtools view -@ -hb > {output}.bam
# 核小体占位区域选取
samtools view -@ 5 -h {input}.bam | awk 'substr($0,1,1)=="@" || ($9>= 120) || ($9<=-120)' | samtools view -@ -hb > {output}.bam
# Bam文件建立index
samtools index -@ 5 {input}.bam {output}.bam
# 转化bigwig
bamCoverage --bam {input}.bam -o {output}.bigwig -of bigwig --scaleFactor 1 --binSize 1 -p 20 --normalizeUsing RPKM
# peak calling
macs2 callpeak -t {input}.bam \
-f BAMPE -g hs --keep-dup all -B --SPMR \
--nomodel --shift -75 --extsize 150 \
--outdir {output} -n {output}
# extend信号转换bigwig
## UCSC下载转换工具 http://hgdownload.soe.ucsc.edu/downloads.html#source_download
bedGraphToBigWig {input}.bdg .hg38_chr_length.txt {output}.bigwig
Deeptools的使用
区域选取
# 基因区 UCSC 上述步骤制备UCSC_RefSeq_hg38.NM.merge_TSS_TES.bed
# 某些peak区域 ENCODE下载例如CTCF peak(ENCFF285QVL)
# 基因组随机区域 bedtools 生成
## 剔除blacklist区域:http://github.com/Boyle-Lab/Blacklist
bedtools shuffle -i {input}.bed -excl blacklist.v2.bed -noOverlapping -g ref_hg38.fa.fai
# bam to bigwig
bamCoverage --bam {input}.bam -o {output}.bigwig -of bigwig --scaleFactor 1 --binSize 10 -p 10 --normalizeUsing RPKM
# deeptools at TSS\TES-reference-point
computeMatrix reference-point -S {input}.bigwig
-R UCSC_RefSeq_hg38.NM.merge_TSS_TES.bed \
--referencePoint TSS或TES \
--beforeRegionStartLength 5000 \
--afterRegionStartLength 5000 \
--binSize 50 \
--numberOfProcessors 24 \
-o {output}.mat.gz \
--samplesLabel {input}
# deeptools at TSS\TES-scale-regions
computeMatrix scale-regions -S {input}.bigwig
-R UCSC_RefSeq_hg38.NM.merge_TSS_TES.bed \
--referencePoint TSS或TES \
--beforeRegionStartLength 5000 \
--afterRegionStartLength 5000 \
--binSize 50 \
--numberOfProcessors 24 \
-o {output}.mat.gz \
--samplesLabel {input}
# plot-reference-point
plotHeatmap -m {input}.mat.gz \
--colorMap Purples \
-out {output}.heatmap.pdf
# plot-scale-regions
reference-pointscale-regions引自:https://deeptools.readthedocs.io/en/develop/content/example_gallery.html