高通量测序数据分析流程:①测序数据下载(SRA Toolkit);②测序数据质控与过滤(fastp);③序列比对(SAMtools、HISAT2);④序列组装(StringTie、TACO);⑤表达定量和差异表达分析(Salmon、DESeq2)及富集分析(clusterProfiler)等。SRA(Sequence Read Archive)数据库存储了二代测序的原始数据,也存储raw reads在参考基因的比对信息。根据数据产生特点,将SRA数据分为四类:Studies--研究课题;Experiments--实验设计;Runs--测序结果集;Samples--样品信息。SRA中数据结构的层次关系为:Studies->Experiments->Samples->Runs。
1. SRA toolkit软件下载
## 下载地址
https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
## 下载安装 SRA toolkit
wget "https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.1.1/sratoolkit.3.1.1-ubuntu64.tar.gz"
## 提取压缩文件
tar -zxvf sratoolkit.3.1.1-ubuntu64.tar.gz # 解压缩
vdb-config --interactive # 配置(按X退出后即可正常使用)
## 检查配置
prefetch -V
用SRAtoolkit下载并处理NCBI数据。将 .sra文件转换为 .fstaq.gz文件。
# 下载
prefetch SRR6283792 -O . #-O . 指定到当前路径,否则默认路径难找
# 解压
fastq-dump SRR6283792.sra #解压后从sra文件变为fastq文件
# 双端测序要加–split-files,否则两端的数据不会分开,难以被其他软件读取
# 建议加上–gzip,将解压后的数据用gzip压缩,避免占用过多空间
2.测序数据质控与过滤:fastp
fastq格式是一种基于文本的存储生物序列和对应碱基(或氨基酸)质量的文件格式。最初由桑格研究所(Wellcome Trust Sanger Institute)开发,现在是存储高通量测序数据的标准。
第1行主要储存序列测序时的坐标信息等。
第2行是测序得到的序列信息,一般用ATCGN来表示。
第3行以“+”开始,可以储存一些附加信息,一般是空的。
第4行储存质量信息,与第2行的碱基序列一一对应。每个符号对应的ASCII值成为phred值,可以简单理解为对应位置碱基的质量,越大说明测序质量越好。
# 进行fastqc, 每个*.fq.gz文件都会得到一个网页文件,可直接通过浏览器查看测序的质量
nohup fastqc -t 6 -o ./ SRR*.fastq.gz >qc.log &
#安装 multiqc
conda install multiqc
# fastqc后的zip数据进行multiqc,将多个质控报告合并为一个,便于查看
nohup multiqc ./*.zip -o ./ > ./multiqc.log &
## 数据过滤的目的:去除低质量的数据和测序接头等
conda install trim-galore
$ trim_galore –help
$ trim_galore –help | more
$ trim_galore --illumina --fastqc SRR6283792.fastq.gz
>>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR6283792.fastq.gz <<<
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
50000000 sequences processed
60000000 sequences processed
70000000 sequences processed
80000000 sequences processed
90000000 sequences processed
This is cutadapt 4.9 with Python 3.12.2
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR6283792.fastq.gz
Processing single-end reads on 1 core ...
Finished in 2139.308 s (21.405 µs/read; 2.80 M reads/minute).
=== Summary ===
Total reads processed: 99,943,320
Reads with adapters: 24,370,603 (24.4%)
Reads written (passing filters): 99,943,320 (100.0%)
Total basepairs processed: 11,991,108,334 bp
Quality-trimmed: 311,435,040 bp (2.6%)
Total written (filtered): 11,642,048,903 bp (97.1%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 24370603 times
Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 19.4%
C: 37.8%
G: 25.6%
T: 16.8%
none/other: 0.3%
Overview of removed sequences
length count expect max.err error counts
1 14970037 24985830.0 0 14970037
2 6674781 6246457.5 0 6674781
3 2426999 1561614.4 0 2426999
4 125056 390403.6 0 125056
5 139063 97600.9 0 139063
6 13074 24400.2 0 13074
7 4523 6100.1 0 4523
8 184 1525.0 0 184
9 1727 381.3 0 21 1706
10 3619 95.3 1 181 3438
11 2073 23.8 1 1 2072
12 856 6.0 1 4 852
13 146 1.5 1 0 146
14 60 1.5 1 0 60
15 123 1.5 1 0 123
16 68 1.5 1 0 68
17 59 1.5 1 0 59
18 61 1.5 1 0 61
19 116 1.5 1 0 116
20 73 1.5 1 0 73
21 110 1.5 1 0 110
22 95 1.5 1 1 94
23 69 1.5 1 0 69
24 143 1.5 1 0 143
25 85 1.5 1 0 85
26 81 1.5 1 0 81
27 109 1.5 1 1 108
28 39 1.5 1 0 39
29 77 1.5 1 0 77
30 55 1.5 1 0 55
31 43 1.5 1 0 43
32 36 1.5 1 0 36
33 76 1.5 1 0 76
34 73 1.5 1 0 73
35 57 1.5 1 0 57
36 56 1.5 1 0 56
37 38 1.5 1 0 38
38 71 1.5 1 0 71
39 64 1.5 1 0 64
40 63 1.5 1 0 63
41 34 1.5 1 0 34
42 64 1.5 1 0 64
43 94 1.5 1 0 94
44 49 1.5 1 0 49
45 73 1.5 1 0 73
46 64 1.5 1 1 63
47 59 1.5 1 0 59
48 87 1.5 1 0 87
49 49 1.5 1 0 49
50 67 1.5 1 1 66
51 207 1.5 1 1 206
52 73 1.5 1 0 73
53 51 1.5 1 0 51
54 67 1.5 1 0 67
55 75 1.5 1 0 75
56 62 1.5 1 0 62
57 68 1.5 1 0 68
58 60 1.5 1 0 60
59 69 1.5 1 0 69
60 57 1.5 1 0 57
61 89 1.5 1 0 89
62 77 1.5 1 0 77
63 113 1.5 1 0 113
64 96 1.5 1 0 96
65 78 1.5 1 0 78
66 38 1.5 1 0 38
67 51 1.5 1 0 51
68 103 1.5 1 0 103
69 120 1.5 1 0 120
70 123 1.5 1 0 123
71 89 1.5 1 1 88
72 86 1.5 1 0 86
73 114 1.5 1 0 114
74 99 1.5 1 0 99
75 87 1.5 1 0 87
76 110 1.5 1 0 110
77 51 1.5 1 0 51
78 48 1.5 1 0 48
79 47 1.5 1 0 47
80 58 1.5 1 0 58
81 74 1.5 1 0 74
82 52 1.5 1 1 51
83 57 1.5 1 0 57
84 68 1.5 1 0 68
85 46 1.5 1 0 46
86 43 1.5 1 0 43
87 67 1.5 1 0 67
88 51 1.5 1 0 51
89 59 1.5 1 0 59
90 57 1.5 1 0 57
91 53 1.5 1 0 53
92 105 1.5 1 0 105
93 73 1.5 1 0 73
94 54 1.5 1 0 54
95 55 1.5 1 0 55
96 63 1.5 1 0 63
97 50 1.5 1 0 50
98 49 1.5 1 0 49
99 46 1.5 1 0 46
100 40 1.5 1 0 40
101 41 1.5 1 0 41
102 50 1.5 1 0 50
103 112 1.5 1 0 112
104 56 1.5 1 0 56
105 110 1.5 1 0 110
106 71 1.5 1 0 71
107 121 1.5 1 0 121
108 55 1.5 1 0 55
109 76 1.5 1 0 76
110 39 1.5 1 0 39
111 45 1.5 1 0 45
112 26 1.5 1 0 26
113 37 1.5 1 0 37
114 50 1.5 1 0 50
115 36 1.5 1 0 36
116 24 1.5 1 0 24
117 27 1.5 1 0 27
118 38 1.5 1 0 38
119 31 1.5 1 0 31
120 30 1.5 1 0 30
121 26 1.5 1 0 26
122 24 1.5 1 0 24
123 33 1.5 1 0 33
124 29 1.5 1 0 29
125 31 1.5 1 0 31
126 30 1.5 1 0 30
127 27 1.5 1 0 27
128 28 1.5 1 0 28
129 29 1.5 1 0 29
130 28 1.5 1 0 28
131 26 1.5 1 0 26
132 39 1.5 1 0 39
133 29 1.5 1 0 29
134 15 1.5 1 0 15
135 31 1.5 1 0 31
136 28 1.5 1 0 28
137 22 1.5 1 0 22
138 25 1.5 1 0 25
139 27 1.5 1 0 27
140 29 1.5 1 0 29
141 25 1.5 1 0 25
142 17 1.5 1 0 17
143 18 1.5 1 0 18
144 18 1.5 1 0 18
145 17 1.5 1 0 17
146 15 1.5 1 0 15
147 12 1.5 1 0 12
148 12 1.5 1 0 12
149 6 1.5 1 0 6
150 12 1.5 1 0 12
151 14 1.5 1 0 14
152 19 1.5 1 0 19
153 8 1.5 1 0 8
154 10 1.5 1 0 10
155 22 1.5 1 0 22
156 14 1.5 1 0 14
157 21 1.5 1 0 21
158 22 1.5 1 0 22
159 12 1.5 1 0 12
160 7 1.5 1 0 7
161 8 1.5 1 0 8
162 6 1.5 1 0 6
163 11 1.5 1 0 11
164 9 1.5 1 0 9
165 11 1.5 1 0 11
166 9 1.5 1 0 9
167 5 1.5 1 0 5
168 8 1.5 1 0 8
169 5 1.5 1 0 5
170 5 1.5 1 0 5
171 7 1.5 1 0 7
172 6 1.5 1 0 6
173 8 1.5 1 0 8
174 8 1.5 1 0 8
175 5 1.5 1 0 5
176 6 1.5 1 0 6
177 5 1.5 1 0 5
178 7 1.5 1 0 7
179 4 1.5 1 0 4
180 2 1.5 1 0 2
181 2 1.5 1 0 2
182 4 1.5 1 0 4
183 3 1.5 1 0 3
184 4 1.5 1 0 4
185 3 1.5 1 0 3
186 7 1.5 1 0 7
187 5 1.5 1 0 5
188 6 1.5 1 0 6
189 2 1.5 1 0 2
190 6 1.5 1 0 6
191 3 1.5 1 0 3
192 5 1.5 1 0 5
193 2 1.5 1 0 2
194 4 1.5 1 0 4
195 2 1.5 1 0 2
197 2 1.5 1 0 2
198 3 1.5 1 0 3
199 2 1.5 1 0 2
200 4 1.5 1 0 4
201 4 1.5 1 0 4
202 5 1.5 1 1 4
203 3 1.5 1 0 3
204 4 1.5 1 0 4
205 1 1.5 1 0 1
207 2 1.5 1 0 2
208 4 1.5 1 0 4
210 3 1.5 1 0 3
211 1 1.5 1 0 1
212 1 1.5 1 0 1
213 1 1.5 1 0 1
214 2 1.5 1 0 2
215 2 1.5 1 0 2
216 1 1.5 1 0 1
217 2 1.5 1 0 2
218 1 1.5 1 0 1
220 2 1.5 1 0 2
221 1 1.5 1 0 1
222 2 1.5 1 0 2
224 1 1.5 1 0 1
229 1 1.5 1 0 1
230 1 1.5 1 0 1
231 3 1.5 1 0 3
233 1 1.5 1 0 1
234 1 1.5 1 0 1
236 1 1.5 1 0 1
238 1 1.5 1 0 1
239 3 1.5 1 0 3
240 1 1.5 1 0 1
243 1 1.5 1 0 1
249 1 1.5 1 0 1
254 1 1.5 1 0 1
257 1 1.5 1 0 1
268 1 1.5 1 0 1
RUN STATISTICS FOR INPUT FILE: SRR6283792.fastq.gz
=============================================
99943320 sequences processed in total
Sequences removed because they became shorter than the length cutoff of 20 bp: 1560266 (1.6%)
>>> Now running FastQC on the data <<<
application/gzip
Started analysis of SRR6283792_trimmed.fq.gz
Approx 5% complete for SRR6283792_trimmed.fq.gz
Approx 10% complete for SRR6283792_trimmed.fq.gz
Approx 15% complete for SRR6283792_trimmed.fq.gz
Approx 20% complete for SRR6283792_trimmed.fq.gz
Approx 25% complete for SRR6283792_trimmed.fq.gz
Approx 30% complete for SRR6283792_trimmed.fq.gz
Approx 35% complete for SRR6283792_trimmed.fq.gz
Approx 40% complete for SRR6283792_trimmed.fq.gz
Approx 45% complete for SRR6283792_trimmed.fq.gz
Approx 50% complete for SRR6283792_trimmed.fq.gz
Approx 55% complete for SRR6283792_trimmed.fq.gz
Approx 60% complete for SRR6283792_trimmed.fq.gz
Approx 65% complete for SRR6283792_trimmed.fq.gz
Approx 70% complete for SRR6283792_trimmed.fq.gz
Approx 75% complete for SRR6283792_trimmed.fq.gz
Approx 80% complete for SRR6283792_trimmed.fq.gz
Approx 85% complete for SRR6283792_trimmed.fq.gz
Approx 90% complete for SRR6283792_trimmed.fq.gz
Approx 95% complete for SRR6283792_trimmed.fq.gz
Analysis complete for SRR6283792_trimmed.fq.gz
Hisat2是一款短序列比对的工具,主要用于转录组数据的比对,是Hisat工具的升级版。Hisat2优化了索引建立的策略,采用了新的比对策略,使其与Bowtie/TopHat2等软件相比具有更高的敏感性和更快的运算速度。Hisat2支持剪切位点的识别和转录本的重构:利用已知或发现的剪切位点信息进行剪切比对,提高比对率和准确性;结合StringTie等软件进行转录本的重构和定量,提供更全面和精确的转录组信息。
## 数据⽐对到参考基因组
conda install -c bioconda hisat2
## 载基因组文件
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz
##下载基因组注释文件
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
##解压
gunzip GCF_000001405.40_GRCh38.p14_genomic.fna.gz
gunzip GCF_000001405.40_GRCh38.p14_genomic.gff.gz
## 建立索引
$ mkdir -p reference/index
$ cd reference/index
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz
# 解压得到两个目录,hg38和mm10
$ tar -zxvf *.tar.gz
# 删除压缩包
$ rm -rf *.tar.gz
参考资料:
https://www.codenong.com/cs105958520/#google_vignette https://www.jianshu.com/p/479c7b576e6f