一套完整的samll RNA上游分析流程 (三)

文摘   2024-10-22 07:30   云南  

一边学习,一边总结,一边分享!

由于微信改版,一直有同学反映。存在长时间接收不到公众号的推文。那么请跟随以下步骤,将小杜的生信筆記设置为星标,不错过每一条推文教程。

欢迎关注《小杜的生信笔记》!!

如何加入社群

小杜的生信笔记仅有微信社群

1. 微信群:付费社群。添加小杜好友,加友请知:加友须知!!,加入社群请查看小杜生笔记付费加友入群声明

入群声明

2. 小杜个人微信:若你有好的教程或想法,可添加小杜个人微信。值得注意的是,小杜个人微信并不支持免费咨询长时间咨询,但支持小问题2-3个免费咨询。

小杜微信:

知识星球:

2022年教程总汇

https://mp.weixin.qq.com/s/Lnl258WhbK2a8pRZFuIyVg

2023年教程总汇

https://mp.weixin.qq.com/s/wCTswNP8iHMNvu5GQauHdg


6. 比对到本物种mRNA.fa

# 构建index 
bowtie-build mRNA.fa mRNA

## 比对本物种mRNA序列
bowtie -f -v 0 -p 1 -a --best --strata \
../S0.ref_prepare/03.genome/mRNA \ # bowtie index
P11.mapped.fa > P11.exon.bwt 2>P11.exon.log

7. 比对到本物种intron序列,用TBtools插件生成gff文件

参考文章:https://mp.weixin.qq.com/s/KQBe4IlMpD2qQqJqwIp8aQ

7.1 获得本物种intron序列

  1. 下载getIntronGXF脚本
https://github.com/nongxinshengxin/getIntronGXF

2. 使用TBtools进行操作

安装安装成功!后面的操作,直接拖拖拽拽就可以。https://mp.weixin.qq.com/s/KQBe4IlMpD2qQqJqwIp8aQ

# 需要先将 gff 1-based 格式转化为 bed 0-based格式
# 用bedtools,提取bed中的序列
awk '{print $1"\t"$4-1"\t"$5}' intron.gtf > intron.bed
bedtools getfasta -fi genome.fa -fo intron.fa -bed intron.bed -name

7.2  进行比对

# 构建index
bowtie-build intorn.fa intorn
## 比对本物种intron序列
bowtie -f -v 0 -p 1 -a --best --strata \
../S0.ref_prepare/03.genome/intron \ # bowtie index
P11.mapped.fa > P11.intron.bwt 2>P11.intron.log

8. 进行最后的注释分类

## 对small rna进行分类注释
## known mirna > ncrna > repeat > exon > intron
perl ../script/srna_anno.pl \
-fa P11.mapped.fa \ # 用于分类的reads文件
-rfam ../S0.ref_prepare/02.Rfam/family.txt1 \ #rfam家族信息文件
-mirna P11.miRNA.bwt \ # 已知miRNA比对结果
-ncrna P11.ncRNA.bwt \ # 与其他ncRNA比对结果
-intron P11.intron.bwt \ # 与intron比对结果
-repeat P11.repeat.bwt \ # 与repeat比对结果
-exon P11.exon.bwt \ # 与mRNA比对结果
-outpre P11.out # 输出结果前缀

srna_anno.pl脚本

#!/usr/bin/env perl

use strict ;

use Getopt::Long;
use Bio::SeqIO;

my ($fa, $rfam, $mirna, $ncrna, $exon, $intron, $repeat, $outpre);
GetOptions(
"fa=s" => \$fa,
"rfam=s" => \$rfam,
"mirna=s" => \$mirna,
"ncrna=s" => \$ncrna,
"exon=s" => \$exon,
"intron=s" => \$intron,
"repeat=s" => \$repeat,
"outpre=s" => \$outpre
);

my $usage="usage:
perl $0 -fa srna.fa -rfam family.txt -mirna mirna.bwt -ncrna rfam.bwt -exon exon.bwt -intron intron.bwt -repeat repeat.bwt -outpre out.stat\n

[options]
-fa srna.fa :small rna in fasta format;
-rfam family.txt : rfam family file, two col;
-mirna mirna.bwt : mirna alignment by bowtie;
-ncrna rfam.bwt : Rfam(rm mirna) alignment by bowtie;
-exon exon.bwt or none : exon aligment by bowtie ;
-intron intron.bwt or none : intron alignment by bowtie;
-repeat repeat.bwt or none : repeat aligment by bowtie;
-outpre outprefix : prefix for output
"
;

#print "$fa, $rfam, $mirna, $ncrna, $exon, $intron, $repeat, $outpre\n";
unless ( $fa && $rfam && $mirna && $ncrna && $exon && $intron && $repeat && $outpre ){
die $usage;
}

my %mirna = &gene_bwt($mirna);
my %exon = $exon eq "none" ? () : &gene_bwt($exon);
my %intron = $intron eq "none" ? () : &gene_bwt($intron);
my %repeat = $repeat eq "none" ? () : &gene_bwt($repeat);
my %ncrna = &ncrna_bwt($rfam, $ncrna);


## annotation order: mirna > ncrna > repeat > exon > intron
## ncrna order: rRNA > tRNA > snRNA > snoRNA > other_ncRNA


my $read_ann = "$outpre.read_anno.txt";
my $stat_ann = "$outpre.read_anno.stat";

my $total = 0 ;
my $uniq = 0 ;


open ANN, ">$read_ann" or die $!;
print ANN "#name\tsequence\tnumber\tlength\tannotation\n";
my %ann;
#my %ann_keep;
my $fasta = Bio::SeqIO->new(-file => $fa, -format => 'fasta');
while(my $obj = $fasta->next_seq()){
my $id = $obj->id;
my $seq = $obj->seq;
my $len = $obj->length;
my $nu = $1 if ($id =~ /^\S+_x(\d+)/);
my $ann = "unknown";
if(exists $mirna{$id}){
$ann = "miRNA";
}
elsif(exists $ncrna{$id}){
$ann = $ncrna{$id};
}
elsif(exists $repeat{$id}){
$ann = "repeat";
}
elsif(exists $exon{$id}){
my $strand = "-" ;
if(exists $exon{$id}{"+"}){
$strand = "+"
}
if($strand eq "+"){
$ann = "exon_sense";
}
else{
$ann = "exon_antisense"
}
}
elsif(exists $intron{$id}){
my $strand = "-" ;
if(exists $intron{$id}{"+"}){
$strand = "+";
}

if($strand eq "+"){
$ann = "intron_sense";
}
else{
$ann = "intron_antisense"
}

}
$ann{$ann}{"total"} += $nu;
$ann{$ann}{"uniq"} ++;


#$ann_keep{$ann} = 1;

$total += $nu;
$uniq ++;
print ANN "$id\t$seq\t$nu\t$len\t$ann\n";
}
close ANN;


open OUT, ">$stat_ann" or die $!;
my @ann_type = ("miRNA",
"rRNA", "tRNA", "snRNA", "snoRNA" , "other_ncRNA" ,
"repeat",
"exon_sense", "exon_antisense",
"intron_sense", "intron_antisense",
"unknown"
);

print OUT "ann_type\ttotal_reads\ttotal_percent\tuniq_reads\tuniq_percent\n";
foreach my $ann_type ( @ann_type ){
my $tnum = 0;
my $unum = 0;

next if ($exon eq "none" && ($ann_type eq "exon_sense" || $ann_type eq "exon_antisense"));
next if ($intron eq "none" && ( $ann_type eq "intron_sense" || $ann_type eq "intron_antisense"));
next if ($repeat eq "none" && $ann_type eq "repeat");

if( exists $ann{$ann_type}){
$tnum = $ann{$ann_type}{"total"};
$unum = $ann{$ann_type}{"uniq"};
}
my $tpercent = $tnum/$total;
my $upercent = $unum/$uniq;
printf OUT "$ann_type\t$tnum\t%.6f\t$unum\t%.6f\n",$tpercent,$upercent;

}
close OUT;

sub gene_bwt{
my $bwt = shift;
my %out ;
open IN, $bwt or die $!;
while(<IN>){
chomp;
# P11_2460498_x5615 + ccr-mir-26a 16 TTCAAGTAATCCAGGATAGGCT IIIIIIIIIIIIIIIIIIIIII 0
my @t = split /\t/, $_;
$out{$t[0]}{$t[1]} = 1;
}
close IN;
return %out;
}

sub ncrna_bwt{
my $fam = shift;
my $bwt = shift;

my %fam;
open FM, $fam or die $!;
#Cis-reg;
#Cis-reg;
#Cis-reg; frameshift_element;
#Cis-reg; IRES;
#Cis-reg; leader;
#Cis-reg; riboswitch;
#Cis-reg; thermoregulator;
#Cis-reg; thermoregulator;
#Gene;
#Gene; antisense;
#Gene; antisense;
#Gene; antitoxin;
#Gene; CRISPR;
#Gene; lncRNA;
#Gene; ribozyme;
#Gene; rRNA;
#Gene; snRNA;
#Gene; snRNA; snoRNA; CD-box;
#Gene; snRNA; snoRNA; HACA-box;
#Gene; snRNA; snoRNA; scaRNA;
#Gene; snRNA; splicing;
#Gene; sRNA;
#Gene; tRNA;
#Intron;

while(<FM>){
chomp;
my ($rf,$name, $type) = split /\t/,$_;
if ($type =~ /rRNA;/){
$fam{$rf} = "rRNA";
}
elsif($type =~ /tRNA;/){
$fam{$rf} = "tRNA";
}
elsif($type =~ /snoRNA;/ ){
$fam{$rf} = "snoRNA";
}
elsif($type =~ /snRNA;/){
$fam{$rf} = "snRNA";
}
else{
$fam{$rf} = "other_ncRNA";
}
#print "$rf\t$fam{$rf}\n";
}
close FM;


my %ty_order = { "rRNA" => 1, "tRNA" => 2, "snRNA" => 3, "snoRNA" => 4, "orther_ncRNA" => 5};
my %out;
open BWT, $bwt or die $!;
while(<BWT>){
chomp;
#P11_1019552_x109273 + RF00005:CU928174.1/736663-736781 37 GAGTACAACTCTTAGCGG
my @t = split /\t/, $_;
my $id = $t[0];
my @f = split /:/, $t[2];
my $ty = $fam{$f[0]};
if( !exists $out{$id} ){
$out{$id} = $ty;
}
else{
my $old = $out{$id};
if( $ty_order{$ty} < $ty_order{$old}){
$out{$id} = $ty;
}
}
}
close BWT;
return %out;
}

若我们的教程对你有所帮助,请点赞+收藏+转发,大家的支持是我们更新的动力!!

往期部分文章

1. 最全WGCNA教程(替换数据即可出全部结果与图形)

推荐大家购买最新的教程,若是已经购买以前WGNCA教程的同学,可以在对应教程留言,即可获得最新的教程。(注:此教程也仅基于自己理解,不仅局限于此,难免有不恰当地方,请结合自己需求,进行改动。)


2. 精美图形绘制教程

3. 转录组分析教程

4. 转录组下游分析

小杜的生信筆記 ,主要发表或收录生物信息学教程,以及基于R分析和可视化(包括数据分析,图形绘制等);分享感兴趣的文献和学习资料!!


小杜的生信筆記
小杜的生信筆記,生信小白,初来乍到请多指教。 主要学习分享,转录组数据分析,基于R语言数据分析和绘制图片等,以及相关文献的分享。
 最新文章