DaPars: Dynamitic analysis of Alternative PolyAdenylation from RNA-seq

Introduction

The dynamic usage of the 3’untranslated region (3’UTR) resulting from alternative polyadenylation (APA) is emerging as a pervasive mechanism for regulating mRNA diversity, stability and translation. Though RNA-seq provides the whole-transcriptome information and a lot of tools for analyzing gene/isoform expression are available, very few tool focus on the analysis of 3’UTR from standard RNA-seq. DaPars is the first de novo tool that directly infers the dynamic alternative polyadenylation (APA) usage by comparing standard RNA-seq. Given the annotated gene model, DaPars can infer the de novo proximal APA sites as well as the long and short 3’UTR expression levels. Finally, the dynamic APA usages between two conditions will be identified.

Installation

Prerequisite: Bedtools; python3; numpy; scipy.

Install DaPars:

tar zxf DaPars-VERSION.tar.gz

cd DaPars-VERSION

Input format

DaPars requires the following two file formats as input:

  • BED file is tab separated, 12 column, plain text file to represent gene model. The gene model can be downloaded from UCSC. Please refer the file hg19_refseq_whole_gene.bed in the DaPars_Test_Dataset.

  • BedGraph files store the reads alignment result, which can be generated from BAM file from RNA-seq alignment tool such as TopHat. One way is to use BedTools with following command:

genomeCoverageBed -bg -ibam sample_sorted.bam -g hg19_chr_size.txt -split > sample.bedgraph

Usage Information

Step 1: Generate region annotation: python DaPars_Extract_Anno.py -b gene.bed -s symbol_map.txt -o extracted_3UTR.bed

DaPars will use the extracted distal polyadenylation sites to infer the proximal polyadenylation sites based on the alignment wiggle files of two samples. The output in this step will be used by the next step.

Options:

-h , –help

Show this help message and exit.

-b GENE_BED_FILE, --bed=GENE_BED_FILE

The gene model in BED format. The BED file can be downloaded from UCSC

-s Gene_Symbol_FILE, --gene_symbol_map=Gene_Symbol_FILE

The mapping of transcripts to gene symbol, which can be extracted from UCSC Tables with the following format. Please refer the example file hg19_4_19_2012_Refseq_id_from_UCSC.txt in the DaPars_Test_Dataset.

-o OUTPUT_FILE, --out-prefix=OUTPUT_FILE

The output of the extracted annotation region will be used in the following configure file after “Annotated_3UTR”.

For example:

python DaPars_Extract_Anno.py -b hg19_refseq_whole_gene.bed -s hg19_4_19_2012_Refseq_id_from_UCSC.txt -o hg19_refseq_extracted_3UTR.bed

Step 2: main function to get final result: python DaPars_main.py configure_file

Run this function to get the final result. The configure file is the only parameter for DaPars_main.py, which stores all the parameters.

The format of the configure is:

#The following file is the result of step 1.

Annotated_3UTR=hg19_refseq_extracted_3UTR.bed

#A comma-separated list of BedGraph files of samples from condition 1

Group1_Tophat_aligned_Wig=Condition_A_chrX.wig
#Group1_Tophat_aligned_Wig=Condition_A_chrX_r1.wig,Condition_A_chrX_r2.wig if multiple files in one group

#A comma-separated list of BedGraph files of samples from condition 2

Group2_Tophat_aligned_Wig=Condition_B_chrX.wig

Output_directory=DaPars_Test_data/

Output_result_file=DaPars_Test_data

#At least how many samples passing the coverage threshold in two conditions
Num_least_in_group1=1

Num_least_in_group2=1

Coverage_cutoff=30

#Cutoff for FDR of P-values from Fisher exact test.

FDR_cutoff=0.05


PDUI_cutoff=0.5

Fold_change_cutoff=0.59

Output format

Output format:

_static/Result_example.jpg

Release history

DaPars v1.0.0

  • Updated to python 3

  • Remove rpy2 and use python for the statistical test

  • Fixed some minor bugs.

DaPars v0.9.1

  • Fixed some minor bugs.

  • Improved documentation.

DaPars v0.9.0

  • DaPars is released.

Contact