Tutorial of GRADitude using Singularity

This tutorial will guide you through a test case to show how to perform a GRAD-seq analysis from the pre-processing steps to the downstream analysis that will be done using GRADitude. We recommend to execute this analysis in a server or in a virtual machine. The pre-processing and mapping steps require a lot of computational time and power.

In this tutorial we will try to show how to use all GRADitude's subcommands in order to have an overview of all possible complexes.

In order to perform all these steps we need to have many tools and dependencies in your pc. To make all simpler and more reproducible we suggest following the example provided below. All these steps can be then easily adjusted to analyze different GRAD-seq experiments.

Docker containers and singularity

Another way to distribute packages is using Docker. Docker is a service that deliver software in so called containers. All containers have their own libraries and packages.

A Docker container has inside all dependencies and applications. This increase the reproducibility of a specific analysis. In this example all the tools required for the analysis are inside the image that will be generated.

In order to avoid root permission when running Docker, Singularity can be used. For this reason, the example analysis provided, is done using singularity.

You can download singularity in here

Once you have singularity in your computer you can run inside your project directory this command:

singularity build grad-seq-analysis.sif docker://silviadigiorgio/grad-seq-analysis-ecoli

This will create an image that can be used to perform the analysis. Before starting the analysis please create the folders input and output in the main directory.

This can be also done via shell using the bash command mkdir.

mkdir input output

We recommend to create subfolders inside the input main folder

cd input &&
mkdir data_folder raw_reads

At the end you will have inside the analysis folder the created image and the two folders

.
├── grad-seq-analysis.sif
├── input
│   ├── data_folder
│   └── raw_reads
└── output

If you are not familiar with Bioinformatics please be aware that some commands we will execute they will take some time. So we recommend to run everything in a server and with tmux

To perform our test case, the raw data have to be retrieved from NCBI.

In order to perform this step the user needs to use the fastq-dump function provided by the SRA-toolkit that can be downloaded here

We suggest to download the raw data inside the input folder it has been previously created


cd input/raw_reads &&
    fastq-dump --bzip2 SRR12067299
    fastq-dump --bzip2 SRR12067300
    fastq-dump --bzip2 SRR12067301
    fastq-dump --bzip2 SRR12067302
    fastq-dump --bzip2 SRR12067303
    fastq-dump --bzip2 SRR12067304
    fastq-dump --bzip2 SRR12067305
    fastq-dump --bzip2 SRR12067306
    fastq-dump --bzip2 SRR12067307
    fastq-dump --bzip2 SRR12067308
    fastq-dump --bzip2 SRR12067309
    fastq-dump --bzip2 SRR12067310
    fastq-dump --bzip2 SRR12067311
    fastq-dump --bzip2 SRR12067312
    fastq-dump --bzip2 SRR12067313
    fastq-dump --bzip2 SRR12067314
    fastq-dump --bzip2 SRR12067315
    fastq-dump --bzip2 SRR12067316
    fastq-dump --bzip2 SRR12067317
    fastq-dump --bzip2 SRR12067318 
    fastq-dump --bzip2 SRR12067319 
    fastq-dump --bzip2 SRR12067320

or to automate the process you can run the same command using a for loop

for i in            \
    SRR12067299     \
    SRR12067300     \
    SRR12067301     \
    SRR12067302     \
    SRR12067303     \
    SRR12067304     \
    SRR12067305     \
    SRR12067306     \
    SRR12067307     \
    SRR12067308     \
    SRR12067309     \
    SRR12067310     \
    SRR12067311     \
    SRR12067312     \
    SRR12067313     \
    SRR12067314     \
    SRR12067315     \
    SRR12067316     \
    SRR12067317     \
    SRR12067318     \
    SRR12067319     \
    SRR12067320     \
; do 
    fastq-dump --bzip2 $i; 
  done

After downloading the data I recommend to change the name of the files according the specific fraction like suggested here:

cd input/raw_reads &&
mv SRR12067299.fastq.bz2 Grad-00L.fastq.bz2 &&
mv SRR12067300.fastq.bz2 Grad-01.fastq.bz2 &&
mv SRR12067301.fastq.bz2 Grad-02.fastq.bz2 &&  
mv SRR12067302.fastq.bz2 Grad-03.fastq.bz2 &&   
mv SRR12067303.fastq.bz2 Grad-04.fastq.bz2 &&   
mv SRR12067304.fastq.bz2 Grad-05.fastq.bz2 &&   
mv SRR12067305.fastq.bz2 Grad-06.fastq.bz2 &&   
mv SRR12067306.fastq.bz2 Grad-07.fastq.bz2 &&    
mv SRR12067307.fastq.bz2 Grad-08.fastq.bz2 &&   
mv SRR12067308.fastq.bz2 Grad-09.fastq.bz2 &&   
mv SRR12067309.fastq.bz2 Grad-10.fastq.bz2 &&   
mv SRR12067310.fastq.bz2 Grad-11.fastq.bz2 &&   
mv SRR12067311.fastq.bz2 Grad-12.fastq.bz2 &&    
mv SRR12067312.fastq.bz2 Grad-13.fastq.bz2 &&    
mv SRR12067313.fastq.bz2 Grad-14.fastq.bz2 &&   
mv SRR12067314.fastq.bz2 Grad-15.fastq.bz2 &&    
mv SRR12067315.fastq.bz2 Grad-16.fastq.bz2 &&   
mv SRR12067316.fastq.bz2 Grad-17.fastq.bz2 &&   
mv SRR12067317.fastq.bz2 Grad-18.fastq.bz2 &&    
mv SRR12067318.fastq.bz2 Grad-19.fastq.bz2 &&    
mv SRR12067319.fastq.bz2 Grad-20.fastq.bz2 &&   
mv SRR12067320.fastq.bz2 Grad-21P.fastq.bz2    

You can download the reference genome that will be used later on from the NCBI using this command


cd input/data_folder &&
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz &&
gzip -d GCF_000005845.2_ASM584v2_genomic.fna.gz && cd ../..

You can download the ERCC fasta file that will be necessary for the normalization


cd input/data_folder &&
wget https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip &&
unzip ERCC92.zip && rm ERCC92.zip && rm ERCC92.gtf && cd ../..

The annotation file can be download from tutorial_data folder in github (Höer et al., 2020, NAR)


cd input/data_folder &&
wget https://github.com/foerstner-lab/GRADitude/raw/main/tutorial_data/NC_000913.3_no_duplicates.gff && cd ../..

Once all the raw reads are downloaded. We can start with removing the adapter using cutadapt.

Before doing that we create the READemption folder where we will store the output that comes out from executing cutdadapt


singularity exec grad-seq-analysis-ecoli.sif reademption create -f READemption_analysis

If the commands works from your shell it should appear the READemption logo with no errors


   ___  _______   ___                 __  _
  / _ \/ __/ _ | / _ \___ __ _  ___  / /_(_)__  ___
 / , _/ _// __ |/ // / -_)  ' \/ _ \/ __/ / _ \/ _ \
/_/|_/___/_/ |_/____/\__/_/_/_/ .__/\__/_/\___/_//_/
                             / /
====================================================
========================================
=======================
==============

Once we create the READemption folder we can execute cutadapt.

ls input/raw_reads | parallel -k -j 10 \
singularity exec grad-seq-analysis-ecoli.sif cutadapt \
-q 20 \
-m 1 \
-a "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" \
-o READemption_analysis/input/reads/'$(basename {})'.bz2 \
input/raw_reads/{} \
> output/cutadapt_stats.txt

After executing cutadapt the trimmed output are already saved in the READemption/input/reads folder as requested by READemption subcommand.

Now we have to copy the reference genome and the annotation in READemption/input/reference_sequences and READemption/input/annotations respectively You can do it manually or you can run the following code:

cp -r input/data_folder/NC_000913.3_no_duplicates.gff READemption_analysis/input/annotations/

cp -r input/data_folder/GCF_000005845.2_ASM584v2_genomic.fna READemption_analysis/input/reference_sequences/

cp -r input/data_folder/ERCC92.fa READemption_analysis/input/reference_sequences/

Now we can finally start with the alignment. The -p parameter can be increase or decreased based on the CPUs that are available


singularity exec grad-seq-analysis-ecoli.sif reademption align \
--project_path READemption_analysis \
-p 4 \
-a 95 \
-l 20 \
--fastq \
--progress

After the alignment is finished the user can continue running the coverage subcommand


singularity exec grad-seq-analysis-ecoli.sif reademption coverage \
-p 4 \
--project_path READemption_analysis

Now the user can run the gene quantification. This subcommand will have as output one of the tables necessary for our further analysis with GRADitude. With the parameter --features you can specify which features you want to quantify. For our analysis we decided to use CDS, rRNA, tRNA and ncRNA but many more features can be added, depending on the annotation file.

    singularity exec grad-seq-analysis-ecoli.sif reademption gene_quanti \
        -p 4 \
        --features CDS,rRNA,tRNA,ncRNA \
        -l \
        --project_path READemption_analysis

After running the gene quantification, the 2 tables necessary for the downstream analysis are found in:

READemption_analysis/output/align/reports_and_stats ---> read_alignment_stats.csv
&
READemption_analysis/output/gene_quanti/gene_quanti_combined --> gene_wise_quantifications_combined.csv

Let's have a look at the folder structure we created right now:


├── grad-seq-analysis-ecoli.sif
├── input
└── output
└── READemption_analysis
    ├── input
    └── output


Now the user can create a folder GRADitude using the following subcommand


mkdir GRADitude && mkdir GRADitude/input GRADitude/output

and then your folder structure will be like the one showed below where there is an input and output folder inside the GRADitude folder created

├── GRADitude
├── input
    └── output
├── grad-seq-analysis-ecoli.sif
├── input
└── output
└── READemption_analysis
    ├── input
    └── output

We now start the analysis copying all the necessary tables inside GRADitude

  cp -r READemption_analysis/output/align/reports_and_stats/read_alignment_stats.csv ./GRADitude/input/ &&
  cp -r READemption_analysis/output/gene_quanti/gene_quanti_combined/gene_wise_quantifications_combined.csv ./GRADitude/input/