πŸ§ͺ GRADitude Tutorial β€” Complete Analysis Example (No Docker/Singularity)

This tutorial will guide you through a complete GRAD-seq analysis using published data and the tool GRADitude.

We will use Escherichia coli GRAD-seq data from HΓΆer et al., 2020 (NAR).
Gradient fractions come from Experiment SRX8595091.


πŸ›  Environment Setup (Conda-based)

We use Conda for reproducibility and ease of setup. This will install READemption, cutadapt, sra-tools, and other required tools.

Step 1: Configure Conda Channels (only once)

conda config --add channels conda-forge
conda config --add channels bioconda

Step 2: Create and Activate Environment

conda create -n gradseq_env python=3.10 -y
conda activate gradseq_env

If you see: Your shell has not been properly configured to use 'conda activate'
run conda init and restart your shell, or source ~/.bashrc (Linux bash) / source ~/.zshrc (macOS zsh).

Step 3: Install required tools

conda install cutadapt sra-tools parallel wget unzip pigz bzip2 coreutils -y
conda install -c bioconda -c conda-forge -c till_sauerwein reademption -y

SSL note: If fastq-dump fails with certificate errors, check version:


If < 2.11, update with:

conda install -c bioconda -c conda-forge sra-tools=3.0.10 -y

** if fastq-dump --version still gives you a previous version then please install using it this:

wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.10/sratoolkit.3.0.10-mac64.tar.gz
tar -xzf sratoolkit.3.0.10-mac64.tar.gz
export PATH="$PWD/sratoolkit.3.0.10-mac64/bin:$PATH"
fastq-dump --version

πŸ“‚ Set Up Directory Structure

mkdir -p GRADseq_analysis/input/{raw_reads,data_folder} GRADseq_analysis/output
cd GRADseq_analysis

Folder structure after setup:

GRADseq_analysis/
β”œβ”€β”€ input/
β”‚   β”œβ”€β”€ raw_reads/
β”‚   └── data_folder/
└── output/

⬇ Download raw sequencing data from NCBI SRA

We download the 22 gradient fractions from experiment SRX8595091 (HΓΆer et al., 2020).
Each SRR accession corresponds to one gradient fraction.

cd input/raw_reads

for i in SRR12067299 SRR12067300 SRR12067301          SRR12067302 SRR12067303 SRR12067304          SRR12067305 SRR12067306 SRR12067307          SRR12067308 SRR12067309 SRR12067310          SRR12067311 SRR12067312 SRR12067313          SRR12067314 SRR12067315 SRR12067316          SRR12067317 SRR12067318 SRR12067319          SRR12067320
do
    fastq-dump --bzip2 "$i"
done

cd ../..

Rename files for clarity:

cd input/raw_reads &&
mv SRR12067299.fastq.bz2 Grad-00L.fastq.bz2 &&
mv SRR12067300.fastq.bz2 Grad-01.fastq.bz2 &&
mv SRR12067301.fastq.bz2 Grad-02.fastq.bz2 &&
mv SRR12067302.fastq.bz2 Grad-03.fastq.bz2 &&
mv SRR12067303.fastq.bz2 Grad-04.fastq.bz2 &&
mv SRR12067304.fastq.bz2 Grad-05.fastq.bz2 &&
mv SRR12067305.fastq.bz2 Grad-06.fastq.bz2 &&
mv SRR12067306.fastq.bz2 Grad-07.fastq.bz2 &&
mv SRR12067307.fastq.bz2 Grad-08.fastq.bz2 &&
mv SRR12067308.fastq.bz2 Grad-09.fastq.bz2 &&
mv SRR12067309.fastq.bz2 Grad-10.fastq.bz2 &&
mv SRR12067310.fastq.bz2 Grad-11.fastq.bz2 &&
mv SRR12067311.fastq.bz2 Grad-12.fastq.bz2 &&
mv SRR12067312.fastq.bz2 Grad-13.fastq.bz2 &&
mv SRR12067313.fastq.bz2 Grad-14.fastq.bz2 &&
mv SRR12067314.fastq.bz2 Grad-15.fastq.bz2 &&
mv SRR12067315.fastq.bz2 Grad-16.fastq.bz2 &&
mv SRR12067316.fastq.bz2 Grad-17.fastq.bz2 &&
mv SRR12067317.fastq.bz2 Grad-18.fastq.bz2 &&
mv SRR12067318.fastq.bz2 Grad-19.fastq.bz2 &&
mv SRR12067319.fastq.bz2 Grad-20.fastq.bz2 &&
mv SRR12067320.fastq.bz2 Grad-21P.fastq.bz2
cd ../../

πŸ“š Download Reference Files

cd input/data_folder

# Reference genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gzip -d GCF_000005845.2_ASM584v2_genomic.fna.gz

# ERCC spike-ins
wget https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip
unzip ERCC92.zip && rm ERCC92.zip ERCC92.gtf

# Annotation file
wget https://github.com/foerstner-lab/GRADitude/raw/main/tutorial_data/NC_000913.3_no_duplicates.gff

cd ../..

βœ‚οΈ Trim Adapters with Cutadapt

reademption create -f READemption_analysis -s 'ecoli=Escherichia coli K-12 MG1655'

ls input/raw_reads | parallel -k -j 10   "cutadapt -q 20 -m 1 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC   -o READemption_analysis/input/reads/{/.}.fastq.bz2 input/raw_reads/{}"   > output/cutadapt_stats.txt

πŸ“ Prepare Reference Files for READemption

cp input/data_folder/NC_000913.3_no_duplicates.gff READemption_analysis/input/ecoli_annotations/
cp input/data_folder/GCF_000005845.2_ASM584v2_genomic.fna READemption_analysis/input/ecoli_reference_sequences/
cp input/data_folder/ERCC92.fa READemption_analysis/input/ecoli_reference_sequences/

🎯 Align Reads

reademption align --project_path READemption_analysis \
-p 5 \
-a 95 \
-l 20 \
--fastq \
--progress


πŸ“Š Generate Coverage

reademption coverage --project_path READemption_analysis -p 4

πŸ“ˆ Gene Quantification

reademption gene_quanti --project_path READemption_analysis --features CDS,rRNA,tRNA,ncRNA -p 4 -l

Output files required for GRADitude are located at:

READemption_analysis/output/align/reports_and_stats/read_alignment_stats.csv
READemption_analysis/output/gene_quanti/gene_quanti_combined/gene_wise_quantifications_combined.csv

πŸ“ Prepare Input for GRADitude

mkdir -p GRADitude/input GRADitude/output

cp READemption_analysis/output/align/reports_and_stats/read_alignment_stats.csv GRADitude/input/
cp READemption_analysis/output/ecoli_gene_quanti_combined/gene_wise_quantifications_combined.csv GRADitude/input/

Folder structure now looks like:

GRADseq_analysis/
β”œβ”€β”€ input/
β”‚   β”œβ”€β”€ raw_reads/
β”‚   └── data_folder/
β”œβ”€β”€ output/
β”œβ”€β”€ READemption_analysis/
β”‚   β”œβ”€β”€ input/
β”‚   └── output/
└── GRADitude/
    β”œβ”€β”€ input/
    └── output/

βœ… Ready for GRADitude!

You have successfully:

  • Preprocessed the raw reads
  • Performed read alignment
  • Generated coverage files
  • Quantified genes

You're now ready to continue the analysis using GRADitude!

Grad-seq Data Preprocessing & Formatting

Once the raw quantification tables are generated, the data must be formatted to be compatible with GRADitude's statistical modules. This phase involves extracting gene identifiers, removing non-relevant fractions (e.g., Total Lysate), and filtering out low-expression genes to ensure robust downstream analysis.

Attribute Extraction & Merging

The first step is to extract specific identifier columns (such as Gene ID and Common Name) from the combined quantification file.

Extract Attributes

Raw quantification tables (like those from READemption) often store gene metadata in a single, semi-structured column called Attributes (e.g., ID=gene01;Name=dnaA;Type=CDS). To perform analysis, these values must be extracted into their own separate columns.

Use the extract_gene_columns command to parse the GFF-style Attributes column and extract specific metadata keys (such as gene ID or Name) into their own dedicated columns.


graditude extract_gene_columns \
    -f input/gene_wise_quantifications_combined.csv \
    -names "gene" "Name" \
    -o input/gene_wise_extended.csv

Consolidate Gene Identifiers

After extracting attributes, gene identifiers may be scattered across different columns depending on the annotation completeness (e.g., some entries use "Parent", others "name" or "ID"). This subcommand consolidates these columns into a single, definitive 'Gene' column to ensure every feature is uniquely identified for downstream analysis.

Command:

graditude merge_attributes \
    -f input/gene_wise_extended.csv \
    -o input/gene_wise_extended_merged.csv

Data Cleaning: Removing Unwanted Columns

After consolidating the identifiers, the dataset often still includes experimental controlsβ€”such as the Lysate (or Input)β€”and redundant metadata columns. While the lysate is essential for quality control, it represents the cellular state prior to fractionation and does not belong to the sedimentation profile. Including it can skew downstream statistical analyses (like normalization and clustering).

The drop_column removes the specified columns to produce a clean numerical matrix for analysis.

graditude drop_column \
    -f input/gene_wise_extended_merged.csv \
    -c "Grad-00L.fastq" "Name" "gene"\
    -o input/gene_wise_quanti.csv

To ensure accurate downstream normalization, the table containing the ERCC spike-in read counts must undergo the same cleaning process as the gene quantification table. This involves removing the Lysate fraction, so that the control data perfectly matches the dimensions of the gradient data.

graditude drop_column \
    -f input/read_alignment_stats.csv \
    -c "Grad-00L.fastq" "Species" \
    -o input/read_alignment_stats_no_lysate.csv

Reorder Columns

After the previous processing steps (merging attributes and dropping controls), the definitive Gene identifier column is typically located at the very end of the table. To prepare the matrix for downstream analysis where the row identifier is expected to be the first column needs to be executed.

The move_columns shifts the specified number of columns from the end of the table to the beginning

graditude move_columns \
    -f input/gene_wise_quanti.csv \
    -n 1 \
    -o input/feature_count_table.csv

Filter Low-Abundance Genes in the gene quantification table

Before running statistical analyses (like clustering or correlation), it is good practice to remove genes that have very low expression levels across the entire gradient. These "noisy" low-count features can destabilize normalization and clustering algorithms. This command filters the quantification table by summing the reads for each gene across the specified fractions and keeping only those that meet a minimum threshold.

The min_row_sum is used to filter out genes that have a total count less than the specified value.

graditude min_row_sum \
    -f input/feature_count_table.csv \
    -fc 11 \
    -fe 32 \
    -m 100 \
    -o input/gene_wise_100.csv

Filter Low-Abundance Genes in the ERCC table

This subcommand filters the ERCC quantification table based on a minimum read count threshold. It calculates the sum of reads for each ERCC transcript across all gradient fractions and discards those that do not meet the specified threshold. This ensures that only robustly detected spike-ins are used for normalization and regression analysis

The min_row_sum_ercc is used to filter out ERCC transcripts that have a total count less than the specified value.```

graditude min_row_sum_ercc \
    -r input/read_alignment_stats_no_lysate.csv \
    -m 100 \
    -fr input/read_alignment_stats_100.csv

Outlier Detection

After filtering for read depth, we must verify that the ERCC spike-ins behave as expected. Theoretically, there should be a linear relationship between the log10(Read Counts) and the log10(Concentration) of the spike-ins.

This command performs a Robust Regression (RANSAC) on each gradient fraction to identify ERCCs that deviate significantly from this linear relationship. By removing these "bad" controls (outliers), we create a high-quality standard for the final normalization.

The robust_regression command is used to perform a robust regression analysis on the ERCC spike-in data. It identifies outliers that deviate significantly from the expected linear relationship between log10(Read Counts) and log10(Concentration) of the spike-ins. By removing these outliers, we ensure that the normalization process is based on a high-quality standard.

graditude robust_regression \
    -r input/read_alignment_stats_100.csv \
    -c ../input/data_folder/cms_095046.txt \
    -n 20 \
    -nc 18 \
    -mix 4 \
    -o input/read_alignment_stats_correlated_ERCC.csv

Normalization

After removing the outlier spike-ins, we proceed to normalize the gene count data. Since sequencing depth can vary between different gradient fractions due to technical reasons (e.g., library preparation efficiency), raw read counts are not directly comparable.

The normalize subcommand uses the filtered, high-quality ERCC spike-ins to calculate Size Factors for each gradient fraction. The method used is the "Median of Ratios" (similar to DESeq2), which is robust against extreme values. The command then divides the raw counts of the target genes by these specific size factors to produce a normalized expression table that allows for accurate comparison across the gradient.

graditude normalize \
    -f input/gene_wise_100.csv \
    -fc 11 \
    -fe 32 \
    -r input/read_alignment_stats_correlated_ERCC.csv \
    -rc 1 \
    -re 22 \
    -o input/Norm_100.csv \
    -s input/size_factor_100.csv

Data Scaling

Once the data has been normalized for sequencing depth, it is often useful to scale the expression profiles to facilitate visualization (e.g., heatmaps) and pattern recognition. Scaling focuses on the shape of the distribution along the gradient rather than the absolute abundance.

The scaling command transforms the normalized counts row-by-row. In this analysis, we use the to_max method, which divides the counts of each gene by its maximum expression value across the gradient. This transformation sets the peak expression of every gene to 1.0, allowing for a direct comparison of sedimentation profiles between genes with very different expression levels.

graditude scaling \
    -f input/Norm_100.csv \
    -fc 11 \
    -fe 32 \
    -sm to_max \
    -o input/scaled_100_to_max.csv

Clustering Analysis

To identify groups of genes with similar sedimentation profiles, we performed clustering analysis on the scaled expression data. Genes falling into the same cluster share similar sedimentation properties, suggesting they may be part of the same macromolecular complexes (e.g., ribosomal subunits) or share similar functional characteristics.

The clustering command applies unsupervised learning algorithms to partition the genes. In this workflow, we utilized the K-Means algorithm (-cm 'k-means'). The number of clusters was set to 6 (-nc 6) based on prior knowledge of the expected sedimentation patterns (e.g., free mRNA, monosomes, polysomes). The input data was already scaled to the maximum value, so no further internal normalization was applied (-sm no_normalization)

graditude clustering \
    -f input/scaled_100_to_max.csv \
    -fc 11 \
    -fe 32 \
    -nc 6 \
    -p 1 \
    -cm 'k-means' \
    -sm no_normalization \
    -o output/k-means_100_max_6_cl.csv

Visualization (t-SNE)

To intuitively visualize the relationships between thousands of genes simultaneously, we employed t-Distributed Stochastic Neighbor Embedding (t-SNE). While clustering groups genes into discrete sets, t-SNE projects the high-dimensional data (expression across 20+ gradient fractions) into a 2D space. In this map, genes with similar sedimentation profiles appear close to each other, while dissimilar genes are far apart.

The t_sne command generates interactive HTML plots (viewable in a web browser) where each dot represents a gene. We generate three views:

  • Colored by Cluster: Validates the K-Means results (clusters should appear as distinct islands).
  • Colored by RNA Class: Highlights the global distribution of RNA types (e.g., rRNA vs mRNA).
  • Colored by Custom Lists: Specifically highlights proteins or genes of interest (e.g., RBPs) to see where they sediment relative to the RNA clusters.
graditude t_sne \
    -f output/k-means_100_max_6_cl.csv \
    -fc 11 \
    -fe 32 \
    -pp 30 \
    -list ../input/data_folder/2019-11-22_predicted_RBPs.txt \
    -names RBPs \
    -o1 output/t-SNE_cl_6cl_30pp.html \
    -o2 output/t-SNE_fe_6cl_30pp.html \
    -o3 output/t-SNE_RBPs_6cl_30pp.html