← Back to projects

RNAGym

Large-scale benchmarks for RNA fitness and structure prediction

RNA molecules play critical roles in biology—from encoding genetic information to catalyzing reactions and regulating gene expression. Despite the growing interest in RNA-based therapeutics and the success of deep learning for protein structure prediction, RNA modeling remains comparatively underexplored. RNAGym provides standardized, large-scale benchmarks to rigorously evaluate models for RNA fitness and structure prediction.

Benchmark Tasks

RNAGym focuses on three core prediction tasks:

  • Fitness prediction: Predicting the functional effects of RNA mutations using 70 deep mutational scanning (DMS) assays spanning diverse RNA families
  • Secondary structure prediction: Predicting base-pairing patterns from sequence using 901k chemical reactivity profiles from standardized experiments
  • Tertiary structure prediction: Predicting 3D atomic coordinates using 215 high-resolution structures from the RNA Puzzles and CASP-RNA blind prediction challenges

Key Findings

  • RNA language models show promise for fitness prediction but remain far from solving the task
  • Secondary structure prediction methods generalize poorly to new RNA families, with performance dropping significantly on sequences distant from training data
  • Tertiary structure methods achieve moderate local accuracy but struggle with global RNA fold prediction
  • Current RNA foundation models underperform relative to protein foundation models at comparable scales

Resources

We release all datasets, evaluation code, and baseline implementations to facilitate reproducible research. The benchmark is designed to be extensible, allowing researchers to easily add new datasets and models.

Website · GitHub

Collaborators

This work was conducted with researchers from Harvard Medical School, Stanford University, MIT, and Genentech, in collaboration with Debora Marks' lab.

Publications

RNAGym: Large-scale Benchmarks for RNA Fitness and Structure Prediction · bioRxiv 2024