LLMs for automating systematic reviews

Automating systematic reviews with large language models

Systematic reviews are the foundation of evidence-based medicine, but they typically take over 16 months and cost upwards of $100,000 to complete. We developed otto-SR, an end-to-end agentic workflow using large language models to automate the systematic review process from initial search to analysis.

otto-SR workflow comparing human systematic review process with automated LLM workflow

Performance

otto-SR outperformed traditional dual human workflows in both screening and data extraction:

Screening: 96.7% sensitivity, 97.9% specificity (vs. human: 81.7% sensitivity, 98.1% specificity)
Data extraction: 93.1% accuracy (vs. human: 79.7% accuracy)

The system uses GPT-4.1 for screening and o3-mini-high for data extraction, targeting tasks that consume the majority of human researcher time.

Cochrane Reproducibility Study

We reproduced and updated an entire issue of Cochrane reviews (n=12) in two days—work representing approximately 12 work-years through conventional methods.

otto-SR correctly identified all 64 included studies across the 12 reviews
Incorrectly excluded a median of 0 studies (IQR 0 to 0.25)
Found a median of 2.0 additional eligible studies (IQR 1 to 6.5) likely missed by original authors
Meta-analyses revealed newly statistically significant findings in 2 reviews

Collaborators

This work was conducted with researchers from Harvard Medical School, University of Toronto, MIT, University of Calgary, Cochrane France, Vector Institute, and others. Key advisors include George Church and Isabelle Boutron.

Publications

Automation of Systematic Reviews with Large Language Models · medRxiv 2025

Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews · Annals of Internal Medicine 2025

Press

Featured in STAT News and Nature News. Cited in J.P. Morgan's Eye on the Market.