LLMs for automating systematic reviews
Automating systematic reviews with large language models
Systematic reviews are the foundation of evidence-based medicine, but they typically take over 16 months and cost upwards of $100,000 to complete. We developed otto-SR, an end-to-end agentic workflow using large language models to automate the systematic review process from initial search to analysis.

Performance
otto-SR outperformed traditional dual human workflows in both screening and data extraction:
- Screening: 96.7% sensitivity, 97.9% specificity (vs. human: 81.7% sensitivity, 98.1% specificity)
- Data extraction: 93.1% accuracy (vs. human: 79.7% accuracy)
The system uses GPT-4.1 for screening and o3-mini-high for data extraction, targeting tasks that consume the majority of human researcher time.
Cochrane Reproducibility Study
We reproduced and updated an entire issue of Cochrane reviews (n=12) in two days—work representing approximately 12 work-years through conventional methods.
- otto-SR correctly identified all 64 included studies across the 12 reviews
- Incorrectly excluded a median of 0 studies (IQR 0 to 0.25)
- Found a median of 2.0 additional eligible studies (IQR 1 to 6.5) likely missed by original authors
- Meta-analyses revealed newly statistically significant findings in 2 reviews
Collaborators
This work was conducted with researchers from Harvard Medical School, University of Toronto, MIT, University of Calgary, Cochrane France, Vector Institute, and others. Key advisors include George Church and Isabelle Boutron.
Publications
Automation of Systematic Reviews with Large Language Models · medRxiv 2025
Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews · Annals of Internal Medicine 2025
Press
Featured in STAT News and Nature News. Cited in J.P. Morgan's Eye on the Market.