back
Get SIGNAL/NOISE in your inbox daily
Background Literature screening constitutes a critical component in evidence synthesis; however, it typically requires substantial time and human resources. Artificial intelligence (AI) has shown promise in this field, yet the accuracy and effectiveness of AI tools for literature screening remain uncertain. This study aims to evaluate the performance of several existing AI-powered automated tools for literature screening. Methods This diagnostic accuracy study employed a cohort to evaluate the performance of five AI tools—ChatGPT 4.0, Claude 3.5, Gemini 1.5, DeepSeek-V3, and RobotSearch—in literature screening. We selected a random sample of 1,000 publications from a well-established literature cohort, with 500 as randomized controlled trials (RCTs) group and 500 as others group. Diagnostic accuracy was measured using several metrics, including the false negative fraction (FNF), time used for screening, false positive fraction (FPF), and the redundancy number needed to screen. Results We reported the FNF for the RCTs group and the FPF for the others group. In the RCTs group, RobotSearch exhibited the lowest FNF at 6.4% (95% CI: 4.6% to 8.9%), whereas Gemini exhibited the highest at 13.0% (95% CI: 10.3% to 16.3%). In the others group, the FPF of the four large language models ranged from 2.8% (95% CI: 1.7% to 4.7%) to 3.8% (95% CI: 2.4% to 5.9%), both of which were significantly lower than RobotSearch’s rate of 22.2% (95% CI: 18.8% to 26.1%). In terms of screening efficiency, the mean time used for screening per article was 1.3 s for ChatGPT, 6.0 s for Claude, 1.2 s for Gemini, and 2.6 s for DeepSeek. Conclusions The AI tools assessed in this study demonstrated commendable performance in literature screening; however, they are not yet suitable as standalone solutions. These tools can serve as effective auxiliary aids, and a hybrid approach that integrates human expertise with AI may enhance both the efficiency and accuracy of the literature screening process. Graphical Abstract
Recent Stories
Jan 16, 2026
DataMesh launches Robotics platform for industrial embodied AI
The new solution uses executable digital twins to train and evaluate robots with dynamic processes, safety rules and task-based rewards.
Jan 16, 2026We’ve Built 12+ Vibe Coded Apps Used 800,000+ Times. I Love It. But I Still Have To Maintain Them Every Single Day.
The ‘prpsumer’ vibe coding revolution is real. I’m a mass convert. We’ve built 12+ AI-powered apps on SaaStr.ai, and the results have been staggering: 800,000+ total uses across our AI …
Jan 16, 2026Researchers Just Found Something That Could Shake the AI Industry to Its Core
Researchers found compelling evidence that AI models are actually copying copyrighted data, not "learning" from it.