×
RL impact on LLM reasoning capacity questioned in new study
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

A new study from Tsinghua University challenges prevailing assumptions about how reinforcement learning (RL) enhances large language models’ reasoning abilities. The research suggests that rather than developing new reasoning capabilities, RL primarily amplifies existing reasoning pathways by increasing their sampling frequency, potentially at the cost of reasoning diversity. This finding has significant implications for AI development strategies and raises questions about the most effective approaches for improving AI reasoning capabilities beyond superficial performance metrics.

The big picture: Researchers discovered that models fine-tuned with reinforcement learning on verifiable rewards (RLVR) initially appear to reason better but actually narrow the model’s reasoning pathways.

  • When comparing base models to their RL-tuned counterparts using pass@k evaluation metrics, RL-tuned models show better performance at low k values but worse performance at higher values.
  • This unexpected pattern suggests RL doesn’t create new capabilities but rather amplifies certain reasoning pathways while potentially eliminating others.

Key details: The study found that RL-tuned models outperform base models when given a single attempt (pass@1) but underperform when allowed multiple attempts (pass@256), especially on math benchmarks.

  • The crossover point where base models begin outperforming RL-tuned models can occur with as few as four attempts (k=4) on some benchmarks.
  • This performance pattern indicates that RL “narrows the reasoning boundary” by increasing the probability of previously successful reasoning paths at the cost of diversity.

Behind the numbers: Lower perplexity scores for RL model generations compared to base models suggest that RL-tuned outputs remain within the distribution of responses the base model could generate.

  • This finding reinforces the theory that RL primarily redistributes probability mass toward certain reasoning paths rather than creating new reasoning capabilities.
  • Sanity checks revealed that for most questions, base models can generate at least one correct reasoning trace, confirming they possess the capability being amplified by RL.

Counterpoints: The researchers acknowledge several limitations that might affect the generalizability of their findings.

  • RL can still enable emergent capabilities on long-horizon tasks not explored in this study.
  • Testing was limited to specific domains (math, code generation, and mathematical visual reasoning) and models up to 32B parameters.
  • Effects might differ for larger models or different types of reasoning tasks.

Why this matters: The study suggests that current RL approaches may not be optimal for developing more capable reasoning in AI systems, despite improving surface-level metrics.

  • These findings connect to previous work on “mode collapse” in RLHF, suggesting a broader pattern in how reinforcement learning affects language models.
  • In contrast, distillation from stronger teacher models was found to “expand the reasoning boundary,” indicating alternative approaches may better develop genuine reasoning capabilities.

Reading between the lines: The research challenges the narrative that reinforcement learning necessarily improves a model’s fundamental capabilities, suggesting AI developers may need to rethink evaluation metrics and development strategies.

  • By relying on pass@1 metrics alone, researchers might overestimate the improvements from RL while missing its potential narrowing effect on reasoning diversity.
  • The findings suggest a more nuanced approach to model evaluation that considers both reliability and the breadth of reasoning capabilities.
Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Recent News

Hacker admits using AI malware to breach Disney employee data

The case reveals how cybercriminals are exploiting AI enthusiasm to deliver sophisticated trojans targeting corporate networks and stealing personal data.

AI-powered social media monitoring expands US government reach

Federal agencies are increasingly adopting AI tools to analyze social media content, raising concerns that surveillance ostensibly targeting immigrants will inevitably capture American citizens' data.

MediaTek’s Q1 results reveal 4 key AI and mobile trends

Growing revenue but shrinking profits for MediaTek highlight the cost of competing in AI and premium mobile chips amid ongoing market volatility.