ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

A transition-aware reinforcement learning framework that separates useful correction from harmful sycophancy in scientific critic interactions.

TL;DR. ReCrit does not optimize only the final answer. Instead, it models how an answer changes from the Initial solution to the Critic solution, and assigns different rewards to Correction, Robustness, Sycophancy, and Boundary transitions.

Method Overview

In scientific reasoning, a critic interaction should not be treated as a generic answer revision. A good critic helps the model correct an initially wrong solution, while a misleading critic should not cause the model to abandon a correct one. ReCrit formalizes this behavior through correctness transitions between the Initial and Critic stages.

ReCrit teaser figure
ReCrit decomposes critic interaction into four correctness-transition quadrants. The key distinction is whether criticism produces grounded correction or harmful answer switching.

Four Transition Quadrants

Correction

Initial wrong, Critic correct. This is the desired behavior: the model uses criticism to repair a flawed solution.

Robustness

Initial correct, Critic correct. The model remains stable under verification or weak opposing feedback.

Sycophancy

Initial correct, Critic wrong. This is the harmful mode where the model changes a correct answer only because it was challenged.

Boundary

Initial wrong, Critic wrong. The model remains incorrect, indicating that the example is still near its capability boundary.

Training Pipeline

ReCrit samples grouped trajectories, judges both Initial and Critic solutions, converts each pair into a transition reward, and performs GRPO-style policy optimization with grouped advantages.

ReCrit training pipeline
The ReCrit pipeline injects critic feedback with different attitudes, evaluates Initial-to-Critic transitions, and updates the policy using transition-aware rewards.

Dynamic Asynchronous Rollout

Multi-turn rollout becomes inefficient when every sample must wait for the slowest one. ReCrit uses asynchronous scheduling so completed samples can advance immediately, and tail-adaptive completion further reduces GPU waiting bubbles.

Dynamic asynchronous rollout comparison
Compared with synchronous rollout, dynamic asynchronous rollout shortens iteration time by reducing tail waiting while preserving valid critic interactions.

Main Results

We report the main ReCrit results on three scientific reasoning benchmarks: ChemBench, TRQA, and EarthSE. Values are percentages. Critic is the primary metric, while Gain measures the net change from Initial to Critic.

Model Benchmark Initial Gain Critic
Qwen3.5-4B ChemBench 50.50 +10.50 61.00
Qwen3.5-4B TRQA 24.42 +11.05 35.47
Qwen3.5-4B EarthSE 38.80 +19.20 58.00
Qwen3.5-9B ChemBench 61.50 +8.00 69.50
Qwen3.5-9B TRQA 31.98 +9.30 41.28
Qwen3.5-9B EarthSE 46.40 +9.60 56.00

Averaged over the three benchmarks, ReCrit improves the final Critic score from 38.15 to 51.49 on Qwen3.5-4B, and from 45.40 to 55.59 on Qwen3.5-9B.

Case Study

The examples below illustrate the target behavior. In both cases, the Initial solution is plausible but wrong, while the Critic solution revisits the reasoning, identifies the weak premise, and returns the correct answer.

ReCrit case studies
Two scientific correction cases from ChemBench and EarthSE. The critic prompt requests verification rather than revealing the answer directly.

Citation

@misc{xu2026recrittransitionawarereinforcementlearning,
      title={ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning},
      author={Wanghan Xu and Yuhao Zhou and Hengyuan Zhao and Shuo Li and Dianzhi Yu and Zhenfei Yin and Yaowen Hu and Fengli Xu and Wanli Ouyang and Wenlong Zhang and Lei Bai},
      year={2026},
      eprint={2605.18799},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.18799},
}