A Reinforcement Learning-Enhanced Framework for Reasoning-Oriented Multi-Hop Retrieval
Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks:
1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions.
2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents.
3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge.
Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.
OPERA's MAPGRPO training framework and performance comparison
Plan Agent: Decomposes complex queries into executable sub-goals with placeholder dependencies for strategic multi-hop reasoning.
Analysis-Answer Agent: Performs information sufficiency assessment and precise answer extraction from retrieved documents.
Rewrite Agent: Reformulates queries adaptively when information is insufficient.
Records complete execution traces with action rationales for enhanced interpretability and debugging.
"What is the GDP per capita of the country where the headquarters of the company that acquired GitHub is located?"
Progressive multi-agent training with pre-scored dataset from DeepSeek R1, high-score sample selection, and role-specific evaluation methods for each specialized agent.
Performance on three multi-hop reasoning benchmarks. All methods use Qwen2.5-7B as backbone.
Numbers in parentheses show improvement over best baseline (underlined values)
OPERA shows larger improvements on more challenging datasets—63.4% relative improvement on Musique versus 25.4% on HotpotQA, suggesting our approach excels at complex multi-hop reasoning.
While BGM achieves only 19.6% EM on Musique, OPERA reaches 39.7% EM, demonstrating that specialized agent architecture provides benefits beyond RL optimization alone.
OPERA achieves state-of-the-art results across different reasoning patterns—comparison, entity traversal, and compositional reasoning—showing broad applicability.
Multi-agent systems have demonstrated significant contributions across various domains, and researchers from both academia and industry are actively exploring their potential. This work introduces novel approaches to multi-agent collaboration and retrieval-augmented question answering, including:
If you find our reward design, architectural patterns, or training methodologies useful for your research, we kindly ask you to cite our work.
GitHub Repository: https://github.com/Ameame1/OPERA
@article{opera2025,
title={OPERA: Orchestrated Planner-Executor Reasoning Architecture
for Reasoning-Centric Retrieval},
author={Anonymous},
journal={arXiv preprint arXiv:2508.16438},
year={2025},
url={https://arxiv.org/pdf/2508.16438}
}
Our training framework is built upon TRL (Transformer Reinforcement Learning), HuggingFace's library for training transformer language models with reinforcement learning. MAPGRPO is an enhanced variant of GRPO (Group Relative Policy Optimization) introduced by DeepSeek, with progressive multi-agent training and role-specific reward functions.