OPERA

Orchestrated Planner-Executor Reasoning Architecture

A Reinforcement Learning-Enhanced Framework for Reasoning-Oriented Multi-Hop Retrieval

AAAI 2026 Main Track

Abstract

Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks:

1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions.

2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents.

3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge.

Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.

OPERA Overview

OPERA's MAPGRPO training framework and performance comparison

Architecture

OPERA Architecture

Goal Planning Module (GPM)

Plan Agent: Decomposes complex queries into executable sub-goals with placeholder dependencies for strategic multi-hop reasoning.

Reason-Execute Module (REM)

Analysis-Answer Agent: Performs information sufficiency assessment and precise answer extraction from retrieved documents.

Rewrite Agent: Reformulates queries adaptively when information is insufficient.

Trajectory Memory Component

Records complete execution traces with action rationales for enhanced interpretability and debugging.

Interactive Demonstration

Interactive Query Decomposition Demo

Complex Query:

"What is the GDP per capita of the country where the headquarters of the company that acquired GitHub is located?"

🎯 MAPGRPO Training Mechanism

Progressive multi-agent training with pre-scored dataset from DeepSeek R1, high-score sample selection, and role-specific evaluation methods for each specialized agent.

View Full Screen

Experimental Results

Performance on three multi-hop reasoning benchmarks. All methods use Qwen2.5-7B as backbone.

Method HotpotQA 2WikiMultiHopQA Musique
EM (%) F1 (%) EM (%) F1 (%) EM (%) F1 (%)
Qwen2.5-7B (No Retrieval) 18.5 26.8 16.2 23.7 4.1 9.1
Single-Step RAG 31.5 44.2 25.9 37.6 14.1 18.4
IRCoT 42.7 54.8 43.3 56.2 18.8 23.9
OPERA (CoT) 44.9 58.5 42.3 50.7 21.2 32.1
Adaptive-RAG 45.7 56.9 30.1 39.3 24.3 35.7
BGM 41.5 53.8 44.3 55.8 19.6 26.8
OPERA (MAPGRPO) 57.3(+11.6) 69.5(+11.0) 60.2(+15.9) 72.7(+16.5) 39.7(+15.4) 58.0(+22.3)

Numbers in parentheses show improvement over best baseline (underlined values)

25.4%
Relative improvement on HotpotQA
63.4%
Relative improvement on Musique
39.7%
EM score on challenging Musique dataset

Key Insights

🎯 Performance Scales with Difficulty

OPERA shows larger improvements on more challenging datasets—63.4% relative improvement on Musique versus 25.4% on HotpotQA, suggesting our approach excels at complex multi-hop reasoning.

🚀 Outperforms RL Baselines

While BGM achieves only 19.6% EM on Musique, OPERA reaches 39.7% EM, demonstrating that specialized agent architecture provides benefits beyond RL optimization alone.

📊 Consistent Improvements

OPERA achieves state-of-the-art results across different reasoning patterns—comparison, entity traversal, and compositional reasoning—showing broad applicability.

📝 Citation

🎯 Important Note

Multi-agent systems have demonstrated significant contributions across various domains, and researchers from both academia and industry are actively exploring their potential. This work introduces novel approaches to multi-agent collaboration and retrieval-augmented question answering, including:

  • A hierarchical three-agent architecture with systematic planning-execution decoupling
  • Multi-Agent Progressive Group Relative Policy Optimization (MAPGRPO) training framework
  • Role-specific reward functions for reinforcement learning in RAG systems

If you find our reward design, architectural patterns, or training methodologies useful for your research, we kindly ask you to cite our work.

BibTeX Citation

GitHub Repository: https://github.com/Ameame1/OPERA

@article{opera2025,
  title={OPERA: Orchestrated Planner-Executor Reasoning Architecture
         for Reasoning-Centric Retrieval},
  author={Anonymous},
  journal={arXiv preprint arXiv:2508.16438},
  year={2025},
  url={https://arxiv.org/pdf/2508.16438}
}

🙏 Acknowledgements

Our training framework is built upon TRL (Transformer Reinforcement Learning), HuggingFace's library for training transformer language models with reinforcement learning. MAPGRPO is an enhanced variant of GRPO (Group Relative Policy Optimization) introduced by DeepSeek, with progressive multi-agent training and role-specific reward functions.