A Reinforcement Learning-Enhanced Framework for Reasoning-Oriented Multi-Hop Retrieval
Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks:
1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions.
2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents.
3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge.
Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.
OPERA's MAPGRPO training framework and performance comparison
Plan Agent: Decomposes complex queries into executable sub-goals with placeholder dependencies for strategic multi-hop reasoning.
Analysis-Answer Agent: Performs information sufficiency assessment and precise answer extraction from retrieved documents.
Rewrite Agent: Reformulates queries adaptively when information is insufficient.
Records complete execution traces with action rationales for enhanced interpretability and debugging.
"What is the GDP per capita of the country where the headquarters of the company that acquired GitHub is located?"
Progressive multi-agent training with pre-scored dataset from DeepSeek R1, high-score sample selection, and role-specific evaluation methods for each specialized agent.
Our novel training algorithm enables fine-grained, role-specific credit assignment through:
# Stage 1: Plan Agent Training
for epoch in range(E1):
for batch in dataset:
candidates = generate_candidates(plan_agent, query)
best = select_best_prescored(query)
candidates.add(best)
rewards = compute_plan_rewards(candidates)
update_plan_agent(rewards)
# Stage 2: Analysis-Answer Agent Training
for epoch in range(E2):
for batch in exec_dataset:
candidates = generate_candidates(analysis_agent, context)
rewards = compute_analysis_rewards(candidates)
update_analysis_agent(rewards)
# Stage 3: Rewrite Agent Training
for epoch in range(E3):
for batch in rewrite_dataset:
candidates = generate_candidates(rewrite_agent, query)
rewards = compute_rewrite_rewards(candidates)
update_rewrite_agent(rewards)
Performance on three multi-hop reasoning benchmarks. All methods use Qwen2.5-7B as backbone.
Numbers in parentheses show improvement over best baseline (underlined values)
OPERA shows larger improvements on more challenging datasets—63.4% relative improvement on MuSiQue versus 25.4% on HotpotQA, suggesting our approach excels at complex multi-hop reasoning.
While BGM achieves only 19.6% EM on MuSiQue, OPERA reaches 39.7% EM, demonstrating that specialized agent architecture provides benefits beyond RL optimization alone.
OPERA achieves state-of-the-art results across different reasoning patterns—comparison, entity traversal, and compositional reasoning—showing broad applicability.
from src.core.orchestrator import OPERAOrchestrator
from src.training.reward_functions import (
PlanRewardFunction,
AnalysisRewardFunction,
RewriteRewardFunction
)
# Initialize OPERA with custom reward functions
orchestrator = OPERAOrchestrator(
plan_agent=PlanAgent(
model="Qwen2.5-7B-Instruct",
reward_fn=PlanRewardFunction(
λ1=0.4, # Logic weight
λ2=0.2, # Structure weight
λ3=0.4 # Execution weight
)
),
analysis_agent=AnalysisAnswerAgent(
model="Qwen2.5-7B-Instruct",
reward_fn=AnalysisRewardFunction(
α=0.3, # Sufficiency weight
β=0.5, # Accuracy weight
γ=0.2 # Format weight
)
),
rewrite_agent=RewriteAgent(
model="Qwen2.5-3B-Instruct",
reward_fn=RewriteRewardFunction(
ω1=0.8, # Retrieval weight
ω2=0.2 # Format weight
)
)
)
# Process complex multi-hop question
question = "What is the GDP per capita of the country where the headquarters of the company that acquired GitHub is located?"
result = orchestrator.process_question(question)
from src.training.mapgrpo_trainer import MAPGRPOTrainer
# Initialize MAPGRPO with high-score sample selection
trainer = MAPGRPOTrainer(
group_size=8,
kl_coeff=0.05,
use_high_score_selection=True,
prescored_dataset="data/expert_demonstrations.json"
)
# Stage 1: Train Plan Agent
trainer.train_plan_agent(
epochs=10,
reward_components=['logic', 'structure', 'execution']
)
# Stage 2: Train Analysis-Answer Agent
trainer.train_analysis_agent(
epochs=10,
reward_components=['sufficiency', 'accuracy', 'format']
)
# Stage 3: Train Rewrite Agent
trainer.train_rewrite_agent(
epochs=5,
reward_components=['retrieval', 'quality']
)
Specialized agents for planning, analysis-answering, and query rewriting
Role-specific reward functions for fine-grained credit assignment
Dynamic dependency tracking with [entity from step X] references
Pre-scored expert samples to address reward sparsity
Complete execution traces for interpretability
Sequential optimization with distribution adaptation
Preprint - Under Review
Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks. We introduce OPERA, a novel reasoning-driven retrieval framework that systematically decouples high-level strategic planning from low-level tactical execution through specialized agents. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.
@article{opera2025,
title={OPERA: A Reinforcement Learning-Enhanced Orchestrated
Planner-Executor Architecture for Reasoning-Oriented
Multi-Hop Retrieval},
author={Anonymous Authors},
journal={arXiv preprint arXiv:2508.16438},
year={2025},
url={https://arxiv.org/abs/2508.16438}
}
Institution Name
Institution Name
Institution Name
Author information is anonymized for double-blind review