OPERA

Orchestrated Planner-Executor Reasoning Architecture

A Reinforcement Learning-Enhanced Framework for Reasoning-Oriented Multi-Hop Retrieval

arXiv Preprint Coming Soon

Abstract

Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks:

1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions.

2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents.

3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge.

Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.

Code Available on GitHub
OPERA Overview

OPERA's MAPGRPO training framework and performance comparison

Architecture

OPERA Architecture

Goal Planning Module (GPM)

Plan Agent: Decomposes complex queries into executable sub-goals with placeholder dependencies for strategic multi-hop reasoning.

Reason-Execute Module (REM)

Analysis-Answer Agent: Performs information sufficiency assessment and precise answer extraction from retrieved documents.

Rewrite Agent: Reformulates queries adaptively when information is insufficient.

Trajectory Memory Component

Records complete execution traces with action rationales for enhanced interpretability and debugging.

Interactive Demonstration

Interactive Query Decomposition Demo

Complex Query:

"What is the GDP per capita of the country where the headquarters of the company that acquired GitHub is located?"

🎯 MAPGRPO Training Mechanism

Progressive multi-agent training with pre-scored dataset from DeepSeek R1, high-score sample selection, and role-specific evaluation methods for each specialized agent.

View Full Screen

MAPGRPO Training Algorithm

Multi-Agents Progressive Group Relative Policy Optimization

Our novel training algorithm enables fine-grained, role-specific credit assignment through:

  • Progressive training of specialized agents
  • Group-based advantage computation
  • Execution-based plan evaluation
  • Agent-specific reward functions

Algorithm 1: MAPGRPO Training


# Stage 1: Plan Agent Training
for epoch in range(E1):
    for batch in dataset:
        candidates = generate_candidates(plan_agent, query)
        best = select_best_prescored(query)
        candidates.add(best)
        rewards = compute_plan_rewards(candidates)
        update_plan_agent(rewards)

# Stage 2: Analysis-Answer Agent Training  
for epoch in range(E2):
    for batch in exec_dataset:
        candidates = generate_candidates(analysis_agent, context)
        rewards = compute_analysis_rewards(candidates)
        update_analysis_agent(rewards)

# Stage 3: Rewrite Agent Training
for epoch in range(E3):
    for batch in rewrite_dataset:
        candidates = generate_candidates(rewrite_agent, query)
        rewards = compute_rewrite_rewards(candidates)
        update_rewrite_agent(rewards)
                    

Experimental Results

Performance on three multi-hop reasoning benchmarks. All methods use Qwen2.5-7B as backbone.

Method HotpotQA 2WikiMultiHopQA MuSiQue
EM (%) F1 (%) EM (%) F1 (%) EM (%) F1 (%)
Qwen2.5-7B (No Retrieval) 18.5 26.8 16.2 23.7 4.1 9.1
Single-Step RAG 31.5 44.2 25.9 37.6 14.1 18.4
IRCoT 42.7 54.8 43.3 56.2 18.8 23.9
OPERA (CoT) 44.9 58.5 42.3 50.7 21.2 26.4
Adaptive-RAG 45.7 56.9 30.1 39.3 24.3 35.7
BGM 41.5 53.8 44.3 55.8 19.6 26.8
OPERA (MAPGRPO) 57.3(+11.6) 69.5(+11.0) 60.2(+15.9) 72.7(+16.5) 39.7(+15.4) 51.9(+16.2)

Numbers in parentheses show improvement over best baseline (underlined values)

25.4%
Relative improvement on HotpotQA
63.4%
Relative improvement on MuSiQue
39.7%
EM score on challenging MuSiQue dataset

Key Insights

🎯 Performance Scales with Difficulty

OPERA shows larger improvements on more challenging datasets—63.4% relative improvement on MuSiQue versus 25.4% on HotpotQA, suggesting our approach excels at complex multi-hop reasoning.

🚀 Outperforms RL Baselines

While BGM achieves only 19.6% EM on MuSiQue, OPERA reaches 39.7% EM, demonstrating that specialized agent architecture provides benefits beyond RL optimization alone.

📊 Consistent Improvements

OPERA achieves state-of-the-art results across different reasoning patterns—comparison, entity traversal, and compositional reasoning—showing broad applicability.

Code Implementation

Implementation Example


from src.core.orchestrator import OPERAOrchestrator
from src.training.reward_functions import (
    PlanRewardFunction,
    AnalysisRewardFunction, 
    RewriteRewardFunction
)

# Initialize OPERA with custom reward functions
orchestrator = OPERAOrchestrator(
    plan_agent=PlanAgent(
        model="Qwen2.5-7B-Instruct",
        reward_fn=PlanRewardFunction(
            λ1=0.4,  # Logic weight
            λ2=0.2,  # Structure weight  
            λ3=0.4   # Execution weight
        )
    ),
    analysis_agent=AnalysisAnswerAgent(
        model="Qwen2.5-7B-Instruct",
        reward_fn=AnalysisRewardFunction(
            α=0.3,   # Sufficiency weight
            β=0.5,   # Accuracy weight
            γ=0.2    # Format weight
        )
    ),
    rewrite_agent=RewriteAgent(
        model="Qwen2.5-3B-Instruct",
        reward_fn=RewriteRewardFunction(
            ω1=0.8,  # Retrieval weight
            ω2=0.2   # Format weight
        )
    )
)

# Process complex multi-hop question
question = "What is the GDP per capita of the country where the headquarters of the company that acquired GitHub is located?"
result = orchestrator.process_question(question)
                    

MAPGRPO Training Framework


from src.training.mapgrpo_trainer import MAPGRPOTrainer

# Initialize MAPGRPO with high-score sample selection
trainer = MAPGRPOTrainer(
    group_size=8,
    kl_coeff=0.05,
    use_high_score_selection=True,
    prescored_dataset="data/expert_demonstrations.json"
)

# Stage 1: Train Plan Agent
trainer.train_plan_agent(
    epochs=10,
    reward_components=['logic', 'structure', 'execution']
)

# Stage 2: Train Analysis-Answer Agent  
trainer.train_analysis_agent(
    epochs=10,
    reward_components=['sufficiency', 'accuracy', 'format']
)

# Stage 3: Train Rewrite Agent
trainer.train_rewrite_agent(
    epochs=5,
    reward_components=['retrieval', 'quality']
)
                    

Core Implementation Features

Three-Agent Architecture

Specialized agents for planning, analysis-answering, and query rewriting

Multi-Dimensional Rewards

Role-specific reward functions for fine-grained credit assignment

Placeholder Mechanism

Dynamic dependency tracking with [entity from step X] references

High-Score Selection

Pre-scored expert samples to address reward sparsity

Trajectory Memory

Complete execution traces for interpretability

Progressive Training

Sequential optimization with distribution adaptation

Paper

Paper Preview

OPERA: A Reinforcement Learning-Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

Preprint - Under Review

Abstract

Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks. We introduce OPERA, a novel reasoning-driven retrieval framework that systematically decouples high-level strategic planning from low-level tactical execution through specialized agents. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.

Citation


@article{opera2025,
  title={OPERA: A Reinforcement Learning-Enhanced Orchestrated 
         Planner-Executor Architecture for Reasoning-Oriented 
         Multi-Hop Retrieval},
  author={Anonymous Authors},
  journal={arXiv preprint arXiv:2508.16438},
  year={2025},
  url={https://arxiv.org/abs/2508.16438}
}
                

Research Team

Anonymous Author 1

Institution Name

Anonymous Author 2

Institution Name

Anonymous Author 3

Institution Name

Author information is anonymized for double-blind review