OPERA Training Mechanism

Progressive Multi-Agent Training with Pre-scored Dataset and Role-Specific Evaluation

🚀 DeepSeek R1 API

Generates and Scores Training Candidates

🏆

High-Score Samples

Score > 0.85

15%

Golden standards with perfect decomposition

🥈

Medium-Score Samples

Score 0.5 - 0.85

70%

Valid plans with minor issues

📊

Low-Score Samples

Score < 0.5

15%

Flawed plans for contrast learning

High-Score Sample Selection Strategy

G1 Generated

G2 Generated

G3 Generated

G4 Generated

⭐ High-Score

Group Size = 5: Four policy-generated samples + One high-score sample from 𝒟_scored

Progressive MAPGRPO Training Pipeline

→

Plan Agent Training

Strategic Decomposer

📋

Offline Pre-scored Evaluation

Uses pre-scored dataset 𝒟_scored with samples already evaluated by end-to-end execution

r_plan = λ₁·f_logic + λ₂·f_struct + λ₃·f_exec

GRPO with High-Score Injection

Learns from pre-evaluated decompositions

Plan Agent Frozen ❄️

Analysis-Answer Training

Information Extractor

🔍

Execution-Based Scoring

Evaluates answer correctness and evidence quality using ground truth

r_ana = α·𝕀[φ=φ*] + β·EM(a,a*) + γ·f_format

Standard GRPO

Trains on plans from frozen Plan Agent

Plan Agent Frozen ❄️

Analysis Agent Frozen ❄️

Rewrite Agent Training

Query Optimizer

🎯

Retriever as Judge

BGE-M3 retriever evaluates query quality through NDCG@k scores

r_rew = ω₁·√NDCG@k + ω₂·f_format

Retrieval-Guided GRPO

Optimizes for retrieval effectiveness

🎭 OPERA Training Mechanism