🎭 OPERA Training Mechanism

Progressive Multi-Agent Training with Pre-scored Dataset and Role-Specific Evaluation

πŸš€ DeepSeek R1 API
Generates and Scores Training Candidates
πŸ†
High-Score Samples
Score > 0.85
15%
Golden standards with perfect decomposition
πŸ₯ˆ
Medium-Score Samples
Score 0.5 - 0.85
70%
Valid plans with minor issues
πŸ“Š
Low-Score Samples
Score < 0.5
15%
Flawed plans for contrast learning
High-Score Sample Selection Strategy
G1 Generated
G2 Generated
G3 Generated
G4 Generated
+
⭐ High-Score

Group Size = 5: Four policy-generated samples + One high-score sample from π’Ÿscored

Progressive MAPGRPO Training Pipeline
β†’
β†’
1
Plan Agent Training
Strategic Decomposer
πŸ“‹
Offline Pre-scored Evaluation
Uses pre-scored dataset π’Ÿscored with samples already evaluated by end-to-end execution
rplan = λ₁·flogic + Ξ»β‚‚Β·fstruct + λ₃·fexec
GRPO with High-Score Injection
Learns from pre-evaluated decompositions
Plan Agent Frozen ❄️
2
Analysis-Answer Training
Information Extractor
πŸ”
Execution-Based Scoring
Evaluates answer correctness and evidence quality using ground truth
rana = α·𝕀[Ο†=Ο†*] + Ξ²Β·EM(a,a*) + Ξ³Β·fformat
Standard GRPO
Trains on plans from frozen Plan Agent
Plan Agent Frozen ❄️
Analysis Agent Frozen ❄️
3
Rewrite Agent Training
Query Optimizer
🎯
Retriever as Judge
BGE-M3 retriever evaluates query quality through NDCG@k scores
rrew = Ο‰β‚Β·βˆšNDCG@k + Ο‰β‚‚Β·fformat
Retrieval-Guided GRPO
Optimizes for retrieval effectiveness