🎭 OPERA Training Mechanism

Progressive Multi-Agent Training with Pre-scored Dataset and Role-Specific Evaluation

πŸš€ DeepSeek R1 β€” Offline Gold Generation
Multiple candidate decompositions β†’ best one selected as gold
πŸ₯‡
Gold (cbest)
Best-of-N from R1
Reference Pool
Offline reference decompositions. For each question, R1 produces multiple candidates; the one with the highest end-to-end execution score is kept as cbest. Together they form π’Ÿscored.
πŸ₯ˆ
Silver (online)
On-policy rollouts
Per-batch
Produced by the model currently being trained. At each step the policy generates Gβˆ’1 candidates that form the bulk of the GRPO group.

Gold injection is reserved for the Plan Agent only. Plan decomposition is open-ended and lacks a unique ground-truth answer, so a periodic anchor stabilizes training. The Analysis-Answer and Rewrite agents already have direct supervision (EM against ground-truth answers, NDCG@k against golden documents) and therefore train with pure on-policy MAPGRPO.

Periodic Gold-Anchor Injection (Plan Agent only)
πŸ₯ˆSilver
πŸ₯ˆSilver
πŸ₯ˆSilver
β‹―on-policy
+
πŸ₯‡Gold cbest

Each GRPO group is composed of multiple on-policy silver rollouts plus one offline gold reference.
The gold sample is not injected every step; it is refreshed at specific reference-refresh points during Plan-Agent training to re-anchor the group baseline. Between refresh points the policy continues with pure on-policy rollouts. Policy-gradient updates are applied only to silver candidates; the gold sample serves as a baseline reference for advantage normalization.

Progressive MAPGRPO Training Pipeline
β†’
β†’
1
Plan Agent Training
Strategic Decomposer
πŸ“‹
Offline Pre-scored Evaluation
Uses pre-scored dataset π’Ÿscored with samples already evaluated by end-to-end execution
rplan = λ₁·flogic + Ξ»β‚‚Β·fstruct + λ₃·fexec
MAPGRPO + Periodic Gold Anchor
Plan decomposition has no unique ground-truth, so cbest is reintroduced at refresh points to stabilize the group baseline.
Plan Agent Frozen ❄️
2
Analysis-Answer Training
Information Extractor
πŸ”
Execution-Based Scoring
Evaluates answer correctness and evidence quality using ground truth
rana = α·𝕀[Ο†=Ο†*] + Ξ²Β·EM(a,a*) + Ξ³Β·fformat
Pure On-Policy MAPGRPO
EM against ground-truth answers provides direct supervision β€” no gold injection needed.
Plan Agent Frozen ❄️
Analysis Agent Frozen ❄️
3
Rewrite Agent Training
Query Optimizer
🎯
Retriever as Judge
BGE-M3 retriever evaluates query quality through NDCG@k scores
rrew = Ο‰β‚Β·βˆšNDCG@k + Ο‰β‚‚Β·fformat
Pure On-Policy MAPGRPO
NDCG@k against golden documents provides direct retrieval supervision β€” no gold injection needed.