π OPERA Training Mechanism
Progressive Multi-Agent Training with Pre-scored Dataset and Role-Specific Evaluation
π DeepSeek R1 β Offline Gold Generation
Multiple candidate decompositions β best one selected as gold
π₯
Gold (cbest)
Best-of-N from R1
Reference Pool
Offline reference decompositions. For each question, R1 produces multiple candidates; the one with the highest end-to-end execution score is kept as cbest. Together they form πscored.
π₯
Silver (online)
On-policy rollouts
Per-batch
Produced by the model currently being trained. At each step the policy generates Gβ1 candidates that form the bulk of the GRPO group.
Gold injection is reserved for the Plan Agent only. Plan decomposition is open-ended and lacks a unique ground-truth answer, so a periodic anchor stabilizes training. The Analysis-Answer and Rewrite agents already have direct supervision (EM against ground-truth answers, NDCG@k against golden documents) and therefore train with pure on-policy MAPGRPO.
Periodic Gold-Anchor Injection (Plan Agent only)
π₯Silver
π₯Silver
π₯Silver
β―on-policy
+
π₯Gold cbest
Each GRPO group is composed of multiple on-policy silver rollouts plus one offline gold reference.
The gold sample is not injected every step; it is refreshed at specific reference-refresh points during Plan-Agent training to re-anchor the group baseline. Between refresh points the policy continues with pure on-policy rollouts. Policy-gradient updates are applied only to silver candidates; the gold sample serves as a baseline reference for advantage normalization.
Progressive MAPGRPO Training Pipeline
β
β
Frozen
π
Offline Pre-scored Evaluation
Uses pre-scored dataset πscored with samples already evaluated by end-to-end execution
rplan = Ξ»βΒ·flogic + Ξ»βΒ·fstruct + Ξ»βΒ·fexec
MAPGRPO + Periodic Gold Anchor
Plan decomposition has no unique ground-truth, so cbest is reintroduced at refresh points to stabilize the group baseline.
Plan Agent Frozen βοΈ
π
Execution-Based Scoring
Evaluates answer correctness and evidence quality using ground truth
rana = Ξ±Β·π[Ο=Ο*] + Ξ²Β·EM(a,a*) + Ξ³Β·fformat
Pure On-Policy MAPGRPO
EM against ground-truth answers provides direct supervision β no gold injection needed.
Plan Agent Frozen βοΈ
Analysis Agent Frozen βοΈ
BGE-M3 retriever evaluates query quality through NDCG@k scores
rrew = ΟβΒ·βNDCG@k + ΟβΒ·fformat
Pure On-Policy MAPGRPO
NDCG@k against golden documents provides direct retrieval supervision β no gold injection needed.