π OPERA Training Mechanism
Progressive Multi-Agent Training with Pre-scored Dataset and Role-Specific Evaluation
π DeepSeek R1 API
Generates and Scores Training Candidates
π
High-Score Samples
Score > 0.85
15%
Golden standards with perfect decomposition
π₯
Medium-Score Samples
Score 0.5 - 0.85
70%
Valid plans with minor issues
π
Low-Score Samples
Score < 0.5
15%
Flawed plans for contrast learning
High-Score Sample Selection Strategy
G1
Generated
G2
Generated
G3
Generated
G4
Generated
+
β
High-Score
Group Size = 5: Four policy-generated samples + One high-score sample from πscored
Progressive MAPGRPO Training Pipeline
β
β
Frozen
π
Offline Pre-scored Evaluation
Uses pre-scored dataset πscored with samples already evaluated by end-to-end execution
rplan = Ξ»βΒ·flogic + Ξ»βΒ·fstruct + Ξ»βΒ·fexec
GRPO with High-Score Injection
Learns from pre-evaluated decompositions
Plan Agent Frozen βοΈ
π
Execution-Based Scoring
Evaluates answer correctness and evidence quality using ground truth
rana = Ξ±Β·π[Ο=Ο*] + Ξ²Β·EM(a,a*) + Ξ³Β·fformat
Standard GRPO
Trains on plans from frozen Plan Agent
Plan Agent Frozen βοΈ
Analysis Agent Frozen βοΈ
BGE-M3 retriever evaluates query quality through NDCG@k scores
rrew = ΟβΒ·βNDCG@k + ΟβΒ·fformat
Retrieval-Guided GRPO
Optimizes for retrieval effectiveness