Fast track to technical report of Deepseek R1
Summary of DeepSeek-R1 Training Approach
Core Methodology
- Two-Stage Model Development:
DeepSeek-R1-Zero:
- Trained purely via reinforcement learning (RL) without supervised fine-tuning (SFT).
- Uses GRPO (Group Relative Policy Optimization) to reduce computational costs by eliminating the need for a critic model.
- Relies on rule-based rewards:
- Accuracy rewards: Verify correctness (e.g., math answers via deterministic rules).
- Format rewards: Enforce structured outputs (e.g.,
<think>
for reasoning,<answer>
for final results).
- Demonstrates self-evolution: Automatically develops advanced reasoning behaviors (e.g., reflection, long chain-of-thought) through RL.
- Achieves 71.0% pass@1 on AIME 2024, rivaling OpenAI-o1-0912.
DeepSeek-R1:
- Builds on R1-Zero by adding cold-start data and multi-stage training:
- Cold-Start SFT: Fine-tunes the base model (DeepSeek-V3) with thousands of high-quality, human-readable reasoning examples.
- Reasoning-Oriented RL: Applies RL to refine reasoning capabilities while introducing a language consistency reward to mitigate mixed-language outputs.
- Rejection Sampling & SFT: Generates diverse SFT data (reasoning and non-reasoning tasks) from RL checkpoints, then retrains the model.
- Final RL Alignment: Aligns the model with human preferences (helpfulness, harmlessness) across all scenarios.
- Matches OpenAI-o1-1217 on reasoning benchmarks (e.g., 97.3% pass@1 on MATH-500).
- Builds on R1-Zero by adding cold-start data and multi-stage training:
Key Innovations
Pure RL for Reasoning:
- Proves reasoning capabilities can emerge without SFT, relying solely on RL incentives.
- Enables autonomous discovery of strategies like self-verification and multi-step reasoning.
Cold-Start Data Design:
- Addresses readability and language mixing by structuring outputs with
|special_token|
tags and summaries.
- Addresses readability and language mixing by structuring outputs with
Distillation to Smaller Models:
- Distills knowledge from DeepSeek-R1 into 1.5B–70B parameter models (Qwen/Llama-based).
- Achieves competitive performance (e.g., 72.6% pass@1 on AIME 2024 for 32B model) without RL.
Performance Highlights
- DeepSeek-R1:
- Codeforces: Outperforms 96.3% of human competitors (2,029 Elo rating).
- MMLU: 90.8% accuracy, surpassing GPT-4o and Claude-3.5.
- SWE-bench: Resolves 49.2% of software engineering tasks.
- Distilled Models:
- 7B model surpasses GPT-4o on math tasks (55.5% pass@1 on AIME).
- 32B model outperforms QwQ-32B-Preview by 22.6% on AIME.
Challenges & Solutions
- Readability/Language Mixing: Addressed via cold-start data and language consistency rewards.
- Reward Hacking: Avoided neural reward models in favor of rule-based rewards.
- Unsuccessful Attempts:
- Process Reward Models (PRM) and Monte Carlo Tree Search (MCTS) faced scalability issues.
Conclusion
DeepSeek-R1 advances LLM reasoning through RL-driven self-evolution and iterative alignment, while distillation democratizes high-performance reasoning for smaller models. The approach emphasizes minimal supervised data and structured reward design, setting a new benchmark for open-source reasoning models.