Fast track to technical report of Deepseek R1

Posted on 2025-02-04 Views: Word count in article: 425 Reading time ≈ 2 mins.

Summary of DeepSeek-R1 Training Approach

Core Methodology

Two-Stage Model Development:
- DeepSeek-R1-Zero:
  - Trained purely via reinforcement learning (RL) without supervised fine-tuning (SFT).
  - Uses GRPO (Group Relative Policy Optimization) to reduce computational costs by eliminating the need for a critic model.
  - Relies on rule-based rewards:
    - Accuracy rewards: Verify correctness (e.g., math answers via deterministic rules).
    - Format rewards: Enforce structured outputs (e.g., <think> for reasoning, <answer> for final results).
  - Demonstrates self-evolution: Automatically develops advanced reasoning behaviors (e.g., reflection, long chain-of-thought) through RL.
  - Achieves 71.0% pass@1 on AIME 2024, rivaling OpenAI-o1-0912.
- DeepSeek-R1:
  - Builds on R1-Zero by adding cold-start data and multi-stage training:
    1. Cold-Start SFT: Fine-tunes the base model (DeepSeek-V3) with thousands of high-quality, human-readable reasoning examples.
    2. Reasoning-Oriented RL: Applies RL to refine reasoning capabilities while introducing a language consistency reward to mitigate mixed-language outputs.
    3. Rejection Sampling & SFT: Generates diverse SFT data (reasoning and non-reasoning tasks) from RL checkpoints, then retrains the model.
    4. Final RL Alignment: Aligns the model with human preferences (helpfulness, harmlessness) across all scenarios.
  - Matches OpenAI-o1-1217 on reasoning benchmarks (e.g., 97.3% pass@1 on MATH-500).

Key Innovations

Pure RL for Reasoning:
- Proves reasoning capabilities can emerge without SFT, relying solely on RL incentives.
- Enables autonomous discovery of strategies like self-verification and multi-step reasoning.
Cold-Start Data Design:
- Addresses readability and language mixing by structuring outputs with |special_token| tags and summaries.
Distillation to Smaller Models:
- Distills knowledge from DeepSeek-R1 into 1.5B–70B parameter models (Qwen/Llama-based).
- Achieves competitive performance (e.g., 72.6% pass@1 on AIME 2024 for 32B model) without RL.

Performance Highlights

DeepSeek-R1:
- Codeforces: Outperforms 96.3% of human competitors (2,029 Elo rating).
- MMLU: 90.8% accuracy, surpassing GPT-4o and Claude-3.5.
- SWE-bench: Resolves 49.2% of software engineering tasks.
Distilled Models:
- 7B model surpasses GPT-4o on math tasks (55.5% pass@1 on AIME).
- 32B model outperforms QwQ-32B-Preview by 22.6% on AIME.

Challenges & Solutions

Readability/Language Mixing: Addressed via cold-start data and language consistency rewards.
Reward Hacking: Avoided neural reward models in favor of rule-based rewards.
Unsuccessful Attempts:
- Process Reward Models (PRM) and Monte Carlo Tree Search (MCTS) faced scalability issues.

Conclusion

DeepSeek-R1 advances LLM reasoning through RL-driven self-evolution and iterative alignment, while distillation democratizes high-performance reasoning for smaller models. The approach emphasizes minimal supervised data and structured reward design, setting a new benchmark for open-source reasoning models.