Fast track to technical report of Deepseek R1

Summary of DeepSeek-R1 Training Approach

Core Methodology

  1. Two-Stage Model Development:
    • DeepSeek-R1-Zero:

      • Trained purely via reinforcement learning (RL) without supervised fine-tuning (SFT).
      • Uses GRPO (Group Relative Policy Optimization) to reduce computational costs by eliminating the need for a critic model.
      • Relies on rule-based rewards:
        • Accuracy rewards: Verify correctness (e.g., math answers via deterministic rules).
        • Format rewards: Enforce structured outputs (e.g., <think> for reasoning, <answer> for final results).
      • Demonstrates self-evolution: Automatically develops advanced reasoning behaviors (e.g., reflection, long chain-of-thought) through RL.
      • Achieves 71.0% pass@1 on AIME 2024, rivaling OpenAI-o1-0912.
    • DeepSeek-R1:

      • Builds on R1-Zero by adding cold-start data and multi-stage training:
        1. Cold-Start SFT: Fine-tunes the base model (DeepSeek-V3) with thousands of high-quality, human-readable reasoning examples.
        2. Reasoning-Oriented RL: Applies RL to refine reasoning capabilities while introducing a language consistency reward to mitigate mixed-language outputs.
        3. Rejection Sampling & SFT: Generates diverse SFT data (reasoning and non-reasoning tasks) from RL checkpoints, then retrains the model.
        4. Final RL Alignment: Aligns the model with human preferences (helpfulness, harmlessness) across all scenarios.
      • Matches OpenAI-o1-1217 on reasoning benchmarks (e.g., 97.3% pass@1 on MATH-500).

Key Innovations

  1. Pure RL for Reasoning:

    • Proves reasoning capabilities can emerge without SFT, relying solely on RL incentives.
    • Enables autonomous discovery of strategies like self-verification and multi-step reasoning.
  2. Cold-Start Data Design:

    • Addresses readability and language mixing by structuring outputs with |special_token| tags and summaries.
  3. Distillation to Smaller Models:

    • Distills knowledge from DeepSeek-R1 into 1.5B–70B parameter models (Qwen/Llama-based).
    • Achieves competitive performance (e.g., 72.6% pass@1 on AIME 2024 for 32B model) without RL.

Performance Highlights

  • DeepSeek-R1:
    • Codeforces: Outperforms 96.3% of human competitors (2,029 Elo rating).
    • MMLU: 90.8% accuracy, surpassing GPT-4o and Claude-3.5.
    • SWE-bench: Resolves 49.2% of software engineering tasks.
  • Distilled Models:
    • 7B model surpasses GPT-4o on math tasks (55.5% pass@1 on AIME).
    • 32B model outperforms QwQ-32B-Preview by 22.6% on AIME.

Challenges & Solutions

  • Readability/Language Mixing: Addressed via cold-start data and language consistency rewards.
  • Reward Hacking: Avoided neural reward models in favor of rule-based rewards.
  • Unsuccessful Attempts:
    • Process Reward Models (PRM) and Monte Carlo Tree Search (MCTS) faced scalability issues.

Conclusion

DeepSeek-R1 advances LLM reasoning through RL-driven self-evolution and iterative alignment, while distillation democratizes high-performance reasoning for smaller models. The approach emphasizes minimal supervised data and structured reward design, setting a new benchmark for open-source reasoning models.