Reading List

Readings

Papers, technical reports, and links — curated and tagged.

Sort

Reading

pre-trainingscalingtest-time compute

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

DeepMind

Readings

Training Compute-Optimal Large Language Models

Outrageously Large Neural Networks: The Sparsely-Gated Mixtures-of-Experts Layer

Scaling Laws for Neural Language Models

NVIDIA Nemotron-3 Nano Technical Report

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Reasoning with Sampling

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Generative Language Modeling for Automated Theorem Proving

Fara-7B: An Efficient Agentic Model for Computer Use

Gated Delta Networks: Improving Mamba2 with Delta Rule

Feedback Descent

Video PreTraining (VPT)

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

The Principles of Diffusion Models

Rethinking Thinking Tokens: LLMs as Improvement Operators

Learning to Reason without External Rewards

https://arxiv.org/pdf/2512.13898