Reading List

Readings

Papers, technical reports, and links — curated and tagged.

Sort
Reading
pre-trainingscalingtest-time compute

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

DeepMind

Reading
MoEarchitectureNeural Networks

Outrageously Large Neural Networks: The Sparsely-Gated Mixtures-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

Google Brain · Jagiellonian University

Reading
pre-trainingscalingoptimizationNeural Networks

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeff Wu, Dario Amodei

OpenAI

Reading
architecturepre-trainingdistillationdeploymentefficiency

NVIDIA Nemotron-3 Nano Technical Report

NVIDIA

Reading
MoEpre-trainingscalingarchitecture

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

DeepSeek-AI

Reading
pre-trainingscalingefficiency

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Tsinghua University · Modelbest Inc.

Reading
reasoningsamplingRLtest-time compute

Reasoning with Sampling

Aakarshan Anand, Edwin Zhang, Erin Grant

Harvard University

Reading
RLreasoningeval

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Andres He, Daniel Fried, Sean Welleck

Reading
reasoningtheorem provingpre-trainingarchitecture

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu, Ilya Sutskever

Reading
agentsefficiency

Fara-7B: An Efficient Agentic Model for Computer Use

Microsoft Research

Reading
architecture

Gated Delta Networks: Improving Mamba2 with Delta Rule

Tri Dao, Albert Gu, Stefano Pellegrini, Awni Hannun

Reading
optimization

Feedback Descent

Stanford AI

Reading
agentsRLpre-training

Video PreTraining (VPT)

OpenAI

Reading
diffusioninterpretabilityvision

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, Duen Horng (Polo) Chau

Georgia Tech · Virginia Tech · IBM Research

Reading
diffusionsurvey

The Principles of Diffusion Models

Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon

Reading
reasoningtest-time compute

Rethinking Thinking Tokens: LLMs as Improvement Operators

Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal

Meta Superintelligence Labs · University College London · Mila · Anthropic · Princeton University

Reading
RLreasoning

Learning to Reason without External Rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song

Reading

https://arxiv.org/pdf/2512.13898