MLSys 2025 - Conference Takeaways

MLSys 2025 Conference Website

A. Talks

1. Soumith Chintala (PyTorch)

Watch Talk

Key Insights:

North star has been how to break for loops into parallelism
Types of innovation in AI:
- Objective function
- Better priors
- Better labels (better data does not imply better labels)

Torch Optimizations:

Faster: e.g. flex attention, torch.compile
Precision flexibility: e.g. torchao (quantization)
Better reliability: e.g. torchft, distributed checkpointing

Extreme PyTorch (Llama 3 paper):

24k GPUs
Maximize effective training time
Meta has large excel sheets of how to shard model

Resources:

Best PyTorch features: torchtitan
Community: GPU Mode Discord

Future Directions:

⭐️ Writing kernels with LLMs
- KernelBench
- KernelLLM

2. LLMArena

LLMArena Website

Key Insights:

Benchmarks should evolve with models
- Static ones are unreliable due to contamination
- Data freshness as a metric
Categorizing prompts into categories
Prompt specific leaderboard (P2L)

3. Scaling Laws - Beidi Chen

Beidi Chen’s Google Scholar

Key Insights:

Stage 3 was scaling context length; Stage 4 is test-time compute
- How fast are we solving the whole problem?

Notable Ideas:

Multiverse 1K: using map-reduce for decoding
Oversampling effective examples (DAPO): Paper

4. Tim Dettmers

Key Insights:

Notion of a good paper: Does this paper share its assumptions?
Being Impactful: Pick hard problems
- Obsess over problems and not solutions or a particular kind of solution
Research Trends: Cyclical nature (e.g., quantization had hit a wall a few years ago, revival with LLMs, and now is again about to hit a wall)

B. Useful Papers

Training Optimization

PIPEFILL: Using GPUs During Bubbles in Pipeline-Parallel LLM Training

A fill job is any independent workload temporarily executed during the idle bubbles in pipeline-parallel training to utilize wasted GPU time
PIPEFILL anticipates bubbles called Pipeline Bubble Instruction, combined with profiling during the initial training iterations

Key Parallelism Techniques:

Technique	Purpose
ZeRO (1/2/3)	Memory optimization
FSDP	Fully Sharded Data Parallel
Pipeline Parallelism	Model is too big to fit even on a single node so it is partitioned across devices

RADIUS: Range-Based Gradient Sparsity for Large Foundation Model Pre-training

Training large foundation models is extremely compute- and communication-heavy, especially when using data parallelism
Naive solutions like top-k gradient sparsity don’t scale and break convergence
After an initial phase, reuse the same top-k indices for T steps
Accumulate dropped gradients into a residual buffer

APOLLO: SGD-Like Memory, AdamW-Level Performance

APOLLO (Approximated Gradient Scaling for Memory Efficient LLM Optimization)
Uses a low-rank auxiliary optimizer state, 3× training throughput vs. AdamW
Related: GaLore

REAL: Efficient RLHF Training of Large Language Models with Parameter Reallocation

A task = a function call like generation, inference, or training, executed by one of the models (actor, critic, reward)
Parameter Reallocation — dynamically redistribute model parameters across the cluster per task
Up to 3.58x speedup vs baselines like DeepSpeed-Chat, OpenRLHF, and NeMo-Aligner

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

Partitions input sequences and iteratively processes mini-sequences
Reduces intermediate memory usage with activation recomputation

Inference Optimization

Enabling Unstructured Sparse Acceleration on Structured Hardware

Express model weights/activations (online) as sum of sparse matrices, similar to Taylor series expansion
Shows significant latency improvements
Don’t clearly specify accuracy/perplexity metrics, but idea is cool

MILO: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators

Quantization helps reduce memory for MoE models but under 4-bit quantization causes non-trivial accuracy drops
Milo: Quantize-then-Compensate with Low-Rank Matrices

Long Context - Offloading KV Cache

Offloads the KV cache to CPU RAM

Cluster Management

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

GitHub
Which job to allocate resources to based on marginal benefit of overall cluster saturation
Treats jobs as white boxes

Applied ML

KNOW WHERE YOU’RE UNCERTAIN WHEN PLANNING WITH MULTIMODAL FOUNDATION MODELS

Use state-transition systems and temporal logic specifications like:
- (pedestrian → wait) — “Always wait if there’s a pedestrian”
- (red light → move forward) — “Never move forward at a red light”
Build reward models that score a trajectory based on how many constraints it violates

AIOPSLAB: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

Agents are plethora for Dev but not for Ops because tasks are missing but 60% deployment failure comes from Ops problems
Provide SRE dataset and interfaces for agents to interact with the cloud
Discussion

AI METROPOLIS: Scaling Large Language Model-Based Multi-Agent Simulation with Out-of-Order Execution

Multi-agent simulations: all agents must finish a step before moving to the next
Introduces out-of-order execution
Maximum velocity: refers to the maximum distance an agent can influence with 1 simulation step

Tools

LUMOS: Efficient Performance Modeling and Estimation for Large-Scale LLM Training

Profiling tool for LLMs (code not found)
Related profiling tools: dpro

C. Resources

Books & Guides:

JAX Scaling Book

Tools:

Communities:

GPU Mode Discord

Aman Singhal

MLSys 2025 - Conference Takeaways

A. Talks

1. Soumith Chintala (PyTorch)

2. LLMArena

3. Scaling Laws - Beidi Chen

4. Tim Dettmers

B. Useful Papers

Training Optimization

Inference Optimization

Cluster Management

Applied ML

Tools

C. Resources