MLSys 2025 - Conference Takeaways

MLSys 2025 - Conference Takeaways

MLSys 2025 Conference Website

A. Talks

1. Soumith Chintala (PyTorch)

Watch Talk

Key Insights:

  • North star has been how to break for loops into parallelism
  • Types of innovation in AI:
    • Objective function
    • Better priors
    • Better labels (better data does not imply better labels)

Torch Optimizations:

  • Faster: e.g. flex attention, torch.compile
  • Precision flexibility: e.g. torchao (quantization)
  • Better reliability: e.g. torchft, distributed checkpointing

Extreme PyTorch (Llama 3 paper):

  • 24k GPUs
  • Maximize effective training time
  • Meta has large excel sheets of how to shard model

Resources:

Future Directions:

2. LLMArena

LLMArena Website

Key Insights:

3. Scaling Laws - Beidi Chen

Beidi Chen’s Google Scholar

Key Insights:

  • Stage 3 was scaling context length; Stage 4 is test-time compute
    • How fast are we solving the whole problem?

Notable Ideas:

  • Multiverse 1K: using map-reduce for decoding
  • Oversampling effective examples (DAPO): Paper

4. Tim Dettmers

Key Insights:

  • Notion of a good paper: Does this paper share its assumptions?
  • Being Impactful: Pick hard problems
    • Obsess over problems and not solutions or a particular kind of solution
  • Research Trends: Cyclical nature (e.g., quantization had hit a wall a few years ago, revival with LLMs, and now is again about to hit a wall)

B. Useful Papers

Training Optimization

PIPEFILL: Using GPUs During Bubbles in Pipeline-Parallel LLM Training

  • A fill job is any independent workload temporarily executed during the idle bubbles in pipeline-parallel training to utilize wasted GPU time
  • PIPEFILL anticipates bubbles called Pipeline Bubble Instruction, combined with profiling during the initial training iterations

Key Parallelism Techniques:

TechniquePurpose
ZeRO (1/2/3)Memory optimization
FSDPFully Sharded Data Parallel
Pipeline ParallelismModel is too big to fit even on a single node so it is partitioned across devices

RADIUS: Range-Based Gradient Sparsity for Large Foundation Model Pre-training

  • Training large foundation models is extremely compute- and communication-heavy, especially when using data parallelism
  • Naive solutions like top-k gradient sparsity don’t scale and break convergence
  • After an initial phase, reuse the same top-k indices for T steps
  • Accumulate dropped gradients into a residual buffer

APOLLO: SGD-Like Memory, AdamW-Level Performance

  • APOLLO (Approximated Gradient Scaling for Memory Efficient LLM Optimization)
  • Uses a low-rank auxiliary optimizer state, 3× training throughput vs. AdamW
  • Related: GaLore

REAL: Efficient RLHF Training of Large Language Models with Parameter Reallocation

  • A task = a function call like generation, inference, or training, executed by one of the models (actor, critic, reward)
  • Parameter Reallocation — dynamically redistribute model parameters across the cluster per task
  • Up to 3.58x speedup vs baselines like DeepSpeed-Chat, OpenRLHF, and NeMo-Aligner

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

  • Partitions input sequences and iteratively processes mini-sequences
  • Reduces intermediate memory usage with activation recomputation

Inference Optimization

Enabling Unstructured Sparse Acceleration on Structured Hardware

  • Express model weights/activations (online) as sum of sparse matrices, similar to Taylor series expansion
  • Shows significant latency improvements
  • Don’t clearly specify accuracy/perplexity metrics, but idea is cool

MILO: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators

  • Quantization helps reduce memory for MoE models but under 4-bit quantization causes non-trivial accuracy drops
  • Milo: Quantize-then-Compensate with Low-Rank Matrices

Long Context - Offloading KV Cache

  • Offloads the KV cache to CPU RAM

Cluster Management

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

  • GitHub
  • Which job to allocate resources to based on marginal benefit of overall cluster saturation
  • Treats jobs as white boxes

Applied ML

KNOW WHERE YOU’RE UNCERTAIN WHEN PLANNING WITH MULTIMODAL FOUNDATION MODELS

  • Use state-transition systems and temporal logic specifications like:
    • (pedestrian → wait) — “Always wait if there’s a pedestrian”
    • (red light → move forward) — “Never move forward at a red light”
  • Build reward models that score a trajectory based on how many constraints it violates

AIOPSLAB: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

  • Agents are plethora for Dev but not for Ops because tasks are missing but 60% deployment failure comes from Ops problems
  • Provide SRE dataset and interfaces for agents to interact with the cloud
  • Discussion

AI METROPOLIS: Scaling Large Language Model-Based Multi-Agent Simulation with Out-of-Order Execution

  • Multi-agent simulations: all agents must finish a step before moving to the next
  • Introduces out-of-order execution
  • Maximum velocity: refers to the maximum distance an agent can influence with 1 simulation step

Tools

LUMOS: Efficient Performance Modeling and Estimation for Large-Scale LLM Training

  • Profiling tool for LLMs (code not found)
  • Related profiling tools: dpro

C. Resources

Books & Guides:

Tools:

Communities: