MLSys 2025 - Conference Takeaways
MLSys 2025 - Conference Takeaways
A. Talks
1. Soumith Chintala (PyTorch)
Key Insights:
- North star has been how to break
forloops into parallelism - Types of innovation in AI:
- Objective function
- Better priors
- Better labels (better data does not imply better labels)
Torch Optimizations:
- Faster: e.g. flex attention, torch.compile
- Precision flexibility: e.g. torchao (quantization)
- Better reliability: e.g. torchft, distributed checkpointing
Extreme PyTorch (Llama 3 paper):
- 24k GPUs
- Maximize effective training time
- Meta has large excel sheets of how to shard model
Resources:
- Best PyTorch features: torchtitan
- Community: GPU Mode Discord
Future Directions:
- ⭐️ Writing kernels with LLMs
2. LLMArena
Key Insights:
- Benchmarks should evolve with models
- Static ones are unreliable due to contamination
- Data freshness as a metric
- Categorizing prompts into categories
- Prompt specific leaderboard (P2L)
3. Scaling Laws - Beidi Chen
Key Insights:
- Stage 3 was scaling context length; Stage 4 is test-time compute
- How fast are we solving the whole problem?
Notable Ideas:
- Multiverse 1K: using map-reduce for decoding
- Oversampling effective examples (DAPO): Paper
4. Tim Dettmers
Key Insights:
- Notion of a good paper: Does this paper share its assumptions?
- Being Impactful: Pick hard problems
- Obsess over problems and not solutions or a particular kind of solution
- Research Trends: Cyclical nature (e.g., quantization had hit a wall a few years ago, revival with LLMs, and now is again about to hit a wall)
B. Useful Papers
Training Optimization
PIPEFILL: Using GPUs During Bubbles in Pipeline-Parallel LLM Training
- A fill job is any independent workload temporarily executed during the idle bubbles in pipeline-parallel training to utilize wasted GPU time
- PIPEFILL anticipates bubbles called Pipeline Bubble Instruction, combined with profiling during the initial training iterations
Key Parallelism Techniques:
| Technique | Purpose |
|---|---|
| ZeRO (1/2/3) | Memory optimization |
| FSDP | Fully Sharded Data Parallel |
| Pipeline Parallelism | Model is too big to fit even on a single node so it is partitioned across devices |
RADIUS: Range-Based Gradient Sparsity for Large Foundation Model Pre-training
- Training large foundation models is extremely compute- and communication-heavy, especially when using data parallelism
- Naive solutions like top-k gradient sparsity don’t scale and break convergence
- After an initial phase, reuse the same top-k indices for T steps
- Accumulate dropped gradients into a residual buffer
APOLLO: SGD-Like Memory, AdamW-Level Performance
- APOLLO (Approximated Gradient Scaling for Memory Efficient LLM Optimization)
- Uses a low-rank auxiliary optimizer state, 3× training throughput vs. AdamW
- Related: GaLore
REAL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
- A task = a function call like generation, inference, or training, executed by one of the models (actor, critic, reward)
- Parameter Reallocation — dynamically redistribute model parameters across the cluster per task
- Up to 3.58x speedup vs baselines like DeepSpeed-Chat, OpenRLHF, and NeMo-Aligner
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training
- Partitions input sequences and iteratively processes mini-sequences
- Reduces intermediate memory usage with activation recomputation
Inference Optimization
Enabling Unstructured Sparse Acceleration on Structured Hardware
- Express model weights/activations (online) as sum of sparse matrices, similar to Taylor series expansion
- Shows significant latency improvements
- Don’t clearly specify accuracy/perplexity metrics, but idea is cool
MILO: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators
- Quantization helps reduce memory for MoE models but under 4-bit quantization causes non-trivial accuracy drops
- Milo: Quantize-then-Compensate with Low-Rank Matrices
Long Context - Offloading KV Cache
- Offloads the KV cache to CPU RAM
Cluster Management
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
- GitHub
- Which job to allocate resources to based on marginal benefit of overall cluster saturation
- Treats jobs as white boxes
Applied ML
KNOW WHERE YOU’RE UNCERTAIN WHEN PLANNING WITH MULTIMODAL FOUNDATION MODELS
- Use state-transition systems and temporal logic specifications like:
(pedestrian → wait)— “Always wait if there’s a pedestrian”(red light → move forward)— “Never move forward at a red light”
- Build reward models that score a trajectory based on how many constraints it violates
AIOPSLAB: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
- Agents are plethora for Dev but not for Ops because tasks are missing but 60% deployment failure comes from Ops problems
- Provide SRE dataset and interfaces for agents to interact with the cloud
- Discussion
AI METROPOLIS: Scaling Large Language Model-Based Multi-Agent Simulation with Out-of-Order Execution
- Multi-agent simulations: all agents must finish a step before moving to the next
- Introduces out-of-order execution
- Maximum velocity: refers to the maximum distance an agent can influence with 1 simulation step
Tools
LUMOS: Efficient Performance Modeling and Estimation for Large-Scale LLM Training
- Profiling tool for LLMs (code not found)
- Related profiling tools: dpro
C. Resources
Books & Guides:
Tools:
Communities: