Paper Digest

2024

Temperature in LLMs: From Softmax to Autoregressive Decoding

3 minute read

Published:

Temperature controls LLM output randomness by scaling logits before softmax — high temperature (>1) flattens the distribution for more creative/random outputs, while low temperature (<1) sharpens it for more deterministic, predictable responses. As temperature approaches zero, softmax behaves like argmax — the highest logit token gets nearly all probability mass, making the model always select the most likely next token according to its training.

Improving MoE Inference with Variable-Sized Batched GEMM Kernels

13 minute read

Published:

Mixture-of-Experts (MoE) models enhance efficiency and scalability of large-scale transformers, enabling sparse computation - dynamically routing data to a subset of experts. However, sparsity complicates efficient training & inference in batched settings. We propose a method to address the challenge. Our project aims to optimize MoE training by developing a variable-sized batched General Matrix-Matrix Multiplication (GEMM) kernel - using NVIDIA cuBLAS. Our approach offers an alternative kernel optimization for MoE compared to the Block Sparse Kernel proposed by MegaBlocks. We solve dynamic and imbalanced computation inherent in MoE architectures and improve training efficiency. By leveraging cuBLAS for variable-sized batched GEMM, we enhance the performance of sparse training in MoEs and provide a robust solution to the limitations of current frameworks.

DINOv2: Learning Robust Visual Features without Supervision

3 minute read

Published:

DINOv2 is a self-supervised foundation model for computer vision that trains a 1B parameter ViT on 142M curated images using dual objectives: an image-level cross-entropy loss between a student (trained on crops) and an EMA-updated teacher (trained on full images), plus a patch-level masked prediction task with decoupled weights. Key innovations include scalable dataset curation — using clustering-based deduplication and similarity-based augmentation with FAISS indexing instead of manual annotation — achieving 2x faster training and 3x less memory than iBOT while learning transferable features at both image and patch levels.

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

3 minute read

Published:

FlashAttention optimizes transformer attention by tiling computations and minimizing HBM reads/writes using fast on-chip SRAM, while FlashAttention-2 adds further gains through causal masking, sequence-length parallelization, and smarter warp partitioning — achieving up to 10x memory and 2x speed improvements.

Visual Instruction Tuning (LLaVA)

3 minute read

Published:

Instruction tuning has been shown to improve zero-shot capabilities in LLMs. The work extends this to multi-modal models. In this regard, they augment existing visual question-answering prompts using GPT-4 and train an image conditioned language generation model. This architecture consists of a pre-trained large language model (LlaMA) and a vision encoder (CLIP).

Segment Anything

4 minute read

Published:

SAM (Segment Anything Model) is a foundation model for promptable image segmentation — using a ViT encoder pretrained with MAE, a prompt encoder for points/boxes/masks, and an ambiguity-aware decoder that outputs multiple candidate masks, trained on 1.1B masks for zero-shot transfer.

2023

Flamingo: a Visual Language Model for Few-Shot Learning

5 minute read

Published:

The paper’s primary contribution is leveraging a pre-trained language model for the development of a multimodal model. This model, which they term a ‘visual language model’, brings forth two main points of innovation. Firstly, it’s adept at handling interleaved images and texts of variable lengths. The importance of interleaving? It’s paramount because that’s the structure of most online data (data availability). What’s the endgame for the authors? They’re aiming for a model that can tackle a new task with just a few examples or instructions (few-shot capabilities). A noteworthy mention is that previous methods didn’t quite hit the mark in low-data scenarios (previous-work). This approach is pioneering in its focus on few-shot learning.

2020

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

3 minute read

Published:

Training large models requires far more memory than just storing weights — a 1.5B parameter model needs ~3GB for weights but ~24GB total due to Adam optimizer states (momentum, variance) and gradients. ZeRO (Zero Redundancy Optimizer) solves this by partitioning optimizer states, gradients, and parameters across GPUs — eliminating redundant storage while maintaining data parallelism, unlike traditional approaches that either replicate everything (DP) or suffer from low utilization (MP/PP). The three ZeRO stages progressively reduce memory: ZeRO-1 partitions optimizer states, ZeRO-2 adds gradient partitioning, and ZeRO-3 partitions parameters too, with communication overhead comparable to standard data parallelism.