Page Not Found
Page not found. Your pixels are in another canvas.
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Page not found. Your pixels are in another canvas.
About me
Learning visual concepts from natural language supervision for zero-shot transfer
AI-powered characters that simulate believable human behavior using LLMs
Overview of GPT-3 and its impact on large language models
Meta’s open-source LLM with dual reward models for safety and helpfulness
Key insights and learnings from MLSys 2025
Key insights and learnings from NeurIPS 2025
This is a page not in th emain menu
Exploring chunking, retrieval strategies, and evaluation metrics for RAG systems
An analysis of streaming popularity, lyrical emotion, and social media sentiment
Published:
Temperature controls LLM output randomness by scaling logits before softmax — high temperature (>1) flattens the distribution for more creative/random outputs, while low temperature (<1) sharpens it for more deterministic, predictable responses. As temperature approaches zero, softmax behaves like argmax — the highest logit token gets nearly all probability mass, making the model always select the most likely next token according to its training.
Published:
Mixture-of-Experts (MoE) models enhance efficiency and scalability of large-scale transformers, enabling sparse computation - dynamically routing data to a subset of experts. However, sparsity complicates efficient training & inference in batched settings. We propose a method to address the challenge. Our project aims to optimize MoE training by developing a variable-sized batched General Matrix-Matrix Multiplication (GEMM) kernel - using NVIDIA cuBLAS. Our approach offers an alternative kernel optimization for MoE compared to the Block Sparse Kernel proposed by MegaBlocks. We solve dynamic and imbalanced computation inherent in MoE architectures and improve training efficiency. By leveraging cuBLAS for variable-sized batched GEMM, we enhance the performance of sparse training in MoEs and provide a robust solution to the limitations of current frameworks.
Published:
DINOv2 is a self-supervised foundation model for computer vision that trains a 1B parameter ViT on 142M curated images using dual objectives: an image-level cross-entropy loss between a student (trained on crops) and an EMA-updated teacher (trained on full images), plus a patch-level masked prediction task with decoupled weights. Key innovations include scalable dataset curation — using clustering-based deduplication and similarity-based augmentation with FAISS indexing instead of manual annotation — achieving 2x faster training and 3x less memory than iBOT while learning transferable features at both image and patch levels.
Published:
FlashAttention optimizes transformer attention by tiling computations and minimizing HBM reads/writes using fast on-chip SRAM, while FlashAttention-2 adds further gains through causal masking, sequence-length parallelization, and smarter warp partitioning — achieving up to 10x memory and 2x speed improvements.
Published:
Instruction tuning has been shown to improve zero-shot capabilities in LLMs. The work extends this to multi-modal models. In this regard, they augment existing visual question-answering prompts using GPT-4 and train an image conditioned language generation model. This architecture consists of a pre-trained large language model (LlaMA) and a vision encoder (CLIP).
Published:
SAM (Segment Anything Model) is a foundation model for promptable image segmentation — using a ViT encoder pretrained with MAE, a prompt encoder for points/boxes/masks, and an ambiguity-aware decoder that outputs multiple candidate masks, trained on 1.1B masks for zero-shot transfer.
Published:
The paper’s primary contribution is leveraging a pre-trained language model for the development of a multimodal model. This model, which they term a ‘visual language model’, brings forth two main points of innovation. Firstly, it’s adept at handling interleaved images and texts of variable lengths. The importance of interleaving? It’s paramount because that’s the structure of most online data (data availability). What’s the endgame for the authors? They’re aiming for a model that can tackle a new task with just a few examples or instructions (few-shot capabilities). A noteworthy mention is that previous methods didn’t quite hit the mark in low-data scenarios (previous-work). This approach is pioneering in its focus on few-shot learning.
Published:
Training large models requires far more memory than just storing weights — a 1.5B parameter model needs ~3GB for weights but ~24GB total due to Adam optimizer states (momentum, variance) and gradients. ZeRO (Zero Redundancy Optimizer) solves this by partitioning optimizer states, gradients, and parameters across GPUs — eliminating redundant storage while maintaining data parallelism, unlike traditional approaches that either replicate everything (DP) or suffer from low utilization (MP/PP). The three ZeRO stages progressively reduce memory: ZeRO-1 partitions optimizer states, ZeRO-2 adds gradient partitioning, and ZeRO-3 partitions parameters too, with communication overhead comparable to standard data parallelism.
Published in Journal 1, 2009
This paper is about the number 1. The number 2 is left for future work.
Recommended citation: Your Name, You. (2009). "Paper Title Number 1." Journal 1. 1(1). http://academicpages.github.io/files/paper1.pdf
Published in Journal 1, 2010
This paper is about the number 2. The number 3 is left for future work.
Recommended citation: Your Name, You. (2010). "Paper Title Number 2." Journal 1. 1(2). http://academicpages.github.io/files/paper2.pdf
Published in Journal 1, 2015
This paper is about the number 3. The number 4 is left for future work.
Recommended citation: Your Name, You. (2015). "Paper Title Number 3." Journal 1. 1(3). http://academicpages.github.io/files/paper3.pdf
Published:
This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.