Regression
Predicting continuous values from input features.
Clustering Mechanism
Grouping similar data points without labels (such as KNN, Fuzzy Clustering etc...)
Principal Component Analysis
Dimensionality reduction via orthogonal transformation
Decision Trees
Hierarchical decision rules for classification
Random Forests
Ensemble bagging of decision trees
Mean Squared Error
L2 loss function for regression tasks
Cross Entropy Loss
Measuring divergence between probability distributions, commonly used in Dense LLM training
Feature Normalization
Scaling inputs to stabilize training
One-hot Encoding
Representing categorical variables as binary vectors
Supervised Learning
Learning mapping from labeled data
Unsupervised Learning
Finding patterns in unlabeled data
Feed Forward Neural Networks
Type of NN in which information flows in a single direction, where inputs are multiplied by weights to obtain outputs
Convolutional Neural Networks
Type of NN processes images by grids allowing spatial understanding
Activation Functions
Introducing non-linearity to neural networks
N-gram Models
statistical language model that predicts the probability of a word (or symbol) based on the preceding n-1 words
Recurrent Neural Networks
Processing sequential data with internal state
Optimizers
updates model weights to improve accuracy and reduce error (such as Adam, SGD)
Regularization
Techniques to prevent overfitting (L1, L2, Dropout)
Learning Rate Schedulers
Adjusting learning rate during training
Gradient Clipping
Limiting gradient magnitude to prevent explosions
Class Imbalance Handling
Techniques like SMOTE or weighted loss
Forward Noising & Denoising
Corrupting Image with gradual noising steps and restorating them to teach diffusion models how to generate images
DDPM Formulation
Denoising Diffusion Probabilistic Models - type of models which works by learning to reverse a gradual corruption process,
Stochastic Gradient Descent
Iterative optimization using mini-batches
The No Free Lunch Theorem
No single algorithm works best for all problems. An algorithm's success is tied to the problem's specifics, meaning one that excels on one dataset will perform poorly on another, averaged across all possible problems
Empirical Risk Minimization
Minimizing error on the training set
Vapnik-Chervonenkis Dimension
Measuring the capacity of a classification algorithm
Rademacher Complexity
Measuring the richness of a class of functions
Double Descent
Model's error rate on the test set initially decreases with the number of parameters, then peaks, then decreases again
Jacobian & Hessian Matrices
First and second-order partial derivatives
Gradient Noise Scale
predicts the largest batch size with minimal noise and data efficiency
Vanishing & Exploding Gradients
Instability in deep network backpropagation where gradients values overflow or underflow
Activation Saturation Effects
Neurons getting stuck at asymptotic values
Convexity & Smoothness
Properties ensuring global minima (minimal loss) reachability
Stability of Tikhonov Regularization
L2 regularization for ill-posed problems (problems where a small change in input data causes a massive change in the output)
Polyak–Łojasiewicz (PL) Condition
Gradient dominance for faster convergence (ensures global linear convergence without requiring objective function to be convex)
Saddle Points vs Local Minima
Why saddle points are the real problem in high-dim?
EMA of Weights
Exponential Moving Average (technique where a mirror set of model parameters is maintained by keeping a running average of the training weights) for stable inference
Mixed Precision Diffusion Training
FP16/FP32 hybrid for VRAM efficiency and faster inference in Diffusion Transformers
Noise Level Conditioning
Feeding the current noise magnitude into the neural network, to guide how much denoising is needed at each step
Lottery Ticket Hypothesis
Proposes that large, randomly initialized neural networks contain small subnetworks, called "winning tickets," that can achieve the same accuracy as the full network if trained in isolation with their original initializations
Fisher Information Matrix
used to calculate the covariance matrices associated with maximum-likelihood estimates.
Neural Tangent Kernel
Infinite-width networks behave like linear models
Universal Approximation Theorem
NNs can approximate any continuous function
Mean-field theory of Neural Networks
Statistical physics approach to large networks (Law of large numbers)
Memorization–Generalization Paradox
Deep models memorize noise yet generalize well
Latent Space Scaling
The encodings of the autoencoder are scaled by this before feeding into the U-Net
Generalization at Higher Dimensions
large nets memorize, but still generalize at higher dimensions
Attention as Kernel Mechanism
Smoothing via similarity kernels
Dot Product Attention Geometry
Cosine similarity in high-dimensional space
Self Attention
Sequence elements attending to themselves
Cross Attention
Attending to context from encoder or other modality (such as image to text, text to image etc...)
Attention as LoRA
Bring negative attention to self-attention modules and learn low-rank attention weights directly, capturing the characteristics of downstream tasks
Encoder-Decoder Architecture
performs sequence-to-sequence tasks, using an encoder to read an input sequence and the decoder to generate output
Residuals as Gradient Highways
Skip connections prevents vanishing gradient problem
Positional Encoding
Technique that adds information about the position of each token in the sequence to the input embeddings.Injecting order into permutation-invariant attention
Learned Positional Embeddings
Position information is encoded as a trainable parameter (embedding vector) rather than a fixed, predefined function
Large Language Models
Scaled up transformers on massive corpus
Transformer Expressivity
Are Transformers really universal approximators?
Universal Approximation of Sequences
Transformers as universal sequence approximators
Inductive Bias of Self-Attention
Relationships between tokens regardless of distance
Reparameterization Equivalence
Different architectures yielding same function space
Probability Flow ODE
Deterministic sampling in diffusion models
RoPE
Encodes the absolute position with a rotation matrix and provides relative position dependency in self-attention formulation
Byte Pair Tokenizer
Hybrid subword tokenization method that iteratively merges the most frequent pairs of adjacent characters or bytes into new, larger tokens
Transformer Grokking
Delayed generalization after overfitting
Instruction Level Parallelism
Hardware execution of multiple instructions
Length Generalization
Extrapolating beyond training context window
QKV Projection Fusion
Merging Q,K,V matrix multiplications for speed
Flash Attention
IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between HBM and GPU on-chip SRAM.
KV Cache Layout
Memory organization for fast decoding
Kernel Fusion Techniques
Combining GPU kernels to reduce overhead
Kernel Tiling Strategies
Optimizing data movement to shared memory
Parallelism Strategies
Data, Tensor, Pipeline, and Sequence parallelism for MultiGPU setups
CUDA Graph Captures
Reducing CPU launch overhead for GPU kernels
Block-sparse Attention
Skipping computation on empty attention blocks
PTX Control Optimization
Low-level assembly tuning for GPUs
Paged Attention
OS-style virtual memory for KV cache
Hardware Specific Languages
generalized tiled programming model for more efficient AI Kernel programming
Flash Multi-head FNNs
I/O-aware fused kernel computing outputs online in SRAM akin to FlashAttention, and a design using dynamically weighted parallel sub-networks to maintain a balanced ratio between intermediate and head dimensions
Gated Attention Mechanism to escape Attention Sinks
applying a head-specific sigmoid gate after the Scaled Dot-Product Attention consistently improves performance
GRPO for Math Performance in Dense LLMs
How using GRPO + MoE aux loss enables math solving in LLMs?
Why Cosine Schedule Works Better
Smooth decay matches loss landscape better
Fused Timestep Embedding Kernels
Optimizing diffusion noise injection
Style Alignment via SharedAttention
Enable style alignment by leaking attention values
Optimal Scaling Laws
Chinchilla: trade-off between params and data
Diffusion Language Models
Generating text via continuous diffusion
Test Time Scaling
Trading compute for accuracy during inference
Free Transformers
Extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision using variational procedure
Mixture of Experts
Sparse activation of model sub-components
RLVR & RLHF
Reinforcement Learning from Verified Rewards/Human Feedback
Sparse Expert Load Balancing
load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training
State Space Models
Mamba/S4: Linear time sequence modeling
Selective Scan Kernels
The parallel prefix sum algorithm for SSMs
Native Sparse Attention
Learning sparsity patterns directly for efficient long context modelling
Compute-Optimal Context Length
Balancing sequence length with model width
FP16 vs BF16 in RL Stability
Why FP16 is much stable for training LLMs with RL
How LLMs are Injective & Invertible?
non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations
Attention Sinks from Graph Perspective
Token nodes acting as information absorbers
Reasoning Stability in Short CoT
Why Short CoT ensures stable reasoning in complex process?
Squeezed Diffusion Models
quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle,which scale noise anisotropically along the principal component of the training distribution
Terminal Velocity Matching
generalization of flow matching that enables high-fidelity one- and few-step generative modeling
Why mask diffusion does not work
why mask diffusion faces difficulties in achieving parallel generation and bidirectional attention?
How much do LLMs memorize
Estimating how much a model “knows” about a datapoint
Normalization Free Transformers
Transformers without normalization can achieve the same or better performance using Dynamic Tanh (DyT)
ARC is a Vision Problem
Achieves higher accuracy in ARC by framing it as an image-to-image translation problem
How LLMs Use Their Depth?
Explains how LLMs internally structure their computations to make predictions
Why INT8 in SageAttention is Better?
8-Bit Attention for Plug-and-play Inference Acceleration
Sinkhorn-Normalized Quantization in LLMs
Uses fast Sinkhorn–Knopp–style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization
Large Concept Models
Assume that a concept corresponds to a sentence, and use an existing sentence embedding space
Rectified Flow Transformers
Uses rectified flow (connecting noise to data using straight-line trajectories) with a Transformer-based architecture
Not all bits are equal
Why models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations
MxFP4 vs NVFP4 Training
Micro-exponent formats for extreme quantization and stability in LLM pre-training
1-bit Transformer Scaling
BitNet and the era of ternary weights
Mixture Block Attention
Applies the principles of MoE to the attention mechanism to transition between full and sparse attention
Why Transformers are Bad at Math?
Why model converges to a local optimum that lacks the required long-range dependencies for multiplication?
Kimi Linear Attention & Hardware aware chunking
Expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism
Gradient Low Rank Projection Optimizers
Projecting gradients into lower rank to save memory
Distributed LLM Training with DiLoCo
Using DiLoCo to train LLMs in distributed poorly connected devices
Why Looped Transformers are good at algorithms?
questions says everything
Positional Integrity Encoding for rapid KV cache edit
Rapid KV cache editing technique for Large code LLMs
The Continual Learning Problem
Investigate whether sparse parameter updates can enable learning without catastrophic forgetting
Spherical Equivariant Graph Transformers
3D molecule modeling with symmetry preservation
Learned Score Field Geometry
Diffusion models of data in general non-Euclidean geometries
Schrödinger Bridge Interpretation
Explains the issues with DSB in complex data generation
How diffusion models memorize?
heading says all
Bias from Finite Timesteps
Observes that maximum likelihood training consistently improves the likelihood of score-based diffusion models across multiple datasets and architectures
Semantic Manifolds in Diffusion Trajectories
how Riemannian geometry maps between the latent space and intermediate feature maps to show semantic axes and curved manifold structure in diffusion trajectories
Why Semantics Appear Late?
shows why semantic information transfer peaks at intermediate timesteps and vanishes near both the beginning and end of the process.
Bad data lead to Good Models
explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model’s output toxicity
Topological Deep Learning
Deep learning to handle complex, non-Euclidean data structures
Infini-gram
engine that efficiently processes n-gram queries with unbounded n and trillion-token massive corpora
XLA Compiler Techniques
accelerating ML models with Linear Algebra & GPUs
Manifold Learning
Explains the set of methods to find the low dimensional structure of data
Universal Weight Subspace Hypothesis
Do all models converge to the same-D subspace?
Holographic Transformers
Encoding sequences in complex associative memory using neuro-symbolic techniques
GSPO for RL Training in MoEs
Stable RL training algorithms to train MoEs
Sequence Objective as First Order Approximation
Explains under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods
LLMs Can Get Brain Rot
Continual exposure to junk web text induces lasting cognitive decline in large language models
Representation Geometry Manifolds
Treats the data space of diffusion models as a Riemannian manifold with a score-derived metri
Alignment as an Optimization Artifact
Is language objective just a local minimum?
Causal Emergence in Representations
Shows how neural representations can align with high-level causal variables through causal abstraction experiments
Warp Divergence from Timesteps
GPU thread inefficiency in conditional generation
Conditioning as Geometry Deformation
Conditioning is done via projective/geometric transformations of the points and features
Riemann Optimization for variables on Curved Spaces
Gradient descent on non-Euclidean manifolds