The ML Trench

by Beens & Nami

Base: Fundamentals
Level 2: Dynamics
Level 3: Deep Nets
Level 4: Transformers
Level 5: Systems
Level 6: Scaling
Level 7: Frontier
Level 8: The Void
0m 200m 400m 600m 800m 1000m 1200m 1400m 1600m 1800m 2000m 2200m 2400m 2600m 2800m 3000m 3200m 3400m 3600m 3800m 4000m 4200m 4400m 4600m 4800m 5000m 5200m 5400m 5600m 5800m 6000m 6200m 6400m 6600m 6800m 7000m 7200m 7400m 7600m 7800m 8000m 8200m 8400m 8600m 8800m 9000m 9200m 9400m 9600m 9800m 10000m 10200m 10400m 10600m 10800m 11000m 11200m 11400m 11600m 11800m 12000m 12200m 12400m 12600m 12800m 13000m 13200m 13400m 13600m 13800m 14000m 14200m 14400m 14600m 14800m 15000m 15200m 15400m 15600m 15800m 16000m 16200m 16400m 16600m 16800m
Regression

Regression

Predicting continuous values from input features.

fundamentals Read Paper ↗
Clustering Mechanism

Clustering Mechanism

Grouping similar data points without labels (such as KNN, Fuzzy Clustering etc...)

unsupervised Read Paper ↗
Principal Component Analysis

Principal Component Analysis

Dimensionality reduction via orthogonal transformation

feature-engineering Read Paper ↗
Decision Trees

Decision Trees

Hierarchical decision rules for classification

classical-ml Read Paper ↗
Random Forests

Random Forests

Ensemble bagging of decision trees

classical-ml Read Paper ↗
Mean Squared Error

Mean Squared Error

L2 loss function for regression tasks

optimization Read Paper ↗
Cross Entropy Loss

Cross Entropy Loss

Measuring divergence between probability distributions, commonly used in Dense LLM training

optimization Read Paper ↗
Feature Normalization

Feature Normalization

Scaling inputs to stabilize training

preprocessing Read Paper ↗
One-hot Encoding

One-hot Encoding

Representing categorical variables as binary vectors

preprocessing Read Paper ↗
Supervised Learning

Supervised Learning

Learning mapping from labeled data

Unsupervised Learning

Unsupervised Learning

Finding patterns in unlabeled data

Feed Forward Neural Networks

Feed Forward Neural Networks

Type of NN in which information flows in a single direction, where inputs are multiplied by weights to obtain outputs

architecture Read Paper ↗
Convolutional Neural Networks

Convolutional Neural Networks

Type of NN processes images by grids allowing spatial understanding

architecture Read Paper ↗
Activation Functions

Activation Functions

Introducing non-linearity to neural networks

fundamentals Read Paper ↗
N-gram Models

N-gram Models

statistical language model that predicts the probability of a word (or symbol) based on the preceding n-1 words

Recurrent Neural Networks

Recurrent Neural Networks

Processing sequential data with internal state

architecture Read Paper ↗
Optimizers

Optimizers

updates model weights to improve accuracy and reduce error (such as Adam, SGD)

optimization Read Paper ↗
Regularization

Regularization

Techniques to prevent overfitting (L1, L2, Dropout)

Learning Rate Schedulers

Learning Rate Schedulers

Adjusting learning rate during training

Gradient Clipping

Gradient Clipping

Limiting gradient magnitude to prevent explosions

Class Imbalance Handling

Class Imbalance Handling

Techniques like SMOTE or weighted loss

preprocessing Read Paper ↗
Forward Noising & Denoising

Forward Noising & Denoising

Corrupting Image with gradual noising steps and restorating them to teach diffusion models how to generate images

generative Read Paper ↗
DDPM Formulation

DDPM Formulation

Denoising Diffusion Probabilistic Models - type of models which works by learning to reverse a gradual corruption process,

generative Read Paper ↗
Stochastic Gradient Descent

Stochastic Gradient Descent

Iterative optimization using mini-batches

optimization Read Paper ↗
The No Free Lunch Theorem

The No Free Lunch Theorem

No single algorithm works best for all problems. An algorithm's success is tied to the problem's specifics, meaning one that excels on one dataset will perform poorly on another, averaged across all possible problems

Empirical Risk Minimization

Empirical Risk Minimization

Minimizing error on the training set

Vapnik-Chervonenkis Dimension

Vapnik-Chervonenkis Dimension

Measuring the capacity of a classification algorithm

Rademacher Complexity

Rademacher Complexity

Measuring the richness of a class of functions

Double Descent

Double Descent

Model's error rate on the test set initially decreases with the number of parameters, then peaks, then decreases again

Jacobian & Hessian Matrices

Jacobian & Hessian Matrices

First and second-order partial derivatives

Gradient Noise Scale

Gradient Noise Scale

predicts the largest batch size with minimal noise and data efficiency

Vanishing & Exploding Gradients

Vanishing & Exploding Gradients

Instability in deep network backpropagation where gradients values overflow or underflow

Activation Saturation Effects

Activation Saturation Effects

Neurons getting stuck at asymptotic values

Convexity & Smoothness

Convexity & Smoothness

Properties ensuring global minima (minimal loss) reachability

Stability of Tikhonov Regularization

Stability of Tikhonov Regularization

L2 regularization for ill-posed problems (problems where a small change in input data causes a massive change in the output)

Polyak–Łojasiewicz (PL) Condition

Polyak–Łojasiewicz (PL) Condition

Gradient dominance for faster convergence (ensures global linear convergence without requiring objective function to be convex)

Saddle Points vs Local Minima

Saddle Points vs Local Minima

Why saddle points are the real problem in high-dim?

EMA of Weights

EMA of Weights

Exponential Moving Average (technique where a mirror set of model parameters is maintained by keeping a running average of the training weights) for stable inference

Mixed Precision Diffusion Training

Mixed Precision Diffusion Training

FP16/FP32 hybrid for VRAM efficiency and faster inference in Diffusion Transformers

Noise Level Conditioning

Noise Level Conditioning

Feeding the current noise magnitude into the neural network, to guide how much denoising is needed at each step

generative Read Paper ↗
Lottery Ticket Hypothesis

Lottery Ticket Hypothesis

Proposes that large, randomly initialized neural networks contain small subnetworks, called "winning tickets," that can achieve the same accuracy as the full network if trained in isolation with their original initializations

Fisher Information Matrix

Fisher Information Matrix

used to calculate the covariance matrices associated with maximum-likelihood estimates.

Neural Tangent Kernel

Neural Tangent Kernel

Infinite-width networks behave like linear models

Universal Approximation Theorem

Universal Approximation Theorem

NNs can approximate any continuous function

Mean-field theory of Neural Networks

Mean-field theory of Neural Networks

Statistical physics approach to large networks (Law of large numbers)

Memorization–Generalization Paradox

Memorization–Generalization Paradox

Deep models memorize noise yet generalize well

Latent Space Scaling

Latent Space Scaling

The encodings of the autoencoder are scaled by this before feeding into the U-Net

Generalization at Higher Dimensions

Generalization at Higher Dimensions

large nets memorize, but still generalize at higher dimensions

Attention as Kernel Mechanism

Attention as Kernel Mechanism

Smoothing via similarity kernels

Dot Product Attention Geometry

Dot Product Attention Geometry

Cosine similarity in high-dimensional space

Self Attention

Self Attention

Sequence elements attending to themselves

Cross Attention

Cross Attention

Attending to context from encoder or other modality (such as image to text, text to image etc...)

Attention as LoRA

Attention as LoRA

Bring negative attention to self-attention modules and learn low-rank attention weights directly, capturing the characteristics of downstream tasks

optimization Read Paper ↗
Encoder-Decoder Architecture

Encoder-Decoder Architecture

performs sequence-to-sequence tasks, using an encoder to read an input sequence and the decoder to generate output

architecture Read Paper ↗
Residuals as Gradient Highways

Residuals as Gradient Highways

Skip connections prevents vanishing gradient problem

architecture Read Paper ↗
Positional Encoding

Positional Encoding

Technique that adds information about the position of each token in the sequence to the input embeddings.Injecting order into permutation-invariant attention

Learned Positional Embeddings

Learned Positional Embeddings

Position information is encoded as a trainable parameter (embedding vector) rather than a fixed, predefined function

Large Language Models

Large Language Models

Scaled up transformers on massive corpus

Transformer Expressivity

Transformer Expressivity

Are Transformers really universal approximators?

Universal Approximation of Sequences

Universal Approximation of Sequences

Transformers as universal sequence approximators

Inductive Bias of Self-Attention

Inductive Bias of Self-Attention

Relationships between tokens regardless of distance

Reparameterization Equivalence

Reparameterization Equivalence

Different architectures yielding same function space

Probability Flow ODE

Probability Flow ODE

Deterministic sampling in diffusion models

generative Read Paper ↗
RoPE

RoPE

Encodes the absolute position with a rotation matrix and provides relative position dependency in self-attention formulation

Byte Pair Tokenizer

Byte Pair Tokenizer

Hybrid subword tokenization method that iteratively merges the most frequent pairs of adjacent characters or bytes into new, larger tokens

Transformer Grokking

Transformer Grokking

Delayed generalization after overfitting

Instruction Level Parallelism

Instruction Level Parallelism

Hardware execution of multiple instructions

Length Generalization

Length Generalization

Extrapolating beyond training context window

QKV Projection Fusion

QKV Projection Fusion

Merging Q,K,V matrix multiplications for speed

Flash Attention

Flash Attention

IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between HBM and GPU on-chip SRAM.

KV Cache Layout

KV Cache Layout

Memory organization for fast decoding

Kernel Fusion Techniques

Kernel Fusion Techniques

Combining GPU kernels to reduce overhead

Kernel Tiling Strategies

Kernel Tiling Strategies

Optimizing data movement to shared memory

Parallelism Strategies

Parallelism Strategies

Data, Tensor, Pipeline, and Sequence parallelism for MultiGPU setups

CUDA Graph Captures

CUDA Graph Captures

Reducing CPU launch overhead for GPU kernels

Block-sparse Attention

Block-sparse Attention

Skipping computation on empty attention blocks

efficiency Read Paper ↗
PTX Control Optimization

PTX Control Optimization

Low-level assembly tuning for GPUs

Paged Attention

Paged Attention

OS-style virtual memory for KV cache

Hardware Specific Languages

Hardware Specific Languages

generalized tiled programming model for more efficient AI Kernel programming

Flash Multi-head FNNs

Flash Multi-head FNNs

I/O-aware fused kernel computing outputs online in SRAM akin to FlashAttention, and a design using dynamically weighted parallel sub-networks to maintain a balanced ratio between intermediate and head dimensions

Gated Attention Mechanism to escape Attention Sinks

Gated Attention Mechanism to escape Attention Sinks

applying a head-specific sigmoid gate after the Scaled Dot-Product Attention consistently improves performance

GRPO for Math Performance in Dense LLMs

GRPO for Math Performance in Dense LLMs

How using GRPO + MoE aux loss enables math solving in LLMs?

Why Cosine Schedule Works Better

Why Cosine Schedule Works Better

Smooth decay matches loss landscape better

Fused Timestep Embedding Kernels

Fused Timestep Embedding Kernels

Optimizing diffusion noise injection

Style Alignment via SharedAttention

Style Alignment via SharedAttention

Enable style alignment by leaking attention values

generative Read Paper ↗
Optimal Scaling Laws

Optimal Scaling Laws

Chinchilla: trade-off between params and data

Diffusion Language Models

Diffusion Language Models

Generating text via continuous diffusion

generative Read Paper ↗
Test Time Scaling

Test Time Scaling

Trading compute for accuracy during inference

Free Transformers

Free Transformers

Extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision using variational procedure

architecture Read Paper ↗
Mixture of Experts

Mixture of Experts

Sparse activation of model sub-components

architecture Read Paper ↗
RLVR & RLHF

RLVR & RLHF

Reinforcement Learning from Verified Rewards/Human Feedback

Sparse Expert Load Balancing

Sparse Expert Load Balancing

load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training

State Space Models

State Space Models

Mamba/S4: Linear time sequence modeling

architecture Read Paper ↗
Selective Scan Kernels

Selective Scan Kernels

The parallel prefix sum algorithm for SSMs

Native Sparse Attention

Native Sparse Attention

Learning sparsity patterns directly for efficient long context modelling

Compute-Optimal Context Length

Compute-Optimal Context Length

Balancing sequence length with model width

FP16 vs BF16 in RL Stability

FP16 vs BF16 in RL Stability

Why FP16 is much stable for training LLMs with RL

How LLMs are Injective & Invertible?

How LLMs are Injective & Invertible?

non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations

Attention Sinks from Graph Perspective

Attention Sinks from Graph Perspective

Token nodes acting as information absorbers

Reasoning Stability in Short CoT

Reasoning Stability in Short CoT

Why Short CoT ensures stable reasoning in complex process?

Squeezed Diffusion Models

Squeezed Diffusion Models

quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle,which scale noise anisotropically along the principal component of the training distribution

generative Read Paper ↗
Terminal Velocity Matching

Terminal Velocity Matching

generalization of flow matching that enables high-fidelity one- and few-step generative modeling

generative Read Paper ↗
Why mask diffusion does not work

Why mask diffusion does not work

why mask diffusion faces difficulties in achieving parallel generation and bidirectional attention?

generative Read Paper ↗
How much do LLMs memorize

How much do LLMs memorize

Estimating how much a model “knows” about a datapoint

generative Read Paper ↗
Normalization Free Transformers

Normalization Free Transformers

Transformers without normalization can achieve the same or better performance using Dynamic Tanh (DyT)

architecture Read Paper ↗
ARC is a Vision Problem

ARC is a Vision Problem

Achieves higher accuracy in ARC by framing it as an image-to-image translation problem

How LLMs Use Their Depth?

How LLMs Use Their Depth?

Explains how LLMs internally structure their computations to make predictions

Why INT8 in SageAttention is Better?

Why INT8 in SageAttention is Better?

8-Bit Attention for Plug-and-play Inference Acceleration

Sinkhorn-Normalized Quantization in LLMs

Sinkhorn-Normalized Quantization in LLMs

Uses fast Sinkhorn–Knopp–style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization

Large Concept Models

Large Concept Models

Assume that a concept corresponds to a sentence, and use an existing sentence embedding space

Rectified Flow Transformers

Rectified Flow Transformers

Uses rectified flow (connecting noise to data using straight-line trajectories) with a Transformer-based architecture

generative Read Paper ↗
Not all bits are equal

Not all bits are equal

Why models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations

generative Read Paper ↗
MxFP4 vs NVFP4 Training

MxFP4 vs NVFP4 Training

Micro-exponent formats for extreme quantization and stability in LLM pre-training

1-bit Transformer Scaling

1-bit Transformer Scaling

BitNet and the era of ternary weights

optimization Read Paper ↗
Mixture Block Attention

Mixture Block Attention

Applies the principles of MoE to the attention mechanism to transition between full and sparse attention

Why Transformers are Bad at Math?

Why Transformers are Bad at Math?

Why model converges to a local optimum that lacks the required long-range dependencies for multiplication?

weaknesses Read Paper ↗
Kimi Linear Attention & Hardware aware chunking

Kimi Linear Attention & Hardware aware chunking

Expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism

Gradient Low Rank Projection Optimizers

Gradient Low Rank Projection Optimizers

Projecting gradients into lower rank to save memory

optimization Read Paper ↗
Distributed LLM Training with DiLoCo

Distributed LLM Training with DiLoCo

Using DiLoCo to train LLMs in distributed poorly connected devices

distributed Read Paper ↗
Why Looped Transformers are good at algorithms?

Why Looped Transformers are good at algorithms?

questions says everything

architecture Read Paper ↗
Positional Integrity Encoding for rapid KV cache edit

Positional Integrity Encoding for rapid KV cache edit

Rapid KV cache editing technique for Large code LLMs

The Continual Learning Problem

The Continual Learning Problem

Investigate whether sparse parameter updates can enable learning without catastrophic forgetting

Spherical Equivariant Graph Transformers

Spherical Equivariant Graph Transformers

3D molecule modeling with symmetry preservation

geometric-dl Read Paper ↗
Learned Score Field Geometry

Learned Score Field Geometry

Diffusion models of data in general non-Euclidean geometries

Schrödinger Bridge Interpretation

Schrödinger Bridge Interpretation

Explains the issues with DSB in complex data generation

How diffusion models memorize?

How diffusion models memorize?

heading says all

Bias from Finite Timesteps

Bias from Finite Timesteps

Observes that maximum likelihood training consistently improves the likelihood of score-based diffusion models across multiple datasets and architectures

Semantic Manifolds in Diffusion Trajectories

Semantic Manifolds in Diffusion Trajectories

how Riemannian geometry maps between the latent space and intermediate feature maps to show semantic axes and curved manifold structure in diffusion trajectories

Why Semantics Appear Late?

Why Semantics Appear Late?

shows why semantic information transfer peaks at intermediate timesteps and vanishes near both the beginning and end of the process.

Bad data lead to Good Models

Bad data lead to Good Models

explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model’s output toxicity

Topological Deep Learning

Topological Deep Learning

Deep learning to handle complex, non-Euclidean data structures

Infini-gram

Infini-gram

engine that efficiently processes n-gram queries with unbounded n and trillion-token massive corpora

XLA Compiler Techniques

XLA Compiler Techniques

accelerating ML models with Linear Algebra & GPUs

Manifold Learning

Manifold Learning

Explains the set of methods to find the low dimensional structure of data

Universal Weight Subspace Hypothesis

Universal Weight Subspace Hypothesis

Do all models converge to the same-D subspace?

Holographic Transformers

Holographic Transformers

Encoding sequences in complex associative memory using neuro-symbolic techniques

architecture Read Paper ↗
GSPO for RL Training in MoEs

GSPO for RL Training in MoEs

Stable RL training algorithms to train MoEs

Sequence Objective as First Order Approximation

Sequence Objective as First Order Approximation

Explains under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods

LLMs Can Get Brain Rot

LLMs Can Get Brain Rot

Continual exposure to junk web text induces lasting cognitive decline in large language models

Representation Geometry Manifolds

Representation Geometry Manifolds

Treats the data space of diffusion models as a Riemannian manifold with a score-derived metri

Alignment as an Optimization Artifact

Alignment as an Optimization Artifact

Is language objective just a local minimum?

Causal Emergence in Representations

Causal Emergence in Representations

Shows how neural representations can align with high-level causal variables through causal abstraction experiments

Warp Divergence from Timesteps

Warp Divergence from Timesteps

GPU thread inefficiency in conditional generation

Conditioning as Geometry Deformation

Conditioning as Geometry Deformation

Conditioning is done via projective/geometric transformations of the points and features

Riemann Optimization for variables on Curved Spaces

Riemann Optimization for variables on Curved Spaces

Gradient descent on non-Euclidean manifolds