The ML Trench

Regression

Predicting continuous values from input features.

fundamentals Read Paper ↗

Clustering Mechanism

Grouping similar data points without labels (such as KNN, Fuzzy Clustering etc...)

unsupervised Read Paper ↗

Principal Component Analysis

Dimensionality reduction via orthogonal transformation

feature-engineering Read Paper ↗

Decision Trees

Hierarchical decision rules for classification

classical-ml Read Paper ↗

Random Forests

Ensemble bagging of decision trees

classical-ml Read Paper ↗

Mean Squared Error

L2 loss function for regression tasks

optimization Read Paper ↗

Cross Entropy Loss

Measuring divergence between probability distributions, commonly used in Dense LLM training

optimization Read Paper ↗

Feature Normalization

Scaling inputs to stabilize training

preprocessing Read Paper ↗

One-hot Encoding

Representing categorical variables as binary vectors

preprocessing Read Paper ↗

Supervised Learning

Learning mapping from labeled data

paradigm Read Paper ↗

Unsupervised Learning

Finding patterns in unlabeled data

paradigm Read Paper ↗

Feed Forward Neural Networks

Type of NN in which information flows in a single direction, where inputs are multiplied by weights to obtain outputs

architecture Read Paper ↗

Convolutional Neural Networks

Type of NN processes images by grids allowing spatial understanding

architecture Read Paper ↗

Activation Functions

Introducing non-linearity to neural networks

fundamentals Read Paper ↗

N-gram Models

statistical language model that predicts the probability of a word (or symbol) based on the preceding n-1 words

nlp Read Paper ↗

Recurrent Neural Networks

Processing sequential data with internal state

architecture Read Paper ↗

Optimizers

updates model weights to improve accuracy and reduce error (such as Adam, SGD)

optimization Read Paper ↗

Regularization

Techniques to prevent overfitting (L1, L2, Dropout)

training Read Paper ↗

Learning Rate Schedulers

Adjusting learning rate during training

training Read Paper ↗

Gradient Clipping

Limiting gradient magnitude to prevent explosions

training Read Paper ↗

Class Imbalance Handling

Techniques like SMOTE or weighted loss

preprocessing Read Paper ↗

Forward Noising & Denoising

Corrupting Image with gradual noising steps and restorating them to teach diffusion models how to generate images

generative Read Paper ↗

DDPM Formulation

Denoising Diffusion Probabilistic Models - type of models which works by learning to reverse a gradual corruption process,

generative Read Paper ↗

Stochastic Gradient Descent

Iterative optimization using mini-batches

optimization Read Paper ↗

The No Free Lunch Theorem

No single algorithm works best for all problems. An algorithm's success is tied to the problem's specifics, meaning one that excels on one dataset will perform poorly on another, averaged across all possible problems

theory Read Paper ↗

Empirical Risk Minimization

Minimizing error on the training set

theory Read Paper ↗

Vapnik-Chervonenkis Dimension

Measuring the capacity of a classification algorithm

theory Read Paper ↗

Rademacher Complexity

Measuring the richness of a class of functions

theory Read Paper ↗

Double Descent

Model's error rate on the test set initially decreases with the number of parameters, then peaks, then decreases again

theory Read Paper ↗

Jacobian & Hessian Matrices

First and second-order partial derivatives

math Read Paper ↗

Gradient Noise Scale

predicts the largest batch size with minimal noise and data efficiency

training Read Paper ↗

Vanishing & Exploding Gradients

Instability in deep network backpropagation where gradients values overflow or underflow

training Read Paper ↗

Activation Saturation Effects

Neurons getting stuck at asymptotic values

training Read Paper ↗

Convexity & Smoothness

Properties ensuring global minima (minimal loss) reachability

math Read Paper ↗

Stability of Tikhonov Regularization

L2 regularization for ill-posed problems (problems where a small change in input data causes a massive change in the output)

math Read Paper ↗

Polyak–Łojasiewicz (PL) Condition

Gradient dominance for faster convergence (ensures global linear convergence without requiring objective function to be convex)

theory Read Paper ↗

Saddle Points vs Local Minima

Why saddle points are the real problem in high-dim?

theory Read Paper ↗

EMA of Weights

Exponential Moving Average (technique where a mirror set of model parameters is maintained by keeping a running average of the training weights) for stable inference

training Read Paper ↗

Mixed Precision Diffusion Training

FP16/FP32 hybrid for VRAM efficiency and faster inference in Diffusion Transformers

training Read Paper ↗

Noise Level Conditioning

Feeding the current noise magnitude into the neural network, to guide how much denoising is needed at each step

generative Read Paper ↗

Lottery Ticket Hypothesis

Proposes that large, randomly initialized neural networks contain small subnetworks, called "winning tickets," that can achieve the same accuracy as the full network if trained in isolation with their original initializations

theory Read Paper ↗

Fisher Information Matrix

used to calculate the covariance matrices associated with maximum-likelihood estimates.

math Read Paper ↗

Neural Tangent Kernel

Infinite-width networks behave like linear models

theory Read Paper ↗

Universal Approximation Theorem

NNs can approximate any continuous function

theory Read Paper ↗

Mean-field theory of Neural Networks

Statistical physics approach to large networks (Law of large numbers)

theory Read Paper ↗

Memorization–Generalization Paradox

Deep models memorize noise yet generalize well

theory Read Paper ↗

Latent Space Scaling

The encodings of the autoencoder are scaled by this before feeding into the U-Net

training Read Paper ↗

Generalization at Higher Dimensions

large nets memorize, but still generalize at higher dimensions

theory Read Paper ↗

Attention as Kernel Mechanism

Smoothing via similarity kernels

attention Read Paper ↗

Dot Product Attention Geometry

Cosine similarity in high-dimensional space

attention Read Paper ↗

Self Attention

Sequence elements attending to themselves

attention Read Paper ↗

Cross Attention

Attending to context from encoder or other modality (such as image to text, text to image etc...)

attention Read Paper ↗

Attention as LoRA

Bring negative attention to self-attention modules and learn low-rank attention weights directly, capturing the characteristics of downstream tasks

optimization Read Paper ↗

Encoder-Decoder Architecture

performs sequence-to-sequence tasks, using an encoder to read an input sequence and the decoder to generate output

architecture Read Paper ↗

Residuals as Gradient Highways

Skip connections prevents vanishing gradient problem

architecture Read Paper ↗

Positional Encoding

Technique that adds information about the position of each token in the sequence to the input embeddings.Injecting order into permutation-invariant attention

attention Read Paper ↗

Learned Positional Embeddings

Position information is encoded as a trainable parameter (embedding vector) rather than a fixed, predefined function

attention Read Paper ↗

Large Language Models

Scaled up transformers on massive corpus

nlp Read Paper ↗

Transformer Expressivity

Are Transformers really universal approximators?

theory Read Paper ↗

Universal Approximation of Sequences

Transformers as universal sequence approximators

theory Read Paper ↗

Inductive Bias of Self-Attention

Relationships between tokens regardless of distance

theory Read Paper ↗

Reparameterization Equivalence

Different architectures yielding same function space

theory Read Paper ↗

Probability Flow ODE

Deterministic sampling in diffusion models

generative Read Paper ↗

RoPE

Encodes the absolute position with a rotation matrix and provides relative position dependency in self-attention formulation

attention Read Paper ↗

Byte Pair Tokenizer

Hybrid subword tokenization method that iteratively merges the most frequent pairs of adjacent characters or bytes into new, larger tokens

nlp Read Paper ↗

Transformer Grokking

Delayed generalization after overfitting

theory Read Paper ↗

Instruction Level Parallelism

Hardware execution of multiple instructions

systems Read Paper ↗

Length Generalization

Extrapolating beyond training context window

nlp Read Paper ↗

QKV Projection Fusion

Merging Q,K,V matrix multiplications for speed

systems Read Paper ↗

Flash Attention

IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between HBM and GPU on-chip SRAM.

systems Read Paper ↗

KV Cache Layout

Memory organization for fast decoding

systems Read Paper ↗

Kernel Fusion Techniques

Combining GPU kernels to reduce overhead

systems Read Paper ↗

Kernel Tiling Strategies

Optimizing data movement to shared memory

systems Read Paper ↗

Parallelism Strategies

Data, Tensor, Pipeline, and Sequence parallelism for MultiGPU setups

systems Read Paper ↗

CUDA Graph Captures

Reducing CPU launch overhead for GPU kernels

systems Read Paper ↗

Block-sparse Attention

Skipping computation on empty attention blocks

efficiency Read Paper ↗

PTX Control Optimization

Low-level assembly tuning for GPUs

systems Read Paper ↗

Paged Attention

OS-style virtual memory for KV cache

systems Read Paper ↗

Hardware Specific Languages

generalized tiled programming model for more efficient AI Kernel programming

systems Read Paper ↗

Flash Multi-head FNNs

I/O-aware fused kernel computing outputs online in SRAM akin to FlashAttention, and a design using dynamically weighted parallel sub-networks to maintain a balanced ratio between intermediate and head dimensions

systems Read Paper ↗

Gated Attention Mechanism to escape Attention Sinks

applying a head-specific sigmoid gate after the Scaled Dot-Product Attention consistently improves performance

attention Read Paper ↗

GRPO for Math Performance in Dense LLMs

How using GRPO + MoE aux loss enables math solving in LLMs?

rl Read Paper ↗

Why Cosine Schedule Works Better

Smooth decay matches loss landscape better

training Read Paper ↗

Fused Timestep Embedding Kernels

Optimizing diffusion noise injection

systems Read Paper ↗

Style Alignment via SharedAttention

Enable style alignment by leaking attention values

generative Read Paper ↗

Optimal Scaling Laws

Chinchilla: trade-off between params and data

scaling Read Paper ↗

Diffusion Language Models

Generating text via continuous diffusion

generative Read Paper ↗

Test Time Scaling

Trading compute for accuracy during inference

inference Read Paper ↗

Free Transformers

Extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision using variational procedure

architecture Read Paper ↗

Mixture of Experts

Sparse activation of model sub-components

architecture Read Paper ↗

RLVR & RLHF

Reinforcement Learning from Verified Rewards/Human Feedback

alignment Read Paper ↗

Sparse Expert Load Balancing

load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training

training Read Paper ↗

State Space Models

Mamba/S4: Linear time sequence modeling

architecture Read Paper ↗

Selective Scan Kernels

The parallel prefix sum algorithm for SSMs

systems Read Paper ↗

Native Sparse Attention

Learning sparsity patterns directly for efficient long context modelling

attention Read Paper ↗

Compute-Optimal Context Length

Balancing sequence length with model width

scaling Read Paper ↗

FP16 vs BF16 in RL Stability

Why FP16 is much stable for training LLMs with RL

training Read Paper ↗

How LLMs are Injective & Invertible?

non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations

theory Read Paper ↗

Attention Sinks from Graph Perspective

Token nodes acting as information absorbers

theory Read Paper ↗

Reasoning Stability in Short CoT

Why Short CoT ensures stable reasoning in complex process?

reasoning Read Paper ↗

Squeezed Diffusion Models

quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle,which scale noise anisotropically along the principal component of the training distribution

generative Read Paper ↗

Terminal Velocity Matching

generalization of flow matching that enables high-fidelity one- and few-step generative modeling

generative Read Paper ↗

Why mask diffusion does not work

why mask diffusion faces difficulties in achieving parallel generation and bidirectional attention?

generative Read Paper ↗

How much do LLMs memorize

Estimating how much a model “knows” about a datapoint

generative Read Paper ↗

Normalization Free Transformers

Transformers without normalization can achieve the same or better performance using Dynamic Tanh (DyT)

architecture Read Paper ↗

ARC is a Vision Problem

Achieves higher accuracy in ARC by framing it as an image-to-image translation problem

reasoning Read Paper ↗

How LLMs Use Their Depth?

Explains how LLMs internally structure their computations to make predictions

theory Read Paper ↗

Why INT8 in SageAttention is Better?

8-Bit Attention for Plug-and-play Inference Acceleration

systems Read Paper ↗

Sinkhorn-Normalized Quantization in LLMs

Uses fast Sinkhorn–Knopp–style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization

systems Read Paper ↗

Large Concept Models

Assume that a concept corresponds to a sentence, and use an existing sentence embedding space

frontier Read Paper ↗

Rectified Flow Transformers

Uses rectified flow (connecting noise to data using straight-line trajectories) with a Transformer-based architecture

generative Read Paper ↗

Not all bits are equal

Why models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations

generative Read Paper ↗

MxFP4 vs NVFP4 Training

Micro-exponent formats for extreme quantization and stability in LLM pre-training

hardware Read Paper ↗

1-bit Transformer Scaling

BitNet and the era of ternary weights

optimization Read Paper ↗

Mixture Block Attention

Applies the principles of MoE to the attention mechanism to transition between full and sparse attention

attention Read Paper ↗

Why Transformers are Bad at Math?

Why model converges to a local optimum that lacks the required long-range dependencies for multiplication?

weaknesses Read Paper ↗

Kimi Linear Attention & Hardware aware chunking

Expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism

attention Read Paper ↗

Gradient Low Rank Projection Optimizers

Projecting gradients into lower rank to save memory

optimization Read Paper ↗

Distributed LLM Training with DiLoCo

Using DiLoCo to train LLMs in distributed poorly connected devices

distributed Read Paper ↗

Why Looped Transformers are good at algorithms?

questions says everything

architecture Read Paper ↗

Positional Integrity Encoding for rapid KV cache edit

Rapid KV cache editing technique for Large code LLMs

systems Read Paper ↗

The Continual Learning Problem

Investigate whether sparse parameter updates can enable learning without catastrophic forgetting

learning Read Paper ↗

Spherical Equivariant Graph Transformers

3D molecule modeling with symmetry preservation

geometric-dl Read Paper ↗

Learned Score Field Geometry

Diffusion models of data in general non-Euclidean geometries

theory Read Paper ↗

Schrödinger Bridge Interpretation

Explains the issues with DSB in complex data generation

math Read Paper ↗

How diffusion models memorize?

heading says all

theory Read Paper ↗

Bias from Finite Timesteps

Observes that maximum likelihood training consistently improves the likelihood of score-based diffusion models across multiple datasets and architectures

math Read Paper ↗

Semantic Manifolds in Diffusion Trajectories

how Riemannian geometry maps between the latent space and intermediate feature maps to show semantic axes and curved manifold structure in diffusion trajectories

theory Read Paper ↗

Why Semantics Appear Late?

shows why semantic information transfer peaks at intermediate timesteps and vanishes near both the beginning and end of the process.

theory Read Paper ↗

Bad data lead to Good Models

explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model’s output toxicity

theory Read Paper ↗

Topological Deep Learning

Deep learning to handle complex, non-Euclidean data structures

math Read Paper ↗

Infini-gram

engine that efficiently processes n-gram queries with unbounded n and trillion-token massive corpora

nlp Read Paper ↗

XLA Compiler Techniques

accelerating ML models with Linear Algebra & GPUs

compilers Read Paper ↗

Manifold Learning

Explains the set of methods to find the low dimensional structure of data

math Read Paper ↗

Universal Weight Subspace Hypothesis

Do all models converge to the same-D subspace?

theory Read Paper ↗

Holographic Transformers

Encoding sequences in complex associative memory using neuro-symbolic techniques

architecture Read Paper ↗

GSPO for RL Training in MoEs

Stable RL training algorithms to train MoEs

rl Read Paper ↗

Sequence Objective as First Order Approximation

Explains under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods

theory Read Paper ↗

LLMs Can Get Brain Rot

Continual exposure to junk web text induces lasting cognitive decline in large language models

safety Read Paper ↗

Representation Geometry Manifolds

Treats the data space of diffusion models as a Riemannian manifold with a score-derived metri

theory Read Paper ↗

Alignment as an Optimization Artifact

Is language objective just a local minimum?

safety Read Paper ↗

Causal Emergence in Representations

Shows how neural representations can align with high-level causal variables through causal abstraction experiments

theory Read Paper ↗

Warp Divergence from Timesteps

GPU thread inefficiency in conditional generation

systems Read Paper ↗

Conditioning as Geometry Deformation

Conditioning is done via projective/geometric transformations of the points and features

theory Read Paper ↗

Riemann Optimization for variables on Curved Spaces

Gradient descent on non-Euclidean manifolds

math Read Paper ↗

The ML Trench

by Beens & Nami

Regression

Clustering Mechanism

Principal Component Analysis

Decision Trees

Random Forests

Mean Squared Error

Cross Entropy Loss

Feature Normalization

One-hot Encoding

Supervised Learning

Unsupervised Learning

Feed Forward Neural Networks

Convolutional Neural Networks

Activation Functions

N-gram Models

Recurrent Neural Networks

Optimizers

Regularization

Learning Rate Schedulers

Gradient Clipping

Class Imbalance Handling

Forward Noising & Denoising

DDPM Formulation

Stochastic Gradient Descent

The No Free Lunch Theorem

Empirical Risk Minimization

Vapnik-Chervonenkis Dimension

Rademacher Complexity

Double Descent

Jacobian & Hessian Matrices

Gradient Noise Scale

Vanishing & Exploding Gradients

Activation Saturation Effects

Convexity & Smoothness

Stability of Tikhonov Regularization

Polyak–Łojasiewicz (PL) Condition

Saddle Points vs Local Minima

EMA of Weights

Mixed Precision Diffusion Training

Noise Level Conditioning

Lottery Ticket Hypothesis

Fisher Information Matrix

Neural Tangent Kernel

Universal Approximation Theorem

Mean-field theory of Neural Networks

Memorization–Generalization Paradox

Latent Space Scaling

Generalization at Higher Dimensions

Attention as Kernel Mechanism

Dot Product Attention Geometry

Self Attention

Cross Attention

Attention as LoRA

Encoder-Decoder Architecture

Residuals as Gradient Highways

Positional Encoding

Learned Positional Embeddings

Large Language Models

Transformer Expressivity

Universal Approximation of Sequences

Inductive Bias of Self-Attention

Reparameterization Equivalence

Probability Flow ODE

RoPE

Byte Pair Tokenizer

Transformer Grokking

Instruction Level Parallelism

Length Generalization

QKV Projection Fusion

Flash Attention

KV Cache Layout

Kernel Fusion Techniques

Kernel Tiling Strategies

Parallelism Strategies

CUDA Graph Captures

Block-sparse Attention

PTX Control Optimization

Paged Attention