VRAM Estimation

https://blog.eleuther.ai/transformer-math/

Speed Estimation:

Accelerator	BF16 FLOPS	VRAM
B200	2250 TFLOPS	180 GB
H100	900 TFLOPS	80 GB
A100	300 TFLOPS	80 GB
TPU-v3-8	500 TFLOPS	128 GB
A6000	150 TFLOPS	48 GB
RTX 4090	300 TFLOPS	24 GB

FLOPs Estimation

What is a FLOP?

Addition or subtraction (e.g., 3.5 + 4.2 ),
Multiplication or division (e.g., 2.1 \times 5.3 ),

For each MxN, NxP matrix multiplication, it requires 2MPN FLOPS (half addition half multiplication)

Model Training FLOPS

MFU
- How much FLOPS is translated into model training
  - Usually around 20-60 MFU for LM pretraining depending on the quality
Model Weights: N billion parameters
- Standard:
  - In PyTorch, by default, floating points are in fp32 — 4 byte each
    - It then takes 4N GB memory
  - It is almost always possible to use bf16 instead — 2 byte / parameter
    - It then takes 2N GB memory
- QLoRA:
  - In QLoRA, it’s possible to do 4bit quantization on model weight, so every parameter takes 0.5 byte
  - It then takes 0.5N GB memory
Optimizer States:
- Usually we use AdamW, which requires 12 bytes each parameter
  - fp32 copy of the parameter — 4 bytes
  - Momentum and variance term — 2 x 4 = 8 bytes
- 8-Bit Optimizer
  - It’s also possible to quantize momentum and variance into 8bit
  - Therefore in total, 6 bytes / parameter
- Optimizer offloading:
  - We can offload the optimizer to CPU, which, in tradeoff for more communication and therefore lower speed, it requires ~ 0GB VRam,
- Parameter-efficient Finetuning:
  - We only update the LoRA layers, which is often negligible compared to the total model size. For estimation, just replace that with
Activation
Gradients
Parallelism
- Tensor-Parallel #TP
  
  Column Splitting Tensor Parallel
  - evenly splits each weight matrix into #TP accelerators. Each accelerator therefore only receives 1/#TP model weight
  - How about activation?
- Data-Parallel #DP
- Context-Parallel #DP
- Pipeline-Parallel #PP
- Distributed Optimizer
  - You can distribute optimizer weights to each card