https://blog.eleuther.ai/transformer-math/
Speed Estimation:
| Accelerator | BF16 FLOPS | VRAM |
|---|---|---|
| B200 | 2250 TFLOPS | 180 GB |
| H100 | 900 TFLOPS | 80 GB |
| A100 | 300 TFLOPS | 80 GB |
| TPU-v3-8 | 500 TFLOPS | 128 GB |
| A6000 | 150 TFLOPS | 48 GB |
| RTX 4090 | 300 TFLOPS | 24 GB |
What is a FLOP?
For each MxN, NxP matrix multiplication, it requires 2MPN FLOPS (half addition half multiplication)
Model Training FLOPS
MFU
Model Weights: N billion parameters
Optimizer States:
Activation
Gradients
Parallelism
Tensor-Parallel #TP

Column Splitting Tensor Parallel
#TP accelerators. Each accelerator therefore only receives 1/#TP model weightData-Parallel #DP
Context-Parallel #DP
Pipeline-Parallel #PP

Distributed Optimizer