ADR-0010: Gate 3 Split into Design Verification and Performance Verification¶
Status¶
Accepted
Context¶
Gate criterion 3 specified "vmap x jit with batch size 1 to 1000 scaling nearly linearly on GPU." This implicitly combined two independent concerns:
- Design verification: Does
jax.custom_vjp(Method C) compose correctly withvmap? - Performance verification: Does throughput scale linearly with batch size on GPU?
The concept proof was conducted on CPU (Apple Silicon). The custom_vjp + vmap integration works correctly (shape tests pass, gradients are accurate across batches), but throughput scaling shows typical CPU memory bandwidth saturation starting at batch size 64.
CPU scaling results:
| batch | throughput | incremental scaling |
|---|---|---|
| 1 | 119k/s | baseline |
| 4 | 508k/s | 4.3x |
| 16 | 1.9M/s | 3.8x |
| 64 | 4.4M/s | 2.3x |
| 256 | 6.9M/s | 1.5x |
| 1024 | 12.0M/s | 1.7x |
Total T(1024)/T(1) = 100x (9.8% efficiency), well below the 70% target. This is a hardware limitation, not a design limitation.
Decision¶
Split Gate 3 into two sub-criteria:
- Gate 3a (design verification): custom_vjp + vmap integration works correctly. PASS.
- Gate 3b (performance verification): GPU throughput scaling meets 70% efficiency. DEFERRED to the 3D extension.
Consequences¶
- Concept proof gate evaluation proceeds with Gate 3a PASS
- The 3D extension will include GPU benchmark as part of the primitive test suite
- CPU flattening data serves as a baseline for GPU comparison
- The 70% efficiency threshold remains unchanged for the GPU evaluation