Show HN: Run 30B model in 4GB Active Memory https://ift.tt/DPTyaA5

Show HN: Run 30B model in 4GB Active Memory We have built fused operator kernels for structured contextual sparsity to avoid loading and computing activations with feed forward layer weights that eventually zero out by the activation. The result? We are seeing 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput: Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation): - Time to First Token (TTFT): 1.51× faster (1.209s → 0.803s) - Output Generation Speed: 1.79× faster (0.7 → 1.2 tokens/sec) - Total Throughput: 1.78× faster (0.7 → 1.3 tokens/sec) - Memory Usage: 26.4% reduction (6.125GB → 4.15GB) Find the operator kernels with differential weight caching open sourced at github/sparse_transformers. Lets get LLMs sprinting! https://ift.tt/rF7zt3W June 6, 2025 at 12:13AM

Komentar

Postingan populer dari blog ini

Show HN: Guish – A GUI for constructing and executing Unix pipelines https://ift.tt/HrXz5ub

Twin Peaks for All: Survey Results

Launch HN: Stacker (YC S20) – Create Apps from Airtable or Google Sheets https://ift.tt/3i3ZJso