Hacker News
- What happens if we remove 50 percent of Llama? https://neuralmagic.com/blog/24-sparse-llama-smaller-models-for-efficient-gpu-inference/ 132 comments
Linked pages
- Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic https://neuralmagic.com/blog/fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse/ 26 comments
- SparseGPT: Remove 100B Parameters For Free - Neural Magic https://neuralmagic.com/blog/sparsegpt-remove-100-billion-parameters-for-free/ 7 comments
- GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs https://github.com/vllm-project/vllm 0 comments
- [2310.06927] Sparse Fine-tuning for Inference Acceleration of Large Language Models https://arxiv.org/abs/2310.06927 0 comments
- GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models. https://github.com/EleutherAI/lm-evaluation-harness 0 comments
Related searches:
Search whole site: site:neuralmagic.com
Search title: 2:4 Sparse Llama: Smaller Models for Efficient GPU Inference
See how to search.