2:4 Sparse Llama: Smaller Models for Efficient GPU Inference - discu.eu

Hacker News

What happens if we remove 50 percent of Llama? https://neuralmagic.com/blog/24-sparse-llama-smaller-models-for-efficient-gpu-inference/ 132 comments 26/11/2024

Linked pages

Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic https://neuralmagic.com/blog/fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse/ 26 comments
SparseGPT: Remove 100B Parameters For Free - Neural Magic https://neuralmagic.com/blog/sparsegpt-remove-100-billion-parameters-for-free/ 7 comments
GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs https://github.com/vllm-project/vllm 0 comments
[2310.06927] Sparse Fine-tuning for Inference Acceleration of Large Language Models https://arxiv.org/abs/2310.06927 0 comments
GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models. https://github.com/EleutherAI/lm-evaluation-harness 0 comments

Related searches:

Search whole site: site:neuralmagic.com

Search title: 2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

See how to search.

Submit link to: