Hacker News
- Activation-Aware Weight Quantization for LLM Compression Outperforms GPTQ https://arxiv.org/abs/2306.00978 2 comments
Linking pages
- HQQ quantization https://mobiusml.github.io/hqq_blog/ 2 comments
- Gemlite: Towards Building Custom Low-Bit Fused CUDA Kernels https://mobiusml.github.io/gemlite_blogpost/ 2 comments
- GitHub - mit-han-lab/llm-awq: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration https://github.com/mit-han-lab/llm-awq 0 comments
- GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs https://github.com/vllm-project/vllm 0 comments
- GitHub - RUCAIBox/LLMSurvey: The official GitHub page for the survey paper "A Survey of Large Language Models". https://github.com/RUCAIBox/LLMSurvey 0 comments
- GitHub - horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models https://github.com/horseee/Awesome-Efficient-LLM 0 comments
- GitHub - AIoT-MLSys-Lab/Efficient-LLMs-Survey: Efficient Large Language Models: A Survey https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey 0 comments
- The Path to Achieve Ultra-Low Inference Latency With LLaMA 65B on PyTorch/XLA | PyTorch https://pytorch.org/blog/path-achieve-low-inference-latency/ 0 comments
- Welcome to vLLM! — vLLM https://docs.vllm.ai/en/latest/ 0 comments
- LLMs for your iPhone: Whole-Tensor 4 Bit Quantization https://stephenpanaro.com/blog/llm-quantization-for-iphone 0 comments
- Selecting GPUs for LLM serving on GKE | Google Cloud Blog https://cloud.google.com/blog/products/ai-machine-learning/selecting-gpus-for-llm-serving-on-gke/ 0 comments
- GitHub - NexaAI/Awesome-LLMs-on-device: Awesome LLMs on Device: A Comprehensive Survey https://github.com/NexaAI/Awesome-LLMs-on-device 0 comments
Related searches:
Search whole site: site:arxiv.org
Search title: [2306.00978] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
See how to search.