100k H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing - discu.eu

Reddit

100k H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing https://www.semianalysis.com/p/100000-h100-clusters-power-network 2 comments 18/6/2024 hardware

Linking pages

$2 H100s: How the GPU Bubble Burst - by Eugene Cheah https://www.latent.space/p/gpu-bubble 289 comments
Calculating the Cost of a Google Deepmind Paper - 152334H https://152334h.github.io/blog/scaling-exponents/ 164 comments
The Future of Compute: NVIDIA's Crown is Slipping https://mohitdagarwal.substack.com/p/from-dominance-to-dilemma-nvidia 120 comments
Multi-Datacenter Training: OpenAI's Ambitious Plan To Beat Google's Infrastructure https://www.semianalysis.com/p/multi-datacenter-training-openais 1 comment
GitHub - AmberLJC/LLMSys-PaperList: Large Language Model (LLM) Systems Paper List https://github.com/AmberLJC/LLMSys-PaperList/ 1 comment
GB200 Hardware Architecture - Component Supply Chain & BOM https://www.semianalysis.com/p/gb200-hardware-architecture-and-component 0 comments
Datacenter Anatomy Part 1: Electrical Systems https://www.semianalysis.com/p/datacenter-anatomy-part-1-electrical 0 comments
Meta’s Next Llama AI Models Are Training on a GPU Cluster ‘Bigger Than Anything’ Else | WIRED https://www.wired.com/story/meta-llama-ai-gpu-training/ 0 comments

Linked pages

Related searches:

Search whole site: site:www.semianalysis.com

Search title: 100k H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing

See how to search.

Submit link to: