Most guides on this topic stop at "NVLink is fast." That is not enough. If you are provisioning, configuring, or operating a Blackwell-based GPU server whether it is an HGX B200, a GB200 NVL72, or a multi rack GB300 cluster you need to understand the full picture: how the hardware topology actually works, how to verify the NVLink fabric is healthy, how to configure NCCL and IMEX correctly, and how to avoid the operational pitfalls that burned early adopters in 2025.
This guide gives you all of that in one place, written with the level of technical depth that real infrastructure engineers need.
Quick Summary / Key Takeaways
- NVLink 5.0 on Blackwell delivers 1.8 TB/s bidirectional bandwidth per GPU across 18 links double the 900 GB/s offered by Hopper's NVLink 4.0 and over 14× the bandwidth of PCIe Gen 5.
- The GB200 NVL72 rack connects 72 Blackwell GPUs through 9 NVLink Switch trays in a single flat NVLink domain, creating what NVIDIA accurately describes as "one massive GPU" with 130 TB/s aggregate fabric bandwidth.
- NCCL 2.25.2+ is required for Multi Node NVLink (MNNVL) support; the NVIDIA IMEX service must be running and correctly configured on all nodes in the domain before cross-node NVLink memory access works.
- Crossing an NVLink domain boundary placing nodes from different racks in the same job without proper topology aware scheduling causes a severe bandwidth cliff from ~800+ GB/s to roughly 100–200 GB/s, making Slurm's topology/block plugin and IMEX per job isolation non negotiable on production clusters.
Part 1: Understanding the NVLink 5.0 Architecture on Blackwell
Before touching a single configuration file, you need a firm mental model of how NVLink actually works inside a Blackwell GPU server. Skipping this step is the single biggest mistake engineers make when they inherit or provision these systems.
1.1 What NVLink 5.0 Actually Is
NVLink is NVIDIA's proprietary high speed GPU to GPU interconnect. The fifth generation, introduced with the Blackwell architecture, operates at 100 GB/s per link direction, with each Blackwell GPU supporting 18 simultaneous NVLink connections. The result: 1.8 TB/s of total bidirectional bandwidth per GPU.
To put that number in context:
| Interconnect | Per GPU Bandwidth |
|---|---|
| PCIe Gen 5 x16 | ~128 GB/s (bidirectional) |
| NVLink 4.0 (Hopper H100) | 900 GB/s |
| NVLink 5.0 (Blackwell B200/GB200) | 1,800 GB/s |
This is not a minor generational bump. It fundamentally changes the economics of tensor parallelism and large model inference by eliminating the memory copy bottlenecks that made multi GPU coordination painful on previous hardware.
1.2 NVLink-C2C: The CPU-GPU Coherent Link
On the GB200 Grace Blackwell Superchip, there is a second, distinct interconnect you must understand: NVLink-C2C (Chip-to-Chip). This connects the Grace ARM CPU directly to the B200 GPU at 900 GB/s with full memory coherence replacing the PCIe bus that ties x86 CPUs to GPUs in traditional server architectures.
The practical outcome: GPU code can directly address CPU memory (and vice versa) without explicit copy operations. This is qualitatively different from how HGX B200 systems work, where the x86 host CPU still communicates via PCIe Gen 6.
Know your platform:
- HGX B200 (8 GPUs, x86 CPU, PCIe Gen 6): NVLink 5.0 within the 8-GPU board; host to GPU via PCIe.
- GB200 NVL72 (72 GPUs, 36 Grace ARM CPUs): NVLink 5.0 across the full rack; CPU to GPU via NVLink-C2C.
1.3 The NVLink Switch: How the GB200 NVL72 Fabric Works
In the GB200 NVL72, there is no direct GPU to GPU copper trace between compute nodes. Every GPU connects to the NVLink fabric exclusively through NVSwitch ASICs housed in dedicated switch trays.
The physical layout:
- 18 compute trays, each containing 4 GPUs (2 Superchips per tray)
- 9 NVLink Switch trays, each housing 2 NVSwitch 4.0 ASICs and offering 144 NVLink ports at 100 GB/s per port
Each B200 GPU connects to all 9 NVSwitch boards via 2 NVLink links per board, creating a fully non-blocking crossbar topology. Any GPU can communicate with any other GPU at full 1,800 GB/s simultaneously without any other GPU's traffic reducing available bandwidth.
This flat, one hop topology means that within a single NVL72 rack, there is no hierarchy. GPU 0 and GPU 71 are topologically identical peers.
1.4 Scaling Beyond One Rack: The 576-GPU Topology
For clusters that span multiple racks, NVIDIA supports a two-tier fat-tree topology connecting up to 576 Blackwell GPUs in a single non blocking NVLink domain. This uses:
- 288 Tier-1 NVSwitch ASICs (in compute racks, 1U switch trays)
- 144 Tier-2 NVSwitch ASICs (in dedicated NVLink switch trays)
Operating at this scale requires careful fabric planning and the full NVIDIA Mission Control management stack. For most engineering teams, the NVL72 single rack domain is where day to day operational focus belongs.
Part 2: Pre Deployment Checklist
Getting the hardware right before the software configuration saves hours of debugging. Work through these checks before powering on any compute trays.
Step 1: Verify Your Power and Cooling Infrastructure
A GB200 NVL72 rack draws up to 120 kW under full load. This is not a number to work around it is a hard prerequisite.
- Confirm your facility power delivery supports the rack's 48V DC bus and peak current draw.
- Liquid cooling is mandatory. The B200 GPU operates at up to 1,000 W per die, making air cooling physically impossible at this power density.
- Verify that all liquid cooling manifolds are properly connected and that leak detection sensors are reporting correct values via BMC before proceeding.
Operational warning from 2025 deployments: Early GB200 racks suffered coolant quick disconnect leaks and NVLink initialization glitches. Inspect all coolant connections physically before first power on, and verify BMC leak detector voltage values are not clamped (a known issue see NVIDIA DGX GB200 NVL72 release notes).
Step 2: Bring Up NVLink Switches Before Compute Trays
This is the most commonly skipped step, and skipping it forces a full restart of all compute nodes.
Per NVIDIA's official bring up guide: bring up the NVLink Switch trays first so the NVLink domain is already configured and operational when compute trays power on. If compute trays initialize before the switch fabric is ready, all nodes must be restarted to establish proper NVLink communication.
# Verify NVLink Switch fabric status via Redfish
curl -sk -u admin:password \
https://<nvswitch-bmc-ip>/redfish/v1/Systems/1/FabricAdapters/GPU_SXM_1/Ports \
| python3 -m json.tool | grep -E '"LinkState"|"LinkStatus"'
Expected output for a healthy switch port:
"LinkState": "Enabled",
"LinkStatus": "LinkUp"
Step 3: Confirm Firmware Versions Match the SBOM
Before running any workloads, ensure all firmware components match the Software Bill of Materials (SBOM) provided with your system.
# Check NVSwitch OS (NVOS) version on each switch tray
nv show sys version
# Check GPU driver version from a compute node
nvidia-smi --query-gpu=driver_version --format=csv,noheader
For B200 systems, ensure NVSwitch firmware is at version 35.2014.4770 or later. Earlier versions have a confirmed bug where the default NVSwitch power profile causes significantly degraded NCCL ALL-REDUCE performance.
Part 3: Verifying NVLink Health on Blackwell
Once the rack is powered on and the operating system is up on your compute nodes, your first operational task is confirming the NVLink fabric is fully healthy. Do not skip this step before running any production workload.
Step 4: Check NVLink Status with nvidia-smi
# Check NVLink status for all links on GPU 0
nvidia-smi nvlink -i 0 -s
# Check NVLink counters (errors, replay counts) for GPU 0
nvidia-smi nvlink -i 0 -e
# Check NVLink capabilities for GPU 0
nvidia-smi nvlink -i 0 -c
A healthy output for -s on a Blackwell GPU in an NVL72 will show all 18 links as active:
GPU 00000000:01:00.0
Link 0: Active
Link 1: Active
Link 2: Active
...
Link 17: Active
Any link showing Inactive or Disabled requires investigation before you run multi GPU workloads.
Step 5: Verify NVLink Fabric Status
On GB200 NVL72 systems, the fabric manager exposes a higher level fabric status. After GPU persistence mode is enabled:
# Enable persistence mode (required for stable fabric status reporting)
nvidia-smi -pm 1
# Check fabric manager status
nvidia-smi --query-gpu=index,name,fabric.state,fabric.status \
--format=csv,noheader
# Wait ~10 seconds after enabling persistence before checking fabric status
# Insufficient Resources reported immediately after reset is a known transient issue
Expected healthy output:
0, NVIDIA B200, Completed, Success
1, NVIDIA B200, Completed, Success
...
71, NVIDIA B200, Completed, Success
If fabric.status shows Insufficient Resources persistently (not just immediately after a reset), reset the affected GPUs and check again:
nvidia-smi --gpu-reset -i <gpu_id>
If the reset fails with Xid errors like GSP Timeout, a full rack-level power cycle is required. This is a known issue on DGX GB200 NVL72 and is documented in NVIDIA's official release notes.
Step 6: Check NVLink Topology
This command displays the NVLink peer connectivity matrix and confirms that the expected all-to-all topology is intact:
nvidia-smi topo -m
On an NVL72, every GPU pair should show NV18 (18 NVLink connections via NVSwitch) in the topology matrix. If any pair shows NODE, PHB, or SYS instead, the NVLink fabric is not complete for that pair.
# Query specific GPU-to-GPU NVLink connection count
nvidia-smi topo --export=topology.json
cat topology.json | python3 -c "
import json, sys
data = json.load(sys.stdin)
# Review GPU-to-GPU relationships
for entry in data.get('gpus', []):
print(entry)
"
Part 4: Configuring IMEX for Multi Node NVLink (MNNVL)
This is the most technically nuanced part of operating a Blackwell NVL72 cluster, and the area where the most operational errors occur. NVIDIA's Internode Memory Exchange (IMEX) service is what enables GPUs across different compute trays within the same NVLink domain to directly access each other's memory over NVLink.
Without IMEX, cross node communication falls back to InfiniBand or Ethernet, dropping bandwidth by roughly 4–8x within the same rack.
Step 7: Configure the IMEX Nodes Configuration File
On every compute node in the NVLink domain, create the IMEX nodes configuration file listing the IP addresses of all nodes that will participate:
# Create the IMEX configuration directory if it does not exist
sudo mkdir -p /etc/nvidia-imex
# Create the nodes config file
# List the management IP of every compute node in the NVLink domain
sudo tee /etc/nvidia-imex/nodes_config.cfg > /dev/null << 'EOF'
# Node IPs for this NVLink domain
192.168.1.10
192.168.1.11
192.168.1.12
192.168.1.13
# ... add all node IPs in the domain
EOF
Step 8: Enable and Start the IMEX Service
# Enable the IMEX service to start at boot
sudo systemctl enable nvidia-imex
# Start the IMEX service
sudo systemctl start nvidia-imex
# Verify it started correctly
sudo systemctl status nvidia-imex
# Check the IMEX log for successful initialization
sudo tail -n 50 /var/log/nvidia-imex.log
A successful IMEX initialization log will show:
[INFO] IMEX version 570.124.06 is running
[INFO] Identified this node as ID 0, using bind IP of '192.168.1.10'
[INFO] NvGpu Library version matched with GPU Driver version
Best practice: Run IMEX per job rather than globally. Per job IMEX isolation means a job run by one user cannot read GPU memory mapped by a different user's concurrent job. This is a security requirement, not merely a performance suggestion.
Part 5: Configuring Slurm for Topology Aware Scheduling
On a bare metal or dedicated GPU server cluster, the Slurm workload manager must understand the NVLink domain topology to avoid placing a single job across domain boundaries. Crossing a boundary within the same cluster for example, assigning nodes from two different NVL72 racks to the same job without correct topology awareness drops bandwidth from ~800 GB/s (intra-rack NVLink) to 100–200 GB/s (inter-rack InfiniBand). This is the bandwidth cliff that derails production AI training jobs.
Step 9: Configure the Slurm Block Topology Plugin
The topology/block plugin, introduced in Slurm 24.05, treats NVLink domains as rigid scheduling blocks. Add the following to /etc/slurm/slurm.conf:
# Enable block topology for NVLink domain awareness
TopologyPlugin=topology/block
# Define the block size to match your NVLink domain
# For GB200 NVL72: 18 nodes per rack = 18 nodes per NVLink domain block
TopologyParam=Block:18
Then define the block topology file at /etc/slurm/topology.conf:
SwitchName=rack01 Nodes=node[001-018]
SwitchName=rack02 Nodes=node[019-036]
# Add additional racks as needed
Reload the Slurm configuration:
sudo scontrol reconfigure
Step 10: Enable the IMEX Slurm Plugin
For per-job IMEX isolation, add the IMEX switch plugin to your slurm.conf:
# Add to /etc/slurm/slurm.conf
SwitchType=switch/nvidia_imex
This plugin (introduced in Slurm 24.05) manages IMEX channel allocation per-job, providing driver-level isolation and preventing accidental cross-job NVLink memory access.
Step 11: Submit a Topology Aware Job
When submitting jobs that require the full NVLink fabric of one or more racks:
# Submit a job constrained to a single NVLink domain (one NVL72 rack)
sbatch --nodes=18 --ntasks-per-node=4 \
--constraint=rack01 \
--segment=1 \
my_training_job.sh
The --segment=1 argument tells the scheduler to keep all nodes within the same NVLink block. Without it, the scheduler may spread nodes across blocks and silently degrade performance.
Part 6: Validating NVLink Performance with NCCL
After hardware verification, IMEX configuration, and Slurm topology setup, validate that you are actually achieving the expected NVLink bandwidth before running production workloads. This step is not optional.
Step 12: Install and Run NCCL Tests
# Clone and build nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda
# Run all-reduce test across all GPUs in the NVLink domain
# For a full NVL72 rack (72 GPUs across 18 nodes):
mpirun -np 72 -N 4 \
--hostfile /etc/slurm/hostfile_rack01 \
./build/all_reduce_perf \
-b 1G -e 8G -f 2 -g 1
Expected results for intra-rack MNNVL (NVL72, 72 GPUs):
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1073741824 268435456 float sum -1 1245.3 861.7 849.8 0 1241.1 865.1 852.7 0
Average bus bandwidth of ≥800 GB/s confirms healthy MNNVL is operational. If you see 100–200 GB/s, IMEX is not correctly configured or nodes are crossing domain boundaries.
Step 13: Verify NCCL Is Using MNNVL
Set the NCCL debug level to confirm MNNVL detection:
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT
# Run a quick 2-GPU test with MNNVL enabled (default since NCCL 2.25.2)
mpirun -np 2 ./build/all_reduce_perf -b 1G -e 1G -f 2 -g 1
Look for this line in the output to confirm MNNVL is active:
NCCL INFO MNNVL enabled: Using NVLink domain for inter-node communication
If you need to explicitly disable MNNVL for debugging or comparison:
export NCCL_MNNVL_ENABLE=0
Important: NCCL 2.25.2 or later is required for MNNVL support on GB200 systems. On earlier NCCL versions, inter-node communication will silently fall back to InfiniBand or Ethernet regardless of IMEX status.
Part 7: Choosing the Right Blackwell NVLink Configuration for Your Workload
Not every workload benefits equally from NVLink 5.0, and not every team needs an NVL72. Here is a clear decision framework.
HGX B200 (8-GPU, NVLink-only within node)
Best for:
- LLM inference serving at moderate scale (models up to ~70B parameters with standard precision)
- Research workloads where a single 8-GPU node provides sufficient parallelism
- Teams that need Blackwell GPU server performance without the infrastructure demands of liquid cooling and 120 kW racks
NVLink profile: 8-GPU all-to-all NVLink domain within the server. No NVSwitch across multiple nodes.
GB200 NVL72 (72-GPU rack, full MNNVL)
Best for:
- Training or fine tuning large MoE models (hundreds of billions to trillions of parameters)
- Real-time inference on trillion parameter LLMs where latency is mission critical
- Workloads that require more than 13 TB of unified GPU memory (the NVL72's total HBM capacity)
NVLink profile: 72-GPU flat NVLink domain, 130 TB/s aggregate, fully non-blocking.
Multi-Rack NVLink Domain (up to 576 GPUs)
Best for:
- Foundation model pre training at scale
- Research requiring the equivalent of a single massive GPU memory space across hundreds of B200 chips
NVLink profile: Two tier fat tree topology, requires dedicated NVSwitch tray infrastructure between racks.
Part 8: Operating on Dedicated GPU Infrastructure
For teams who need Blackwell NVLink performance without the capital expense of purchasing NVL72 racks outright, dedicated GPU servers on a specialized platform give you bare metal access without the overhead of public cloud GPU instances.
GPUYard is a GPU dedicated server platform that provides bare metal access to professional grade GPU servers, including Blackwell architecture hardware. For teams validating NVLink configurations, running NCCL benchmarks, or building distributed training infrastructure, bare metal dedicated servers give you the direct hardware access you need no hypervisor layer, no shared tenancy, no artificial bandwidth throttling on the GPU interconnect fabric.
When evaluating any dedicated GPU server provider for Blackwell NVLink workloads, confirm:
- Direct NVLink topology access (not virtualized)
- Support for
nvidia-smifabric management commands - The ability to configure IMEX and Slurm topology plugins at the OS level
- Liquid cooling infrastructure for high-density Blackwell configurations
Part 9: Troubleshooting Common NVLink Issues on Blackwell
These are the real-world issues engineers actually encounter not the theoretical ones that appear in generic guides.
Issue: Fabric.status Reports "Insufficient Resources"
Cause: Often a transient state immediately after GPU reset or system startup. The fabric manager needs ~10 seconds after nvidia-persistenced is running.
Fix:
# Wait 10–15 seconds after enabling persistence mode
nvidia-smi -pm 1
sleep 15
nvidia-smi --query-gpu=index,fabric.state,fabric.status --format=csv,noheader
If still showing after 30 seconds, reset the affected GPU:
nvidia-smi --gpu-reset -i <gpu_id>
Issue: NCCL ALL-REDUCE Performance Significantly Below 800 GB/s
Most common causes:
- NCCL version below 2.25.2 (no MNNVL support)
- IMEX not running or misconfigured
- Nodes allocated across NVLink domain boundaries
- NVSwitch firmware bug (fixed in NVSwitch version 35.2014.4770)
Diagnostic commands:
# Check NCCL version
python3 -c "import torch; print(torch.cuda.nccl.version())"
# Verify IMEX service status on all nodes
pdsh -w node[001-018] systemctl status nvidia-imex | grep Active
# Confirm Slurm block topology is in effect
scontrol show topology
Issue: NVLink GSP Timeout / Xid 145 After Partial Switch Tray Reboot
Cause: If a subset of NVSwitch trays are rebooted while an NVLink Sharp workload is actively running, GPU Xid 145 errors can appear and require a full rack power cycle to recover.
Fix: Full AC or DC power cycle of the rack. There is no software workaround for this specific condition. Future NVOS releases from NVIDIA are expected to address this.
Issue: NCCL Performance Slow After Long Uptime (60+ Days)
Cause: After extended uptime (~60+ days), a known issue in DGX B200 firmware can cause NVLink task scheduling behavior that leads to GPU driver hangs.
Fix: Apply the latest DGX B200 firmware update, which includes the fix. If already impacted, a system reboot resolves the immediate symptom.
Frequently Asked Questions (FAQ)
- Q: What is NVLink 5.0 bandwidth on Blackwell GPU servers?
A: NVLink 5.0 on Blackwell (B200/GB200) provides 1,800 GB/s (1.8 TB/s) bidirectional bandwidth per GPU via 18 links operating at 100 GB/s each. This is 2× the 900 GB/s of NVLink 4.0 on Hopper (H100) and over 14× the bandwidth of PCIe Gen 5. - Q: How many GPUs can be connected with NVLink on Blackwell?
A: A single NVLink 5.0 domain can connect up to 576 Blackwell GPUs in a non-blocking compute fabric using a two-tier fat-tree topology. A single GB200 NVL72 rack connects 72 GPUs in a flat single hop domain with 130 TB/s aggregate bandwidth. - Q: What is IMEX and why is it required for Blackwell NVLink?
A: IMEX (Internode Memory Exchange Service) is an NVIDIA service that enables GPUs in different compute nodes within the same NVLink domain to directly access each other's HBM memory over NVLink. Without IMEX running and correctly configured on all nodes, inter-node communication falls back to InfiniBand or Ethernet, reducing bandwidth by 4–8×. IMEX is new to Blackwell-generation systems and has no equivalent on Hopper. - Q: What version of NCCL is required for Multi-Node NVLink (MNNVL) on GB200?
A: NCCL 2.25.2 or later is required for MNNVL support on GB200 systems. On earlier versions, NCCL will silently fall back to the available network (InfiniBand or Ethernet) without using the NVLink fabric for inter-node traffic. - Q: What is the difference between GB200 NVL72 and HGX B200 for NVLink workloads?
A: The HGX B200 connects 8 B200 GPUs via NVLink within a single server node; communication between multiple HGX B200 servers uses InfiniBand or Ethernet. The GB200 NVL72 extends NVLink across an entire rack of 72 GPUs through NVSwitch ASICs, creating a single unified NVLink domain with full 1.8 TB/s GPU-to-GPU bandwidth between any pair of GPUs in the rack without traversing any external network fabric. - Q: What bandwidth should I expect from NCCL ALL-REDUCE on an NVL72 with MNNVL enabled?
A: With MNNVL correctly configured (IMEX running, NCCL ≥ 2.25.2, nodes within the same NVLink domain), expect average bus bandwidth of ≥800 GB/s for large all-reduce operations across all 72 GPUs. If you see 100–200 GB/s, the likely cause is either IMEX misconfiguration or nodes placed across rack boundaries. - Q: Does NVLink 5.0 on Blackwell support memory coherency between GPU and CPU?
A: On the GB200 Grace Blackwell Superchip, yes. The NVLink-C2C (Chip-to-Chip) interconnect between the Grace ARM CPU and B200 GPU provides 900 GB/s of coherent bandwidth, allowing GPU kernels to address CPU memory and vice versa without explicit copy operations. On the HGX B200 (x86-based), the CPU-to-GPU interface uses PCIe Gen 6 and does not have this coherency property. - Q: How do I confirm that my Blackwell GPU's NVLink fabric is healthy before running workloads?
A: Run three checks in sequence:nvidia-smi nvlink -i <gpu_id> -s— all 18 links must show Activenvidia-smi --query-gpu=index,fabric.state,fabric.status --format=csv— all GPUs must show Completed, Successnvidia-smi topo -m— all GPU pairs must show NV18 (NVLink via NVSwitch with 18 links)
- Q: What is NVLink-C2C vs standard NVLink?
A: Standard NVLink connects GPU-to-GPU (or GPU-to-NVSwitch). NVLink-C2C is a variant that connects a CPU to a GPU on the same physical package, as in the GB200 Grace Blackwell Superchip. NVLink-C2C provides 900 GB/s of coherent CPU-GPU bandwidth and enables unified memory addressing across both chips. It is not a separate product — it is the specific NVLink implementation used for intra-Superchip CPU-GPU communication.
Conclusion
NVLink 5.0 on Blackwell GPU servers is not simply an incremental upgrade to GPU interconnect it represents a structural change in how large-scale AI workloads are architected and executed. A 72-GPU GB200 NVL72 rack, operating as a single coherent memory space with 130 TB/s of aggregate fabric bandwidth, makes previously impractical model configurations routine.
But the technology only delivers on its promise when the full software stack is correctly configured. Getting IMEX running, verifying NCCL MNNVL is active, and teaching your workload scheduler to respect NVLink domain boundaries are not optional optimizations they are the difference between a system that performs at specification and one that silently underperforms at a fraction of its capability.
Work through the steps in this guide in sequence. Verify each layer before moving to the next. And when you hit the issues described in the troubleshooting section and you will you will know exactly what to check.
Ready to Deploy on Blackwell GPU Servers?
If you need dedicated, bare metal GPU server access to implement the configurations in this guide from NVLink fabric verification to full MNNVL workload deployment GPUYard provides professional GPU dedicated servers built for exactly this kind of demanding infrastructure work.
No shared tenancy. No hypervisor overhead on your GPU interconnect. Direct hardware access with the flexibility to configure your NVLink fabric, IMEX service, and Slurm topology exactly as your workload requires.