显卡常规检测 GPUBURN gpu-burn 测试显卡步骤
gpu-burn是我们基础服务器检查的必备工具
=========================================
http://wili.cc/blog/gpu-burn.html
https://github.com/wilicc/gpu-burn
=========================================
1.Linux下下载软件
wget https://codeload.github.com/wilicc/gpu-burn/zip/master
Easy docker build and run
git clone cd gpu-burn docker build -t gpu_burn . docker run --rm --gpus all gpu_burn
也可以直接点击这里
2.解压缩
unzip gpu-burn-master.zip
3.进入目录编译(确保cuda环境变量已经配置成功 nvcc -v能显示结果)
cd gpu-burn-master
make
4.编译成功后,会在当前目录生成 gpu_burn 这个文件
gpu_burn
5.默认执行,跑全部GPU卡,空格后面参数为时间,一般快速测试设置100,稳定性测试为500
[root@localhost gpu-burn-master]#
./gpu_burn 100
GPU 0: Tesla V100 (UUID: GPU-6250466c-35ed-c279-fc0b-3b9b613a586f)
GPU 1: Tesla V100 (UUID: GPU-0a4a2b9c-d32c-1ba2-42a0-151ed9907d57)
GPU 2: Tesla V100 (UUID: GPU-f6cf184f-9173-1edd-648f-71e841afe152)
GPU 3: Tesla V100 (UUID: GPU-044f96e6-cc66-cc93-6283-07b829216f91)
Initialized device 2 with 11178 MB of memory (10993 MB available, using 9894 MB of it), using FLOATS
Initialized device 1 with 11178 MB of memory (10993 MB available, using 9894 MB of it), using FLOATS
Initialized device 3 with 11178 MB of memory (10993 MB available, using 9894 MB of it), using FLOATS
Initialized device 0 with 11178 MB of memory (10993 MB available, using 9894 MB of it), using FLOATS
6.可以指定某几张卡跑,比如指定0和1号卡
export CUDA_VISIBLE_DEVICES=0,1
./gpu_burn 100
如何找出故障卡
1. dmesg -l err 筛选出错误卡的Bus-Id
2. 根据Bus-Id找出对应的GPU卡编号,在跑测试的时候排除它,比如机器8张卡,device 2 故障,那个参数这 样写:
export CUDA_VISIBLE_DEVICES=0,1,3,4,5,6,7 #2不写在里面
./gpu_burn 100
3. 跑完之后关机,找出那张没有温度的卡,即故障卡
==============================================================
Building
To build GPU Burn:
make
To remove artifacts built by GPU Burn:
make clean
GPU Burn builds with a default Compute Capability of 5.0. To override this with a different value:
make COMPUTE=
CFLAGS can be added when invoking make to add to the default list of compiler flags:
make CFLAGS=-Wall
LDFLAGS can be added when invoking make to add to the default list of linker flags:
make LDFLAGS=-lmylib
NVCCFLAGS can be added when invoking make to add to the default list of nvcc flags:
make NVCCFLAGS=-ccbin
CUDAPATH can be added to point to a non standard install or specific version of the cuda toolkit (default is /usr/local/cuda):
make CUDAPATH=/usr/local/cuda-
CCPATH can be specified to point to a specific gcc (default is /usr/bin):
make CCPATH=/usr/local/bin
CUDA_VERSION and IMAGE_DISTRO can be used to override the base images used when building the Docker image
target, while IMAGE_NAME can be set to change the resulting image tag:
make IMAGE_NAME=myregistry.private.com/gpu-burn CUDA_VERSION=12.0.1 IMAGE_DISTRO=ubuntu22.04 image
Usage
GPU Burn Usage: gpu_burn [OPTIONS] [TIME] -m X Use X MB of memory -m N%Use N% of the available GPU memory -d Use doubles -tcTry to use Tensor cores (if available) -l List all GPUs in the system -i N Execute only on GPU N -h Show this help message Example: gpu_burn -d 3600