测试数据

转RTX3090 TensorFlow, NAMD and HPCG Performance on Linux (Preliminary)

2020-10-17 00:37:00 NJTST 193




RTX3090 TensorFlow, NAMD and HPCG Performance on Linux (Preliminary)

Written on September 24, 2020 by Dr Donald Kinghorn
Table of Contents
  1. Introduction

  2. Test system

  3. Results

  4. Performance Charts

  5. Conclusions

Introduction

The second new NVIDIA RTX30 series card, the GeForce RTX3090 has been released.

The RTX3090 is loaded with 24GB of memory making it a good replacement for the RTX Titan... at significantly less cost! The performance for Machine Learning and Molecular Dynamics on the RTX3090 is quite good, as expected.

This post is a follow-on to the post from last week on the RTX3080

RTX3080 TensorFlow and NAMD Performance on Linux (Preliminary)

Testing with the RTX3090 went smoother than with the RTX3080, which had been uncomfortably rushed and problematic.

I was able to use my favorite container platform, NVIDIA Enroot. This is a wonderful user space tool to run docker (and other) containers in a user owned "sandbox" environment. Last week I had some difficulties that were related to incomplete installation of all driver components. Expect to see a series of posts soon introducing and describing usage of Enroot!

The HPCG (High Performance Conjugate Gradient) benchmark was added for this testing.

There were the same failures with the RTX3090 as with the RTX3080;

  • TensorFlow 2 failed to run properly with a fatal error in BLAS calls

  • My usual LSTM benchmark failed with mysterious memory allocation errors

  • The ptxas assembler failed to run. This left ptx compilation to the driver which caused slow start up times for TensorFlow (a few minutes). See the output below,

2020-09-22 11:42:03.984823: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312]
Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. This message will be only logged once.

The reference to "sm_86" is referring to the "compute level", 8.6, for the GA102 chip. The Ampere GA100 chip has the code "8.0" i.e. sm_80.

I used containers from NVIDIA NGC for TensorFlow 1.15, NAMD 2.13 and CUDA for HPCG. All of these applications were built with CUDA 11.

The current CUDA 11.0 does not have full support for the GA102 chips used in the RTX 3090 and RTX3080 (sm_86).

The results in this post are not optimal for RTX30 series. These are preliminary results that will likely improve with an update to CUDA and the driver.


Test system

Hardware

  • INTEL Xeon 3265W: 24-cores (4.4/3.4 GHz)

  • Motherboard: Asus PRO WS C621-64L SAGE/10G (INTEL C621-64L EATX)

  • Memory: 6x REG ECC DDR4-2933 32GB (192GB total)

  • NVIDIA RTX3090 RTX3080, RTX TITAN and RTX2080Ti

Software

  • Ubuntu 20.04 Linux

  • Enroot 3.3.1

  • NVIDIA Driver Version: 455.23.04

  • nvidia-container-toolkit 1.3.0-1

  • NVIDIA NGC containers

  • nvcr.io/nvidia/tensorflow:20.08-tf1-py3

  • nvcr.io/hpc/namd:2.13-singlenode

  • nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4 for HPCG)

Test Jobs

  • TensorFlow-1.15: ResNet50 v1, fp32 and fp16

  • NAMD-2.13: apoa1, stmv

  • HPCG (High Performance Conjugant Gradient) "HPCG 3.1 Binary for NVIDIA GPUs Including Ampere based on CUDA 11"

Example Command Lines

  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/nvidia/tensorflow:20.08-tf1-py3

  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/hpc/namd:2.13-singlenode

  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=96 --precision=fp32

  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=192 --precision=fp16

  • namd2 +p24 +setcpuaffinity +idlepoll +devices 0 apoa1.namd

  • OMP_NUM_THREADS=24 ./xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80

Note: I listed docker command lines above for reference. I actually ran the containers with enroot

Job run info

  • The batch size used for TensorFlow 1.15 ResNet50 v1 was 96 at fp32 and 192 at fp16 for all GPUs except for the RTX3090 which used 192 for both fp32 and fp16 (using batch_size 384 gave worse results!)

  • The HPCG benchmark used defaults with the problem dimensions 256x256x256

HPCG output for RTX3090,

1x1x1 process grid
256x256x256 local domain
SpMV=132.1 GF ( 832.1 GB/s Effective)132.1 GF_per ( 832.1 GB/s Effective)
SymGS =162.5 GF (1254.3 GB/s Effective)162.5 GF_per (1254.3 GB/s Effective)
total =153.8 GF (1166.5 GB/s Effective)153.8 GF_per (1166.5 GB/s Effective)
final =145.9 GF (1106.4 GB/s Effective)145.9 GF_per (1106.4 GB/s Effective)


Results

These results we run on the system, software and GPU's listed above.

Benchmark JobRTX3090RTX3080RTX TitanRTX 2080Ti
TensorFlow 1.15, ResNet50 FP32561 images/sec462 images/sec373 images/sec343 images/sec
TensorFlow 1.15, ResNet50 FP161163 images/sec1023 images/sec1082 images/sec932 images/sec
NAMD 2.13, Apoa10.0264 day/ns
(37.9 ns/day)
0.0285 day/ns
(35.1 ns/day)
0.0306 day/ns
(32.7 ns/day)
0.0315 day/ns
(31.7 ns/day)
NAMD 2.13, STMV0.3398 day/ns
(2.94 ns/day)
0.3400 day/ns
(2.94 ns/day)
0.3496 day/ns
(2.86 ns/day)
0.3528 day/ns
(2.83 ns/day)
HPCG Benchmark 3.1145.9 GFLOPS119.3 GFLOPSNot run93.4 GFLOPS

Note: that the results using TensorFlow 15.1 are much improved for the older RTX20 series GPUs compared to past testing that I have done using earlier versions of the NGC TensorFlow 1.13 container. This is especially true for the fp16 results. I feel there is a possibility of significantly better results for RTX30 after they have become fully supported.

Performance Charts

Results from past GPU testing are not included since they are not strictly comparable because of improvements in CUDA and TensorFlow


TensorFlow 1.15 (CUDA11) ResNet50 benchmark. NGC container nvcr.io/nvidia/tensorflow:20.08-tf1-py3

图片关键词

The FP32 results show a good performance increase for the RTX30 GPUs and I expect performance to improve when they are more full supported.

图片关键词

I feel that the FP16 results should be much higher for the RTX30 GPUs since this should be a strong point, I expect improvement with CUDA a update. The surprising results were how much better the RTX20 GPUs performed with CUDA 11 and TensorFlow 1.15. My older results with CUDA 10 and TensorFlow 1.13 where 653 img/s for the RTXTitan and 532 img/s for the 2080Ti!


NAMD 2.13 (CUDA11) apoa1 and stmv benchmarks. NGC container nvcr.io/hpc/namd:2.13-singlenode

图片关键词


These Molecular Dynamics simulation tests with NAMD are almost surely CPU bound. There needs to be a balance between CPU and GPU. These GPU are so high performance that even the excellent 24-core Xeon 3265W is probably not enough. I will do testing using a a later time using AMD Threadripper platforms.


HPCG 3.1 (xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80) nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4)

图片关键词

I did not have the HPCG benchmark setup when I had access to the RTX Titan. HPCG is an interesting benchmark as it is significantly memory bound. The high performance memory on the GPUs has a large performance impact. The Xeon 3265W yields 14.8 GFLOPS. The RTX3090 is nearly 10 times that performance!

Conclusions

The new RTX30 series GPUs look to be quite worthy successors to the already excellent RTX20 series GPUs. I am also expecting that the compute performance exhibited in this post will improve significantly after the new GPUs are fully supported with a CUDA and driver update.

I can tell you that some of the nice features on the Ampere Tesla GPUs are not available on the GeForce RTX30 series. There is no MIG (Multi-instance GPU) support and the double precision floating point performance is very poor compared to the Tesla A100 ( I compiled and ran nbody as a quick check). However, for the many applications where fp32 and fp16 are appropriate these new GeForce RTX30 GPUs look like they will make for very good and cost effective compute accelerators.

Happy computing! --dbk @dbkinghorn


解决方案