How to setup and optimize CUDA and TensorFlow on Ubuntu 20.04 — 2022
Introduction
Usually, companies use cloud-based servers for Machine Learning applications such as Microsoft Azure and Google Cloud Platform, mostly because of their readily available and easy to use services, which are built for a wide range of problems and offer low-code solutions for daily ML operations.
Although the majority of companies can deliver value quickly and inexpensively through these tools, there are businesses that rely mainly on their ML capacities as their core product or strategy and most frequently invest lots of money and time on their AI infrastructure and code.
If you work as a Data Scientist for one of those, you may have faced the need of setting up an Ubuntu server with the CUDA/cuDNN and TensorFlow stack for serving training/prediction jobs and pipelines.
Also, one of the biggest burdens when dealing with AI is running out of memory while training heavy models (e.g. neural networks for image and video detection/segmentation), making the optimization of GPU usage necessary for sustaining the whole training process without crashing.
In this article, I will cover the setup steps for the right CUDA/cuDNN installation and optimization, being the topics organized as the following:
- Discovering and installing nvidia drivers.
- Matching the nvidia driver with CUDA and TensorFlow versions.
- Installing the CUDA toolkit and cuDNN.
- CUDA environment variables.
- Installing TensorFlow.
- Optimizing your GPU processing.
1️⃣ Discovering and installing NVIDIA drivers
NVIDIA GPUs are the best supported in terms of machine learning libraries and integration with common frameworks, such as PyTorch or TensorFlow. The nvidia CUDA toolkit includes GPU-accelerated libraries, a C and C++ compiler and runtime, and optimization and debugging tools. It enables you to get started right away without worrying about building custom integrations. (If you don’t have a nvidia GPU on your machine, this tutorial is not for you.)
First, detect the model of your nvidia graphic card and the recommended driver. To do so execute the following command. Please note that your output and recommended driver will most likely be different:
$ ubuntu-drivers devices
And this will give you an output similar to this:
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C03sv00001043sd000085ABbc03sc00i00
vendor : NVIDIA Corporation
model : GP106 [GeForce GTX 1060 6GB]
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-435 - distro non-free
driver : nvidia-driver-440 - distro non-free recommended
driver : xserver-xorg-video-nouveau - distro free builtin
From the above output we can conclude that the current system has NVIDIA GeForce GTX 1060 6GB graphic card installed and the recommend driver to install is nvidia-driver-440.
If you agree with the recommendation feel free to use the ubuntu-drivers
command again to install all recommended drivers:
$ sudo ubuntu-drivers autoinstall
Alternatively, install desired driver selectively using the apt
command. For example:
$ sudo apt install nvidia-driver-440
Once the installation is concluded, reboot your system and you are done.
$ sudo reboot
After reboot, check if you can find your GPU through your nividia driver:
$ nvidia-smi
2️⃣ Matching the nvidia driver with CUDA and TensorFlow versions
One of my nightmares when I first set up an Ubuntu server with the CUDA stack was that I accidentally installed the wrong CUDA toolkit version for my nvidia GPU. That took me a lot of time to discover what went wrong because I couldn’t find it explicitly anywhere. To avoid this kind of error, you should first check all the versions compatibility between CUDA, cuDNN, Tensorflow and your nvidia drivers.
Refer to this link from the official nvidia documentation showing a support matrix that provides a look into the supported versions of the OS, NVIDIA CUDA, the CUDA driver, and the hardware for the NVIDIA cuDNN 8.3.1 — if you want older or earlier releases, refer to this link.
On my case, since I installed the r440 nvidia driver for this tutorial, my CUDA toolkit version will be the 10.2.
3️⃣ Installing the CUDA toolkit and cuDNN
CUDA toolkit
CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU) and is a prerequisite for running TensorFlow applications.
After checking the version you'll need, you can install the CUDA toolkit following the instructions on this link.
In my case, I just executed the following commands, one at a time:
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
$ sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
$ sudo apt-get update
$ sudo apt-get -y install cuda
cuDNN
The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers. It is also a prerequisite for running TensorFlow applications.
Its installation guide can be found on this link. Almost all the prerequisites were installed on the previous steps (nvidia drivers and CUDA toolkit) — the only one left is zlib:
$ sudo apt-get install zlib1g
Now, in order to download cuDNN, ensure you are registered for the NVIDIA Developer Program. It will give you a .tar file to be unziped and installed.
Go to the .tar file location and execute the following to unzip it:
$ tar -xvf cudnn-linux-x86_64-8.x.x.x_cudaX.Y-archive.tar.xz
You’ll need to replace X.Y and v8.x.x.x with your specific CUDA and cuDNN versions and package date.
For example, when installing the cuDNN v8.3.1 for CUDA 11.5, it will look like this:
$ tar -xvf cudnn-linux-x86_64–8.3.1.22_cuda11.5-archive.tar.xz
Then, in the same folder where the files were unziped, move those into the CUDA toolkit repository and give access permissions:
$ sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include
$ sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
4️⃣ CUDA environment variables
After having your CUDA/cuDNN stack correctly installed, now you'll need to export the environment variables needed to locate your installation path when running a ML application.
I suggest you to integrate these exports into your Python code or .sh script so every time you run it the environment variables will be automatically set up.
In my applications, I always define a function to be called that will export all the variables needed, called exports()
:
def exports(): # Set CUDA and CUPTI paths
os.environ['CUDA_HOME'] = '/usr/local/cuda'
os.environ['PATH']= '/usr/local/cuda/bin:$PATH'
os.environ['CPATH'] = '/usr/local/cuda/include:$CPATH'
os.environ['LIBRARY_PATH'] = '/usr/local/cuda/lib64:$LIBRARY_PATH'
os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH'
os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/lib64:$LD_LIBRARY_PATH'
Above, we export every environment variable needed for locating your CUDA stack. CUPTI (CUDA Profiling Tools Interface) enables the creation of profiling and tracing tools that target CUDA applications and already comes along with the toolkit.
5️⃣ Installing TensorFlow
System install
You can install TF directly on your machine, although the recommended is to do it in a virtual environment inside the repository you're working with to avoid the dependency hell. Can be done via pip package manager:
$ pip3 install --user --upgrade tensorflow # install in $HOME
Then, verify the installation executing a Python script that will run a simple TensorFlow operation:
$ python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
Virtual Environment (recommended)
Just execute the following inside your virtual environment:
$ pip install --upgrade tensorflow
Then, verify it by executing the same TF operation mentioned on the system install:
$ python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
If you don't know how to create a Python virtual environment, follow this tutorial.
6️⃣ Optimizing your GPU processing
Now that you have all ready to go, you'll maybe want to optimize your stack already.
For that, we will interact with some of the CUDA C and cuDNN APIs to perform faster operations. This is particularly useful when dealing with heavy neural networks and can be decisive for not crashing your training applications.
This will be done exporting some environment variables that will tell CUDA how to act.
- CUDA_CACHE_DISABLE
Disables caching (when set to 1) or enables caching (when set to 0) for just-in-time-compilation. When disabled, no binary code is added to or retrieved from the cache. Use:
os.environ[‘CUDA_CACHE_DISABLE’] = ‘0’
2. TF_FORCE_GPU_ALLOW_GROWTH
In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two methods to control this.
This environment variable attempts to allocate only as much GPU memory as needed for the runtime allocations: it starts out allocating very little memory, and as the program gets run and more GPU memory is needed, the GPU memory region is extended for the TensorFlow process. Memory is not released since it can lead to memory fragmentation.
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
3. TF_CPP_MIN_LOG_LEVEL
This variable is just to disable TF's warnings and logging.
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3''''
Legend: 0 = all messages are logged (default behavior)
1 = INFO messages are not printed
2 = INFO and WARNING messages are not printed
3 = INFO, WARNING, and ERROR messages are not printed
'''
4. TF_GPU_THREAD_MODE
This ensures that GPU kernels are launched from their own dedicated threads and don’t get queued behind tf.data
work and prevents CPU-side threads to interfere with the GPU activity.
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
5. TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT
When input tensors are very small, duration does not change with input size. This is due to tensors being small enough that memory bandwidth isn’t fully utilized. However, for larger inputs, the duration increases close to linearly with size; it will take twice as long to move twice as many input and output values.
When inputs are small enough, a better single-pass algorithm (persistent batch normalization) can be used by cuDNN — here, inputs are read once into on-chip GPU memory, and then both statistics computation and normalization is performed from there, without any additional data reads. Fewer data reads result in reduced traffic to off-chip memory, which — for constant bandwidth — means the duration is reduced. In other words, spatial persistent batch normalization is faster than its non-persistent variant.
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1'
6. TF_ENABLE_WINOGRAD_NONFUSED
This variable enables the use of the non-fused Winograd convolution algorithm, in which all steps of the algorithm are performed by a separate kernel call. The initial two kernels are used to transform the inputs and filters, and after this is performed, the convolutions compute first the multiplication and second the transformations to finally obtain the outputs.
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
7. TF_AUTOTUNE_THRESHOLD
This variable improves the stability of the auto-tuning process used to select the fastest convolution algorithms. Setting it to a higher value improves stability, but requires a larger number of trial steps at the beginning of training before the best algorithms are found.
os.environ[‘TF_AUTOTUNE_THRESHOLD’] = ‘1’
8. TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32
This variable enables and disables Tensor Core math for float32 matrix multiplication operations in TensorFlow.
os.environ['TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32'] = '1'
9. TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32
This variable enables and disables Tensor Core math for float32 convolution operations in TensorFlow.
os.environ['TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32'] = '1'
10. TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32
This variable enables and disables Tensor Core math for float32 cuDNN RNN operations in TensorFlow.Tensor Core math for float32 operations is disabled by default, but can be enabled by setting this variable to 1.
os.environ['TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32'] = '1'
Additional environment variables can be exported for specific neural network operations. This is an ongoing research topic and the native TensorFlow profiler — a tool that tracks hardware resources utilization — can be helpful identifying what is needed to further optimize your applications.
Summary
In this article, you learned:
- To check for a compatible GPU and its drivers.
- To match the versions of your nvidia driver with your CUDA/cuDNN and TensorFlow stacks.
- To install all the prerequisites for using TensorFlow and its deep learning libraries.
- To install TensorFlow via the pip package manager.
- To export the needed environment variables for a working CUDA stack.
- To export environment variables that can optimize your machine learning operations by making convolution and matrix algorithms work faster.