BlueEye

How to configure server for machine learning?

We need x86/AMD64 computer with GPU CUDA accelerator. Here I have described how to build something inexpensive at home.
Ubuntu will be our operating system. This is the simplest solution, also for Windows enthusiasts.
We will install the software:

Ubuntu 20.04 LTS Server

Download and install Ubuntu 20.04 LTS Server from https://ubuntu.com/download/server
During the installation we must select Install OpenSSH server. Other parameters can remain the default.
Select Install OpenSSH server
Let's check IP address in Ubuntu terminal:
ip addr show
We should see a similar result:
Now we can connect remote over SSH protocol to server. If your workstation is on Windows, you must install PuTTY client. If you use Linux, you propably have installed SSH client and you can connect to server (in the example I assume that the server is running on the 192.168.1.31 port):
Example results ip addr show
Work with server having dynamic IP address (DHCP) can be problematic, so we will set a static IP number. Add with sudo privileges file:
sudo vi /etc/netplan/50-cloud-init.yaml
and insert configuration:

    network:
        ethernets:
            enp4s0:
                dhcp4: false
                addresses: [192.168.1.100/24]
                gateway4: 192.168.1.1
                nameservers:
                    addresses: [8.8.8.8,8.8.4.4,192.168.1.1]
        version: 2

Ethernet interface name enp4s0 you must set from your configuration (you can read from ip addr show). Other network configuration may be also different, I gave an example of the most popular configuration.
sudo netplan apply
connect to 192.168.1.100
Connection via the address 192.168.1.100
Actualize Ubuntu packages:
sudo apt update
sudo apt upgrade -y
Turn off sleep, hibernate and etc. funcions on server:
sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

NVIDIA CUDA Drivers

Select, download from page https://developer.nvidia.com/cuda-downloads and install NVIDIA Drivers
in my example I run:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-1-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda

NVIDIA CUDA Profiling Tools Interface

Run command:
sudo apt install -y libcupti-dev

NVIDIA CUDA 11.1 Toolkit

Run command:
sudo apt install nvidia-cuda-toolkit
and test it:
nvcc -V

Command hints

Command What we want to check
sudo lshw -C display List all display devices detected by Ubuntu. Devices not supporting CUDA are also listed.
lspci | grep -i nvidia List all PCI nvidia devices.
nvidia-smi Validate NVIDIA GPU drivers installation. List all CUDA devices suported by installed drivers.
We have additional information about power, temperature, memory consumption, running processes, software version, etc.
nvcc -V Validate CUDA Toolkit installaion

Popular problems

Error message Reason
nvidia-smi response: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA drivers is installed and running. NVIDIA CUDA drivers aren't installed, but NVIDIA tools are installed. You must install your NVIDIA CUDA drivers or reinstall NVIDIA CUDA Toolkit.
NVIDIA driver installation: ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA drivers, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver. You must disable Nouveau kernel driver. Look at NVIDIA GPU drivers section.
NVIDIA driver installation: ERROR: Unable to find the development tool `cc` in your path; please make sure that you have the package 'gcc' installed. If gcc is installed on your system, then please check that `cc` is in your PATH. Not installed package GCC. Run sudo apt install gcc
NVIDIA driver installation: ERROR: Unable to find the development tool `make` in your path; please make sure that you have the package 'make' installed. If make is installed on your system, then please check that `make` is in your PATH. Not installed package Make. Run sudo apt install make

Links