Personal Website



Multi GPU Neural Network Training on Blueshark Supercomputer (HPC)

Category : Tutorials · No Comments · by Dec 3rd, 2018

In  this tutorial we will setup an environment to use TensorFlow with multiple GPUs for training neural networks on Blueshark Supercomputer (HPC) that uses Slurm Workload Manager:

1- SSH to Blushark:

 
ssh username@blueshark.fit.edu

2-  Install miniconda and necessary packages, in this tutorial we will use miniconda:

Download miniconda for python2.7 or python3 :

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh

Execute the bash script by:

[yourname@blueshark ~]$./Miniconda2-latest-Linux-x86_64.sh #for pyhton2
[yourname@blueshark ~]$./Miniconda3-latest-Linux-x86_64.sh #for pyhton3
Create a virtual environment and install necessary packages for TensorFlow, we call over environment "myenv":
[yourname@blueshark ~]$ conda create -n myenv cudatoolkit=9.0 cudnn=7.1.2 tensorflow-gpu=1.10.0 Pillow numpy protobuf matplotlib
You can open a terminal on one of the GPU computing nodes and check the drivers and number of the GPUs:
[yourname@blueshark ~]$ srun -N 1 -p gpu --pty bash

Now you should be able see which node is assigned to you:

 [yourname@node68 ~]$ 

In this case I am on node68 gpu node:
We can test Nvidia driver and name and number of the GPUs in the system by running:

 [yourname@node68 ~]$nvidia-smi 

You should get something like:

[naghli2014@node68 ~]$ nvidia-smi
Sun Dec  2 18:24:46 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          Off  | 00000000:03:00.0 Off |                    0 |
| N/A   21C    P8    20W / 235W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          Off  | 00000000:04:00.0 Off |                    0 |
| N/A   23C    P8    20W / 235W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40m          Off  | 00000000:82:00.0 Off |                    0 |
| N/A   25C    P8    20W / 235W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K40m          Off  | 00000000:83:00.0 Off |                    0 |
| N/A   24C    P8    20W / 235W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This output confirms that you have 4 Tesla K40m GPUs in the node with Nvidia driver version
387.26 and currently no process is running on GPUs.

3- Test TensorFlow installation and If it can use multiple GPUs:

Activate your TensorFlow environment :

 [yourname@blueshark ~]$Conda activate myenv 

Your terminal should look like this:

(testenv) [yourname@blueshark ~]$ 

Create a python file test.py and copy this code in it :



import tensorflow as tf
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
from tensorflow.python.client import device_lib
def get_available_gpus():
 local_device_protos = device_lib.list_local_devices()
 return [x.name for x in local_device_protos if x.device_type == 'GPU']
print(get_available_gpus())

os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"

Is used to define how many CUDA cores are available in the system for TensorFlow, You can limit the number of the GPUs or increase them if the node has more GPUs . You should include it in your training python code!

 [yourname@blueshark ~] python test.py 

Your test.py output should look like this :

Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
Created TensorFlow device (/device:GPU:0 with 10747 MB memory) ; physical GPU (device: 0, name: Tesla K40m, pci bus id: 0000:03:00.0, compute capability: 3.5)
Created TensorFlow device (/device:GPU:1 with 10747 MB memory) ; physical GPU (device: 1, name: Tesla K40m, pci bus id: 0000:04:00.0, compute capability: 3.5)
Created TensorFlow device (/device:GPU:2 with 10747 MB memory) ; physical GPU (device: 2, name: Tesla K40m, pci bus id: 0000:82:00.0, compute capability: 3.5)
Created TensorFlow device (/device:GPU:3 with 10747 MB memory) ; physical GPU (device: 3, name: Tesla K40m, pci bus id: 0000:83:00.0, compute capability: 3.5)
['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3']

4- Run your training python code:

Assuming your training python code is named training.py you can use this command to run it on a GPU node:

 srun -n 1 --mem=40G -p gpu --qos=gpu --gres=gpu:4 python traininng.py 

If you want to run your training on a reserved node that no other user can use that node while you are training your network use:

 srun -n 1 --exclusive --mem=40G -p gpu --qos=gpu --gres=gpu:4 python training.py 

In this method if you disconnect from the server your code execution will also end. If you want to submit your training code as a job you can use sbatch scripts :
create a sh file and name it myjob.sh and copy these lines to it and save :

#SBATCH --job-name facenet_training 
#SBATCH --nodes 1
#SBATCH --mem=40GB
#SBATCH --mail-user=youremailadress@my.fit.edu
#SBATCH --mail-type=END,FAIL
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --exclusive
#SBATCH --error=logs/errors.%J.err
#SBATCH --output=logs/output.%J.out
python train.py

Now you can run your job by executing:

</em> (testenv) [yourname@blueshark ~]sbatch myjob.sh<em> 

Your output should look like this :

 (testenv) [yourname@blueshark ~]sbatch myjob.sh
<em>Submitted batch job 148411 </em>

You will also receive an email if your job fails or completes.
For more details you can check your log dir(make sure you have created a log dir before running the sbatch)

 

Finally you can run this command to see your job in the job list :


[yourname@blueshark ~] squeue