Short summary:
sudo swapoff /swapfile
sudo dd if=/dev/zero of=/swapfile bs=1M count=65536 oflag=append conv=notrunc
sudo mkswap /swapfile
sudo swapon /swapfile
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 1
sudo apt update
sudo apt install python3-dev python3-pip
sudo apt install python3-testresources
pip install -U --user pip numpy==1.19.5 wheel
pip install -U --user keras_preprocessing --no-deps
sudo apt install git
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git checkout r2.5
sudo apt install npm
sudo npm install -g @bazel/bazelisk
./configure
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install /tmp/tensorflow_pkg/tensorflow-2.5.0-cp38-cp38-linux_x86_64.whl
sudo apt-get install openssh-server
sudo systemctl enable ssh
sudo systemctl start ssh
ssh-keygen
ssh-copy-id vasily@SERVER2
ssh SERVER2
ssh-keygen
ssh-copy-id vasily@SERVER1
sudo snap install cmake --classic
sudo apt install openmpi-bin
mpirun -H SERVER1:1,SERVER2:1 hostname
git clone --recursive https://github.com/uber/horovod.git
cd horovod
python setup.py clean
python setup.py bdist_wheel
HOROVOD_WITH_TENSORFLOW=1 pip install ./dist/horovod-0.22.1-cp38-cp38-linux_x86_64.whl[tensorflow,keras]
mpirun -H SERVER1:3,SERVER2:3 python3 /home/vasily/horovod/examples/tensorflow2/tensorflow2_keras_mnist.py
In my previous post I mentioned that building a TensorFlow from sources is rather an activity for masochists passionate DevOps.
However, sometimes it is impossible to get maximum pleausure performance without masochism DevOps activity. Moreover, if you experiment with an outdated but still functioning hardware (which can be cheaply bought on eBay), there is no alternative: since the legacy hardware might not support AVX2 instruction set, which is set by default in tensorflow binaries. I, for one, bought 18 units of Dell™ OptiPlex™ 780 a couple of years ago. By that time I tried to build a MAAS, did not succeed from the 1st attempt and then had to relocate... the Dell computers were stored in a moist cellar for two years and still function (ironically, my new Dell laptop went recently kaputt but this is another story).
So replacing the old Pentium Duo CPUs with QuadCore (which costs virtually nothing) and adding some more RAM (which costs a bit but is definitely worthy) I build a BeerWulf cluster with 72 CPU cores.
So let us have a look at build process in more details.
As the 0-th step I install Ubuntu 20.04.2 LTS: minimal installation, install third-party drivers and download updates (do let Ubuntu install these updates and reboot after the first run).
Further, since I have only 0.5 liter 8GB RAM on the BeerWulf nodes, we need to add some swap (which is set by default just to 2GB).
sudo swapoff /swapfile
sudo dd if=/dev/zero of=/swapfile bs=1M count=65536 oflag=append conv=notrunc
sudo mkswap /swapfile
sudo swapon /swapfile
Next we add an alias for the Python3 path (not a must but better do): Ubuntu's default is /usr/bin/python3, which is correctly recognized by tensorflow-config. However, many 3rd-party examples expect /usr/bin/python thus set it!
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 1
Further we need basic Python bricks, so execute
pip install -U --user pip numpy==1.19 wheel
pip install -U --user keras_preprocessing --no-deps
NB! The official TensorFlow installation guide does not specify the version of numpy. The problem was: numpy 1.21 was installed, whereas tenfoflow2.5 requires numpy 1.19.x
Then we just do the common git stuff (note that since we chose the minimal Ubuntu installation, we need to install git client first)
sudo apt install git
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git checkout r2.5
Further we need to install bazel, which is Google's build management system (like make or maven but likely better... if you managed to figure out how to install it... i finally did)
sudo apt install npm
sudo npm install -g @bazel/bazelisk
After the installation of bazel run configuration-script (always choose defaults and answer N)
./configure
vasily@OptiPlex01:~/tensorflow$ ./configure
You have bazel 3.7.2 installed.
Please specify the location of python. [Default is /usr/bin/python3]:
Found possible Python library paths:
/usr/lib/python3/dist-packages
/usr/local/lib/python3.8/dist-packages
Please input the desired Python library path to use. Default is [/usr/lib/python3/dist-packages]
Do you wish to build TensorFlow with ROCm support? [y/N]: n
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: n
No CUDA support will be enabled for TensorFlow.
Do you wish to download a fresh release of clang? (Experimental) [y/N]: n
Clang will not be downloaded.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -Wno-sign-compare]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=mkl_aarch64 # Build with oneDNN and Compute Library for the Arm Architecture (ACL).
--config=monolithic # Config for mostly static monolithic build.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v2 # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished
Then execute
... and go sleep: on a legacy machine the nighlty build is to understand pretty literally, by me it took more than 12 hours.
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
As you wake up, install the tensorflow package
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install /tmp/tensorflow_pkg/tensorflow-2.5.0-cp38-cp38-linux_x86_64.whl
So the tensorflow part is done, now let us proceed to horovod. But before a remark aside: for a quickstart I prefer conda-solutions, since they (if not broken) can be setup in two minutes. IBM does provide a good tensorflow channel but their horovod channel is herovod (Russian native speakers will understand, for all others: just do not waste your time for it).
Contrary to Google with their bazel, Uber relies on the old good cmake, which we need to install first.
sudo snap install cmake --classic
Then we need to setup the openmpi. Explaining what it is is beyond the scope of this notes but openmpi needs to be able to connect to the cluster nodes via SSH without asking for password, so
sudo apt-get install openssh-server
sudo systemctl enable ssh
sudo systemctl start ssh
ssh-keygen
ssh-copy-id vasily@SERVER2
ssh SERVER2
ssh-keygen
ssh-copy-id vasily@SERVER1
After you setup the SSH service, install openmp and test it
sudo apt install openmpi-bin
mpirun -H SERVER1:1,SERVER2:1 hostname
Finally build horovod, install it
git clone --recursive https://github.com/uber/horovod.git
cd horovod
python setup.py clean
python setup.py bdist_wheel
HOROVOD_WITH_TENSORFLOW=1 pip install ./dist/horovod-0.22.1-cp38-cp38-linux_x86_64.whl[tensorflow,keras]
and run the test case
mpirun -H SERVER1:3,SERVER2:3 python3 /home/vasily/horovod/examples/tensorflow2/tensorflow2_keras_mnist.py
Note that the official horovod documentation is partially obsolete, so you'd better look directly to horovod's github.
REFERENCES:
https://askubuntu.com/questions/927854/how-do-i-increase-the-size-of-swapfile-without-removing-it-in-the-terminal
https://stackoverflow.com/questions/41986507/unable-to-set-default-python-version-to-python3-in-ubuntu
https://docs.bazel.build/versions/main/install-bazelisk.html
https://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
https://www.tensorflow.org/install/source
Hope this notes helped you ... and will not become obsolete too fast.
In the next blogpost I will tell you how I cheaply bought a not-so-legacy hardware on eBay and which performance for my models I achieved with XEON and Tesla GPU. Stay tuned!
FinViz - an advanced stock screener (both for technical and fundamental traders)