I had some hard time getting Tensorflow with GPU support and OpenAI Gym at the same time working on an AWS EC2 instance, and it seems like I’m in good company. For some time I used NVIDIA-Docker for this but as much as I love Docker, depending on special access to the (NVIDIA) GPU drivers, took away some of the biggest advantages when using Docker, at least for my use cases. Running OpenAI Gym in a normal container, exposing some port to the outside and running agents/neural nets etc. elsewhere seems like a really promising approach and I’m looking forward to it being ready.
There are good explanations on how to get Tensorflow with CUDA going, those were pretty helpful to me. However I suppose they were mostly concerned with supervised learning. If you want to run certain OpenAI Gym environments headless on a server, you have to provide an X-server to them, even when you don’t want to render a video. You can use a virtual framebuffer like xvfb for this, it works fine. But I never could getting it to work with GLX support. Also other solutions like X-Dummy failed. The problem is, that there’s no way to keep NVIDIA from installing OpenGl-libs, when using packages from some repo (which most of the tutorials do, because it’s way more convenient). Finally this comment by pemami4911 in a github issue pointed me into the right direction. Since many people seem to run into the same problems, maybe the following will spare you some trouble.
I’m using Python 3 because it really annoys me, that everyone still uses Python 2.7 in the deep learning community. Also I’m using Ubuntu Ubuntu 16.04 LTS XENIAL XERUS because it’s released for ages (even the official Canonical AMI on AWS) and when spinning up a single new instance for computing just a few things, I don’t see why I would use Ubuntu 14.04, just because 16.04 is not among the three AMIs AWS offers to me first. But I think it should be no trouble to adapt this to other needs.
OT: I also recommend using the AWS CLI together with the oh-my-zsh AWS plugin, because zsh and especially oh-my-zsh is awesome.
Okay, let’s begin:
Go to https://cloud-images.ubuntu.com/locator/ec2/ and look for the Ubuntu 16.04 LST XENIAL XERUS hvm:ebs-ssd AMI from Canonical. For eu-central-1 it’s ami-8504fdea. Spin up a new EC2 GPU instance (g2.2xlarge will do) using that AMI (use a spot instance if you are as broke as me). I recommend using at least 20GB as root volume. SSH into your instance, username is ubuntu.
Let’s get some basics:
The Java stuff is for Bazel Google’s build tool we will use later for compiling Tensorflow. We will use openJDK but if you have to, you can also use the proprietary one from Oracle.
Also some kernel sources, compilers and other stuff:
As we are on it, let’s install Bazel right now:
Now curl or wget the right NVIDIA driver. For the GRID K520 GPU 367.57 should be the right choice (maybe Linus would it call the least wrong choice at most).
NVIDIA will clash with the nouveau driver so deactivate it:
Insert the following lines and save:
Update the initframs (basically functionality to mount your real rootfs, which has been outsourced from the kernel) and reboot:
Make the NVIDIA driver runfile executable and install the driver and reboot one more time, just to be sure.
IMPORTANT: In my experience xvfb will only work if you use the –no-opengl-files option!
Now wget CUDA 8.0. toolkid from NVIDIA
While downloading register at NVIDIA, download CUDNN 5 runtinme lib on your local machine and SCP it to the remot instance:
Now make the runfile executable and install CUDA but don’t install the driver. Also the –override option helps to prevent some annoying errors, which could happen.
IMPORTANT: Be sure to use the –no-opengl-libs option
Now open your .bashrc
and add the following lines:
If the SCP operation is complete, extract it to the right locations.
Reboot one more time:
Now git clone Tensorflow and start the configuration:
I used /usr/bin/python3.5 for the Python binary and /usr/local/lib/python3.5/dist-packages for the path, Cuda SDK 8.0, cudnn 5.1.5, compiled without cloud-support and OpenCL but with GPU support of course. Computing capabilities for the instance are are 3.0.
Okay, now we compile everything:
This will take some time. You could watch this video while you’re waiting:
Build the wheel:
If you want, you can use a virtual environment:
Installing OpenAI Gym is pretty straight forward, cause the people at OpenAI and all other contributers have done an amazing job (: Sometimes there are some problems with Box2D. If you want to be sure, follow the instructions below. We already installed Swig (we need 3.x before compiling Box2D). Git clone Pybox2d
Build and install it:
Installing OpenAI Gym should now be no trouble at all. We already installed most of the dependencies but I copy-pasted everything from their github instructions just to be sure:
If you had problems running OpenAI Gym headless using xvfb as X-server it should now work, if you do as explained by Trevor Blackwell in this post (the GLX option is active by default):
Have fun (:
If you run into any troubles, just let me know. I’m always happy to help.