Accelerating Java With GPUs

May 09, 2022

Overview

Programming a graphics processing unit (GPU) seems like a very distant world for a Java developer which is quite understandable as most of the use cases for Java are not applicable for GPUs. It is the only domain where Java is not dominating like in most of the other areas in the technology space. Java is now heavily used in enterprises, application development, ML, data science, and the financial sector and most of these applications require many computations and a lot of processing power. This is how the concept of general-purpose GPU (GPGPU) came around.

Table of Contents

In this article, we will be discussing how Java developers can benefit from GPGPU by automatically running Java on GPU. Firstly, an overview of the TornadoVM project and architecture is provided and later on, you will be able to read on different parts of TornadoVM with a relevant example.

TornadoVM

One very common tool for GPGPU is OpenCL. OpenCL provides a base implementation in C language but it is still technically accessible via Java Native Interface (JNI) or Java Native Access (JNA). Such access would still be a bit too much work for most developers. There are some Java libraries for that like JavaCL but these are either not functional or not up to date. This is where TornadoVM comes to the rescue.

TornadoVM is a Java programming and execution framework for offloading and running JVM applications on GPUs. It extends the Graal JIT compiler with a brand-new backend for OpenCL to use OpenCL with Java. Java developers also have the benefit that the applications written for TornadoVM are single-source so the same code is used to express the host code as well as the accelerated code. TornadoVM can also perform the live-task migration across different computing devices.

Why do we need a Java GPU framework like TornadoVM?

There is not a single computer architecture that is suited for executing all types of workloads properly. This leads to the need for heterogeneous hardware that has a mix of computing elements. The new heterogeneous devices for computing include multi-core CPUs, Field Programmable Gate Arrays (FPGAs), and GPUs. This diversity would be very efficient but it also requires very efficiently written programs for all these new devices.

CUDA and OpenCL are two prime examples of heterogeneous programming languages. However, they are not designed to work well with Java and has various low-level features in the API that makes them difficult to be used by inexperienced developers

Here, TornadoVM proves to be a suitable alternative to low-level parallel programming languages for heterogeneous computing. It is a parallel programming framework for Java that can transparently and dynamically offload Java bytecodes into OpenCL, and execute the generated code on a GPU. Along with this, TornadoVM also integrates an optimizing runtime, that allows the reusing of device buffers and saves the data transfers across different devices.

Working with TornadoVM

Let’s get into the details with a code demonstration. Following is the code to program and run matrix multiplication with TornadoVM on dedicated or integrated GPUs. Matrix multiplication is a good example to start with due to being very simple.

The following code snippet shows the matrix multiplication programmed in Java:

1. class Calculate {
2.   public static void matrixMult(final float[] A, final float[] B, final float[] C, final int size) {
3.                for (int i = 0; i < size; i++) {
4.                for (int j = 0; j < size; j++) {
5.                float sum = 0;
6.                for (int k = 0; k < size; k++) 
7.                               sum += A[(i * size) + k] * B[(k * size) + j];
8.                 C[(i * size) + j] = sum;
9.                }
10.           }
11.       }
12.   }

To accelerate this code snippet with TornadoVM using Java GPU, we first have to annotate the loops that they can be parallelized. Here, we can fully parallelize the two outermost loops, as there are no dependencies between their iterations. The code can be annotated using the TornadoVM annotations @Parallel.

See the following code:

1. class Calculate {
2.    public static void matrixMult (final float[] A, final float[] B, final float[] C, final int size) {
3.                for (@Parallel int i = 0; i < size; i++) {
4.                for (@Parallel int j = 0; j < size; j++) {
5.                float sum = 0;
6.                for (int k = 0; k < size; k++) 
7.                               sum += A[(i * size) + k] * B[(k * size) + j];
8.                 C[(i * size) + j] = sum;
9.              }
10.         }
11.     }
12.  }

The @Parallel annotation is used as a symbol by the TornadoVM JIT compiler that transforms Java bytecode into OpenCL. The TornadoVM JIT compiler does not force the parallelization. Instead, it ensures whether the annotated loops can be parallelized or not, and then it replaces the loops for the equivalent parallel indexing in OpenCL (get_global_id(dimension)). If the loops are depended and cannot be parallelized, TornadoVM just executes the sequential code.

Java developers also need to identify which Java methods to accelerate using Java GPU. For that purpose, TornadoVM exposes a lightweight task-based API, that sets the list of methods to be accelerated. Java developers can now create a group of these tasks via a task scheduler.

The following code snippet shows a task schedule created for the matrix multiplication:

1. TaskSchedule t = new TaskSchedule("s1")
2.       .task("t1", Calculate::matrixMult, A, B, result, size)
3.       .streamOut(result);

A task-schedule object (t) is created. In its constructor, you can pass a name of your choice for the task. This name is later used for changing the device in which all tasks will be executed. Then we define a set of tasks. To keep it simple, we only have one task for now.

The parameters for the tasks include,

First, the name (here, it is “t1”),
A reference to the method to be accelerated (in this case, it points to the method matrixMult from the Java class Calculate.)
The remaining parameters correspond to the actual set of parameters for the method.

Accelerating a Java code using GPUs with TorandoVM

Now, we will be running this code on TornadoVM using Graal 19.3.0 as a JDK. To execute this code, it is assumed that OpenCL is already installed.

$ mkdir -p TornadoVM
$ cd TornadoVM
$ wget https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-19.3.0/graalvm-ce-java11-linux-amd64-19.3.0.tar.gz
$ tar -xf graalvm-ce-java11-linux-amd64-19.3.0.tar.gz
$ export JAVA_HOME=$PWD/graalvm-ce-java11-19.3.0
$ git clone --depth 1 https://github.com/beehive-lab/TornadoVM
$ cd TornadoVM
$ export PATH=$PWD/bin/bin:$PATH
$ export TORNADO_SDK=$PWD/bin/sdk
$ export CMAKE_ROOT=<SET YOUR PATH TO CMAKE ROOT>
$ make graal-jdk-11
$ export TORNADO_ROOT=$PWD

You need to first download the repositories like this,

$ git clone https://github.com/jjfumero/qconlondon2020-tornadovm
$ cd qconlondon2020-tornadovm/
$ export JAVA_HOME=/path/to/graalvm-ce-java11-19.3.0
$ export PATH="${PATH}:${TORNADO_ROOT}/bin/bin/"  ## Defined previously
$ export TORNADO_SDK=${TORNADO_ROOT}/bin/sdk
$ export CLASSPATH=target/tornado-1.0-SNAPSHOT.jar
$ mvn clean install

Now we have everything ready to execute this array multiplication example. You can start by checking which devices are available to use:

$ tornado --devices
Number of Tornado drivers: 1
Total number of devices: 2

Tornado device=0:0
          Intel(R) OpenCL -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
                   Global Memory Size: 31.0 GB
                   Local Memory Size: 32.0 KB
                   Workgroup Dimensions: 3
                   Max WorkGroup Configuration: [8192, 8192, 8192]
                   Device OpenCL C version: OpenCL C 1.2

Tornado device=0:1
          Intel(R) OpenCL HD Graphics -- Intel(R) Gen9 HD Graphics NEO
                   Global Memory Size: 24.8 GB
                   Local Memory Size: 64.0 KB
                   Workgroup Dimensions: 3
                   Max WorkGroup Configuration: [256, 256, 256]
                   Device OpenCL C version: OpenCL C 2.0

In this case, I have two devices available on my system: An Intel multi-core CPU and an Intel HD Graphics (an integrated GPU). TornadoVM selects device 0 (first device) by default but users can change the device by associating tasks to specific devices.

Following is the default configuration:

$ tornado qconlondon.MatrixMult 256 tornado

This program will execute the Matrix Multiplication method 100 times and reports the total time per iteration:

$ tornado qconlondon.MatrixMultiplication 256 tornado
Computing MxM of 256
Total time: 77568790 (ns), 0.0776 (s)
Total time: 3133182 (ns), 0.0031 (s)
Total time: 3126146 (ns), 0.0031 (s)

We have focused on a simple example of matrix multiplication to demonstrate the working of different parts of the TornadoVM runtime and JIT compiler. However, with TornadoVM, you can program more than just one task, with simple data types. Generally, TornadoVM is well suited for accelerating workloads that follow the SIMD (Single Instruction Multiple Data) patterns, and pipeline Java applications.

Conclusion

TornadoVM is a plugin for OpenJDK and GraalVM that allows Java developers to run Java on GPUs and other heterogeneous hardware. This article explored the functionality of TornadoVM through an example. We explored how Java code can be parallelized and is executed on GPU. This article only scratches the surface of how Java can be accelerated using GPUs, and TornadoVM is one way to do it.