Getting started with CUDA

The following instructions are for the Unix and OSX platforms. Windows support is coming soon.

In order to utilize the GPU, Delite requires that CUDA 3.2 or later be installed. In addition Delite also requires the CUDA libraries Thrust and cuBLAS. These libraries are included automatically with the CUDA toolkit download as of CUDA 4.0.

Once installed, you must add the information necessary for Delite to connect to the CUDA installation in Delite's configuration files. From the Delite home directory navigate to config/delite and create CUDA.xml with a text editor. An example template is available in this directory that you can copy and then modify. You must now add 3 pieces of information:


  • Within the <compiler> tag, add the absolute path to nvcc. For example
     <compiler> /usr/local/cuda/bin/nvcc </compiler> 
  • Within the <arch> tag, specify the compute capability of the installed gpu (e.g., "2.0" for Fermi devices or "1.x" for Tesla devices)

  • Within the <headers> tag, add the path to "jni.h" of your current JDK distribution as shown below:
     <path> ${JAVA_HOME}/include </path> 
    If using an Oracle JVM, add the additional path to "jni_md.h" (/include/linux for Linux and /include/darwin for OS X):
     <path> ${JAVA_HOME}/include/linux </path> 
    These are required because Delite links the generated CUDA code to the generated Scala code via JNI.
  • Almost done. Now just open up cuBLAS.xml and modify the <compiler> tag with the path to nvcc and the <arch> tag with your GPU's compute capability just as you did in CUDA.xml.

    Congratulations! You're now ready to run OptiML applications on the GPU. You can enable or disable gpu execution for each application. In order to enable CUDA code generation in the compiler use the "--cuda" flag when calling delitec:

     delitec AppRunner --cuda 

    With this flag the compiler generates both Scala and CUDA versions of the application kernels. You still have the option to run the application using only the CPU(s) or the CPU(s) and GPU. In order to utilize the GPU when running the application, add the "--cuda 1" flag when calling delite:

     delite AppRunner [app_params] --cuda 1 


    Getting started with BLAS

    OptiML can optionally utilize BLAS libraries for performing certain linear algebra operations on the CPU. It is highly recommended that you provide OptiML with a BLAS implementation as certain dense matrix operations (e.g., matrix multiplication) are significantly faster than OptiML's default implementation. You should be able to use any implementation of BLAS that adheres to the CBLAS specification (e.g., MKL, GotoBLAS). Once BLAS is installed on your machine, from the Delite home directory navigate to config/delite and create BLAS.xml in a text editor. You must now add a few pieces of information so that Delite can link to BLAS when compiling. A few example templates are available in this directory that you can copy and then modify.


  • Within the <compiler> tag, add the absolute path to the C compiler installed on your machine (e.g., gcc, icc)

  • Within the <headers> tag, add the path to "jni.h" of your current JDK distribution as shown below:
     <path> ${JAVA_HOME}/include </path> 
    If using an Oracle JVM, add the additional path to "jni_md.h" (/include/linux for Linux and /include/darwin for OS X):
     <path> ${JAVA_HOME}/include/linux </path> 
  • Similary add the absolute path(s) to all include directories for BLAS.

  • Finally for <headers>, add the name of the header file that contains the CBLAS interface within the <include> tag. This is typically "cblas.h" (or "mkl.h" if using MKL).

  • Within the <libs> tag, add all paths to BLAS library directories within <path> tags.

  • Finally, add all the necessary BLAS libraries within <library> tags (e.g., -lcblas). Delite can only be linked with shared object libraries (should end in ".so" rather than ".a").
  • Congratulations! Your OptiML applications will now be accelerated with BLAS. To ensure that the OptiML compiler is using BLAS libraries when possible, make sure that the option "no_blas=False" is included in the echoed options when running delitec.