Deep Learning with Delite and OptiML

Introduction

The deep learning examples are the fastest way to become familiar with deep learning in OptiML. This page provides a user reference guide.

XML File Format
Dataset Layout
Network Generation
Running the Network

XML File Format

See cnn_example.xml and rnn_example.xml (in the NeuralNetwork/ directory) for examples of all the XML tags and their supported attributes.

Networks are specified in XML files. Each XML file contains a net tag, as well as a number of layer tags within this net tag. The XML file is used as input to a generator script (either generate_cnn.py, for convolutional neural networks, or generate_rnn.py, for recurrent neural networks). The generator script generates a scala (OptiML) file implementing the neural network. This file calls functions implemented in NetLib.scala. The generator script also creates parameter files for the network, allowing things like the learning rate or momentum to be modified per-layer.

net tag

The net tag has the following attributes:

name -- The name of the network
Legal Values: Any string, no spaces or punctuation allowed except _
Required: Yes
Description: When the network is generated, a subdirectory with this name will be created in the NeuralNetwork/ directory. That subdirectory will contain a scala (OptiML) source file with this same name, and it will also be the name of the OptiMLApplication implemented in that file.
dataset_path -- The path to the dataset
Legal Values: Either an absolute path or a path relative to published/OptiML directory (where the app is run from) that contains the dataset files.
Required: Yes
Description: See Dataset Layout for the contents of the dataset directory.
blas -- Uses BLAS (or cuBLAS on the GPU) GEMM instead of OptiML parallel dot products to do matrix multiplication.
Legal Values: "1" uses BLAS library calls for every GEMM, "0" (default) does not.
Required: No (default: "0")

Convolutional neural networks (i.e. XML files read by generate_cnn.py) additionally have the following attributes:

img_size -- Input image size
Legal Values: Square images specified in the following format: "32x32"
Required: Yes
colormap -- Whether input images are RGB or Grayscale
Legal Values: "RGB" or "Grayscale"
Required: Only if images are RGB (default: "Grayscale")
num_input_channels -- An alternative to colormap. Specifies the number of channels of an input image (RGB = "3", Grayscale = "1", etc.)
Legal Values: Any integer string (e.g. "4")
Required: No (default: "1")
Description: Use num_input_channels to specify the number of input channels in each row of your input data. E.g. a grayscale image has 1 and an RGB image has 3. "colormap" is a shorthand for num_input_channels, i.e. colormap="Grayscale" is analogous to num_input_channels="1" and colormap="RGB" is analogous to num_input_channels="3". If num_input_channels is specified, colormap is not needed. If both are specified, colormap is ignored. If neither is specified, the default is Grayscale (1 channel).
lr_cmd_line_arg -- This specifies whether the learning rate should be an input through the command line ("1") or a separate parameter file ("0").
Legal Values: "0" (learning rate through parameter files) or "1" (learning rate as input argument)
Required: No (default: "0")
Description: If specified, the network can be trained by running, for example:
delite CNNExampleCompiler 0.01
or with CUDA,
delite CNNExampleCompiler --cude 1 0.01
This is useful if you want to quickly or programmatically set all layers to have the same learning rate (rather than modifying the parameter file for each layer)
input_size -- This is an alternative to the img_size attribute. Use this attribute when the inputs to the network are not images, but arbitrary data vectors.
Legal Values: Any integer string (e.g. "400")
Required: Either this or img_size is required
Description: While generate_cnn.py is meant to generate convolutional networks that take 2D images as input (i.e. networks that do 2D convolution and pooling), it can also generate networks which contain only fully-connected / softmax layers. These have no 2D spatial interpretation and just treat the input as a 1D vector. That means that generate_cnn.py can generate networks for any type of input data.
If the generated network contains convolution or pooling layers, i.e. layers with 2D computations, then img_size must be specified and the size must be square. If input_size is specified instead, it will result in an error during network generation.
But if the network only contains fully-connected and softmax layers (1D computations), then either img_size or input_size can be used to specify the size of the input vector. E.g. for networks with no convolution/pooling, all these are equivalent:

img_size="20x20"
img_size="40x10"
img_size="400x1"
img_size="1x400"
input_size="400"

Here each input example is a vector with 400 elements, and e.g. could be a 20x20 image or any length 400 vector of arbitrary data. Note that if the network contains any convolution or pooling layers, then only the first attribute is valid.

Recurrent neural networks (i.e. XML files read by generate_rnn.py) additionally have the following attributes:

samples_per_window -- The number of samples per time window
Legal Values: Any integer string (e.g. "25")
Required: Yes
Description: For example if input data is speech, sampled at 1 kHz, and every time step or interval is 10ms, then set the value to "10", because every 10ms window we have 10 input data points.

layer tag

The net tag contains a number of layer tags, which define the architecture of the network (For convolutional networks, only a single "stack" of layers is supported, not any directed acyclic connectivity. Similarly for recurrent networks, only a single "stack" of layers is specified in the XML file, with implied connectivities between each hidden layer at time t and its corresponding hidden layer at time t+1.) The order in which layer tags appear specifies their order in the network, from top to bottom (so the final layer listed is the output layer, and the first layer listed takes input from the dataset).

The following layer types are supported (see cnn_example.xml for an example containing all of the layers and their usage)

CONVOLUTION -- valid in XML files read by generate_cnn.py only (not valid in recurrent networks)
MAX_POOL -- valid in XML files read by generate_cnn.py only (not valid in recurrent networks)
FULLY_CONNECTED
SOFTMAX -- valid as the final layer only

Mandatory attributes

All layer tags have the following mandatory attributes:

name -- currently only used for comments in the code
type -- one of the supported layer types above (e.g. CONVOLUTION, in all capitals)

All layer tags except for MAX_POOL layers also have the following mandatory attribute:

num_hidden -- For convolutional layers, this is the number of output feature maps of that layer. For softmax layers, this is the number of outputs (classes). For fully-connected layers, this is the number of hidden units.

Max-pool layers require the pool_size attribute, which is an integer, e.g. "2". This specifies the pooling size (2x2 in this example). Overlapping pooling is currently not supported.

Convolutional layers require the kernel_size attribute, which is an odd integer, e.g. "5". This specifies that convolution kernels (receptive fields) of size 5x5 be used. A stride of 1 is used. Other strides are currently not supported. Furthermore, convolutions do not change feature map sizes, i.e. padding is added automatically to maintain feature map size. Variable padding is also currently not supported.

Optional attributes

Fully-connected layers optionally support the dropout attribute, which is a number between 0 and 1, e.g. "0.5". This specifies the dropout probability during training of each hidden unit activation (default: "0").

Convolution and fully-connected layers have the following optional attribute:

activation -- (optional, default: "ReLU") The type of output activation unit. Options are LINEAR (no activation), LOGISTIC (sigmoid between 0 and 1), and ReLU (Rectified linear).

Dataset Layout

See the examples/ directory for dataset examples. The dataset is specified by six files: train_data.txt, train_labels.txt, val_data.txt, val_labels.txt, test_data.txt and test_labels.txt (test_data and test_labels are optional). The *_data files specify a matrix and the *_labels files a vector. The data matrix is in tsv format, with one example per row, i.e. each line is a row of the matrix, with elements in the row separated by tabs. The labels vector has 1 entry per line.

For image data (convolutional networks), where each example is an image (or a number of images, e.g. 3 RGB colormaps), concatenate each row of the images to form a single vector. For example, for a dataset containing 32x32 RGB images, each row of the data files will have length (32x32x3 =) 3072. For a training set of 10,000 images, the data matrix will therefore have 10,000 rows and 3072 columns. Columns 0-1023 correspond to the 32x32 red colormap, the next 1024 elements to the green colormap, and the final 1024 to the blue colormap (the RGB ordering can be changed as long as each colormap is specified by 1024 consecutive columns). Within this group of 1024 columns, the first 32 columns represents row 1 of that colormap, the next 32 row 2, etc.

For time series data (recurrent networks), all the samples in an example are concatenated into a single row. For example for speech data, if each audio sample lasts 10 seconds, with a sample rate of 1kHz, the row length will be 10,000. The time step (interval size) is then specfied in the XML file: e.g. if there are 20 samples per time window, then this row of size 10,000 corresponds to 500 time windows. Finally, because input data is specified as a single matrix, each row much have the same length. This means that each example must have the same number of total samples (each row must have the same number of columns). In cases where this is unrealistic (e.g. in speech not all utterances have the same duration), all examples should be padded to have the same length.

The networks above describe how to handle inputs which are 2-dimensional (e.g. images, for which convolutional networks are used) or time-series (for which recurrent networks are used). For data which is neither 2-dimensional nor time-series, a convolutional network should still be used (i.e. generate the XML file with generate_cnn.py), but one that contains no convolution/pooling, only fully-connected layers and softmax. See the input_size attribute in the XML section for more details.

Network Generation

The examples page describes how to generate networks, once the XML file and dataset have been created. The XML file is used as input to a generator script (either generate_cnn.py, for convolutional neural networks, or generate_rnn.py, for recurrent neural networks). The generator script generates a scala (OptiML) file implementing the neural network. This file calls functions implemented in NetLib.scala. The generator script also creates parameter files for the network, allowing things like the learning rate or momentum to be modified per-layer.

Note that the generated files are placed in a new subdirectory of the NeuralNetwork/ apps directory. In the future this directory may be an input argument during generation, but currently all generated files are placed in this subdirectory which is relative to published/OptiML/. Specifically, relative to published/OptiML/, the generated files are placed in: apps/src/NeuralNetwork/name.

Network generation does the following:

Create a subdirectory in NeuralNetwork/ for this new network, with the name given in the XML file
Create a .scala (OptiML) source file for this network
Create a global_params.txt file, described below
Create a layer_*_params.txt file for each layer in the network, described below
Create a checkpoints subdirectory to store weight checkpoints during training

The generated OptiML source file contains two modes, either training or testing. The mode is specified in the global parameters file. The parameter files specify how to run the network, including hyper-parameters such as learning rate and momentum as well as the number of training epochs and how often to create a "checkpoint" by saving the weights during training. The network application reads these files to determine what to do. The generated OptiML source file also contains gradient checking code, which is commented out by default but can be uncommented for debugging if experimenting with new algorithms.

Note that information describing the network architecture (such as the number of layers) was all specified before code generation in the XML file describing the network. Information describing how to run the network (such as the number of epochs to run) is described in these automatically generated parameter files, which are read every time the network is run and can be changed without having to recompile or regenerate any code.

Running the Network

The examples page describes how to run networks. Note that the generated code contains hard-coded directory paths pointing to the dataset as well as saved training weights, layer parameters, etc. For this reason, it is important that the networks be run from the published/OptiML/ directory. In the future the hard-coded directory to the parameter files may instead be an input argument during generation, but currently all generated files are placed in a subdirectory relative to published/OptiML/, specifically apps/src/NeuralNetwork/name, and so it is important that the call to delite be made from the OptiML/directory.

This section describes the settings in the automatically generated parameter files, which describe how to run the network.

global_params.txt

The generated file global_params.txt specifies the number of epochs to run and the mini-batch size during training. It also specifies how often during training to check the current network performance. This can be done on the training or validation set. For example, every 10 epochs you may instruct the network to run on the training and validation sets. This will determine the error and cross entropy on each dataset. Or, if you only want to run the check on the validation set every 10 epochs, you can omit the training set from the check by setting its frequency to "0 epochs" (and similarly you can omit the validation set). You can also specify the mini-batch size during these checks in the global_params.txt file.

Just as you can run checks on the training/validation sets every few epochs, you can also choose to save the network weights to the checkpoints/ directory every few epochs, and also keep a copy of the most recent weights that the network can read in. The latest weights are stored in the same directory as the parameter files, and all previous checkpoints (including a copy of the latest weights) are stored in the checkpoints directory. I.e. every time a checkpoint epoch is reached, two things will be done:

A unique checkpoint will be created for these weights in the checkpoints directory
The latest weights will be updated in the parameters directory

If you want to stop training and then restart where you left off, set the "Read model from file" and "Read momentum from file" parameters to "1" in global_params.txt. If these are set, then instead of initializing the weights from random values, the network will read in the weight and momentum files stored in the root directory for the network (the directory where the parameter files are). For example if the network contains only a single layer, it will read in the files w0.txt and b0.txt (layer 0 weights and biases), as well as dw0.txt and db0.txt (layer 0 momentum for weights and biases, if the "Read momentum from file" option is set). In order for this to work, at least one checkpoint must have already been saved. Then you can stop training the network at any time and restart, and it will pick up from the previous checkpoint. Alternatively, previous checkpoints can be copied into the network root directory (e.g. overwriting w0.txt and b0.txt in the example above), in which case the network will start from that checkpoint.

global_params.txt also lets you select whether you want to train the network (Test mode = "0") or test on the test set (Test mode = "1").

Automatic Learning Rate Reduction

If you specify in global_params.txt that the validation set should be run periodically, then automatically the network will record the validation error over time in a log file (and the same is true for the training error). If the validation error ever increases, then the network will automatically reduce the learning rates by a factor of 10 and continue training (recall that initial learning rates for each layer can be set in the layer parameter files, or all at once for all layers by specifying the net attribute lr_cmd_line_arg).

layer_*_params.txt

Parameter files are also generated for each layer. These can be used to modify the learning rate, momentum, L2 regularization, as well as ranges of initial weights and biases. Weights are initalized randomly from a normal distribution with mean 0 and variance 1. The random weights from this distribution are then multiplied by your setting of the "initial weights" parameter. Similarly, the biases are initialized at constant 1, and then multiplied by the value of the "initial biases" parameter (except for softmax layers which always have biases initialized to 0).