Deep Learning with Delite and OptiML

Introduction

The deep learning examples are the fastest way to become familiar with deep learning in OptiML. This page provides a user reference guide.

Contents

  1. XML File Format
  2. Dataset Layout
  3. Network Generation
  4. Running the Network

XML File Format

See cnn_example.xml and rnn_example.xml (in the NeuralNetwork/ directory) for examples of all the XML tags and their supported attributes.

Networks are specified in XML files. Each XML file contains a net tag, as well as a number of layer tags within this net tag. The XML file is used as input to a generator script (either generate_cnn.py, for convolutional neural networks, or generate_rnn.py, for recurrent neural networks). The generator script generates a scala (OptiML) file implementing the neural network. This file calls functions implemented in NetLib.scala. The generator script also creates parameter files for the network, allowing things like the learning rate or momentum to be modified per-layer.

net tag

The net tag has the following attributes:

Convolutional neural networks (i.e. XML files read by generate_cnn.py) additionally have the following attributes:

Recurrent neural networks (i.e. XML files read by generate_rnn.py) additionally have the following attributes:

layer tag

The net tag contains a number of layer tags, which define the architecture of the network (For convolutional networks, only a single "stack" of layers is supported, not any directed acyclic connectivity. Similarly for recurrent networks, only a single "stack" of layers is specified in the XML file, with implied connectivities between each hidden layer at time t and its corresponding hidden layer at time t+1.) The order in which layer tags appear specifies their order in the network, from top to bottom (so the final layer listed is the output layer, and the first layer listed takes input from the dataset).

The following layer types are supported (see cnn_example.xml for an example containing all of the layers and their usage)

Mandatory attributes

All layer tags have the following mandatory attributes:

All layer tags except for MAX_POOL layers also have the following mandatory attribute:

Max-pool layers require the pool_size attribute, which is an integer, e.g. "2". This specifies the pooling size (2x2 in this example). Overlapping pooling is currently not supported.

Convolutional layers require the kernel_size attribute, which is an odd integer, e.g. "5". This specifies that convolution kernels (receptive fields) of size 5x5 be used. A stride of 1 is used. Other strides are currently not supported. Furthermore, convolutions do not change feature map sizes, i.e. padding is added automatically to maintain feature map size. Variable padding is also currently not supported.

Optional attributes

Fully-connected layers optionally support the dropout attribute, which is a number between 0 and 1, e.g. "0.5". This specifies the dropout probability during training of each hidden unit activation (default: "0").

Convolution and fully-connected layers have the following optional attribute:

Dataset Layout

See the examples/ directory for dataset examples. The dataset is specified by six files: train_data.txt, train_labels.txt, val_data.txt, val_labels.txt, test_data.txt and test_labels.txt (test_data and test_labels are optional). The *_data files specify a matrix and the *_labels files a vector. The data matrix is in tsv format, with one example per row, i.e. each line is a row of the matrix, with elements in the row separated by tabs. The labels vector has 1 entry per line.

For image data (convolutional networks), where each example is an image (or a number of images, e.g. 3 RGB colormaps), concatenate each row of the images to form a single vector. For example, for a dataset containing 32x32 RGB images, each row of the data files will have length (32x32x3 =) 3072. For a training set of 10,000 images, the data matrix will therefore have 10,000 rows and 3072 columns. Columns 0-1023 correspond to the 32x32 red colormap, the next 1024 elements to the green colormap, and the final 1024 to the blue colormap (the RGB ordering can be changed as long as each colormap is specified by 1024 consecutive columns). Within this group of 1024 columns, the first 32 columns represents row 1 of that colormap, the next 32 row 2, etc.

For time series data (recurrent networks), all the samples in an example are concatenated into a single row. For example for speech data, if each audio sample lasts 10 seconds, with a sample rate of 1kHz, the row length will be 10,000. The time step (interval size) is then specfied in the XML file: e.g. if there are 20 samples per time window, then this row of size 10,000 corresponds to 500 time windows. Finally, because input data is specified as a single matrix, each row much have the same length. This means that each example must have the same number of total samples (each row must have the same number of columns). In cases where this is unrealistic (e.g. in speech not all utterances have the same duration), all examples should be padded to have the same length.

The networks above describe how to handle inputs which are 2-dimensional (e.g. images, for which convolutional networks are used) or time-series (for which recurrent networks are used). For data which is neither 2-dimensional nor time-series, a convolutional network should still be used (i.e. generate the XML file with generate_cnn.py), but one that contains no convolution/pooling, only fully-connected layers and softmax. See the input_size attribute in the XML section for more details.

Network Generation

The examples page describes how to generate networks, once the XML file and dataset have been created. The XML file is used as input to a generator script (either generate_cnn.py, for convolutional neural networks, or generate_rnn.py, for recurrent neural networks). The generator script generates a scala (OptiML) file implementing the neural network. This file calls functions implemented in NetLib.scala. The generator script also creates parameter files for the network, allowing things like the learning rate or momentum to be modified per-layer.

Note that the generated files are placed in a new subdirectory of the NeuralNetwork/ apps directory. In the future this directory may be an input argument during generation, but currently all generated files are placed in this subdirectory which is relative to published/OptiML/. Specifically, relative to published/OptiML/, the generated files are placed in: apps/src/NeuralNetwork/name.

Network generation does the following:

The generated OptiML source file contains two modes, either training or testing. The mode is specified in the global parameters file. The parameter files specify how to run the network, including hyper-parameters such as learning rate and momentum as well as the number of training epochs and how often to create a "checkpoint" by saving the weights during training. The network application reads these files to determine what to do. The generated OptiML source file also contains gradient checking code, which is commented out by default but can be uncommented for debugging if experimenting with new algorithms.

Note that information describing the network architecture (such as the number of layers) was all specified before code generation in the XML file describing the network. Information describing how to run the network (such as the number of epochs to run) is described in these automatically generated parameter files, which are read every time the network is run and can be changed without having to recompile or regenerate any code.

Running the Network

The examples page describes how to run networks. Note that the generated code contains hard-coded directory paths pointing to the dataset as well as saved training weights, layer parameters, etc. For this reason, it is important that the networks be run from the published/OptiML/ directory. In the future the hard-coded directory to the parameter files may instead be an input argument during generation, but currently all generated files are placed in a subdirectory relative to published/OptiML/, specifically apps/src/NeuralNetwork/name, and so it is important that the call to delite be made from the OptiML/directory.

This section describes the settings in the automatically generated parameter files, which describe how to run the network.

global_params.txt

The generated file global_params.txt specifies the number of epochs to run and the mini-batch size during training. It also specifies how often during training to check the current network performance. This can be done on the training or validation set. For example, every 10 epochs you may instruct the network to run on the training and validation sets. This will determine the error and cross entropy on each dataset. Or, if you only want to run the check on the validation set every 10 epochs, you can omit the training set from the check by setting its frequency to "0 epochs" (and similarly you can omit the validation set). You can also specify the mini-batch size during these checks in the global_params.txt file.

Just as you can run checks on the training/validation sets every few epochs, you can also choose to save the network weights to the checkpoints/ directory every few epochs, and also keep a copy of the most recent weights that the network can read in. The latest weights are stored in the same directory as the parameter files, and all previous checkpoints (including a copy of the latest weights) are stored in the checkpoints directory. I.e. every time a checkpoint epoch is reached, two things will be done:

If you want to stop training and then restart where you left off, set the "Read model from file" and "Read momentum from file" parameters to "1" in global_params.txt. If these are set, then instead of initializing the weights from random values, the network will read in the weight and momentum files stored in the root directory for the network (the directory where the parameter files are). For example if the network contains only a single layer, it will read in the files w0.txt and b0.txt (layer 0 weights and biases), as well as dw0.txt and db0.txt (layer 0 momentum for weights and biases, if the "Read momentum from file" option is set). In order for this to work, at least one checkpoint must have already been saved. Then you can stop training the network at any time and restart, and it will pick up from the previous checkpoint. Alternatively, previous checkpoints can be copied into the network root directory (e.g. overwriting w0.txt and b0.txt in the example above), in which case the network will start from that checkpoint.

global_params.txt also lets you select whether you want to train the network (Test mode = "0") or test on the test set (Test mode = "1").

Automatic Learning Rate Reduction

If you specify in global_params.txt that the validation set should be run periodically, then automatically the network will record the validation error over time in a log file (and the same is true for the training error). If the validation error ever increases, then the network will automatically reduce the learning rates by a factor of 10 and continue training (recall that initial learning rates for each layer can be set in the layer parameter files, or all at once for all layers by specifying the net attribute lr_cmd_line_arg).

layer_*_params.txt

Parameter files are also generated for each layer. These can be used to modify the learning rate, momentum, L2 regularization, as well as ranges of initial weights and biases. Weights are initalized randomly from a normal distribution with mean 0 and variance 1. The random weights from this distribution are then multiplied by your setting of the "initial weights" parameter. Similarly, the biases are initialized at constant 1, and then multiplied by the value of the "initial biases" parameter (except for softmax layers which always have biases initialized to 0).