FAQ
- I get errors like
error: don't know how to mirror StringToInt(Sym(674))
. What do I do? - How can I profile my application?
- I would like to use a Scala library within my OptiML code. Is this possible?
- Can I access the indices of an element while iterating through a vector or matrix?
- Is there a performance advantage to using
map
instead of afor
loop? - What flags do I need to pass to delitec (and/or delite) in order to have it execute my code with multiple threads?
- I have some constant/global values defined in my application and am getting "Couldn't find following op" errors when running
delite
. What's wrong? - How can I tell if GPU code is being generated for my application?
I get errors like error: don't know how to mirror StringToInt(Sym(674))
. What do I do?
This annoying error reflects an incompleteness in the implementation of some IR nodes, either in OptiML, Delite, or LMS. Due to the sheer volume
of IR nodes and our limited time, we have not yet implemented the mirroring functionality for all nodes. Mirroring is used to copy IR nodes during
some optimizations (especially during transformations). The simple way to avoid this error when testing for correctness is to pass the --nf
flag to delitec
and to not enable any additional optimizations (no -O
flag). Please file a ticket in github with your error
log and we'll fix the errors ASAP. They are not difficult to address, just time-consuming to achieve full coverage.
How can I profile my application?
There are multiple options, depending on the granularity you wish to profile at.
One option is to use tic()
and toc()
calls, which are similar to their MATLAB counterparts. The basic idea is
tic()/toc() calls should be paired, and then OptiML will print out the elapsed time between them when the program runs. If you want to make
nested tic()/toc() calls, you can name them:
tic()
while{..} {
tic("while")
toc("while")
}
toc()
You can also specify dependencies that must run before a tic or a toc, to prevent code motion from moving them around while trying to optimize:
val z = lotsOfWork()
tic("x", z)
val x = lotsOfMoreWork()
toc("x", x)
OptiML also has a time() function similar to Python's time(). You can also supply it dependencies, and then you can take the difference of two time() calls to return elapsed time in seconds:
val st = time()
val x = lotsOfWork()
val en = time(x)
println("elapsed: " + en-st)
If you wish to go beyond application-level timing, Delite comes with an in-development profiler that measures op execution time across multiple resources and produces a simple HTML visualization that can help identify bottlenecks in your code. Please see the debugging page for more more information and instructions.
Finally, another option is to use standard Java profiling tools when running delite
to execute your application. A popular choice is hprof.
These tools will provide very low-level (from OptiML's perspective) profile information, i.e. hotspots in generated code and DSL methods. Nonetheless, they can be helpful in discovering if there is
a particular call or set of instructions that are dominating execution time. If you encounter a hotspot using hprof that you can't trace back to your application code, please drop us an email.
I would like to use a Scala library within my OptiML code. Is this possible?
Unfortunately, the short answer is no. You cannot arbitrarily mingle unstaged library code within OptiML code, since OptiML uses staged values (wrapped in the Rep[T]
type constructor), which
represent future computations that OptiML will generate code to execute. In contrast, library code is normal Scala that is evaluated during staging, while OptiML is building its IR and trying
to generate code. Thus, in general, you will get type errors related to mixing Rep[T]
with normal types, in particular when trying to pass the result of an OptiML computation to a library.
However, there are two sensible ways to use normal Scala code with OptiML. The first is to use the Scala code to do stage-time preprocessing (using only the pure Scala code and no OptiML). The pre-processed results will be evaluated immediately and then passed to the subsequent OptiML code as constant values. Therefore, any pre-processing stage using this partial evaluation technique should not be performance-critical. The second approach is to use OptiML and Delite's support for scopes, coarse-grained execution blocks that are compiled, staged, and executed independently. This feature is experimental and still under development, but it allows users to pass values into and out of staged blocks from surrounding ordinary Scala (or other DSL) code. For more information, please see our publications or shoot us an email.
Can I access the indices of an element while iterating through a vector or matrix?
OptiML doesn't in general support accessing the index of an element. Usually there are ways to avoid needing the index while iterating over elements by using a different operation (a map, a map over the indices, a zip of the collection and its indices), but these vary in how elegant they are - it tends to depend on the actual operation you need to perform. As a last resort, you can always use a simple sequential while loop over an index variable, but this should be avoided in performance-critical sections if at all possible.
Is there a performance advantage to using map
instead of a for
loop?
In general, map
should be preferred over for
, since for
loops have side-effects by definition, and this can preclude some optimization opportunities.
However, for
loops in OptiML are always parallel and we will also try to generate CUDA code for it, so there is not a catagorical performance hit to using them. Due to the nature
of mixing side-effects and parallelism, though, users must be more careful with for
loops (in the future, we plan to enhance the compiler to reject the most common problematic cases
automatically, but currently there is no safety net). For example, both of the following cases are broken:
for (i <- indices) {
out(i) = foo // fine
out(i+1) = bar // race
}
var j = 0
for (i <- indices) {
j += i // race
out(i) = j
}
The issue with both of these examples is they write to shared objects inside parallel sections, which can result in multiple threads writing to the same memory location simultaneously (a typical
race condition). Thus, the argument to for
loops requires functions that follow certain rules (no disjoint writes) for correct execution. These parallel loops are probably the least 'implicitly parallel'
part of OptiML, and may disappear in a future version. For now, though, they allow careful users to obtain the best performance in certain situations.
What flags do I need to pass to delitec (and/or delite) in order to have it execute my code with multiple threads?
Both delitec
and delite
accept --help
command line arguments, which list all of the options available. To execute with multiple threads, simply pass
-t <number of threads>
to delite
.
I have some constant/global values defined in my application and am getting "Couldn't find following op" errors when running delite
. What's wrong?
All variables and values need to be declared/defined within the dynamic scope of the application's main
method, or they won't get staged properly. See
this example from the OptiML docs for a concrete description of how to organize OptiML code.
How can I tell if GPU code is being generated for my application?
Make sure that you run delitec
and delite
with the --gpu
option, and then look inside the generated code folder to make sure there are CUDA kernels
(typically at $DELITE_HOME/generated/cuda/kernels). If there are no kernels generated, first make sure that you are properly following the instructions for CUDA at
the getting started guide. If everything looks good, make sure there are no GPU generation warnings when running delitec
like:
** GPU Warning [unknown file] ** Code has nested memory allocations (not supported by current Delite GPU code generator). Try manually unrolling the outer loop. Stack Trace: <unknown>
These warnings mean that a CUDA kernel could not be generated for the reason listed and requires manual changes to the application code to make it more GPUable. Finally, if there are no warnings but still no GPU code generated, it could mean that no op was found in the application that could be generated as a CUDA kernel. Please send us an email if you think this shouldn't be the case and we are happy to help you figure out what's going wrong.