A comparison of deep learning frameworks

A comparison of Theano with other deep learning frameworks, highlighting a series of design choices in no particular order.

Optimizations

Computation graph/IR

Most deep learning libraries rely on the ability to construct a computation graph, which can be considered the intermediate representation (IR) of our program. The lazy construction of a graph allows for optimization (Theano, CGT), scheduling (MXNet), and/or automatic differentiation (Torch, MXNet). The question is what information should be contained in this graph.

Shapes: The shape of a tensor can be defined when constructing the graph. Requiring this reduces flexibility significantly (for example, minibatches can have different shapes). Alternatively, shape information can be left out of the graph entirely, which is currently the case in Theano (except for some specialized operators like convolutions). [*] The middle ground would be to make the specification of shape information optional. This can useful for optimization purposes: If some tensors are determined to be of fixed size, they could be allocated at compilation time already (trades memory efficiency for performance) [1].

[*]	Theano's documentation discusses "shape promotion" <http://deeplearning.net/software/theano/optimizations.html#term-shape-promotion>, but I can find no code for to the mentioned optimizations, `local_shape_lift_*`.

[1]	MXNet documentation <https://mxnet.readthedocs.org/en/latest/developer-guide/operator.html#operator-property>

Memory allocation: The allocation of memory can be made part of the IR. That is, we can represent y = x + 1 as Var(x), Constant(1) -> [Add] -> y, or as const_storage = Allocate((1,)); y_storage = Allocate(x.shape); const_storage, x_storage y_storage -> [Add] -> y_storage. Unclear as to what the advantages of either approach are.
Loops and branches: Loops and branches can be supported in different ways. Theano's approach seems to be to enforce acyclicity of the computation graph by treating loops as single nodes (the scan op), which itself contains another computation graph. Branches are handled by the task scheduler (confusingly called virtual machines in Theano). Most other frameworks don't have any support for loops and branches. A more principled way of dealing with loops would be to allow loops in the computation graph. In order to maintain strict static single assignment (SSA) form, this requires the implementation of Phi functions (or something similar [2]). Keeping the graph flat could make loops more scalable e.g. allow for nested loops.

[2]	Wikipedia: Static single assignment

[3]	WebKit code

Device independence: Theano tries to support a variety of backends. In the case of Theano, they share the same IR, which is then transformed into a device-specific representation during the optimization phase. The device-independent optimization is separated from the device-specific stuff by setting optimization "levels". Programmatically, this could be separated more fully. In the spirit of compilers, we would have an IR optimizer, and then an entirely separate device-specific optimizer and/or task scheduler.

GPU/CPU transparency

The interaction between CPU and GPU computation can be done in a variety of ways.

Fully transparent: Theano tries to make the interaction fully transparent by offloading to GPU when possible, but also transparently moving data to the host and back again if needed i.e. when a certain operator is only implemented on CPU. The advantage is that the user can still make use of GPU acceleration, even when a particular op isn't implemented on GPU. The downside is that a lot of subpar performance is due to the Theano transferring data to the host without the user being aware of it (e.g. because they used advanced indexing). This means users still need to write "GPU-runnable" Theano code.
Explicit transfers: Torch takes a more explicit approach. GPU-enabled operators will only accept GPU arrays, and vice versa. Moving tensors to and from the GPU simply requires calling cuda() or double() on the tensor [†]. The advantage is that the user is fully aware of the device the computation is being performed on. The downside is that scripts need to enable GPU usage by explicitly transferring tensors to the GPU.
Exclusive mode: A final option is to not allow transfer between the CPU and GPU within a single computation graph. The disadvantage is that if a user really needs to perform part of the computation on CPU, they will need to construct and compile separate computation graphs. The advantage is the user is fully aware of the transfer. Internally, this restriction could simplify the code.

[†] Note that Torch does not have a graph compiler or task scheduler, which means that the user can interrupt the computations with calls to cuda() or double() in an imperative style whenever. It is more difficult to imagine what this kind of explicit CPU-GPU transfer would look like in the case of a full computation graph approach.

Differentiation

Symbolic: Theano, CGT; Automatic: Torch, MXNet

Symbolic and automatic differentiation are often confused or used interchangeably, although their implementations are significantly different.

Reverse automatic differentiation is a generalization of backpropagation [3]. When a gradient needs to be calculated, the input is forward propagated through the computation graph with each node remembering its input. The graph is then traversed in reverse, calling the "backward" method of each node with the gradient (with respect to the node's outputs) as its argument.

[4]	Automatic differentiation in machine learning: a survey

On the other hand, symbolic differentiation relies on each operator having its gradient defined in terms of the same symbolic operators that constructed the graph. When the gradient of a parameter is requested, the nodes are traversed in reverse, each returning a series of symbolic operators that extend the computation graph. The result is an extended computation graph that defines a path from input to gradient.

Graph size: With automatic differentiation the same graph is traversed twice, halving the size of the graph that needs to be optimized and compiled.
Debugging: Automatic differentiation is easier to debug, since each operator that could raise an error was at one point called by the user. With symbolic differentiation, new operators might appear as part of the gradient calculation which are hard to place for the user.
Optimization: Symbolic differentiation requires thorough optimization of the graph in order to e.g. remove common subexpressions whereas automatic differentiation requires no optimization whatsoever. However, symbolic differentitation might allow for more optimization e.g. operators that cannot be fused in the forward pass, could be fused in the backward pass.
Memory: Automatic differentiation requires each operator to remember its input (at least, a naive implementation does [4]). Symbolic differentiation with proper memory optimization routines could perhaps be more memory efficient.

[5]	The Data-Flow Equations of Checkpointing in Reverse Automatic Differentiation

Higher order derivatives: Symbolic differentiation comes with the advantage that only the forward R-op needs to be implemented. With automatic differentiation, we need to implement both the forward and the backward R-op. [5]

[6]	torch/nn#399

Flexibility: Symbolic differentiation seems to be more flexible when considering cases in which we have multiple cost functions, want to have a symbolic representation of the gradient to manipulate further, etc. Automatic differentiation can support all these cases, but requires the framework to be extended in order to provide symbolc expressions of the gradients to be manipulated further.

Foreign function interface

Theano: C extensions, CGT: Cython, scikit-cuda: ctypes, cuda4py: CFFI

There are a variety of methods to interface Python with C, each with pros and cons.

C extension (Python C API): Theano's current approach. Small overhead and well-supported, but introduces quite some complexity e.g. the user is responsible for reference counting of Python objects.
ctypes: Part of the standard library, and the approach used by e.g. scikit-cuda. Very simple, well-supported, and requires only Python code. However, can incur some significant overhead (anywhere from 50% [6] to 250 compared to C-API [7]).

[7]	http://www.dalkescientific.com/writings/NBN/c_extensions.html

[8]	http://stackoverflow.com/a/8069179

CFFI: PyPy's preferred way of interfacing with C, inspired by LuaJIT's FFI (which is one of the main reasons Torch was written in Lua). It is quite verbose, but relatively fast and pretty simple. Used by the cuda4py library. [8]

[9]

cuda4py

Cython: Popular framework that compiles an extended (annotated) version of Python to C extensions which can be imported as modules. Used by CGT. [9]

[10]

cycgt.pyx

Numba: Although Numba does not provide a C-interface directly, its JIT compiler supports compiling Python functions with both ctypes [10] and CFFI calls [11], potentially speeding them up.

[11]	Numba ctypes

[12]	Numba cffi

Dynamic compilation / metaprogramming

✓ Theano ✓ CGT ✗ Torch ✗ MXNet

Libraries can either dynamically compile their operators (e.g. Theano), or they can simply call pre-compiled operators (e.g. Torch).

Performance: Dynamic compilation of kernels and C functions allows them to be completely specialized, skipping unnecessary input checks, using less general but more efficient implementations. Frameworks with hand-tuned kernels like cuDNN and cuBLAS reduce the need for dynamic compilation, since they are often more efficient than custom-written ones.
Compilation time: Having dynamically compiled operations introduces the need to wait for functions to compile before they can be evaluated. This can be mitigated by having non-compiled version of the operators (e.g. some Theano operators have NumPy implementations), but note that this introduces the burden of maintaining multiple implementations of each operator, and requires the operators to behave identically.
Kernel fusing: Having dynamically compiled kernels allows e.g. element-wise operations to be fused into a single kernel. For GPU programming, this can be beneficial since launching the kernel and transferring data from global to shared memory can be a significant overhead.
Foreign language interface: If multiple nodes in the computation graph can be compiled together, it can limit the number of functions calls to be made from the host language (e.g. Python for Theano, Lua for Torch). For languages like Python, with a relatively slow FFI and significant function call overhead, this could make a measurable difference.
Robustness: Dynamically compiling operators introduces a compilation step in the pipeline, which can be a source of errors e.g. (library) paths must be set correctly, memory issues can occur during compilation, etc. Dynamically compiled operators are harder to debug, because reproducing the error might require reproducing the exact conditions under which the operator was compiled.
Shared libraries: CGT compiles shared libraries which are dynamically linked during runtime, whereas Theano compiles a Python C-API extension (requiring Python and NumPy headers). The later can be significantly slower.