A comparison of deep learning frameworks

A comparison of Theano with other deep learning frameworks.

Computation graph/IR

Most deep learning libraries rely on the ability to construct a computation graph, which can be considered the intermediate representation (IR) of our program. The lazy construction of a graph allows for optimization (Theano, CGT), scheduling (MXNet), and/or automatic differentiation (Torch, MXNet). The question is what information should be contained in this graph.

Shapes: The shape of a tensor can be defined when constructing the graph. Requiring this reduces flexibility significantly (for example, minibatches can have different shapes). Alternatively, shape information can be left out of the graph entirely, which is currently the case in Theano (except for some specialized operators like convolutions). The middle ground would be to make the specification of shape information optional. This can useful for optimization purposes: If some tensors are determined to be of fixed size, they could be allocated at compilation time already (trades memory efficiency for performance) [1].

[1]	MXNet documentation <https://mxnet.readthedocs.org/en/latest/developer-guide/operator.html#operator-property>

Memory allocation: The allocation of memory can be made part of the IR. That is, we can represent y = x + 1 as Var(x), Constant(1) -> [Add] -> y, or as const_storage = Allocate((1,)); y_storage = Allocate(x.shape); const_storage, x_storage y_storage -> [Add] -> y_storage. Unclear as to what the advantages of either approach are.
Loops and branches: Loops and branches can be supported in different ways. Theano's approach seems to be to enforce acyclicity of the computation graph by treating loops as single nodes instead (the scan node), which itself contains another computation DAG. Branches are handled by the task scheduler (confusingly called VMs in Theano). Most other frameworks don't have any support for loops and branches. A more principled way of dealing with loops would be to allow loops in the computation graph. In order to maintain strict static single assignment (SSA) form, this requires the implementation of Phi functions (or something similar [2]).

[2]	Wikipedia: Static single assignment

[3]	WebKit code

Differentiation

Symbolic: Theano, CGT; Automatic: Torch, MXNet

Symbolic and automatic differentiation are often confused or used interchangeably, although their implementations are significantly different.

Reverse automatic differentiation is a generalization of backpropagation [3]. When a gradient needs to be calculated, the input is forward propagated through the computation graph with each node remembering its input. The graph is then traversed in reverse, calling the "backward" method of each node with the gradient (with respect to the node's outputs) as its argument.

[4]	Automatic differentiation in machine learning: a survey

On the other hand, symbolic differentiation relies on each operator having its gradient defined in terms of the same symbolic operators that constructed the graph. When the gradient of a parameter is requested, the nodes are traversed in reverse, each returning a series of symbolic operators that extend the computation graph. The result is an extended computation graph that defines a path from input to gradient.

Graph size: With automatic differentiation the same graph is traversed twice, halving the size of the graph that needs to be optimized and compiled.
Debugging: Automatic differentiation is easier to debug, since each operator that could raise an error was at one point called by the user. With symbolic differentiation, new operators might appear as part of the gradient calculation which are hard to place for the user.
Optimization: Symbolic differentiation requires thorough optimization of the graph in order to e.g. remove common subexpressions whereas automatic differentiation requires no optimization whatsoever. However, symbolic differentitation might allow for more optimization e.g. operators that cannot be fused in the forward pass, could be fused in the backward pass.
Memory: Automatic differentiation requires each operator to remember its input (at least, a naive implementation does [4]). Symbolic differentiation with proper memory optimization routines could perhaps be more memory efficient.

[5]	The Data-Flow Equations of Checkpointing in Reverse Automatic Differentiation

Higher order derivatives: Symbolic differentiation comes with the advantage that only the forward R-op needs to be implemented. With automatic differentiation, we need to implement both the forward and the backward R-op. [5]

[6]	torch/nn#399

Flexibility: Symbolic differentiation seems to be more flexible when considering cases in which we have multiple cost functions, want to have a symbolic representation of the gradient to manipulate further, etc. Automatic differentiation can support all these cases, but requires the framework to be extended in order to provide symbolc expressions of the gradients to be manipulated further.

Foreign function interface

Theano: C extensions, CGT: Cython, scikit-cuda: ctypes, cuda4py: CFFI

There are a variety of methods to interface Python with C, each with pros and cons.

C extension (Python C API): Theano's current approach. Small overhead and well-supported, but introduces quite some complexity e.g. the user is responsible for reference counting of Python objects.
ctypes: Part of the standard library, and the approach used by e.g. scikit-cuda. Very simple, well-supported, and requires only Python code. However, can incur some significant overhead (anywhere from 50% [6] to 250 compared to C-API [7]).

[7]	http://www.dalkescientific.com/writings/NBN/c_extensions.html

[8]	http://stackoverflow.com/a/8069179

CFFI: PyPy's preferred way of interfacing with C, inspired by LuaJIT's FFI (which is one of the main reasons Torch was written in Lua). It is quite verbose, but relatively fast and pretty simple. Used by the cuda4py library. [8]

[9]

cuda4py

Cython: Popular framework that compiles an extended (annotated) version of Python to C extensions which can be imported as modules. Used by CGT. [9]

[10]

cycgt.pyx

Numba: Although Numba does not provide a C-interface directly, its JIT compiler supports compiling Python functions with both ctypes [10] and CFFI calls [11], potentially speeding them up.

[11]	Numba ctypes

[12]	Numba cffi

Dynamic compilation / metaprogramming

✓ Theano ✓ CGT ✗ Torch ✗ MXNet

Libraries can either dynamically compile their operators (e.g. Theano), or they can simply call pre-compiled operators (e.g. Torch).

Performance: Dynamic compilation of kernels and C functions allows them to be completely specialized, skipping unnecessary input checks, using less general but more efficient implementations. Frameworks with hand-tuned kernels like cuDNN and cuBLAS reduce the need for dynamic compilation, since they are often more efficient than custom-written ones.
Compilation time: Having dynamically compiled operations introduces the need to wait for functions to compile before they can be evaluated. This can be mitigated by having non-compiled version of the operators (e.g. some Theano operators have NumPy implementations), but note that this introduces the burden of maintaining multiple implementations of each operator, and requires the operators to behave identically.
Kernel fusing: Having dynamically compiled kernels allows e.g. element-wise operations to be fused into a single kernel. For GPU programming, this can be beneficial since launching the kernel and transferring data from global to shared memory can be a significant overhead.
Foreign language interface: If multiple nodes in the computation graph can be compiled together, it can limit the number of functions calls to be made from the host language (e.g. Python for Theano, Lua for Torch). For languages like Python, with a relatively slow FFI and significant function call overhead, this could make a measurable difference.
Robustness: Dynamically compiling operators introduces a compilation step in the pipeline, which can be a source of errors e.g. (library) paths must be set correctly, memory issues can occur during compilation, etc. Dynamically compiled operators are harder to debug, because reproducing the error might require reproducing the exact conditions under which the operator was compiled.
Shared libraries: CGT compiles shared libraries which are dynamically linked during runtime, whereas Theano compiles a Python C-API extension (requiring Python and NumPy headers). The later can be significantly slower.

bartvm/dl-frameworks.rst

Computation graph/IR

Differentiation

Foreign function interface

Dynamic compilation / metaprogramming

botev commented Jun 9, 2016

Uh oh!

yuanbyu commented Dec 16, 2016

Uh oh!