A comparison of Theano with other deep learning frameworks.
Symbolic: Theano, CGT; Automatic: Torch, MXNet
Symbolic and automatic differentiation are often confused or used interchangeably, although their implementations are significantly different.
Reverse automatic differentiation is a generalization of backpropagation. When a gradient needs to be calculated, the input is forward propagated through the computation graph with each node remembering its input. The graph is then traversed in reverse, calling the "backward" method of each node with the gradient (with respect to the node's outputs) as its argument. Its computationally efficient, but not necessarily memory-efficient. See `Automatic differentiation in machine learning: a survey`_
On the other hand, symbolic differentiation relies on each operator having its gradient defined in terms of the same symbolic operators that constructed the graph. When the gradient of a parameter is requested, the nodes are traversed in reverse, each returning a series of symbolic operators that extend the computation graph. The result is an extended computation graph that defines a path from input to gradient.
- Graph size
- With automatic differentiation the same graph is traversed twice, halving the size of the graph that needs to be optimized and compiled.
- Debugging
- Automatic differentiation is easier to debug, since each operator that could raise an error was at one point called by the user. With symbolic differentiation, new operators might appear as part of the gradient calculation which are hard to place for the user.
- Optimization
- Symbolic differentiation allows for better optimization e.g. operators that cannot be fused in the forward pass, could be fused in the backward pass.
- Higher order derivatives
- Symbolic differentiation comes with the advantage that only the forward R-op needs to be implemented. With automatic differentiation, we need to implement both the forward and the backward R-op.
- Flexibility
- Symbolic differentiation seems to be more flexible when considering cases in which we have multiple cost functions, want to have a symbolic representation of the gradient to manipulate further, etc.
Theano: C extensions, CGT: Cython, scikit-cuda: ctypes, cuda4py: CFFI
There are a variety of methods to interface Python with C, each with pros and cons.
- C extension (Python C API)
- Theano's current approach. Small overhead and well-supported, but introduces quite some complexity e.g. the user is responsible for reference counting of Python objects.
- ctypes
- Part of the standard library, and the approach used by e.g.
scikit-cuda. Very simple, well-supported, and requires only Python code. However, can incur some significant overhead (~3 compared to C-API, ballpark figure). - CFFI
- PyPy's preferred way of interfacing with C, inspired by LuaJIT's FFI (which is one of the main reasons Torch was written in Lua). It is quite verbose, but relatively fast and pretty simple.
- Cython
- Popular framework that compiles an extended (annotated) version of Python to C extensions which can be imported as modules. Used by CGT.
- Numba
- Although Numba does not provide a C-interface directly, its JIT compiler supports compiling Python functions with both ctypes and CFFI calls, potentially speeding them up.
✓ Theano ✓ CGT ✗ Torch ✗ MXNet
Libraries can either dynamically compile their operators (e.g. Theano), or they can simply call pre-compiled operators (e.g. Torch).
- Performance
- Dynamic compilation of kernels and C functions allows them to be completely specialized, skipping unnecessary input checks, using less general but more efficient implementations. Frameworks with hand-tuned kernels like cuDNN and cuBLAS reduce the need for dynamic compilation, since they are often more efficient than custom-written ones.
- Compilation time
- Having dynamically compiled operations introduces the need to wait for functions to compile before they can be evaluated. This can be mitigated by having non-compiled version of the operators (e.g. some Theano operators have NumPy implementations), but note that this introduces the burden of maintaining multiple implementations of each operator, and requires the operators to behave identically.
- Kernel fusing
- Having dynamically compiled kernels allows e.g. element-wise operations to be fused into a single kernel. For GPU programming, this can be beneficial since launching the kernel and transferring data from global to shared memory can be a significant overhead.
- Foreign language interface
- If multiple nodes in the computation graph can be compiled together, it can limit the number of functions calls to be made from the host language (e.g. Python for Theano, Lua for Torch). For languages like Python, with a relatively slow FFI and significant function call overhead, this could make a measurable difference.
- Robustness
- Dynamically compiling operators introduces a compilation step in the pipeline, which can be a source of errors e.g. (library) paths must be set correctly, memory issues can occur during compilation, etc. Dynamically compiled operators are harder to debug, because reproducing the error might require reproducing the exact conditions under which the operator was compiled.
- Shared libraries
- CGT compiles shared libraries which are dynamically linked during runtime, whereas Theano compiles a Python C-API extension (requiring Python and NumPy headers). The later can be significantly slower.
Hi,
As I see you have great understanding and interest in this, I was wondering if you are interested in implementing these things in a framework and taking your ideas further?