WelcomeÂ¶
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently. Theano features:
 tight integration with NumPy – Use numpy.ndarray in Theanocompiled functions.
 transparent use of a GPU – Perform dataintensive computations much faster than on a CPU.
 efficient symbolic differentiation – Theano does your derivatives for functions with one or many inputs.
 speed and stability optimizations – Get the right answer for
log(1+x)
even whenx
is really tiny.  dynamic C code generation – Evaluate expressions faster.
 extensive unittesting and selfverification – Detect and diagnose many types of errors.
Theano has been powering largescale computationally intensive scientific investigations since 2007. But it is also approachable enough to be used in the classroom (University of Montreal’s deep learning/machine learning classes).
NewsÂ¶
 2017/08/09: Release of Theano 0.10.0beta1, many improvements and bugfixes, release candidate to coming.
 Removed support for the old (device=gpu) backend. Use the new backend (device=cuda) for gpu computing. See Converting to the new gpu back end(gpuarray) for help with conversion.
 2017/03/20: Release of Theano 0.9.0. Everybody is encouraged to update.
 2017/03/13: Release of Theano 0.9.0rc4, with crash fixes and bug fixes.
 2017/03/06: Release of Theano 0.9.0rc3, with crash fixes, bug fixes and improvements.
 2017/02/27: Release of Theano 0.9.0rc2, with crash fixes, bug fixes and improvements.
 2017/02/20: Release of Theano 0.9.0rc1, many improvements and bugfixes, final release to coming.
 2017/01/24: Release of Theano 0.9.0beta1, many improvements and bugfixes, release candidate to coming.
 2016/05/09: New technical report on Theano: Theano: A Python framework for fast computation of mathematical expressions. This is the new preferred reference.
 2016/04/21: Release of Theano 0.8.2, adding support for CuDNN v5.
 2016/03/29: Release of Theano 0.8.1, fixing a compilation issue on MacOS X with XCode 7.3.
 2016/03/21: Release of Theano 0.8. Everybody is encouraged to update.
 MultiGPU.
 We added support for CNMeM to speed up the GPU memory allocation.
 Theano 0.7 was released 26th March 2015. Everybody is encouraged to update.
 We support cuDNN if it is installed by the user.
 Open Machine Learning Workshop 2014
presentation
.  Colin Raffel tutorial on Theano.
 Ian Goodfellow did a 12h class with exercises on Theano.
 New technical report on Theano: Theano: new features and speed improvements.
 HPCS 2011 Tutorial. We included a few fixes discovered while doing the Tutorial.
You can watch a quick (20 minute) introduction to Theano given as a talk at SciPy 2010 via streaming (or downloaded) video:
Transparent GPU Computing With Theano. James Bergstra, SciPy 2010, June 30, 2010.
DownloadÂ¶
Theano is now available on PyPI, and can be installed via easy_install
Theano
, pip install Theano
or by downloading and unpacking the tarball
and typing python setup.py install
.
Those interested in bleedingedge features should obtain the latest development version, available via:
git clone git://github.com/Theano/Theano.git
You can then place the checkout directory on your $PYTHONPATH
or use
python setup.py develop
to install a .pth
into your sitepackages
directory, so that when you pull updates via Git, they will be
automatically reflected the “installed” version. For more information about
installation and configuration, see installing Theano.
Citing TheanoÂ¶
If you use Theano for academic research, you are highly encouraged (though not required) to cite the following, most recent paper:
 Theano Development Team. “Theano: A Python framework for fast computation of mathematical expressions”.
(
short BibTeX
,full BibTeX
)
Theano is primarily developed by academics, and so citations matter a lot to us. As an added benefit, you increase Theano’s exposure and potential user (and developer) base, which is to the benefit of all users of Theano. Thanks in advance!
See our Theano Citation Policy for details.
DocumentationÂ¶
Roughly in order of what you’ll want to check out:
 Installing Theano – How to install Theano.
 Theano at a Glance – What is Theano?
 Tutorial – Learn the basics.
 Troubleshooting – Tips and tricks for common debugging.
 API Documentation – Theano’s functionality, module by module.
 Frequently Asked Questions – A set of commonly asked questions.
 Optimizations – Guide to Theano’s graph optimizations.
 Extending Theano – Learn to add a Type, Op, or graph optimization.
 Developer Start Guide – How to contribute code to Theano.
 Internal Documentation – How to maintain Theano and more...
 Release – How our release should work.
 Acknowledgements – What we took from other projects.
 Related Projects – link to other projects that implement new functionalities on top of Theano
You can download the latest PDF documentation, rather than reading it online.
Check out how Theano can be used for Machine Learning: Deep Learning Tutorials.
Theano was featured at SciPy 2010.
CommunityÂ¶
“Thank YOU for correcting it so quickly. I wish all packages I worked with would have such an active maintenance  this is as good as it gets :)”
(theanousers, Aug 2, 2010)
 Register to theanoannounce if you want to be kept informed on important change on theano(low volume).
 Register and post to theanousers if you want to talk to all Theano users.
 Register and post to theanodev if you want to talk to the developers.
 Register to theanogithub if you want to receive an email for all changes to the GitHub repository.
 Register to theanobuildbot if you want to receive our daily buildbot email.
 Ask/view questions/answers at StackOverflow
 We use Github tickets to keep track of issues (however, some old tickets can still be found on Assembla).
 Come visit us in Montreal! Most developers are students in the LISA group at the University of Montreal.
Help!Â¶
How to Seek HelpÂ¶
The appropriate venue for seeking help depends on the kind of question you have.
 How do I? – theanousers mailing list or StackOverflow
 I got this error, why? – theanousers mailing list or StackOverflow (please include the full error message, even if it’s long)
 I got this error and I’m sure it’s a bug – Github ticket
 I have an idea/request – post the suggestion to theanodev or, even better, implement the idea and submit a GitHub pull request!
 Why do you? – theanousers mailing list (not appropriate for StackOverflow)
 When will you? – theanodev mailing list (not appropriate for StackOverflow)
Please do take some time to search for similar questions that were asked and answered in the past. If you find something similar that doesn’t fully answer your question, it can be helpful to say something like “I found X but it doesn’t address facet Y” and link to the previous discussion.
When asking questions on StackOverflow, please use the theano tag, so your question can be found, and follow StackOverflow’s guidance on asking questions. Consider also using the python and numpy tags, especially if you are unsure which library your problem relates to.
It’s often helpful to include the following details with your question:
 If you have an error, the full error message, even if it’s long
 Which versions of Python and Theano you’re using
 Whether you’re using a CPU or GPU device
 Details of your Theano configuration settings (you can print this in Python via print theano.config)
Spending the time to create a minimal specific example of a problem is likely to get you to an answer quicker than posting something quickly that has too much irrelevant detail or is too vague. A minimal example may take you a bit more time to create but the first response is more likely to be the answer you need than, rather than a frustrated request for clarification.
How to provide helpÂ¶
If you see a question on the theanousers mailing list, or on StackOverflow, that you feel reasonably confident you know an answer to, please do support the community by helping others.
We were all newbies to Theano once and, as the community expands, there is a constant stream of new Theano users looking for help. Perhaps you asked a question when you were first starting out? Now you can pay it forward by helping others. It’s also a good way to reinforce your own Theano knowledge.
Often it’s easiest to answer a question directly but sometimes it may be better to refer people to a good answer that was provided in the past. Pointing people to relevant sections in the documentation, or to a Theano tutorial, can also be helpful.
When answering questions please be nice (as always!) and, on StackOverflow, follow their guidance for answering questions.
Release NotesÂ¶
Theano 0.10.0beta1 (9th of August, 2017)Â¶
This release contains a lot of bug fixes, improvements and new features to prepare the upcoming release candidate.
We recommend that every developer updates to this version.
 Highlights:
 Moved Python 3.* minimum supported version from 3.3 to 3.4
 Replaced deprecated package
noseparameterized
with uptodate packageparameterized
for Theano requirements  Theano now internally uses
sha256
instead ofmd5
to work on systems that forbidemd5
for security reason  Removed old GPU backend
theano.sandbox.cuda
. New backendtheano.gpuarray
is now the official GPU backend  Support more debuggers for
PdbBreakpoint
 Scan improvements
 Speed up Theano scan compilation and gradient computation
 Added meaningful message when missing inputs to scan
 Speed up graph toposort algorithm
 Faster C compilation by massively using a new interface for op params
 Faster optimization step
 Documentation updated and more complete
 Many bug fixes, crash fixes and warning improvements
A total of 65 people contributed to this release since 0.9.0, see list below.
 Interface changes:
 Merged duplicated diagonal functions into two ops:
ExtractDiag
(extract a diagonal to a vector), andAllocDiag
(set a vector as a diagonal of an empty array)  Renamed
MultinomialWOReplacementFromUniform
toChoiceFromUniform
 Removed or deprecated Theano flags:
cublas.lib
cuda.enabled
enable_initial_driver_test
gpuarray.sync
home
lib.cnmem
nvcc.*
flagspycuda.init
 Changed
grad()
method toL_op()
in ops that need the outputs to compute gradient
 Merged duplicated diagonal functions into two ops:
 Convolution updates:
 Extended Theano flag
dnn.enabled
with new optionno_check
to help speed up cuDNN importation  Implemented separable convolutions
 Implemented grouped convolutions
 Extended Theano flag
 GPU:
 Prevent GPU initialization when not required
 Added disk caching option for kernels
 Added method
my_theano_function.sync_shared()
to help synchronize GPU Theano functions  Added useful stats for GPU in profile mode
 Added Cholesky op based on
cusolver
backend  Added GPU ops based on magma library: SVD, matrix inverse, QR, cholesky and eigh
 Added
GpuCublasTriangularSolve
 Added atomic addition and exchange for
long long
values inGpuAdvancedIncSubtensor1_dev20
 Support log gamma function for all noncomplex types
 Support GPU SoftMax in both OpenCL and CUDA
 Support offset parameter
k
forGpuEye
CrossentropyCategorical1Hot
and its gradient are now lifted to GPU Better cuDNN support
 Official support for
v5.*
andv6.*
 Better support and loading on Windows and Mac
 Support cuDNN v6 dilated convolutions
 Support cuDNN v6 reductions
 Added new Theano flags
cuda.include_path
,dnn.base_path
anddnn.bin_path
to help configure Theano when CUDA and cuDNN can not be found automatically.
 Official support for
 Updated
float16
support Added documentation for GPU float16 ops
 Support
float16
forGpuGemmBatch
 Started to use
float32
precision for computations that don’t supportfloat16
on GPU
 New features:
 Added a wrapper for Baidu’s CTC cost and gradient functions
 Added scalar and elemwise CPU ops for modified Bessel function of order 0 and 1 from
scipy.special
.  Added Scaled Exponential Linear Unit (SELU) activation
 Added sigmoid_binary_crossentropy function
 Added trigamma function
 Added modes
half
andfull
forImages2Neibs
ops  Implemented gradient for
AbstractBatchNormTrainGrad
 Implemented gradient for matrix pseudoinverse op
 Added new prop replace for
ChoiceFromUniform
op  Added new prop
on_error
for CPUCholesky
op  Added new Theano flag
deterministic
to help control how Theano optimize certain ops that have deterministic versions. Currently used for subtensor Ops only.  Added new Theano flag
cycle_detection
to speedup optimization step by reducing time spending in inplace optimizations  Added new Theano flag
check_stack_trace
to help check the stack trace during optimization process  Added new Theano flag
cmodule.debug
to allow a debug mode for Theano C code. Currently used for cuDNN convolutions only.
 Others:
 Added deprecation warning for the softmax and logsoftmax vector case
 Added a warning to announce that C++ compiler will become mandatory in next Theano release
0.11
 Other more detailed changes:
 Removed useless warning when profile is manually disabled
 Added tests for abstract conv
 Added options for disconnected_outputs to Rop
 Removed
theano/compat/six.py
 Removed
COp.get_op_params()
 Support of list of strings for
Op.c_support_code()
, to help not duplicate support codes  Macro names provided for array properties are now standardized in both CPU and GPU C codes
 Started to move C code files into separate folder
c_code
in every Theano module  Many improvements for Travis CI tests (with better splitting for faster testing)
 Many improvements for Jenkins CI tests: daily testings on Mac and Windows in addition to Linux
 Commiters since 0.9.0:
 Frederic Bastien
 Arnaud Bergeron
 amrithasuresh
 JoÃ£o Victor Tozatti Risso
 Steven Bocco
 Pascal Lamblin
 Mohammed Affan
 Reyhane Askari
 Alexander Matyasko
 Simon Lefrancois
 Shawn Tan
 Thomas George
 Faruk Ahmed
 Zhouhan LIN
 Aleksandar Botev
 jhelie
 xiaoqie
 Tegan Maharaj
 Matt Graham
 Cesar Laurent
 Gabe Schwartz
 Juan Camilo Gamboa Higuera
 AndroidCloud
 Saizheng Zhang
 vipulraheja
 Florian Bordes
 Sina Honari
 Vikram
 erakra
 Chiheb Trabelsi
 Shubh Vachher
 Daren Eiri
 Gijs van Tulder
 Laurent Dinh
 Mohamed Ishmael Diwan Belghazi
 mila
 Jeff Donahue
 Ramana Subramanyam
 Bogdan Budescu
 Ghislain Antony Vaillant
 Jan SchlÃ¼ter
 Xavier Bouthillier
 fo40225
 Aarni Koskela
 Adam Becker
 Adam Geitgey
 Adrian Keet
 Adrian Seyboldt
 Andrei Costinescu
 Anmol Sahoo
 Chong Wu
 Holger Kohr
 Jayanth Koushik
 Jenkins
 Lilian Besson
 Lv Tao
 Michael Manukyan
 Murugesh Marvel
 NALEPA
 Ubuntu
 Zotov Yuriy
 dareneiri
 lrast
 morrme
 yikang
Theano at a GlanceÂ¶
Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multidimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling handcrafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs.
Theano combines aspects of a computer algebra system (CAS) with aspects of an optimizing compiler. It can also generate customized C code for many mathematical operations. This combination of CAS with optimizing compilation is particularly useful for tasks in which complicated mathematical expressions are evaluated repeatedly and evaluation speed is critical. For situations where many different expressions are each evaluated once Theano can minimize the amount of compilation/analysis overhead, but still provide symbolic features such as automatic differentiation.
Theano’s compiler applies many optimizations of varying complexity to these symbolic expressions. These optimizations include, but are not limited to:
 use of GPU for computations
 constant folding
 merging of similar subgraphs, to avoid redundant calculation
 arithmetic simplification (e.g.
x*y/x > y
,x > x
)  inserting efficient BLAS operations (e.g.
GEMM
) in a variety of contexts  using memory aliasing to avoid calculation
 using inplace operations wherever it does not interfere with aliasing
 loop fusion for elementwise subexpressions
 improvements to numerical stability (e.g. and )
 for a complete list, see Optimizations
Theano was written at the LISA lab to support rapid development of efficient machine learning algorithms. Theano is named after the Greek mathematician, who may have been Pythagoras’ wife. Theano is released under a BSD license (link).
Sneak peekÂ¶
Here is an example of how to use Theano. It doesn’t show off many of Theano’s features, but it illustrates concretely what Theano is.
import theano
from theano import tensor
# declare two symbolic floatingpoint scalars
a = tensor.dscalar()
b = tensor.dscalar()
# create a simple expression
c = a + b
# convert the expression into a callable object that takes (a,b)
# values as input and computes a value for c
f = theano.function([a,b], c)
# bind 1.5 to 'a', 2.5 to 'b', and evaluate 'c'
assert 4.0 == f(1.5, 2.5)
Theano is not a programming language in the normal sense because you write a program in Python that builds expressions for Theano. Still it is like a programming language in the sense that you have to
 declare variables (
a,b
) and give their types  build expressions for how to put those variables together
 compile expression graphs to functions in order to use them for computation.
It is good to think of theano.function
as the interface to a
compiler which builds a callable object from a purely symbolic graph.
One of Theano’s most important features is that theano.function
can optimize a graph and even compile some or all of it into native
machine instructions.
What does it do that they don’t?Â¶
Theano is a Python library and optimizing compiler for manipulating and evaluating expressions, especially matrixvalued ones. Manipulation of matrices is typically done using the numpy package, so what does Theano do that Python and numpy do not?
 execution speed optimizations: Theano can use g++ or nvcc to compile parts your expression graph into CPU or GPU instructions, which run much faster than pure Python.
 symbolic differentiation: Theano can automatically build symbolic graphs for computing gradients.
 stability optimizations: Theano can recognize [some] numerically unstable expressions and compute them with more stable algorithms.
The closest Python package to Theano is sympy. Theano focuses more on tensor expressions than Sympy, and has more machinery for compilation. Sympy has more sophisticated algebra rules and can handle a wider variety of mathematical operations (such as series, limits, and integrals).
If numpy is to be compared to MATLAB and sympy to Mathematica, Theano is a sort of hybrid of the two which tries to combine the best of both worlds.
Getting startedÂ¶
 Installing Theano
 Instructions to download and install Theano on your system.
 Tutorial
 Getting started with Theano’s basic features. Go here if you are new!
 API Documentation
 Details of what Theano provides. It is recommended to go through the Tutorial first though.
A PDF version of the online documentation may be found here.
Theano VisionÂ¶
This is the vision we have for Theano. This is give people an idea of what to expect in the future of Theano, but we can’t promise to implement all of it. This should also help you to understand where Theano fits in relation to other computational tools.
 Support tensor and sparse operations
 Support linear algebra operations
 Graph Transformations
 Differentiation/higher order differentiation
 ‘R’ and ‘L’ differential operators
 Speed/memory optimizations
 Numerical stability optimizations
 Can use many compiled languages, instructions sets: C/C++, CUDA, OpenCL, PTX, CAL, AVX, ...
 Lazy evaluation
 Loop
 Parallel execution (SIMD, multicore, multinode on cluster, multinode distributed)
 Support all NumPy/basic SciPy functionality
 Easy wrapping of library functions in Theano
Note: There is no short term plan to support multinode computation.
Theano Vision StateÂ¶
Here is the state of that vision as of August 9th, 2017 (after Theano 0.10.0beta1):
 We support tensors using the numpy.ndarray object and we support many operations on them.
 We support sparse types by using the scipy.{csc,csr,bsr}_matrix object and support some operations on them.
 We have implementing/wrapping more advanced linear algebra operations. Still more possible.
 We have basic support for the creation of new operations from graphs at runtime. It supports well gradient overload for every input and inlining at the start of compilation. We don’t cover well the case when it is not inlined.
 We have many graph transformations that cover the 4 categories listed above.
 We can improve the graph transformation with better storage optimization
and instruction selection.
 Similar to autotuning during the optimization phase, but this doesn’t apply to only 1 op.
 Example of use: Determine if we should move computation to the GPU or not depending on the input size.
 We support Python 2 and Python 3.
 We have a new CUDA backend for tensors with many dtype support.
 Loops work, but not all related optimizations are currently done.
 The cvm linker allows lazy evaluation. It is the current default linker.
 How to have DebugMode check it? Right now, DebugMode checks the computation nonlazily.
 SIMD parallelism on the CPU comes from the compiler.
 Multicore parallelism support limited. If the external BLAS implementation supports it, many dot are parallelized via gemm, gemv and ger. Also, elementwise operation are supported. See Multi cores support in Theano.
 No multinode support.
 Most, but not all NumPy functions/aliases are implemented.
 Wrapping an existing Python function in easy and documented.
 We know how to separate the shared variable memory storage location from its object type (tensor, sparse, dtype, broadcast flags), but we need to do it.
Contact usÂ¶
Discussion about Theano takes place in the theanodev and theanousers mailing lists. People interested in development of Theano should check the former, while the latter is reserved for issues that concern the end users.
Questions, comments, praise, criticism as well as bug reports should be submitted to these mailing lists.
We welcome all kinds of contributions. If you have any questions regarding how to extend Theano, please feel free to ask on the theanodev mailing list.
RequirementsÂ¶
Note
We only support the installation of the requirements through conda.
 Python == 2.7* or ( >= 3.4 and < 3.6 )
 The development package (pythondev or pythondevel on most Linux distributions) is recommended (see just below). Python 2.4 was supported up to and including the release 0.6. Python 2.6 was supported up to and including the release 0.8.2. Python 3.3 was supported up to and including release 0.9.
 NumPy >= 1.9.1 <= 1.12
 Earlier versions could work, but we donâ€™t test it.
 SciPy >= 0.14 < 0.17.1
 Only currently required for sparse matrix and special functions support, but highly recommended. SciPy >=0.8 could work, but earlier versions have known bugs with sparse matrices.
 BLAS installation (with Level 3 functionality)
 Recommended: MKL, which is free through Conda with
mklservice
package. Alternatively, we suggest to install OpenBLAS, with the development headers (
dev
,devel
, depending on your Linux distribution).
Optional requirements
g++
(Linux and Windows),clang
(OS X) Highly recommended. Theano can fall back on a NumPybased Python execution model, but a C compiler allows for vastly faster execution.
 nose >= 1.3.0
 Recommended, to run Theano’s testsuite.
 Sphinx >= 0.5.1, pygments
 For building the documentation. LaTeX and dvipng are also necessary for math to show up as images.
 pydotng
 To handle large picture for gif/images.
 NVIDIA CUDA drivers and SDK
 Highly recommended Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
 libgpuarray
 Required for GPU/CPU code generation on CUDA and OpenCL devices (see: GpuArray Backend).
 pycuda and skcuda
 Required for some extra operations on the GPU like fft and solvers. We use them to wrap cufft and cusolver. Quick install
pip install pycuda scikitcuda
. For cuda 8, the dev version of skcuda (will be released as 0.5.2) is needed for cusolver:pip install pycuda; pip install git+https://github.com/lebedov/scikitcuda.git#egg=scikitcuda
.
Requirements installation through Conda (recommended)Â¶
Install MinicondaÂ¶
Follow this link to install Miniconda.
Note
If you want fast compiled code (recommended), make sure you have g++
(Windows/Linux) or Clang
(OS X) installed.
Install requirements and optional packagesÂ¶
conda install numpy scipy mkl <nose> <sphinx> <pydotng>
 Arguments between <...> are optional.
Package parameterized
is also optional but may be required for unit testing. It is available via pip
.
pip install parameterized
Install and configure the GPU drivers (recommended)Â¶
Warning
OpenCL support is still minimal for now.
Install CUDA drivers
 Follow this link to install the CUDA driver and the CUDA Toolkit.
 You must reboot the computer after the driver installation.
 Test that it was loaded correctly after the reboot, executing the command nvidiasmi from the command line.
Note
Sanity check: The bin subfolder should contain an nvcc program. This folder is called the cuda root directory.
 Fix ‘lib’ path
 Add the CUDA ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a
64bit OS) to your
$LD_LIBRARY_PATH
environment variable. Example:/usr/local/cuda/lib64
 Add the CUDA ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a
64bit OS) to your
Installing TheanoÂ¶
Supported platforms:
Ubuntu Installation InstructionsÂ¶
Warning
If you want to install the bleedingedge or development version of Theano from GitHub, please make sure you are reading the latest version of this page.
RequirementsÂ¶
Note
We only support the installation of the requirements through conda.
 Python == 2.7* or ( >= 3.4 and < 3.6 )
 The development package (pythondev or pythondevel on most Linux distributions) is recommended (see just below). Python 2.4 was supported up to and including the release 0.6. Python 2.6 was supported up to and including the release 0.8.2. Python 3.3 was supported up to and including release 0.9.
 NumPy >= 1.9.1 <= 1.12
 Earlier versions could work, but we donâ€™t test it.
 SciPy >= 0.14 < 0.17.1
 Only currently required for sparse matrix and special functions support, but highly recommended. SciPy >=0.8 could work, but earlier versions have known bugs with sparse matrices.
 BLAS installation (with Level 3 functionality)
 Recommended: MKL, which is free through Conda with
mklservice
package. Alternatively, we suggest to install OpenBLAS, with the development headers (
dev
,devel
, depending on your Linux distribution).
Optional requirements
pythondev
,g++
>= 4.2 Highly recommended. Theano can fall back on a NumPybased Python execution model, but a C compiler allows for vastly faster execution.
 nose >= 1.3.0
 Recommended, to run Theano’s testsuite.
 Sphinx >= 0.5.1, pygments
 For building the documentation. LaTeX and dvipng are also necessary for math to show up as images.
 pydotng
 To handle large picture for gif/images.
 NVIDIA CUDA drivers and SDK
 Highly recommended Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
 libgpuarray
 Required for GPU/CPU code generation on CUDA and OpenCL devices (see: GpuArray Backend).
 pycuda and skcuda
 Required for some extra operations on the GPU like fft and solvers. We use them to wrap cufft and cusolver. Quick install
pip install pycuda scikitcuda
. For cuda 8, the dev version of skcuda (will be released as 0.5.2) is needed for cusolver:pip install pycuda; pip install git+https://github.com/lebedov/scikitcuda.git#egg=scikitcuda
.
Requirements installation through Conda (recommended)Â¶
Follow this link to install Miniconda.
Note
If you want fast compiled code (recommended), make sure you have g++
installed.
conda install numpy scipy mkl <nose> <sphinx> <pydotng>
 Arguments between <...> are optional.
Package parameterized
is also optional but may be required for unit testing. It is available via pip
.
pip install parameterized
Install and configure the GPU drivers (recommended)Â¶
Warning
OpenCL support is still minimal for now.
Install CUDA drivers
 Follow this link to install the CUDA driver and the CUDA Toolkit.
 You must reboot the computer after the driver installation.
 Test that it was loaded correctly after the reboot, executing the command nvidiasmi from the command line.
Note
Sanity check: The bin subfolder should contain an nvcc program. This folder is called the cuda root directory.
 Fix ‘lib’ path
 Add the CUDA ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a
64bit OS) to your
$LD_LIBRARY_PATH
environment variable. Example:/usr/local/cuda/lib64
 Add the CUDA ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a
64bit OS) to your
InstallationÂ¶
Stable InstallationÂ¶
conda
Â¶If you use conda, you can directly install both theano and pygpu. Libgpuarray will be automatically installed as a dependency.
conda install theano pygpu
Warning
Last conda packages for theano (0.9) and pygpu (0.6*) currently don’t support Python 3.4 branch.
pip
Â¶If you use pip, you have to install Theano and libgpuarray separately.
Install the latest stable version of Theano with:
<sudo> pip install <user> Theano[test, doc]
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 [test] will install the requirements for testing.
 [doc] will install the requirements in order to generate the documentation.
If you encountered any trouble, head to the Troubleshooting page.
The latest stable version of Theano is 0.9.0
(tagged with rel0.9.0
).
For the stable version of Theano you need a specific version of libgpuarray,
that has been tagged v0.6.9
.
Download it with:
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
git checkout tags/v0.6.5 b v0.6.9
and then follow the Stepbystep instructions.
BleedingEdge Installation (recommended)Â¶
Install the latest, bleedingedge, development version of Theano with:
<sudo> pip install <user> <nodeps> git+https://github.com/Theano/Theano.git#egg=Theano
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to be installed through pip. This is important when they have already been installed as system packages.
If you encountered any trouble, head to the Troubleshooting page.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
Developer InstallationÂ¶
Install the developer version of Theano with:
git clone git://github.com/Theano/Theano.git
cd Theano
<sudo> pip install <user> <nodeps> e .
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to be installed through pip. This is important when they have already been installed as system packages.
 e makes your installation editable, i.e., it links it to your source directory.
If you encountered any trouble, head to the Troubleshooting page.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
Prerequisites through System Packages (not recommended)Â¶
If you want to acquire the requirements through your system packages and install them system wide follow these instructions:
For Ubuntu 16.04 with cuda 7.5
sudo aptget install pythonnumpy pythonscipy pythondev pythonpip pythonnose g++ libopenblasdev git graphviz
sudo pip install Theano
# cuda 7.5 don't support the default g++ version. Install an supported version and make it the default.
sudo aptget install g++4.9
sudo updatealternatives install /usr/bin/gcc gcc /usr/bin/gcc4.9 20
sudo updatealternatives install /usr/bin/gcc gcc /usr/bin/gcc5 10
sudo updatealternatives install /usr/bin/g++ g++ /usr/bin/g++4.9 20
sudo updatealternatives install /usr/bin/g++ g++ /usr/bin/g++5 10
sudo updatealternatives install /usr/bin/cc cc /usr/bin/gcc 30
sudo updatealternatives set cc /usr/bin/gcc
sudo updatealternatives install /usr/bin/c++ c++ /usr/bin/g++ 30
sudo updatealternatives set c++ /usr/bin/g++
For Ubuntu 11.10 through 14.04:
sudo aptget install pythonnumpy pythonscipy pythondev pythonpip pythonnose g++ libopenblasdev git
On 14.04, this will install Python 2 by default. If you want to use Python 3:
sudo aptget install python3numpy python3scipy python3dev python3pip python3nose g++ libopenblasdev git
sudo pip3 install Theano
For Ubuntu 11.04:
sudo aptget install pythonnumpy pythonscipy pythondev pythonpip pythonnose g++ git libatlas3gfbase libatlasdev
Manual Openblas installation (deprecated)Â¶
The openblas included in some older Ubuntu version is limited to 2 threads. Ubuntu 14.04 do not have this limit. If you want to use more cores at the same time, you will need to compile it yourself. Here is some code that will help you.
# remove openblas if you installed it
sudo aptget remove libopenblasbase
# Download the development version of OpenBLAS
git clone git://github.com/xianyi/OpenBLAS
cd OpenBLAS
make FC=gfortran
sudo make PREFIX=/usr/local/ install
# Tell Theano to use OpenBLAS.
# This works only for the current user.
# Each Theano user on that computer should run that line.
echo e "\n[blas]\nldflags = lopenblas\n" >> ~/.theanorc
Mac OS Installation InstructionsÂ¶
Warning
If you want to install the bleedingedge or development version of Theano from GitHub, please make sure you are reading the latest version of this page.
There are various ways to install Theano dependencies on a Mac. Here we describe the process in detail with Anaconda, Homebrew or MacPorts but if you did it differently and it worked, please let us know the details on the theanousers mailinglist, so that we can add alternative instructions here.
RequirementsÂ¶
Note
We only support the installation of the requirements through conda.
 Python == 2.7* or ( >= 3.4 and < 3.6 )
 The conda distribution is highly recommended. Python 2.4 was supported up to and including the release 0.6. Python 2.6 was supported up to and including the release 0.8.2. Python 3.3 was supported up to and including release 0.9.
 NumPy >= 1.9.1 <= 1.12
 Earlier versions could work, but we donâ€™t test it.
 SciPy >= 0.14 < 0.17.1
 Only currently required for sparse matrix and special functions support, but highly recommended. SciPy >=0.8 could work, but earlier versions have known bugs with sparse matrices.
 BLAS installation (with Level 3 functionality)
 Recommended: MKL, which is free through Conda with
mklservice
package. Alternatively, we suggest to install OpenBLAS, with the development headers (
dev
,devel
, depending on your Linux distribution).
Optional requirements
clang
(the system version) Highly recommended. Theano can fall back on a NumPybased Python execution model, but a C compiler allows for vastly faster execution.
 nose >= 1.3.0
 Recommended, to run Theano’s testsuite.
 Sphinx >= 0.5.1, pygments
 For building the documentation. LaTeX and dvipng are also necessary for math to show up as images.
 pydotng
 To handle large picture for gif/images.
 NVIDIA CUDA drivers and SDK
 Highly recommended Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
 libgpuarray
 Required for GPU/CPU code generation on CUDA and OpenCL devices (see: GpuArray Backend).
 pycuda and skcuda
 Required for some extra operations on the GPU like fft and solvers. We use them to wrap cufft and cusolver. Quick install
pip install pycuda scikitcuda
. For cuda 8, the dev version of skcuda (will be released as 0.5.2) is needed for cusolver:pip install pycuda; pip install git+https://github.com/lebedov/scikitcuda.git#egg=scikitcuda
.
Requirements installation through Conda (recommended)Â¶
Follow this link to install Miniconda.
Note
If you want fast compiled code (recommended), make sure you have Clang
installed.
conda install numpy scipy mkl <nose> <sphinx> <pydotng>
 Arguments between <...> are optional.
Package parameterized
is also optional but may be required for unit testing. It is available via pip
.
pip install parameterized
Install and configure the GPU drivers (recommended)Â¶
Warning
OpenCL support is still minimal for now.
Install CUDA drivers
 Follow this link to install the CUDA driver and the CUDA Toolkit.
 You must reboot the computer after the driver installation.
 Test that it was loaded correctly after the reboot, executing the command nvidiasmi from the command line.
Note
Sanity check: The bin subfolder should contain an nvcc program. This folder is called the cuda root directory.
 Fix ‘lib’ path
 Add the CUDA ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a
64bit OS) to your
$LD_LIBRARY_PATH
environment variable. Example:/usr/local/cuda/lib64
 Add the CUDA ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a
64bit OS) to your
Attention
For MacOS you should be able to follow the above instructions to setup CUDA, but be aware of the following caveats:
 If you want to compile the CUDA SDK code, you may need to temporarily revert back to Apple’s gcc (
sudo port select gcc
) as their Makefiles are not compatible with MacPort’s gcc. If CUDA seems unable to find a CUDAcapable GPU, you may need to manually toggle your GPU on, which can be done with gfxCardStatus.
Attention
Theano officially supports only clang on OS X. This can be installed by getting XCode from the App Store and running it once to install the commandline tools.
InstallationÂ¶
Stable InstallationÂ¶
conda
Â¶If you use conda, you can directly install both theano and pygpu. Libgpuarray will be automatically installed as a dependency.
conda install theano pygpu
Warning
Last conda packages for theano (0.9) and pygpu (0.6*) currently don’t support Python 3.4 branch.
pip
Â¶If you use pip, you have to install Theano and libgpuarray separately.
Install the latest stable version of Theano with:
<sudo> pip install <user> Theano[test, doc]
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 [test] will install the requirements for testing.
 [doc] will install the requirements in order to generate the documentation.
If you encountered any trouble, head to the Troubleshooting page.
The latest stable version of Theano is 0.9.0
(tagged with rel0.9.0
).
For the stable version of Theano you need a specific version of libgpuarray,
that has been tagged v0.6.9
.
Download it with:
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
git checkout tags/v0.6.5 b v0.6.9
and then follow the Stepbystep instructions.
BleedingEdge Installation (recommended)Â¶
Install the latest, bleedingedge, development version of Theano with:
<sudo> pip install <user> <nodeps> git+https://github.com/Theano/Theano.git#egg=Theano
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to be installed through pip. This is important when they have already been installed as system packages.
If you encountered any trouble, head to the Troubleshooting page.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
Developer InstallationÂ¶
Install the developer version of Theano with:
git clone git://github.com/Theano/Theano.git
cd Theano
<sudo> pip install <user> <nodeps> e .
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to be installed through pip. This is important when they have already been installed as system packages.
 e makes your installation editable, i.e., it links it to your source directory.
If you encountered any trouble, head to the Troubleshooting page.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
Requirements through Homebrew (not recommended)Â¶
Install python with homebrew:
$ brew install python # or python3 if you prefer
This will install pip. Then use pip to install numpy, scipy:
$ pip install numpy scipy
If you want to use openblas instead of Accelerate, you have to install numpy and scipy with hombrew:
$ brew tap homebrew/python
$ brew install numpy withopenblas
$ brew install scipy withopenblas
Requirements through MacPorts (not recommended)Â¶
Using MacPorts to install all required Theano dependencies is easy, but be aware that it will take a long time (a few hours) to build and install everything.
MacPorts requires installing XCode first (which can be found in the Mac App Store), if you do not have it already. If you can’t install it from the App Store, look in your MacOS X installation DVD for an old version. Then update your Mac to update XCode.
Download and install MacPorts, then ensure its package list is uptodate with
sudo port selfupdate
.Then, in order to install one or more of the required libraries, use
port install
, e.g. as follows:$ sudo port install py27numpy +atlas py27scipy +atlas py27pip
This will install all the required Theano dependencies. gcc will be automatically installed (since it is a SciPy dependency), but be aware that it takes a long time to compile (hours)! Having NumPy and SciPy linked with ATLAS (an optimized BLAS implementation) is not mandatory, but recommended if you care about performance.
You might have some different versions of gcc, SciPy, NumPy, Python installed on your system, perhaps via Xcode. It is a good idea to use either the MacPorts version of everything or some other set of compatible versions (e.g. provided by Xcode or Fink). The advantages of MacPorts are the transparency with which everything can be installed and the fact that packages are updated quite frequently. The following steps describe how to make sure you are using the MacPorts version of these packages.
In order to use the MacPorts version of Python, you will probably need to explicitly select it with
sudo port select python python27
. The reason this is necessary is because you may have an Appleprovided Python (via, for example, an Xcode installation). After performing this step, you should check that the symbolic link provided bywhich python
points to the MacPorts python. For instance, on MacOS X Lion with MacPorts 2.0.3, the output ofwhich python
is/opt/local/bin/python
and this symbolic link points to/opt/local/bin/python2.7
. When executingsudo port select python python27apple
(which you should not do), the link points to/usr/bin/python2.7
.Similarly, make sure that you are using the MacPortsprovided gcc: use
sudo port select gcc
to see which gcc installs you have on the system. Then execute for instancesudo port select gcc mpgcc44
to create a symlink that points to the correct (MacPorts) gcc (version 4.4 in this case).At this point, if you have not done so already, it may be a good idea to close and restart your terminal, to make sure all configuration changes are properly taken into account.
Afterwards, please check that the
scipy
module that is imported in Python is the right one (and is a recent one). For instance,import scipy
followed byprint scipy.__version__
andprint scipy.__path__
should result in a version number of at least 0.7.0 and a path that starts with/opt/local
(the path where MacPorts installs its packages). If this is not the case, then you might have some old installation ofscipy
in yourPYTHONPATH
so you should editPYTHONPATH
accordingly.Please follow the same procedure with
numpy
.This is covered in the MacPorts installation process, but make sure that your
PATH
environment variable contains/opt/local/bin
and/opt/local/sbin
before any other paths (to ensure that the Python and gcc binaries that you installed with MacPorts are visible first).MacPorts does not create automatically
nosetests
andpip
symlinks pointing to the MacPorts version, so you can add them yourself with$ sudo ln s /opt/local/bin/nosetests2.7 /opt/local/bin/nosetests $ sudo ln s /opt/local/bin/pip2.7 /opt/local/bin/pip
Windows Installation InstructionsÂ¶
Warning
If you want to install the bleedingedge or development version of Theano from GitHub, please make sure you are reading the latest version of this page.
RequirementsÂ¶
Note
We only support the installation of the requirements through conda.
 Python == 2.7* or ( >= 3.4 and < 3.6 )
 The conda distribution is highly recommended. Python 2.4 was supported up to and including the release 0.6. Python 2.6 was supported up to and including the release 0.8.2. Python 3.3 was supported up to and including release 0.9.
 NumPy >= 1.9.1 <= 1.12
 Earlier versions could work, but we donâ€™t test it.
 SciPy >= 0.14 < 0.17.1
 Only currently required for sparse matrix and special functions support, but highly recommended. SciPy >=0.8 could work, but earlier versions have known bugs with sparse matrices.
 BLAS installation (with Level 3 functionality)
 Recommended: MKL, which is free through Conda with
mklservice
package. Alternatively, we suggest to install OpenBLAS, with the development headers (
dev
,devel
, depending on your Linux distribution).
Optional requirements
 GCC compiler with
g++
(version >=4.2.*
), and Python development files Highly recommended. Theano can fall back on a NumPybased Python execution model, but a C compiler allows for vastly faster execution.
 nose >= 1.3.0
 Recommended, to run Theano’s testsuite.
 Sphinx >= 0.5.1, pygments
 For building the documentation. LaTeX and dvipng are also necessary for math to show up as images.
 pydotng
 To handle large picture for gif/images.
 NVIDIA CUDA drivers and SDK
 Highly recommended Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
 libgpuarray
 Required for GPU/CPU code generation on CUDA and OpenCL devices (see: GpuArray Backend).
 pycuda and skcuda
 Required for some extra operations on the GPU like fft and solvers. We use them to wrap cufft and cusolver. Quick install
pip install pycuda scikitcuda
. For cuda 8, the dev version of skcuda (will be released as 0.5.2) is needed for cusolver:pip install pycuda; pip install git+https://github.com/lebedov/scikitcuda.git#egg=scikitcuda
.
Requirements installation through Conda (recommended)Â¶
Follow this link to install Miniconda.
Note
If you want fast compiled code (recommended), make sure you have g++
installed.
conda install numpy scipy mklservice libpython <m2w64toolchain> <nose> <sphinx> <pydotng> <git>
Note
 Arguments between <...> are optional.
m2w64toolchain
package provides a fullycompatible version of GCC and is then highly recommended.git
package installs git source control through conda, which is required for the development versions of Theano and libgpuarray
Package parameterized
is also optional but may be required for unit testing. It is available via pip
.
pip install parameterized
InstallationÂ¶
Stable InstallationÂ¶
conda
Â¶If you use conda, you can directly install both theano and pygpu. Libgpuarray will be automatically installed as a dependency.
conda install theano pygpu
Warning
Last conda packages for theano (0.9) and pygpu (0.6*) currently don’t support Python 3.4 branch.
pip
Â¶If you use pip, you have to install Theano and libgpuarray separately.
Install the latest stable version of Theano with:
<sudo> pip install <user> Theano[test, doc]
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 [test] will install the requirements for testing.
 [doc] will install the requirements in order to generate the documentation.
If you encountered any trouble, head to the Troubleshooting page.
The latest stable version of Theano is 0.9.0
(tagged with rel0.9.0
).
For the stable version of Theano you need a specific version of libgpuarray,
that has been tagged v0.6.9
.
Download it with:
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
git checkout tags/v0.6.5 b v0.6.9
and then follow the Stepbystep instructions.
BleedingEdge Installation (recommended)Â¶
Install the latest, bleedingedge, development version of Theano with:
<sudo> pip install <user> <nodeps> git+https://github.com/Theano/Theano.git#egg=Theano
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to be installed through pip. This is important when they have already been installed as system packages.
If you encountered any trouble, head to the Troubleshooting page.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
Developer InstallationÂ¶
Install the developer version of Theano with:
git clone git://github.com/Theano/Theano.git
cd Theano
<sudo> pip install <user> <nodeps> e .
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to be installed through pip. This is important when they have already been installed as system packages.
 e makes your installation editable, i.e., it links it to your source directory.
If you encountered any trouble, head to the Troubleshooting page.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
Instructions for other Python distributions (not recommended)Â¶
If you plan to use Theano with other Python distributions, these are generic guidelines to get a working environment:
Look for the mandatory requirements in the package manager’s repositories of your distribution. Many distributions come with
pip
package manager which use PyPI repository. The required modules are Python (of course), NumPy, SciPy and a BLAS implementation (MKL or OpenBLAS). Use the versions recommended at the top of this documentation.If the package manager provide a GCC compiler with the recommended version (see at top), install it. If not, you could use the build TDM GCC which is provided for both 32 and 64bit platforms. A few caveats to watch for during installation:
 Install to a directory without spaces (we have placed it in
C:\SciSoft\TDMGCC64
) If you don’t want to clutter your system PATH uncheck
add to path
option. Enable OpenMP support by checking the option
openmp support option
.Install CUDA with the same instructions as above.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
CentOS 6 Installation InstructionsÂ¶
Warning
If you want to install the bleedingedge or development version of Theano from GitHub, please make sure you are reading the latest version of this page.
RequirementsÂ¶
Note
We only support the installation of the requirements through conda.
 Python == 2.7* or ( >= 3.4 and < 3.6 )
 The development package (pythondev or pythondevel on most Linux distributions) is recommended (see just below). Python 2.4 was supported up to and including the release 0.6. Python 2.6 was supported up to and including the release 0.8.2. Python 3.3 was supported up to and including release 0.9.
 NumPy >= 1.9.1 <= 1.12
 Earlier versions could work, but we donâ€™t test it.
 SciPy >= 0.14 < 0.17.1
 Only currently required for sparse matrix and special functions support, but highly recommended. SciPy >=0.8 could work, but earlier versions have known bugs with sparse matrices.
 BLAS installation (with Level 3 functionality)
 Recommended: MKL, which is free through Conda with
mklservice
package. Alternatively, we suggest to install OpenBLAS, with the development headers (
dev
,devel
, depending on your Linux distribution).
Optional requirements
pythondev
,g++
>= 4.2 Highly recommended. Theano can fall back on a NumPybased Python execution model, but a C compiler allows for vastly faster execution.
 nose >= 1.3.0
 Recommended, to run Theano’s testsuite.
 Sphinx >= 0.5.1, pygments
 For building the documentation. LaTeX and dvipng are also necessary for math to show up as images.
 pydotng
 To handle large picture for gif/images.
 NVIDIA CUDA drivers and SDK
 Highly recommended Required for GPU code generation/execution on NVIDIA gpus. See instruction below.
 libgpuarray
 Required for GPU/CPU code generation on CUDA and OpenCL devices (see: GpuArray Backend).
 pycuda and skcuda
 Required for some extra operations on the GPU like fft and solvers. We use them to wrap cufft and cusolver. Quick install
pip install pycuda scikitcuda
. For cuda 8, the dev version of skcuda (will be released as 0.5.2) is needed for cusolver:pip install pycuda; pip install git+https://github.com/lebedov/scikitcuda.git#egg=scikitcuda
.
Requirements installation through Conda (recommended)Â¶
Follow this link to install Miniconda.
Note
If you want fast compiled code (recommended), make sure you have g++
installed.
conda install numpy scipy mkl <nose> <sphinx> <pydotng>
 Arguments between <...> are optional.
Package parameterized
is also optional but may be required for unit testing. It is available via pip
.
pip install parameterized
Install and configure the GPU drivers (recommended)Â¶
Warning
OpenCL support is still minimal for now.
Install CUDA drivers
 Follow this link to install the CUDA driver and the CUDA Toolkit.
 You must reboot the computer after the driver installation.
 Test that it was loaded correctly after the reboot, executing the command nvidiasmi from the command line.
Note
Sanity check: The bin subfolder should contain an nvcc program. This folder is called the cuda root directory.
 Fix ‘lib’ path
 Add the CUDA ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a
64bit OS) to your
$LD_LIBRARY_PATH
environment variable. Example:/usr/local/cuda/lib64
 Add the CUDA ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a
64bit OS) to your
InstallationÂ¶
Stable InstallationÂ¶
conda
Â¶If you use conda, you can directly install both theano and pygpu. Libgpuarray will be automatically installed as a dependency.
conda install theano pygpu
Warning
Last conda packages for theano (0.9) and pygpu (0.6*) currently don’t support Python 3.4 branch.
pip
Â¶If you use pip, you have to install Theano and libgpuarray separately.
Install the latest stable version of Theano with:
<sudo> pip install <user> Theano[test, doc]
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 [test] will install the requirements for testing.
 [doc] will install the requirements in order to generate the documentation.
If you encountered any trouble, head to the Troubleshooting page.
The latest stable version of Theano is 0.9.0
(tagged with rel0.9.0
).
For the stable version of Theano you need a specific version of libgpuarray,
that has been tagged v0.6.9
.
Download it with:
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
git checkout tags/v0.6.5 b v0.6.9
and then follow the Stepbystep instructions.
BleedingEdge Installation (recommended)Â¶
Install the latest, bleedingedge, development version of Theano with:
<sudo> pip install <user> <nodeps> git+https://github.com/Theano/Theano.git#egg=Theano
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to be installed through pip. This is important when they have already been installed as system packages.
If you encountered any trouble, head to the Troubleshooting page.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
Developer InstallationÂ¶
Install the developer version of Theano with:
git clone git://github.com/Theano/Theano.git
cd Theano
<sudo> pip install <user> <nodeps> e .
 Any argument between <...> is optional.
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to be installed through pip. This is important when they have already been installed as system packages.
 e makes your installation editable, i.e., it links it to your source directory.
If you encountered any trouble, head to the Troubleshooting page.
Install the latest, development version of libgpuarray following the Stepbystep instructions.
Requirements through System Packages (not recommended)Â¶
sudo yum install pythondevel pythonnose pythonsetuptools gcc gccgfortran gccc++ blasdevel lapackdevel atlasdevel
sudo easy_install pip
Other Platformspecific InstallationsÂ¶
Warning
These instructions are not kept up to date.
NVIDIA Jetson TX1 embedded platformÂ¶
sudo aptget install pythonnumpy pythonscipy pythondev pythonpip pythonnose g++ libblasdev git
pip install upgrade nodeps git+git://github.com/Theano/Theano.git user # Need Theano 0.8 or more recent
GentooÂ¶
Brian Vandenberg emailed installation instructions on Gentoo, focusing on how to install the appropriate dependencies.
Nicolas Pinto provides ebuild scripts.
AWS Marketplace with Bitfusion AMIÂ¶
AWS EC2 AMI preinstalled with Nvidia drivers, CUDA, cuDNN, Theano, Keras, Lasagne, Python 2, Python 3, PyCuda, ScikitLearn, Pandas, Enum34, iPython, and Jupyter. Note, as always there is no charge for Theano and other open software, however there is a charge for AWS hosting + Bitfusion.
Launch an instance from the AWS Marketplace.
DockerÂ¶
Builds of Theano are available as Docker images: Theano Docker (CPU) or Theano Docker (CUDA). These are updated on a weekly basis with bleedingedge builds of Theano. Examples of running bash in a Docker container are as follows:
sudo docker run it kaixhin/theano
sudo nvidiadocker run it kaixhin/cudatheano:7.0
For a guide to Docker, see the official docs. CUDA support requires NVIDIA Docker. For more details on how to use the Theano Docker images, consult the source project.
Once your setup is complete and if you installed the GPU libraries, head to Testing Theano with GPU to find how to verify everything is working properly.
To update your current installation see Updating Theano.
Updating TheanoÂ¶
Follow one of these three sections depending on how you installed Theano.
You should update frequently, bugs are fixed on a very regular basis, and features are added even more frequently!
Stable InstallationÂ¶
The following command will update only Theano:
<sudo> pip install <user> <nodeps> theano
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to not be installed through pip. This is important when they have already been installed as system packages.
Warning
If you installed NumPy/SciPy with yum/aptget, updating NumPy/SciPy with pip/easy_install is not always a good idea. This can make Theano crash due to problems with BLAS. The versions of NumPy/SciPy in the distribution are sometimes linked against faster versions of BLAS. Installing NumPy/SciPy with yum/aptget/pip/easy_install won’t install the development package needed to recompile it with the fast version. To fix a possible crash, you can clear the Theano cache like this:
theanocache clear
BleedingEdge InstallationÂ¶
The following command will update your bleedingedge version of Theano
<sudo> pip install <user> <nodeps> git+https://github.com/Theano/Theano.git#egg=Theano
 Use sudo for a root installation.
 Use user for a user installation without admin rights. It will install Theano in your local sitepackages.
 Use nodeps when you don’t want the dependencies of Theano to not be installed through pip. This is important when they have already been installed as system packages.
Warning
If you installed NumPy/SciPy with yum/aptget, updating NumPy/SciPy with pip/easy_install is not always a good idea. This can make Theano crash due to problems with BLAS. The versions of NumPy/SciPy in the distribution are sometimes linked against faster versions of BLAS. Installing NumPy/SciPy with yum/aptget/pip/easy_install won’t install the development package needed to recompile it with the fast version. To fix a possible crash, you can clear the Theano cache like this:
theanocache clear
Developer InstallationÂ¶
To update your library to the latest revision, change directory (cd
)
to your Theano
folder and execute the following command:
Warning
The following assumes you have knowledge of git and know how to do a rebase.
git pull rebase
TutorialÂ¶
Let us start an interactive session (e.g. with python
or ipython
) and import Theano.
>>> from theano import *
Several of the symbols you will need to use are in the tensor
subpackage
of Theano. Let us import that subpackage under a handy name like
T
(the tutorials will frequently use this convention).
>>> import theano.tensor as T
If that succeeded you are ready for the tutorial, otherwise check your installation (see Installing Theano).
Throughout the tutorial, bear in mind that there is a Glossary as well as index and modules links in the upperright corner of each page to help you out.
PrerequisitesÂ¶
Python tutorialÂ¶
In this documentation, we suppose that the reader knows Python. Here is a small list of Python tutorials/exercises if you need to learn it or only need a refresher:
 Python Challenge
 Dive into Python
 Google Python Class
 Enthought Python course (free for academics)
We have a tutorial on how Python manages its memory.
NumPy refresherÂ¶
 Here are some quick guides to NumPy:
Matrix conventions for machine learningÂ¶
Rows are horizontal and columns are vertical. Every row is an example. Therefore, inputs[10,5] is a matrix of 10 examples where each example has dimension 5. If this would be the input of a neural network then the weights from the input to the first hidden layer would represent a matrix of size (5, #hid).
Consider this array:
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]])
array([[ 1., 2.],
[ 3., 4.],
[ 5., 6.]])
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]]).shape
(3, 2)
This is a 3x2 matrix, i.e. there are 3 rows and 2 columns.
To access the entry in the 3rd row (row #2) and the 1st column (column #0):
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]])[2, 0]
5.0
To remember this, keep in mind that we read lefttoright, toptobottom, so each thing that is contiguous is a row. That is, there are 3 rows and 2 columns.
BroadcastingÂ¶
Numpy does broadcasting of arrays of different shapes during arithmetic operations. What this means in general is that the smaller array (or scalar) is broadcasted across the larger array so that they have compatible shapes. The example below shows an instance of broadcasting:
>>> a = numpy.asarray([1.0, 2.0, 3.0])
>>> b = 2.0
>>> a * b
array([ 2., 4., 6.])
The smaller array b
(actually a scalar here, which works like a 0d array) in this case is broadcasted to the same size
as a
during the multiplication. This trick is often useful in
simplifying how expression are written. More detail about broadcasting
can be found in the numpy user guide.
BasicsÂ¶
Baby Steps  AlgebraÂ¶
Adding two ScalarsÂ¶
To get us started with Theano and get a feel of what we’re working with, let’s make a simple function: add two numbers together. Here is how you do it:
>>> import numpy
>>> import theano.tensor as T
>>> from theano import function
>>> x = T.dscalar('x')
>>> y = T.dscalar('y')
>>> z = x + y
>>> f = function([x, y], z)
And now that we’ve created our function we can use it:
>>> f(2, 3)
array(5.0)
>>> numpy.allclose(f(16.3, 12.1), 28.4)
True
Let’s break this down into several steps. The first step is to define
two symbols (Variables) representing the quantities that you want
to add. Note that from now on, we will use the term
Variable to mean “symbol” (in other words,
x, y, z are all Variable objects). The output of the function
f is a numpy.ndarray
with zero dimensions.
If you are following along and typing into an interpreter, you may have
noticed that there was a slight delay in executing the function
instruction. Behind the scene, f was being compiled into C code.
Step 1
>>> x = T.dscalar('x')
>>> y = T.dscalar('y')
In Theano, all symbols must be typed. In particular, T.dscalar
is the type we assign to “0dimensional arrays (scalar) of doubles
(d)”. It is a Theano Type.
dscalar
is not a class. Therefore, neither x nor y
are actually instances of dscalar
. They are instances of
TensorVariable
. x and y
are, however, assigned the theano Type dscalar
in their type
field, as you can see here:
>>> type(x)
<class 'theano.tensor.var.TensorVariable'>
>>> x.type
TensorType(float64, scalar)
>>> T.dscalar
TensorType(float64, scalar)
>>> x.type is T.dscalar
True
By calling T.dscalar
with a string argument, you create a
Variable representing a floatingpoint scalar quantity with the
given name. If you provide no argument, the symbol will be unnamed. Names
are not required, but they can help debugging.
More will be said in a moment regarding Theano’s inner structure. You could also learn more by looking into Graph Structures.
Step 2
The second step is to combine x and y into their sum z:
>>> z = x + y
z is yet another Variable which represents the addition of x and y. You can use the pp function to prettyprint out the computation associated to z.
>>> from theano import pp
>>> print(pp(z))
(x + y)
Step 3
The last step is to create a function taking x and y as inputs and giving z as output:
>>> f = function([x, y], z)
The first argument to function
is a list of Variables
that will be provided as inputs to the function. The second argument
is a single Variable or a list of Variables. For either case, the second
argument is what we want to see as output when we apply the function. f may
then be used like a normal Python function.
Note
As a shortcut, you can skip step 3, and just use a variable’s
eval
method.
The eval()
method is not as flexible
as function()
but it can do everything we’ve covered in
the tutorial so far. It has the added benefit of not requiring
you to import function()
. Here is how eval()
works:
>>> import numpy
>>> import theano.tensor as T
>>> x = T.dscalar('x')
>>> y = T.dscalar('y')
>>> z = x + y
>>> numpy.allclose(z.eval({x : 16.3, y : 12.1}), 28.4)
True
We passed eval()
a dictionary mapping symbolic theano
variables to the values to substitute for them, and it returned
the numerical value of the expression.
eval()
will be slow the first time you call it on a variable –
it needs to call function()
to compile the expression behind
the scenes. Subsequent calls to eval()
on that same variable
will be fast, because the variable caches the compiled function.
Adding two MatricesÂ¶
You might already have guessed how to do this. Indeed, the only change from the previous example is that you need to instantiate x and y using the matrix Types:
>>> x = T.dmatrix('x')
>>> y = T.dmatrix('y')
>>> z = x + y
>>> f = function([x, y], z)
dmatrix
is the Type for matrices of doubles. Then we can use
our new function on 2D arrays:
>>> f([[1, 2], [3, 4]], [[10, 20], [30, 40]])
array([[ 11., 22.],
[ 33., 44.]])
The variable is a NumPy array. We can also use NumPy arrays directly as inputs:
>>> import numpy
>>> f(numpy.array([[1, 2], [3, 4]]), numpy.array([[10, 20], [30, 40]]))
array([[ 11., 22.],
[ 33., 44.]])
It is possible to add scalars to matrices, vectors to matrices, scalars to vectors, etc. The behavior of these operations is defined by broadcasting.
The following types are available:
 byte:
bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4, btensor5
 16bit integers:
wscalar, wvector, wmatrix, wrow, wcol, wtensor3, wtensor4, wtensor5
 32bit integers:
iscalar, ivector, imatrix, irow, icol, itensor3, itensor4, itensor5
 64bit integers:
lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4, ltensor5
 float:
fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4, ftensor5
 double:
dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4, dtensor5
 complex:
cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4, ctensor5
The previous list is not exhaustive and a guide to all types compatible with NumPy arrays may be found here: tensor creation.
Note
You, the user—not the system architecture—have to choose whether your
program will use 32 or 64bit integers (i
prefix vs. the l
prefix)
and floats (f
prefix vs. the d
prefix).
More ExamplesÂ¶
At this point it would be wise to begin familiarizing yourself more systematically with Theano’s fundamental objects and operations by browsing this section of the library: Basic Tensor Functionality.
As the tutorial unfolds, you should also gradually acquaint yourself with the other relevant areas of the library and with the relevant subjects of the documentation entrance page.
Logistic FunctionÂ¶
Here’s another straightforward example, though a bit more elaborate than adding two numbers together. Let’s say that you want to compute the logistic curve, which is given by:
You want to compute the function elementwise on matrices of doubles, which means that you want to apply this function to each individual element of the matrix.
Well, what you do is this:
>>> import theano
>>> import theano.tensor as T
>>> x = T.dmatrix('x')
>>> s = 1 / (1 + T.exp(x))
>>> logistic = theano.function([x], s)
>>> logistic([[0, 1], [1, 2]])
array([[ 0.5 , 0.73105858],
[ 0.26894142, 0.11920292]])
The reason logistic is performed elementwise is because all of its operations—division, addition, exponentiation, and division—are themselves elementwise operations.
It is also the case that:
We can verify that this alternate form produces the same values:
>>> s2 = (1 + T.tanh(x / 2)) / 2
>>> logistic2 = theano.function([x], s2)
>>> logistic2([[0, 1], [1, 2]])
array([[ 0.5 , 0.73105858],
[ 0.26894142, 0.11920292]])
Computing More than one Thing at the Same TimeÂ¶
Theano supports functions with multiple outputs. For example, we can compute the elementwise difference, absolute difference, and squared difference between two matrices a and b at the same time:
>>> a, b = T.dmatrices('a', 'b')
>>> diff = a  b
>>> abs_diff = abs(diff)
>>> diff_squared = diff**2
>>> f = theano.function([a, b], [diff, abs_diff, diff_squared])
Note
dmatrices produces as many outputs as names that you provide. It is a shortcut for allocating symbolic variables that we will often use in the tutorials.
When we use the function f, it returns the three variables (the printing was reformatted for readability):
>>> f([[1, 1], [1, 1]], [[0, 1], [2, 3]])
[array([[ 1., 0.],
[1., 2.]]), array([[ 1., 0.],
[ 1., 2.]]), array([[ 1., 0.],
[ 1., 4.]])]
Setting a Default Value for an ArgumentÂ¶
Let’s say you want to define a function that adds two numbers, except that if you only provide one number, the other input is assumed to be one. You can do it like this:
>>> from theano import In
>>> from theano import function
>>> x, y = T.dscalars('x', 'y')
>>> z = x + y
>>> f = function([x, In(y, value=1)], z)
>>> f(33)
array(34.0)
>>> f(33, 2)
array(35.0)
This makes use of the In class which allows
you to specify properties of your function’s parameters with greater detail. Here we
give a default value of 1 for y by creating a In
instance with
its value
field set to 1.
Inputs with default values must follow inputs without default values (like Python’s functions). There can be multiple inputs with default values. These parameters can be set positionally or by name, as in standard Python:
>>> x, y, w = T.dscalars('x', 'y', 'w')
>>> z = (x + y) * w
>>> f = function([x, In(y, value=1), In(w, value=2, name='w_by_name')], z)
>>> f(33)
array(68.0)
>>> f(33, 2)
array(70.0)
>>> f(33, 0, 1)
array(33.0)
>>> f(33, w_by_name=1)
array(34.0)
>>> f(33, w_by_name=1, y=0)
array(33.0)
Note
In
does not know the name of the local variables y and w
that are passed as arguments. The symbolic variable objects have name
attributes (set by dscalars
in the example above) and these are the
names of the keyword parameters in the functions that we build. This is
the mechanism at work in In(y, value=1)
. In the case of In(w,
value=2, name='w_by_name')
. We override the symbolic variable’s name
attribute with a name to be used for this function.
You may like to see Function in the library for more detail.
Copying functionsÂ¶
Theano functions can be copied, which can be useful for creating similar
functions but with different shared variables or updates. This is done using
the copy()
method of function
objects. The optimized graph of the original function is copied,
so compilation only needs to be performed once.
Let’s start from the accumulator defined above:
>>> import theano
>>> import theano.tensor as T
>>> state = theano.shared(0)
>>> inc = T.iscalar('inc')
>>> accumulator = theano.function([inc], state, updates=[(state, state+inc)])
We can use it to increment the state as usual:
>>> accumulator(10)
array(0)
>>> print(state.get_value())
10
We can use copy()
to create a similar accumulator but with its own internal state
using the swap
parameter, which is a dictionary of shared variables to exchange:
>>> new_state = theano.shared(0)
>>> new_accumulator = accumulator.copy(swap={state:new_state})
>>> new_accumulator(100)
[array(0)]
>>> print(new_state.get_value())
100
The state of the first function is left untouched:
>>> print(state.get_value())
10
We now create a copy with updates removed using the delete_updates
parameter, which is set to False
by default:
>>> null_accumulator = accumulator.copy(delete_updates=True)
As expected, the shared state is no longer updated:
>>> null_accumulator(9000)
[array(10)]
>>> print(state.get_value())
10
Using Random NumbersÂ¶
Because in Theano you first express everything symbolically and afterwards compile this expression to get functions, using pseudorandom numbers is not as straightforward as it is in NumPy, though also not too complicated.
The way to think about putting randomness into Theano’s computations is to put random variables in your graph. Theano will allocate a NumPy RandomStream object (a random number generator) for each such variable, and draw from it as necessary. We will call this sort of sequence of random numbers a random stream. Random streams are at their core shared variables, so the observations on shared variables hold here as well. Theano’s random objects are defined and implemented in RandomStreams and, at a lower level, in RandomStreamsBase.
Here’s a brief example. The setup code is:
from theano.tensor.shared_randomstreams import RandomStreams
from theano import function
srng = RandomStreams(seed=234)
rv_u = srng.uniform((2,2))
rv_n = srng.normal((2,2))
f = function([], rv_u)
g = function([], rv_n, no_default_updates=True) #Not updating rv_n.rng
nearly_zeros = function([], rv_u + rv_u  2 * rv_u)
Here, ‘rv_u’ represents a random stream of 2x2 matrices of draws from a uniform
distribution. Likewise, ‘rv_n’ represents a random stream of 2x2 matrices of
draws from a normal distribution. The distributions that are implemented are
defined in RandomStreams
and, at a lower level,
in raw_random. They only work on CPU.
See Other Implementations for GPU version.
Now let’s use these objects. If we call f(), we get random uniform numbers. The internal state of the random number generator is automatically updated, so we get different random numbers every time.
>>> f_val0 = f()
>>> f_val1 = f() #different numbers from f_val0
When we add the extra argument no_default_updates=True
to
function
(as in g), then the random number generator state is
not affected by calling the returned function. So, for example, calling
g multiple times will return the same numbers.
>>> g_val0 = g() # different numbers from f_val0 and f_val1
>>> g_val1 = g() # same numbers as g_val0!
An important remark is that a random variable is drawn at most once during any single function execution. So the nearly_zeros function is guaranteed to return approximately 0 (except for rounding error) even though the rv_u random variable appears three times in the output expression.
>>> nearly_zeros = function([], rv_u + rv_u  2 * rv_u)
Random variables can be seeded individually or collectively.
You can seed just one random variable by seeding or assigning to the
.rng
attribute, using .rng.set_value()
.
>>> rng_val = rv_u.rng.get_value(borrow=True) # Get the rng for rv_u
>>> rng_val.seed(89234) # seeds the generator
>>> rv_u.rng.set_value(rng_val, borrow=True) # Assign back seeded rng
You can also seed all of the random variables allocated by a RandomStreams
object by that object’s seed
method. This seed will be used to seed a
temporary random number generator, that will in turn generate seeds for each
of the random variables.
>>> srng.seed(902340) # seeds rv_u and rv_n with different seeds each
As usual for shared variables, the random number generators used for random variables are common between functions. So our nearly_zeros function will update the state of the generators used in function f above.
For example:
>>> state_after_v0 = rv_u.rng.get_value().get_state()
>>> nearly_zeros() # this affects rv_u's generator
array([[ 0., 0.],
[ 0., 0.]])
>>> v1 = f()
>>> rng = rv_u.rng.get_value(borrow=True)
>>> rng.set_state(state_after_v0)
>>> rv_u.rng.set_value(rng, borrow=True)
>>> v2 = f() # v2 != v1
>>> v3 = f() # v3 == v1
In some use cases, a user might want to transfer the “state” of all random
number generators associated with a given theano graph (e.g. g1, with compiled
function f1 below) to a second graph (e.g. g2, with function f2). This might
arise for example if you are trying to initialize the state of a model, from
the parameters of a pickled version of a previous model. For
theano.tensor.shared_randomstreams.RandomStreams
and
theano.sandbox.rng_mrg.MRG_RandomStreams
this can be achieved by copying elements of the state_updates parameter.
Each time a random variable is drawn from a RandomStreams object, a tuple is added to the state_updates list. The first element is a shared variable, which represents the state of the random number generator associated with this particular variable, while the second represents the theano graph corresponding to the random number generation process (i.e. RandomFunction{uniform}.0).
An example of how “random states” can be transferred from one theano function to another is shown below.
>>> from __future__ import print_function
>>> import theano
>>> import numpy
>>> import theano.tensor as T
>>> from theano.sandbox.rng_mrg import MRG_RandomStreams
>>> from theano.tensor.shared_randomstreams import RandomStreams
>>> class Graph():
... def __init__(self, seed=123):
... self.rng = RandomStreams(seed)
... self.y = self.rng.uniform(size=(1,))
>>> g1 = Graph(seed=123)
>>> f1 = theano.function([], g1.y)
>>> g2 = Graph(seed=987)
>>> f2 = theano.function([], g2.y)
>>> # By default, the two functions are out of sync.
>>> f1()
array([ 0.72803009])
>>> f2()
array([ 0.55056769])
>>> def copy_random_state(g1, g2):
... if isinstance(g1.rng, MRG_RandomStreams):
... g2.rng.rstate = g1.rng.rstate
... for (su1, su2) in zip(g1.rng.state_updates, g2.rng.state_updates):
... su2[0].set_value(su1[0].get_value())
>>> # We now copy the state of the theano random number generators.
>>> copy_random_state(g1, g2)
>>> f1()
array([ 0.59044123])
>>> f2()
array([ 0.59044123])
There are other distributions implemented.
A Real Example: Logistic RegressionÂ¶
The preceding elements are featured in this more realistic example. It will be used repeatedly.
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400 # training sample size
feats = 784 # number of input variables
# generate a dataset: D = (input_values, target_class)
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
training_steps = 10000
# Declare Theano symbolic variables
x = T.dmatrix("x")
y = T.dvector("y")
# initialize the weight vector w randomly
#
# this and the following bias variable b
# are shared so they keep their values
# between training iterations (updates)
w = theano.shared(rng.randn(feats), name="w")
# initialize the bias term
b = theano.shared(0., name="b")
print("Initial model:")
print(w.get_value())
print(b.get_value())
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(T.dot(x, w)  b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = y * T.log(p_1)  (1y) * T.log(1p_1) # Crossentropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize
gw, gb = T.grad(cost, [w, b]) # Compute the gradient of the cost
# w.r.t weight vector w and
# bias term b
# (we shall return to this in a
# following section of this tutorial)
# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=((w, w  0.1 * gw), (b, b  0.1 * gb)))
predict = theano.function(inputs=[x], outputs=prediction)
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("Final model:")
print(w.get_value())
print(b.get_value())
print("target values for D:")
print(D[1])
print("prediction on D:")
print(predict(D[0]))
Derivatives in TheanoÂ¶
Computing GradientsÂ¶
Now let’s use Theano for a slightly more sophisticated task: create a
function which computes the derivative of some expression y with
respect to its parameter x. To do this we will use the macro T.grad
.
For instance, we can compute the
gradient of with respect to . Note that:
.
Here is the code to compute this gradient:
>>> import numpy
>>> import theano
>>> import theano.tensor as T
>>> from theano import pp
>>> x = T.dscalar('x')
>>> y = x ** 2
>>> gy = T.grad(y, x)
>>> pp(gy) # print out the gradient prior to optimization
'((fill((x ** TensorConstant{2}), TensorConstant{1.0}) * TensorConstant{2}) * (x ** (TensorConstant{2}  TensorConstant{1})))'
>>> f = theano.function([x], gy)
>>> f(4)
array(8.0)
>>> numpy.allclose(f(94.2), 188.4)
True
In this example, we can see from pp(gy)
that we are computing
the correct symbolic gradient.
fill((x ** 2), 1.0)
means to make a matrix of the same shape as
x ** 2 and fill it with 1.0.
Note
The optimizer simplifies the symbolic gradient expression. You can see this by digging inside the internal properties of the compiled function.
pp(f.maker.fgraph.outputs[0])
'(2.0 * x)'
After optimization there is only one Apply node left in the graph, which doubles the input.
We can also compute the gradient of complex expressions such as the logistic function defined above. It turns out that the derivative of the logistic is: .
>>> x = T.dmatrix('x')
>>> s = T.sum(1 / (1 + T.exp(x)))
>>> gs = T.grad(s, x)
>>> dlogistic = theano.function([x], gs)
>>> dlogistic([[0, 1], [1, 2]])
array([[ 0.25 , 0.19661193],
[ 0.19661193, 0.10499359]])
In general, for any scalar expression s, T.grad(s, w)
provides
the Theano expression for computing . In
this way Theano can be used for doing efficient symbolic differentiation
(as the expression returned by T.grad
will be optimized during compilation), even for
function with many inputs. (see automatic differentiation for a description
of symbolic differentiation).
Note
The second argument of T.grad
can be a list, in which case the
output is also a list. The order in both lists is important: element
i of the output list is the gradient of the first argument of
T.grad
with respect to the ith element of the list given as second argument.
The first argument of T.grad
has to be a scalar (a tensor
of size 1). For more information on the semantics of the arguments of
T.grad
and details about the implementation, see
this section of the library.
Additional information on the inner workings of differentiation may also be found in the more advanced tutorial Extending Theano.
Computing the JacobianÂ¶
In Theano’s parlance, the term Jacobian designates the tensor comprising the
first partial derivatives of the output of a function with respect to its inputs.
(This is a generalization of to the socalled Jacobian matrix in Mathematics.)
Theano implements the theano.gradient.jacobian()
macro that does all
that is needed to compute the Jacobian. The following text explains how
to do it manually.
In order to manually compute the Jacobian of some function y with
respect to some parameter x we need to use scan
. What we
do is to loop over the entries in y and compute the gradient of
y[i] with respect to x.
Note
scan
is a generic op in Theano that allows writing in a symbolic
manner all kinds of recurrent equations. While creating
symbolic loops (and optimizing them for performance) is a hard task,
effort is being done for improving the performance of scan
. We
shall return to scan later in this tutorial.
>>> import theano
>>> import theano.tensor as T
>>> x = T.dvector('x')
>>> y = x ** 2
>>> J, updates = theano.scan(lambda i, y, x : T.grad(y[i], x), sequences=T.arange(y.shape[0]), non_sequences=[y, x])
>>> f = theano.function([x], J, updates=updates)
>>> f([4, 4])
array([[ 8., 0.],
[ 0., 8.]])
What we do in this code is to generate a sequence of ints from 0 to
y.shape[0]
using T.arange
. Then we loop through this sequence, and
at each step, we compute the gradient of element y[i] with respect to
x. scan
automatically concatenates all these rows, generating a
matrix which corresponds to the Jacobian.
Note
There are some pitfalls to be aware of regarding T.grad
. One of them is that you
cannot rewrite the above expression of the Jacobian as
theano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,
non_sequences=x)
, even though from the documentation of scan this
seems possible. The reason is that y_i will not be a function of
x anymore, while y[i] still is.
Computing the HessianÂ¶
In Theano, the term Hessian has the usual mathematical meaning: It is the
matrix comprising the second order partial derivative of a function with scalar
output and vector input. Theano implements theano.gradient.hessian()
macro that does all
that is needed to compute the Hessian. The following text explains how
to do it manually.
You can compute the Hessian manually similarly to the Jacobian. The only
difference is that now, instead of computing the Jacobian of some expression
y, we compute the Jacobian of T.grad(cost,x)
, where cost is some
scalar.
>>> x = T.dvector('x')
>>> y = x ** 2
>>> cost = y.sum()
>>> gy = T.grad(cost, x)
>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
>>> f = theano.function([x], H, updates=updates)
>>> f([4, 4])
array([[ 2., 0.],
[ 0., 2.]])
Jacobian times a VectorÂ¶
Sometimes we can express the algorithm in terms of Jacobians times vectors, or vectors times Jacobians. Compared to evaluating the Jacobian and then doing the product, there are methods that compute the desired results while avoiding actual evaluation of the Jacobian. This can bring about significant performance gains. A description of one such algorithm can be found here:
 Barak A. Pearlmutter, “Fast Exact Multiplication by the Hessian”, Neural Computation, 1994
While in principle we would want Theano to identify these patterns automatically for us, in practice, implementing such optimizations in a generic manner is extremely difficult. Therefore, we provide special functions dedicated to these tasks.
The R operator is built to evaluate the product between a Jacobian and a vector, namely . The formulation can be extended even for x being a matrix, or a tensor in general, case in which also the Jacobian becomes a tensor and the product becomes some kind of tensor product. Because in practice we end up needing to compute such expressions in terms of weight matrices, Theano supports this more generic form of the operation. In order to evaluate the Roperation of expression y, with respect to x, multiplying the Jacobian with v you need to do something similar to this:
>>> W = T.dmatrix('W')
>>> V = T.dmatrix('V')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> JV = T.Rop(y, W, V)
>>> f = theano.function([W, V, x], JV)
>>> f([[1, 1], [1, 1]], [[2, 2], [2, 2]], [0,1])
array([ 2., 2.])
List of Op that implement Rop.
In similitude to the Roperator, the Loperator would compute a row vector times the Jacobian. The mathematical formula would be . The Loperator is also supported for generic tensors (not only for vectors). Similarly, it can be implemented as follows:
>>> W = T.dmatrix('W')
>>> v = T.dvector('v')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> VJ = T.Lop(y, W, v)
>>> f = theano.function([v,x], VJ)
>>> f([2, 2], [0, 1])
array([[ 0., 0.],
[ 2., 2.]])
Note
v, the point of evaluation, differs between the Loperator and the Roperator. For the Loperator, the point of evaluation needs to have the same shape as the output, whereas for the Roperator this point should have the same shape as the input parameter. Furthermore, the results of these two operations differ. The result of the Loperator is of the same shape as the input parameter, while the result of the Roperator has a shape similar to that of the output.
Hessian times a VectorÂ¶
If you need to compute the Hessian times a vector, you can make use of the abovedefined operators to do it more efficiently than actually computing the exact Hessian and then performing the product. Due to the symmetry of the Hessian matrix, you have two options that will give you the same result, though these options might exhibit differing performances. Hence, we suggest profiling the methods before using either one of the two:
>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> vH = T.grad(T.sum(gy * v), x)
>>> f = theano.function([x, v], vH)
>>> f([4, 4], [2, 2])
array([ 4., 4.])
or, making use of the Roperator:
>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> Hv = T.Rop(gy, x, v)
>>> f = theano.function([x, v], Hv)
>>> f([4, 4], [2, 2])
array([ 4., 4.])
Final PointersÂ¶
 The
grad
function works symbolically: it receives and returns Theano variables. grad
can be compared to a macro since it can be applied repeatedly. Scalar costs only can be directly handled by
grad
. Arrays are handled through repeated applications.  Builtin functions allow to compute efficiently vector times Jacobian and vector times Hessian.
 Work is in progress on the optimizations required to compute efficiently the full Jacobian and the Hessian matrix as well as the Jacobian times vector.
ConditionsÂ¶
IfElse vs SwitchÂ¶
 Both ops build a condition over symbolic variables.
IfElse
takes a boolean condition and two variables as inputs.Switch
takes a tensor as condition and two variables as inputs.switch
is an elementwise operation and is thus more general thanifelse
. Whereas
switch
evaluates both output variables,ifelse
is lazy and only evaluates one variable with respect to the condition.
Example
from theano import tensor as T
from theano.ifelse import ifelse
import theano, time, numpy
a,b = T.scalars('a', 'b')
x,y = T.matrices('x', 'y')
z_switch = T.switch(T.lt(a, b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a, b), T.mean(x), T.mean(y))
f_switch = theano.function([a, b, x, y], z_switch,
mode=theano.Mode(linker='vm'))
f_lazyifelse = theano.function([a, b, x, y], z_lazy,
mode=theano.Mode(linker='vm'))
val1 = 0.
val2 = 1.
big_mat1 = numpy.ones((10000, 1000))
big_mat2 = numpy.ones((10000, 1000))
n_times = 10
tic = time.clock()
for i in range(n_times):
f_switch(val1, val2, big_mat1, big_mat2)
print('time spent evaluating both values %f sec' % (time.clock()  tic))
tic = time.clock()
for i in range(n_times):
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print('time spent evaluating one value %f sec' % (time.clock()  tic))
In this example, the IfElse
op spends less time (about half as much) than Switch
since it computes only one variable out of the two.
$ python ifelse_switch.py
time spent evaluating both values 0.6700 sec
time spent evaluating one value 0.3500 sec
Unless linker='vm'
or linker='cvm'
are used, ifelse
will compute both
variables and take the same computation time as switch
. Although the linker
is not currently set by default to cvm
, it will be in the near future.
There is no automatic optimization replacing a switch
with a
broadcasted scalar to an ifelse
, as this is not always faster. See
this ticket.
Note
If you use test values, then all branches of the IfElse will be computed. This is normal, as using test_value means everything will be computed when we build it, due to Python’s greedy evaluation and the semantic of test value. As we build both branches, they will be executed for test values. This doesn’t cause any changes during the execution of the compiled Theano function.
LoopÂ¶
ScanÂ¶
 A general form of recurrence, which can be used for looping.
 Reduction and map (loop over the leading dimensions) are special cases of
scan
.  You
scan
a function along some input sequence, producing an output at each timestep.  The function can see the previous K timesteps of your function.
sum()
could be computed by scanning the z + x(i) function over a list, given an initial state of z=0. Often a for loop can be expressed as a
scan()
operation, andscan
is the closest that Theano comes to looping.  Advantages of using
scan
over for loops: Number of iterations to be part of the symbolic graph.
 Minimizes GPU transfers (if GPU is involved).
 Computes gradients through sequential steps.
 Slightly faster than using a for loop in Python with a compiled Theano function.
 Can lower the overall memory usage by detecting the actual amount of memory needed.
The full documentation can be found in the library: Scan.
A good ipython notebook with explanation and more examples.
Scan Example: Computing tanh(x(t).dot(W) + b) elementwise
import theano
import theano.tensor as T
import numpy as np
# defining the tensor variables
X = T.matrix("X")
W = T.matrix("W")
b_sym = T.vector("b_sym")
results, updates = theano.scan(lambda v: T.tanh(T.dot(v, W) + b_sym), sequences=X)
compute_elementwise = theano.function(inputs=[X, W, b_sym], outputs=results)
# test values
x = np.eye(2, dtype=theano.config.floatX)
w = np.ones((2, 2), dtype=theano.config.floatX)
b = np.ones((2), dtype=theano.config.floatX)
b[1] = 2
print(compute_elementwise(x, w, b))
# comparison with numpy
print(np.tanh(x.dot(w) + b))
[[ 0.96402758 0.99505475]
[ 0.96402758 0.99505475]]
[[ 0.96402758 0.99505475]
[ 0.96402758 0.99505475]]
Scan Example: Computing the sequence x(t) = tanh(x(t  1).dot(W) + y(t).dot(U) + p(T  t).dot(V))
import theano
import theano.tensor as T
import numpy as np
# define tensor variables
X = T.vector("X")
W = T.matrix("W")
b_sym = T.vector("b_sym")
U = T.matrix("U")
Y = T.matrix("Y")
V = T.matrix("V")
P = T.matrix("P")
results, updates = theano.scan(lambda y, p, x_tm1: T.tanh(T.dot(x_tm1, W) + T.dot(y, U) + T.dot(p, V)),
sequences=[Y, P[::1]], outputs_info=[X])
compute_seq = theano.function(inputs=[X, W, Y, U, P, V], outputs=results)
# test values
x = np.zeros((2), dtype=theano.config.floatX)
x[1] = 1
w = np.ones((2, 2), dtype=theano.config.floatX)
y = np.ones((5, 2), dtype=theano.config.floatX)
y[0, :] = 3
u = np.ones((2, 2), dtype=theano.config.floatX)
p = np.ones((5, 2), dtype=theano.config.floatX)
p[0, :] = 3
v = np.ones((2, 2), dtype=theano.config.floatX)
print(compute_seq(x, w, y, u, p, v))
# comparison with numpy
x_res = np.zeros((5, 2), dtype=theano.config.floatX)
x_res[0] = np.tanh(x.dot(w) + y[0].dot(u) + p[4].dot(v))
for i in range(1, 5):
x_res[i] = np.tanh(x_res[i  1].dot(w) + y[i].dot(u) + p[4i].dot(v))
print(x_res)
[[0.99505475 0.99505475]
[ 0.96471973 0.96471973]
[ 0.99998585 0.99998585]
[ 0.99998771 0.99998771]
[ 1. 1. ]]
[[0.99505475 0.99505475]
[ 0.96471973 0.96471973]
[ 0.99998585 0.99998585]
[ 0.99998771 0.99998771]
[ 1. 1. ]]
Scan Example: Computing norms of lines of X
import theano
import theano.tensor as T
import numpy as np
# define tensor variable
X = T.matrix("X")
results, updates = theano.scan(lambda x_i: T.sqrt((x_i ** 2).sum()), sequences=[X])
compute_norm_lines = theano.function(inputs=[X], outputs=results)
# test value
x = np.diag(np.arange(1, 6, dtype=theano.config.floatX), 1)
print(compute_norm_lines(x))
# comparison with numpy
print(np.sqrt((x ** 2).sum(1)))
[ 1. 2. 3. 4. 5. 0.]
[ 1. 2. 3. 4. 5. 0.]
Scan Example: Computing norms of columns of X
import theano
import theano.tensor as T
import numpy as np
# define tensor variable
X = T.matrix("X")
results, updates = theano.scan(lambda x_i: T.sqrt((x_i ** 2).sum()), sequences=[X.T])
compute_norm_cols = theano.function(inputs=[X], outputs=results)
# test value
x = np.diag(np.arange(1, 6, dtype=theano.config.floatX), 1)
print(compute_norm_cols(x))
# comparison with numpy
print(np.sqrt((x ** 2).sum(0)))
[ 0. 1. 2. 3. 4. 5.]
[ 0. 1. 2. 3. 4. 5.]
Scan Example: Computing trace of X
import theano
import theano.tensor as T
import numpy as np
floatX = "float32"
# define tensor variable
X = T.matrix("X")
results, updates = theano.scan(lambda i, j, t_f: T.cast(X[i, j] + t_f, floatX),
sequences=[T.arange(X.shape[0]), T.arange(X.shape[1])],
outputs_info=np.asarray(0., dtype=floatX))
result = results[1]
compute_trace = theano.function(inputs=[X], outputs=result)
# test value
x = np.eye(5, dtype=theano.config.floatX)
x[0] = np.arange(5, dtype=theano.config.floatX)
print(compute_trace(x))
# comparison with numpy
print(np.diagonal(x).sum())
4.0
4.0
Scan Example: Computing the sequence x(t) = x(t  2).dot(U) + x(t  1).dot(V) + tanh(x(t  1).dot(W) + b)
import theano
import theano.tensor as T
import numpy as np
# define tensor variables
X = T.matrix("X")
W = T.matrix("W")
b_sym = T.vector("b_sym")
U = T.matrix("U")
V = T.matrix("V")
n_sym = T.iscalar("n_sym")
results, updates = theano.scan(lambda x_tm2, x_tm1: T.dot(x_tm2, U) + T.dot(x_tm1, V) + T.tanh(T.dot(x_tm1, W) + b_sym),
n_steps=n_sym, outputs_info=[dict(initial=X, taps=[2, 1])])
compute_seq2 = theano.function(inputs=[X, U, V, W, b_sym, n_sym], outputs=results)
# test values
x = np.zeros((2, 2), dtype=theano.config.floatX) # the initial value must be able to return x[2]
x[1, 1] = 1
w = 0.5 * np.ones((2, 2), dtype=theano.config.floatX)
u = 0.5 * (np.ones((2, 2), dtype=theano.config.floatX)  np.eye(2, dtype=theano.config.floatX))
v = 0.5 * np.ones((2, 2), dtype=theano.config.floatX)
n = 10
b = np.ones((2), dtype=theano.config.floatX)
print(compute_seq2(x, u, v, w, b, n))
# comparison with numpy
x_res = np.zeros((10, 2))
x_res[0] = x[0].dot(u) + x[1].dot(v) + np.tanh(x[1].dot(w) + b)
x_res[1] = x[1].dot(u) + x_res[0].dot(v) + np.tanh(x_res[0].dot(w) + b)
x_res[2] = x_res[0].dot(u) + x_res[1].dot(v) + np.tanh(x_res[1].dot(w) + b)
for i in range(2, 10):
x_res[i] = (x_res[i  2].dot(u) + x_res[i  1].dot(v) +
np.tanh(x_res[i  1].dot(w) + b))
print(x_res)
[[ 1.40514825 1.40514825]
[ 2.88898899 2.38898899]
[ 4.34018291 4.34018291]
[ 6.53463142 6.78463142]
[ 9.82972243 9.82972243]
[ 14.22203814 14.09703814]
[ 20.07439936 20.07439936]
[ 28.12291843 28.18541843]
[ 39.1913681 39.1913681 ]
[ 54.28407732 54.25282732]]
[[ 1.40514825 1.40514825]
[ 2.88898899 2.38898899]
[ 4.34018291 4.34018291]
[ 6.53463142 6.78463142]
[ 9.82972243 9.82972243]
[ 14.22203814 14.09703814]
[ 20.07439936 20.07439936]
[ 28.12291843 28.18541843]
[ 39.1913681 39.1913681 ]
[ 54.28407732 54.25282732]]
Scan Example: Computing the Jacobian of y = tanh(v.dot(A)) wrt x
import theano
import theano.tensor as T
import numpy as np
# define tensor variables
v = T.vector()
A = T.matrix()
y = T.tanh(T.dot(v, A))
results, updates = theano.scan(lambda i: T.grad(y[i], v), sequences=[T.arange(y.shape[0])])
compute_jac_t = theano.function([A, v], results, allow_input_downcast=True) # shape (d_out, d_in)
# test values
x = np.eye(5, dtype=theano.config.floatX)[0]
w = np.eye(5, 3, dtype=theano.config.floatX)
w[2] = np.ones((3), dtype=theano.config.floatX)
print(compute_jac_t(w, x))
# compare with numpy
print(((1  np.tanh(x.dot(w)) ** 2) * w).T)
[[ 0.41997434 0. 0.41997434 0. 0. ]
[ 0. 1. 1. 0. 0. ]
[ 0. 0. 1. 0. 0. ]]
[[ 0.41997434 0. 0.41997434 0. 0. ]
[ 0. 1. 1. 0. 0. ]
[ 0. 0. 1. 0. 0. ]]
Note that we need to iterate over the indices of y
and not over the elements of y
. The reason is that scan create a placeholder variable for its internal function and this placeholder variable does not have the same dependencies than the variables that will replace it.
Scan Example: Accumulate number of loop during a scan
import theano
import theano.tensor as T
import numpy as np
# define shared variables
k = theano.shared(0)
n_sym = T.iscalar("n_sym")
results, updates = theano.scan(lambda:{k:(k + 1)}, n_steps=n_sym)
accumulator = theano.function([n_sym], [], updates=updates, allow_input_downcast=True)
k.get_value()
accumulator(5)
k.get_value()
Scan Example: Computing tanh(v.dot(W) + b) * d where d is binomial
import theano
import theano.tensor as T
import numpy as np
# define tensor variables
X = T.matrix("X")
W = T.matrix("W")
b_sym = T.vector("b_sym")
# define shared random stream
trng = T.shared_randomstreams.RandomStreams(1234)
d=trng.binomial(size=W[1].shape)
results, updates = theano.scan(lambda v: T.tanh(T.dot(v, W) + b_sym) * d, sequences=X)
compute_with_bnoise = theano.function(inputs=[X, W, b_sym], outputs=results,
updates=updates, allow_input_downcast=True)
x = np.eye(10, 2, dtype=theano.config.floatX)
w = np.ones((2, 2), dtype=theano.config.floatX)
b = np.ones((2), dtype=theano.config.floatX)
print(compute_with_bnoise(x, w, b))
[[ 0.96402758 0. ]
[ 0. 0.96402758]
[ 0. 0. ]
[ 0.76159416 0.76159416]
[ 0.76159416 0. ]
[ 0. 0.76159416]
[ 0. 0.76159416]
[ 0. 0.76159416]
[ 0. 0. ]
[ 0.76159416 0.76159416]]
Note that if you want to use a random variable d
that will not be updated through scan loops, you should pass this variable as a non_sequences
arguments.
Scan Example: Computing pow(A, k)
import theano
import theano.tensor as T
theano.config.warn.subtensor_merge_bug = False
k = T.iscalar("k")
A = T.vector("A")
def inner_fct(prior_result, B):
return prior_result * B
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A ** 1 through A ** k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[1]
power = theano.function(inputs=[A, k], outputs=final_result,
updates=updates)
print(power(range(10), 2))
[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
Scan Example: Calculating a Polynomial
import numpy
import theano
import theano.tensor as T
theano.config.warn.subtensor_merge_bug = False
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x")
max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print(calculate_polynomial(test_coeff, 3))
19.0
How Shape Information is Handled by TheanoÂ¶
It is not possible to strictly enforce the shape of a Theano variable when building a graph since the particular value provided at runtime for a parameter of a Theano function may condition the shape of the Theano variables in its graph.
Currently, information regarding shape is used in two ways in Theano:
To generate faster C code for the 2d convolution on the CPU and the GPU, when the exact output shape is known in advance.
To remove computations in the graph when we only want to know the shape, but not the actual value of a variable. This is done with the Op.infer_shape method.
Example:
>>> import theano
>>> x = theano.tensor.matrix('x')
>>> f = theano.function([x], (x ** 2).shape)
>>> theano.printing.debugprint(f)
MakeVector{dtype='int64'} [id A] '' 2
Shape_i{0} [id B] '' 1
 x [id C]
Shape_i{1} [id D] '' 0
x [id C]
The output of this compiled function does not contain any multiplication or power. Theano has removed them to compute directly the shape of the output.
Shape Inference ProblemÂ¶
Theano propagates information about shape in the graph. Sometimes this can lead to errors. Consider this example:
>>> import numpy
>>> import theano
>>> x = theano.tensor.matrix('x')
>>> y = theano.tensor.matrix('y')
>>> z = theano.tensor.join(0, x, y)
>>> xv = numpy.random.rand(5, 4)
>>> yv = numpy.random.rand(3, 3)
>>> f = theano.function([x, y], z.shape)
>>> theano.printing.debugprint(f)
MakeVector{dtype='int64'} [id A] '' 4
Elemwise{Add}[(0, 0)] [id B] '' 3
 Shape_i{0} [id C] '' 2
  x [id D]
 Shape_i{0} [id E] '' 1
 y [id F]
Shape_i{1} [id G] '' 0
x [id D]
>>> f(xv, yv) # DOES NOT RAISE AN ERROR AS SHOULD BE.
array([8, 4])
>>> f = theano.function([x,y], z)# Do not take the shape.
>>> theano.printing.debugprint(f)
Join [id A] '' 0
TensorConstant{0} [id B]
x [id C]
y [id D]
>>> f(xv, yv)
Traceback (most recent call last):
...
ValueError: ...
As you can see, when asking only for the shape of some computation (join
in the
example), an inferred shape is computed directly, without executing
the computation itself (there is no join
in the first output or debugprint).
This makes the computation of the shape faster, but it can also hide errors. In
this example, the computation of the shape of the output of join
is done only
based on the first input Theano variable, which leads to an error.
This might happen with other ops such as elemwise
and dot
, for example.
Indeed, to perform some optimizations (for speed or stability, for instance),
Theano assumes that the computation is correct and consistent
in the first place, as it does here.
You can detect those problems by running the code without this
optimization, using the Theano flag
optimizer_excluding=local_shape_to_shape_i
. You can also obtain the
same effect by running in the modes FAST_COMPILE
(it will not apply this
optimization, nor most other optimizations) or DebugMode
(it will test
before and after all optimizations (much slower)).
Specifying Exact ShapeÂ¶
Currently, specifying a shape is not as easy and flexible as we wish and we plan some upgrade. Here is the current state of what can be done:
 You can pass the shape info directly to the
ConvOp
created when callingconv2d
. You simply set the parametersimage_shape
andfilter_shape
inside the call. They must be tuples of 4 elements. For example:
theano.tensor.nnet.conv2d(..., image_shape=(7, 3, 5, 5), filter_shape=(2, 3, 4, 4))
 You can use the
SpecifyShape
op to add shape information anywhere in the graph. This allows to perform some optimizations. In the following example, this makes it possible to precompute the Theano function to a constant.
>>> import theano
>>> x = theano.tensor.matrix()
>>> x_specify_shape = theano.tensor.specify_shape(x, (2, 2))
>>> f = theano.function([x], (x_specify_shape ** 2).shape)
>>> theano.printing.debugprint(f)
DeepCopyOp [id A] '' 0
TensorConstant{(2,) of 2} [id B]
Future PlansÂ¶
The parameter “constant shape” will be added totheano.shared()
. This is probably the most frequent occurrence withshared
variables. It will make the code simpler and will make it possible to check that the shape does not change when updating theshared
variable.
BroadcastingÂ¶
Broadcasting is a mechanism which allows tensors with different numbers of dimensions to be added or multiplied together by (virtually) replicating the smaller tensor along the dimensions that it is lacking.
Broadcasting is the mechanism by which a scalar may be added to a matrix, a vector to a matrix or a scalar to a vector.
Broadcasting a row matrix. T and F respectively stand for True and False and indicate along which dimensions we allow broadcasting.
If the second argument were a vector, its shape would be
(2,)
and its broadcastable pattern (False,)
. They would
be automatically expanded to the left to match the
dimensions of the matrix (adding 1
to the shape and True
to the pattern), resulting in (1, 2)
and (True, False)
.
It would then behave just like the example above.
Unlike numpy which does broadcasting dynamically, Theano needs to know, for any operation which supports broadcasting, which dimensions will need to be broadcasted. When applicable, this information is given in the Type of a Variable.
The following code illustrates how rows and columns are broadcasted in order to perform an addition operation with a matrix:
>>> r = T.row()
>>> r.broadcastable
(True, False)
>>> mtr = T.matrix()
>>> mtr.broadcastable
(False, False)
>>> f_row = theano.function([r, mtr], [r + mtr])
>>> R = np.arange(3).reshape(1, 3)
>>> R
array([[0, 1, 2]])
>>> M = np.arange(9).reshape(3, 3)
>>> M
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> f_row(R, M)
[array([[ 0., 2., 4.],
[ 3., 5., 7.],
[ 6., 8., 10.]])]
>>> c = T.col()
>>> c.broadcastable
(False, True)
>>> f_col = theano.function([c, mtr], [c + mtr])
>>> C = np.arange(3).reshape(3, 1)
>>> C
array([[0],
[1],
[2]])
>>> M = np.arange(9).reshape(3, 3)
>>> f_col(C, M)
[array([[ 0., 1., 2.],
[ 4., 5., 6.],
[ 8., 9., 10.]])]
In these examples, we can see that both the row vector and the column vector are broadcasted in order to be be added to the matrix.
See also:
AdvancedÂ¶
SparseÂ¶
In general, sparse matrices provide the same functionality as regular matrices. The difference lies in the way the elements of sparse matrices are represented and stored in memory. Only the nonzero elements of the latter are stored. This has some potential advantages: first, this may obviously lead to reduced memory usage and, second, clever storage methods may lead to reduced computation time through the use of sparse specific algorithms. We usually refer to the generically stored matrices as dense matrices.
Theano’s sparse package provides efficient algorithms, but its use is not recommended in all cases or for all matrices. As an obvious example, consider the case where the sparsity proportion is very low. The sparsity proportion refers to the ratio of the number of zero elements to the number of all elements in a matrix. A low sparsity proportion may result in the use of more space in memory since not only the actual data is stored, but also the position of nearly every element of the matrix. This would also require more computation time whereas a dense matrix representation along with regular optimized algorithms might do a better job. Other examples may be found at the nexus of the specific purpose and structure of the matrices. More documentation may be found in the SciPy Sparse Reference.
Since sparse matrices are not stored in contiguous arrays, there are several
ways to represent them in memory. This is usually designated by the socalled format
of the matrix. Since Theano’s sparse matrix package is based on the SciPy
sparse package, complete information about sparse matrices can be found
in the SciPy documentation. Like SciPy, Theano does not implement sparse formats for
arrays with a number of dimensions different from two.
So far, Theano implements two formats
of sparse matrix: csc
and csr
.
Those are almost identical except that csc
is based on the columns of the
matrix and csr
is based on its rows. They both have the same purpose:
to provide for the use of efficient algorithms performing linear algebra operations.
A disadvantage is that they fail to give an efficient way to modify the sparsity structure
of the underlying matrix, i.e. adding new elements. This means that if you are
planning to add new elements in a sparse matrix very often in your computational graph,
perhaps a tensor variable could be a better choice.
More documentation may be found in the Sparse Library Reference.
Before going further, here are the import
statements that are assumed for the rest of the
tutorial:
>>> import theano
>>> import numpy as np
>>> import scipy.sparse as sp
>>> from theano import sparse
Compressed Sparse FormatÂ¶
Theano supports two compressed sparse formats: csc
and csr
, respectively based on columns
and rows. They have both the same attributes: data
, indices
, indptr
and shape
.
 The
data
attribute is a onedimensionalndarray
which contains all the nonzero elements of the sparse matrix. The
indices
andindptr
attributes are used to store the position of the data in the sparse matrix. The
shape
attribute is exactly the same as theshape
attribute of a dense (i.e. generic) matrix. It can be explicitly specified at the creation of a sparse matrix if it cannot be inferred from the first three attributes.
At the end, the format does not affect the length of the data
and indices
attributes. They are both
completely fixed by the number of elements you want to store. The only thing that changes with the format
is indptr
. In csc
format, the matrix is compressed along columns so a lower number of columns will
result in less memory use. On the other hand, with the csr
format, the matrix is compressed along
the rows and with a matrix that have a lower number of rows, csr
format is a better choice. So here is the rule:
Note
If shape[0] > shape[1], use csc
format. Otherwise, use csr
.
Sometimes, since the sparse module is young, ops does not exist for both format. So here is what may be the most relevant rule:
Note
Use the format compatible with the ops in your computation graph.
The documentation about the ops and their supported format may be found in the Sparse Library Reference.
Handling Sparse in TheanoÂ¶
Most of the ops in Theano depend on the format
of the sparse matrix.
That is why there are two kinds of constructors of sparse variables:
csc_matrix
and csr_matrix
. These can be called with the usual
name
and dtype
parameters, but no broadcastable
flags are
allowed. This is forbidden since the sparse package, as the SciPy sparse module,
does not provide any way to handle a number of dimensions different from two.
The set of all accepted dtype
for the sparse matrices can be found in
sparse.all_dtypes
.
>>> sparse.all_dtypes
set(['int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64',
'float32', 'float64', 'complex64', 'complex128'])
To move back and forth from a dense matrix to a sparse matrix representation, Theano
provides the dense_from_sparse
, csr_from_dense
and
csc_from_dense
functions. No additional detail must be provided. Here is
an example that performs a full cycle from sparse to sparse:
>>> x = sparse.csc_matrix(name='x', dtype='float32')
>>> y = sparse.dense_from_sparse(x)
>>> z = sparse.csc_from_dense(y)
Although sparse variables do not allow direct access to their properties,
this can be accomplished using the csm_properties
function. This will return
a tuple of onedimensional tensor
variables that represents the internal characteristics
of the sparse matrix.
In order to reconstruct a sparse matrix from some properties, the functions CSC
and CSR
can be used. This will create the sparse matrix in the desired
format. As an example, the following code reconstructs a csc
matrix into
a csr
one.
>>> x = sparse.csc_matrix(name='x', dtype='int64')
>>> data, indices, indptr, shape = sparse.csm_properties(x)
>>> y = sparse.CSR(data, indices, indptr, shape)
>>> f = theano.function([x], y)
>>> a = sp.csc_matrix(np.asarray([[0, 1, 1], [0, 0, 0], [1, 0, 0]]))
>>> print(a.toarray())
[[0 1 1]
[0 0 0]
[1 0 0]]
>>> print(f(a).toarray())
[[0 0 1]
[1 0 0]
[1 0 0]]
The last example shows that one format can be obtained from transposition of
the other. Indeed, when calling the transpose
function,
the sparse characteristics of the resulting matrix cannot be the same as the one
provided as input.
Several ops are set to make use of the very peculiar structure of the sparse matrices. These ops are said to be structured and simply do not perform any computations on the zero elements of the sparse matrix. They can be thought as being applied only to the data attribute of the latter. Note that these structured ops provide a structured gradient. More explication below.
>>> x = sparse.csc_matrix(name='x', dtype='float32')
>>> y = sparse.structured_add(x, 2)
>>> f = theano.function([x], y)
>>> a = sp.csc_matrix(np.asarray([[0, 0, 1], [0, 2, 1], [3, 0, 0]], dtype='float32'))
>>> print(a.toarray())
[[ 0. 0. 1.]
[ 0. 2. 1.]
[ 3. 0. 0.]]
>>> print(f(a).toarray())
[[ 0. 0. 1.]
[ 0. 0. 3.]
[ 5. 0. 0.]]
The gradients of the ops in the sparse module can also be structured. Some ops provide a flag to indicate if the gradient is to be structured or not. The documentation can be used to determine if the gradient of an op is regular or structured or if its implementation can be modified. Similarly to structured ops, when a structured gradient is calculated, the computation is done only for the nonzero elements of the sparse matrix.
More documentation regarding the gradients of specific ops can be found in the Sparse Library Reference.
Using the GPUÂ¶
For an introductory discussion of Graphical Processing Units (GPU) and their use for intensive parallel computation purposes, see GPGPU.
One of Theano’s design goals is to specify computations at an abstract level, so that the internal function compiler has a lot of flexibility about how to carry out those computations. One of the ways we take advantage of this flexibility is in carrying out calculations on a graphics card.
Using the GPU in Theano is as simple as setting the device
configuration flag to device=cuda
. You can optionally target a
specific gpu by specifying the number of the gpu as in
e.g. device=cuda2
. It is also encouraged to set the floating
point precision to float32 when working on the GPU as that is usually
much faster. For example:
THEANO_FLAGS='device=cuda,floatX=float32'
. You can also set these
options in the .theanorc file’s [global]
section:
[global] device = cuda floatX = float32
Note
 If your computer has multiple GPUs and you use
device=cuda
, the driver selects the one to use (usually cuda0).  You can use the program
nvidiasmi
to change this policy.  By default, when
device
indicates preference for GPU computations, Theano will fall back to the CPU if there is a problem with the GPU. You can use the flagforce_device=True
to instead raise an error when Theano cannot use the GPU.
GpuArray BackendÂ¶
If you have not done so already, you will need to install libgpuarray as well as at least one computing toolkit (CUDA or OpenCL). Detailed instructions to accomplish that are provided at libgpuarray.
To install Nvidia’s GPUprogramming toolchain (CUDA) and configure Theano to use it, see the installation instructions for Linux, MacOS and Windows.
While all types of devices are supported if using OpenCL, for the remainder of this section, whatever compute device you are using will be referred to as GPU.
Note
GpuArray backend uses config.gpuarray.preallocate
for GPU memory
allocation.
Warning
The backend was designed to support OpenCL, however current support is incomplete. A lot of very useful ops still do not support it because they were ported from the old backend with minimal change.
To see if your GPU is being used, cut and paste the following program into a file and run it.
Use the Theano flag device=cuda
to require the use of the GPU. Use the flag
device=cuda{0,1,...}
to specify which GPU to use.
from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1  t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The program just computes exp()
of a bunch of random numbers. Note
that we use the theano.shared()
function to make sure that the
input x is stored on the GPU.
$ THEANO_FLAGS=device=cpu python gpu_tutorial1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.271284 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
$ THEANO_FLAGS=device=cuda0 python gpu_tutorial1.py
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.697514 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
By default functions that execute on the GPU still return a standard
numpy ndarray. A transfer operation is inserted just before the
results are returned to ensure a consistent interface with CPU code.
This allows changing the device some code runs on by only replacing
the value of the device
flag without touching the code.
If you don’t mind a loss of flexibility, you can ask theano to return the GPU object directly. The following code is modified to do just that.
from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x).transfer(None))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1  t0))
print("Result is %s" % (numpy.asarray(r),))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Here tensor.exp(x).transfer(None)
means “copy exp(x)
to the GPU”,
with None
the default GPU context when not explicitly given.
For information on how to set GPU contexts, see Using multiple GPUs.
The output is
$ THEANO_FLAGS=device=cuda0 python gpu_tutorial2.py
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
Looping 1000 times took 0.040277 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
While the time per call appears to be much lower than the two previous invocations (and should indeed be lower, since we avoid a transfer) the massive speedup we obtained is in part due to asynchronous nature of execution on GPUs, meaning that the work isn’t completed yet, just ‘launched’. We’ll talk about that later.
The object returned is a GpuArray from pygpu. It mostly acts as a
numpy ndarray with some exceptions due to its data being on the GPU.
You can copy it to the host and convert it to a regular ndarray by
using usual numpy casting such as numpy.asarray()
.
For even more speed, you can play with the borrow
flag. See
Borrowing when Constructing Function Objects.
The performance characteristics will of course vary from device to device, and also as we refine our implementation:
 In general, matrix multiplication, convolution, and large elementwise operations can be accelerated a lot (550x) when arguments are large enough to keep 30 processors busy.
 Indexing, dimensionshuffling and constanttime reshaping will be equally fast on GPU as on CPU.
 Summation over rows/columns of tensors can be a little slower on the GPU than on the CPU.
 Copying of large quantities of data to and from a device is relatively slow, and often cancels most of the advantage of one or two accelerated functions on that data. Getting GPU performance largely hinges on making data transfer to the device pay off.
The backend supports all regular theano data types (float32, float64, int, ...), however GPU support varies and some units can’t deal with double (float64) or small (less than 32 bits like int16) data types. You will get an error at compile time or runtime if this is the case.
By default all inputs will get transferred to GPU. You can prevent an
input from getting transferred by setting its tag.target
attribute to
‘cpu’.
Complex support is untested and most likely completely broken.
 Consider adding
floatX=float32
(or the type you are using) to your.theanorc
file if you plan to do a lot of GPU work.  The GPU backend supports float64 variables, but they are still slower to compute than float32. The more float32, the better GPU performance you will get.
 Prefer constructors like
matrix
,vector
andscalar
(which follow the type set infloatX
) todmatrix
,dvector
anddscalar
. The latter enforce double precision (float64 on most machines), which slows down GPU computations on current hardware.  Minimize transfers to the GPU device by using
shared
variables to store frequentlyaccessed data (seeshared()
). When using the GPU, tensorshared
variables are stored on the GPU by default to eliminate transfer time for GPU ops using those variables.  If you aren’t happy with the performance you see, try running your
script with
profile=True
flag. This should print some timing information at program termination. Is time being used sensibly? If an op or Apply is taking more time than its share, then if you know something about GPU programming, have a look at how it’s implemented in theano.gpuarray. Check the line similar to Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op. This can tell you if not enough of your graph is on the GPU or if there is too much memory transfer.  To investigate whether all the Ops in the computational graph are running on GPU, it is possible to debug or check your code by providing a value to assert_no_cpu_op flag, i.e. warn, for warning, raise for raising an error or pdb for putting a breakpoint in the computational graph if there is a CPU Op.
By default, all operations on the GPU are run asynchronously. This means that they are only scheduled to run and the function returns. This is made somewhat transparently by the underlying libgpuarray.
A forced synchronization point is introduced when doing memory transfers between device and host.
It is possible to force synchronization for a particular GpuArray by
calling its sync()
method. This is useful to get accurate timings
when doing benchmarks.
Consider again the logistic regression:
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(T.dot(x, w)b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = y*T.log(p_1)  (1y)*T.log(1p_1) # Crossentropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=[(w, w0.01*gw), (b, b0.01*gb)],
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if theano used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("target values for D")
print(D[1])
print("prediction on D")
print(predict(D[0]))
print("floatX=", theano.config.floatX)
print("device=", theano.config.device)
Modify and execute this example to run on GPU with floatX=float32
and time it using the command line time python file.py
. (Of
course, you may use some of your answer to the exercise in section
Configuration Settings and Compiling Mode.)
Is there an increase in speed from CPU to GPU?
Where does it come from? (Use profile=True
flag.)
What can be done to further increase the speed of the GPU version? Put your ideas to test.
Software for Directly Programming a GPUÂ¶
Leaving aside Theano which is a metaprogrammer, there are:
CUDA: GPU programming API by NVIDIA based on extension to C (CUDA C)
 Vendorspecific
 Numeric libraries (BLAS, RNG, FFT) are maturing.
OpenCL: multivendor version of CUDA
 More general, standardized.
 Fewer libraries, lesser spread.
PyCUDA: Python bindings to CUDA driver interface allow to access Nvidia’s CUDA parallel computation API from Python
Convenience:
Makes it easy to do GPU metaprogramming from within Python.
Abstractions to compile lowlevel CUDA code from Python (
pycuda.driver.SourceModule
).GPU memory buffer (
pycuda.gpuarray.GPUArray
).Helpful documentation.
Completeness: Binding to all of CUDA’s driver API.
Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
Speed: PyCUDA’s base layer is written in C++.
Good memory management of GPU objects:
Object cleanup tied to lifetime of objects (RAII, ‘Resource Acquisition Is Initialization’).
Makes it much easier to write correct, leak and crashfree code.
PyCUDA knows about dependencies (e.g. it won’t detach from a context before all memory allocated in it is also freed).
(This is adapted from PyCUDA’s documentation and Andreas Kloeckner’s website on PyCUDA.)
PyOpenCL: PyCUDA for OpenCL
Learning to Program with PyCUDAÂ¶
If you already enjoy a good proficiency with the C programming language, you may easily leverage your knowledge by learning, first, to program a GPU with the CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA API with a Python wrapper.
The following resources will assist you in this learning process:
 CUDA API and CUDA C: Introductory
 CUDA API and CUDA C: Advanced
 MIT IAP2009 CUDA (full coverage: lectures, leading KirkHwu textbook, examples, additional resources)
 Course U. of Illinois (full lectures, KirkHwu textbook)
 NVIDIA’s knowledge base (extensive coverage, levels from introductory to advanced)
 practical issues (on the relationship between grids, blocks and threads; see also linked and related issues on same page)
 CUDA optimization
 PyCUDA: Introductory
 PYCUDA: Advanced
The following examples give a foretaste of programming a GPU with PyCUDA. Once you feel competent enough, you may try yourself on the corresponding exercises.
Example: PyCUDA
# (from PyCUDA's documentation)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
assert numpy.allclose(dest, a*b)
print(dest)
Run the preceding example.
Modify and execute to work for a matrix of shape (20, 10).
Example: Theano + PyCUDA
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
__props__ = ()
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2, impl):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [storage_map[v] for v in node.inputs]
outputs = [storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape != inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512, 1, 1), grid=grid)
return thunk
Use this code to test it:
>>> x = theano.tensor.fmatrix()
>>> f = theano.function([x], PyCUDADoubleOp()(x))
>>> xv = numpy.ones((4, 5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2)
>>> print(numpy.asarray(f(xv)))
Run the preceding example.
Modify and execute to multiply two matrices: x * y.
Modify and execute to return two outputs: x + y and x  y.
(Notice that Theano’s current elemwise fusion optimization is only applicable to computations involving a single output. Hence, to gain efficiency over the basic solution that is asked here, the two operations would have to be jointly optimized explicitly in the code.)
Modify and execute to support stride (i.e. to avoid constraining the input to be Ccontiguous).
NoteÂ¶
 See Other Implementations to know how to handle random numbers on the GPU.
 The mode FAST_COMPILE disables C code, so also disables the GPU. You can use the Theano flag optimizer=’fast_compile’ to speed up compilation and keep the GPU.
Using multiple GPUsÂ¶
Theano has a feature to allow the use of multiple GPUs at the same time in one function. The multiple gpu feature requires the use of the GpuArray Backend backend, so make sure that works correctly.
In order to keep a reasonably high level of abstraction you do not refer to device names directly for multiplegpu use. You instead refer to what we call context names. These are then mapped to a device using the theano configuration. This allows portability of models between machines.
Warning
The code is rather new and is still considered experimental at this point. It has been tested and seems to perform correctly in all cases observed, but make sure to doublecheck your results before publishing a paper or anything of the sort.
Note
For dataparallelism, you probably are better using platoon.
Defining the context mapÂ¶
The mapping from context names to devices is done through the
config.contexts
option. The format looks like this:
dev0>cuda0;dev1>cuda1
Let’s break it down. First there is a list of mappings. Each of these mappings is separated by a semicolon ‘;’. There can be any number of such mappings, but in the example above we have two of them: dev0>cuda0 and dev1>cuda1.
The mappings themselves are composed of a context name followed by the two characters ‘>’ and the device name. The context name is a simple string which does not have any special meaning for Theano. For parsing reasons, the context name cannot contain the sequence ‘>’ or ‘;’. To avoid confusion context names that begin with ‘cuda’ or ‘opencl’ are disallowed. The device name is a device in the form that gpuarray expects like ‘cuda0’ or ‘opencl0:0’.
Note
Since there are a bunch of shell special characters in the syntax, defining this on the commandline will require proper quoting, like this:
$ THEANO_FLAGS="contexts=dev0>cuda0"
When you define a context map, if config.print_active_device
is True (the default), Theano will print the mappings as they are
defined. This will look like this:
$ THEANO_FLAGS="contexts=dev0>cuda0;dev1>cuda1" python c 'import theano'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X (0000:09:00.0)
Mapped name dev1 to device cuda1: GeForce GTX TITAN X (0000:06:00.0)
If you don’t have enough GPUs for a certain model, you can assign the same device to more than one name. You can also assign extra names that a model doesn’t need to some other devices. However, a proliferation of names is not always a good idea since theano often assumes that different context names will be on different devices and will optimize accordingly. So you may get faster performance for a single name and a single device.
Note
It is often the case that multigpu operation requires or assumes that all the GPUs involved are equivalent. This is not the case for this implementation. Since the user has the task of distributing the jobs across the different device a model can be built on the assumption that one of the GPU is slower or has smaller memory.
A simple graph on two GPUsÂ¶
The following simple program works on two GPUs. It builds a function which perform two dot products on two different GPUs.
import numpy
import theano
v01 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v02 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v11 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
v12 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
f = theano.function([], [theano.tensor.dot(v01, v02),
theano.tensor.dot(v11, v12)])
f()
This model requires a context map with assignations for ‘dev0’ and ‘dev1’. It should run twice as fast when the devices are different.
Explicit transfers of dataÂ¶
Since operations themselves cannot work on more than one device, they will pick a device to work on based on their inputs and automatically insert transfers for any input which is not on the right device.
However you may want some explicit control over where and how these
transfers are done at some points. This is done by using the new
transfer()
method that is present on variables. It works for
moving data between GPUs and also between the host and the GPUs. Here
is a example.
import theano
v = theano.tensor.fmatrix()
# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')
# Move back to the cpu
cv = gv.transfer('cpu')
Of course you can mix transfers and operations in any order you choose. However you should try to minimize transfer operations because they will introduce overhead that may reduce performance.
Convolution arithmetic tutorialÂ¶
Note
This tutorial is adapted from an existing convolution arithmetic guide [1], with an added emphasis on Theano’s interface.
Also, note that the signal processing community has a different nomenclature and a well established literature on the topic, but for this tutorial we will stick to the terms used in the machine learning community. For a signal processing point of view on the subject, see for instance Winograd, Shmuel. Arithmetic complexity of computations. Vol. 33. Siam, 1980.
About this tutorialÂ¶
Learning to use convolutional neural networks (CNNs) for the first time is generally an intimidating experience. A convolutional layer’s output shape is affected by the shape of its input as well as the choice of kernel shape, zero padding and strides, and the relationship between these properties is not trivial to infer. This contrasts with fullyconnected layers, whose output size is independent of the input size. Additionally, socalled transposed convolutional layers (also known as fractionally strided convolutional layers, or – wrongly – as deconvolutions) have been employed in more and more work as of late, and their relationship with convolutional layers has been explained with various degrees of clarity.
The relationship between a convolution operation’s input shape, kernel size, stride, padding and its output shape can be confusing at times.
The tutorial’s objective is threefold:
 Explain the relationship between convolutional layers and transposed convolutional layers.
 Provide an intuitive understanding of the relationship between input shape, kernel shape, zero padding, strides and output shape in convolutional and transposed convolutional layers.
 Clarify Theano’s API on convolutions.
Refresher: discrete convolutionsÂ¶
The bread and butter of neural networks is affine transformations: a vector is received as input and is multiplied with a matrix to produce an output (to which a bias vector is usually added before passing the result through a nonlinearity). This is applicable to any type of input, be it an image, a sound clip or an unordered collection of features: whatever their dimensionality, their representation can always be flattened into a vector before the transformation.
Images, sound clips and many other similar kinds of data have an intrinsic structure. More formally, they share these important properties:
 They are stored as multidimensional arrays.
 They feature one or more axes for which ordering matters (e.g., width and height axes for an image, time axis for a sound clip).
 One axis, called the channel axis, is used to access different views of the data (e.g., the red, green and blue channels of a color image, or the left and right channels of a stereo audio track).
These properties are not exploited when an affine transformation is applied; in fact, all the axes are treated in the same way and the topological information is not taken into account. Still, taking advantage of the implicit structure of the data may prove very handy in solving some tasks, like computer vision and speech recognition, and in these cases it would be best to preserve it. This is where discrete convolutions come into play.
A discrete convolution is a linear transformation that preserves this notion of ordering. It is sparse (only a few input units contribute to a given output unit) and reuses parameters (the same weights are applied to multiple locations in the input).
Here is an example of a discrete convolution:
The light blue grid is called the input feature map. A kernel (shaded area) of value
slides across the input feature map. At each location, the product between each element of the kernel and the input element it overlaps is computed and the results are summed up to obtain the output in the current location. The final output of this procedure is a matrix called output feature map (in green).
This procedure can be repeated using different kernels to form as many output feature maps (a.k.a. output channels) as desired. Note also that to keep the drawing simple a single input feature map is being represented, but it is not uncommon to have multiple feature maps stacked one onto another (an example of this is what was referred to earlier as channels for images and sound clips).
Note
While there is a distinction between convolution and crosscorrelation from a signal processing perspective, the two become interchangeable when the kernel is learned. For the sake of simplicity and to stay consistent with most of the machine learning literature, the term convolution will be used in this tutorial.
If there are multiple input and output feature maps, the collection of kernels
form a 4D array (output_channels, input_channels, filter_rows,
filter_columns
). For each output channel, each input channel is convolved with
a distinct part of the kernel and the resulting set of feature maps is summed
elementwise to produce the corresponding output feature map. The result of this
procedure is a set of output feature maps, one for each output channel, that is
the output of the convolution.
The convolution depicted above is an instance of a 2D convolution, but can be generalized to ND convolutions. For instance, in a 3D convolution, the kernel would be a cuboid and would slide across the height, width and depth of the input feature map.
The collection of kernels defining a discrete convolution has a shape corresponding to some permutation of , where
The following properties affect the output size of a convolutional layer along axis :
 : input size along axis ,
 : kernel size along axis ,
 : stride (distance between two consecutive positions of the kernel) along axis ,
 : zero padding (number of zeros concatenated at the beginning and at the end of an axis) along axis j.
For instance, here is a kernel applied to a input padded with a border of zeros using strides:
The analysis of the relationship between convolutional layer properties is eased by the fact that they don’t interact across axes, i.e., the choice of kernel size, stride and zero padding along axis only affects the output size of axis . Because of that, this section will focus on the following simplified setting:
 2D discrete convolutions (),
 square inputs (),
 square kernel size (),
 same strides along both axes (),
 same zero padding along both axes ().
This facilitates the analysis and the visualization, but keep in mind that the results outlined here also generalize to the ND and nonsquare cases.
Theano terminologyÂ¶
Theano has its own terminology, which differs slightly from the convolution arithmetic guide’s. Here’s a simple conversion table for the two:
Theano  Convolution arithmetic 

filters 
4D collection of kernels 
input_shape 
(batch size (b ), input channels (c ), input rows (i1 ), input columns (i2 )) 
filter_shape 
(output channels (c1 ), input channels (c2 ), filter rows (k1 ), filter columns (k2 )) 
border_mode 
'valid' , 'half' , 'full' or (, ) 
subsample 
(s1 , s2 ) 
For instance, the convolution shown above would correspond to the following Theano call:
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(1, 1, 5, 5), filter_shape=(1, 1, 3, 3),
border_mode=(1, 1), subsample=(2, 2))
Convolution arithmeticÂ¶
The simplest case to analyze is when the kernel just slides across every position of the input (i.e., and ). Here is an example for and :
One way of defining the output size in this case is by the number of possible placements of the kernel on the input. Let’s consider the width axis: the kernel starts on the leftmost part of the input feature map and slides by steps of one until it touches the right side of the input. The size of the output will be equal to the number of steps made, plus one, accounting for the initial position of the kernel. The same logic applies for the height axis.
More formally, the following relationship can be inferred:
Relationship 1
For any and , and for and ,
This translates to the following Theano code:
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
border_mode=(0, 0), subsample=(1, 1))
# output.shape[2] == (i1  k1) + 1
# output.shape[3] == (i2  k2) + 1
To factor in zero padding (i.e., only restricting to ), let’s consider its effect on the effective input size: padding with zeros changes the effective input size from to . In the general case, Relationship 1 can then be used to infer the following relationship:
Relationship 2
For any , and , and for ,
This translates to the following Theano code:
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
border_mode=(p1, p2), subsample=(1, 1))
# output.shape[2] == (i1  k1) + 2 * p1 + 1
# output.shape[3] == (i2  k2) + 2 * p2 + 1
Here is an example for , and :
In practice, two specific instances of zero padding are used quite extensively because of their respective properties. Let’s discuss them in more detail.
Having the output size be the same as the input size (i.e., ) can be a desirable property:
Relationship 3
For any and for odd (), and ,
This translates to the following Theano code:
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
border_mode='half', subsample=(1, 1))
# output.shape[2] == i1
# output.shape[3] == i2
This is sometimes referred to as half (or same) padding. Here is an example for , and (therefore) :
Note that half padding also works for evenvalued and for , but in that case the property that the output size is the same as the input
size is lost. Some frameworks also implement the same
convolution slightly
differently (e.g., in Keras ).
While convolving a kernel generally decreases the output size with respect to the input size, sometimes the opposite is required. This can be achieved with proper zero padding:
Relationship 4
For any and , and for and ,
This translates to the following Theano code:
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
border_mode='full', subsample=(1, 1))
# output.shape[2] == i1 + (k1  1)
# output.shape[3] == i2 + (k2  1)
This is sometimes referred to as full padding, because in this setting every possible partial or complete superimposition of the kernel on the input feature map is taken into account. Here is an example for , and (therefore) :
All relationships derived so far only apply for unitstrided convolutions. Incorporating non unitary strides requires another inference leap. To facilitate the analysis, let’s momentarily ignore zero padding (i.e., and ). Here is an example for , and :
Once again, the output size can be defined in terms of the number of possible placements of the kernel on the input. Let’s consider the width axis: the kernel starts as usual on the leftmost part of the input, but this time it slides by steps of size until it touches the right side of the input. The size of the output is again equal to the number of steps made, plus one, accounting for the initial position of the kernel. The same logic applies for the height axis.
From this, the following relationship can be inferred:
Relationship 5
For any , and , and for ,
This translates to the following Theano code:
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
border_mode=(0, 0), subsample=(s1, s2))
# output.shape[2] == (i1  k1) // s1 + 1
# output.shape[3] == (i2  k2) // s2 + 1
The floor function accounts for the fact that sometimes the last possible step does not coincide with the kernel reaching the end of the input, i.e., some input units are left out.
The most general case (convolving over a zero padded input using nonunit strides) can be derived by applying Relationship 5 on an effective input of size , in analogy to what was done for Relationship 2:
Relationship 6
For any , , and ,
This translates to the following Theano code:
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
border_mode=(p1, p2), subsample=(s1, s2))
# output.shape[2] == (i1  k1 + 2 * p1) // s1 + 1
# output.shape[3] == (i2  k2 + 2 * p2) // s2 + 1
As before, the floor function means that in some cases a convolution will produce the same output size for multiple input sizes. More specifically, if is a multiple of , then any input size will produce the same output size. Note that this ambiguity applies only for .
Here is an example for , , and :
Here is an example for , , and :
Interestingly, despite having different input sizes these convolutions share the same output size. While this doesn’t affect the analysis for convolutions, this will complicate the analysis in the case of transposed convolutions.
Transposed convolution arithmeticÂ¶
The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution. For instance, one might use such a transformation as the decoding layer of a convolutional autoencoder or to project feature maps to a higherdimensional space.
Once again, the convolutional case is considerably more complex than the fullyconnected case, which only requires to use a weight matrix whose shape has been transposed. However, since every convolution boils down to an efficient implementation of a matrix operation, the insights gained from the fullyconnected case are useful in solving the convolutional case.
Like for convolution arithmetic, the dissertation about transposed convolution arithmetic is simplified by the fact that transposed convolution properties don’t interact across axes.
This section will focus on the following setting:
 2D transposed convolutions (),
 square inputs (),
 square kernel size (),
 same strides along both axes (),
 same zero padding along both axes ().
Once again, the results outlined generalize to the ND and nonsquare cases.
Take for example the convolution presented in the No zero padding, unit strides subsection:
If the input and output were to be unrolled into vectors from left to right, top to bottom, the convolution could be represented as a sparse matrix where the nonzero elements are the elements of the kernel (with and being the row and column of the kernel respectively):
(Note: the matrix has been transposed for formatting purposes.) This linear operation takes the input matrix flattened as a 16dimensional vector and produces a 4dimensional vector that is later reshaped as the output matrix.
Using this representation, the backward pass is easily obtained by transposing ; in other words, the error is backpropagated by multiplying the loss with . This operation takes a 4dimensional vector as input and produces a 16dimensional vector as output, and its connectivity pattern is compatible with by construction.
Notably, the kernel defines both the matrices and used for the forward and backward passes.
Let’s now consider what would be required to go the other way around, i.e., map from a 4dimensional space to a 16dimensional space, while keeping the connectivity pattern of the convolution depicted above. This operation is known as a transposed convolution.
Transposed convolutions – also called fractionally strided convolutions – work by swapping the forward and backward passes of a convolution. One way to put it is to note that the kernel defines a convolution, but whether it’s a direct convolution or a transposed convolution is determined by how the forward and backward passes are computed.
For instance, the kernel defines a convolution whose forward and backward passes are computed by multiplying with and respectively, but it also defines a transposed convolution whose forward and backward passes are computed by multiplying with and respectively.
Note
The transposed convolution operation can be thought of as the gradient of some convolution with respect to its input, which is usually how transposed convolutions are implemented in practice.
Finally note that it is always possible to implement a transposed convolution with a direct convolution. The disadvantage is that it usually involves adding many columns and rows of zeros to the input, resulting in a much less efficient implementation.
Building on what has been introduced so far, this section will proceed somewhat backwards with respect to the convolution arithmetic section, deriving the properties of each transposed convolution by referring to the direct convolution with which it shares the kernel, and defining the equivalent direct convolution.
The simplest way to think about a transposed convolution is by computing the output shape of the direct convolution for a given input shape first, and then inverting the input and output shapes for the transposed convolution.
Let’s consider the convolution of a kernel on a input with unitary stride and no padding (i.e., , , and ). As depicted in the convolution below, this produces a output:
The transpose of this convolution will then have an output of shape when applied on a input.
Another way to obtain the result of a transposed convolution is to apply an equivalent – but much less efficient – direct convolution. The example described so far could be tackled by convolving a kernel over a input padded with a border of zeros using unit strides (i.e., , , and ), as shown here:
Notably, the kernel’s and stride’s sizes remain the same, but the input of the equivalent (direct) convolution is now zero padded.
Note
Although equivalent to applying the transposed matrix, this visualization adds a lot of zero multiplications in the form of zero padding. This is done here for illustration purposes, but it is inefficient, and software implementations will normally not perform the useless zero multiplications.
One way to understand the logic behind zero padding is to consider the connectivity pattern of the transposed convolution and use it to guide the design of the equivalent convolution. For example, the top left pixel of the input of the direct convolution only contribute to the top left pixel of the output, the top right pixel is only connected to the top right output pixel, and so on.
To maintain the same connectivity pattern in the equivalent convolution it is necessary to zero pad the input in such a way that the first (topleft) application of the kernel only touches the topleft pixel, i.e., the padding has to be equal to the size of the kernel minus one.
Proceeding in the same fashion it is possible to determine similar observations for the other elements of the image, giving rise to the following relationship:
Relationship 7
A convolution described by , and has an associated transposed convolution described by , and and its output size is
In other words,
input = theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output, filters, filter_shape=(c1, c2, k1, k2), border_mode=(0, 0),
subsample=(1, 1))
# input.shape[2] == output.shape[2] + (k1  1)
# input.shape[3] == output.shape[3] + (k2  1)
Interestingly, this corresponds to a fully padded convolution with unit strides.
Knowing that the transpose of a nonpadded convolution is equivalent to convolving a zero padded input, it would be reasonable to suppose that the transpose of a zero padded convolution is equivalent to convolving an input padded with less zeros.
It is indeed the case, as shown in here for , and :
Formally, the following relationship applies for zero padded convolutions:
Relationship 8
A convolution described by , and has an associated transposed convolution described by , and and its output size is
In other words,
input = theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output, filters, filter_shape=(c1, c2, k1, k2), border_mode=(p1, p2),
subsample=(1, 1))
# input.shape[2] == output.shape[2] + (k1  1)  2 * p1
# input.shape[3] == output.shape[3] + (k2  1)  2 * p2
By applying the same inductive reasoning as before, it is reasonable to expect that the equivalent convolution of the transpose of a half padded convolution is itself a half padded convolution, given that the output size of a half padded convolution is the same as its input size. Thus the following relation applies:
Relationship 9
A convolution described by , and has an associated transposed convolution described by , and and its output size is
In other words,
input = theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output, filters, filter_shape=(c1, c2, k1, k2), border_mode='half',
subsample=(1, 1))
# input.shape[2] == output.shape[2]
# input.shape[3] == output.shape[3]
Here is an example for , and (therefore) :
Knowing that the equivalent convolution of the transpose of a nonpadded convolution involves full padding, it is unsurprising that the equivalent of the transpose of a fully padded convolution is a nonpadded convolution:
Relationship 10
A convolution described by , and has an associated transposed convolution described by , and and its output size is
In other words,
input = theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output, filters, filter_shape=(c1, c2, k1, k2), border_mode='full',
subsample=(1, 1))
# input.shape[2] == output.shape[2]  (k1  1)
# input.shape[3] == output.shape[3]  (k2  1)
Here is an example for , and (therefore) :
Using the same kind of inductive logic as for zero padded convolutions, one might expect that the transpose of a convolution with involves an equivalent convolution with . As will be explained, this is a valid intuition, which is why transposed convolutions are sometimes called fractionally strided convolutions.
Here is an example for , and :
This should help understand what fractional strides involve: zeros are inserted between input units, which makes the kernel move around at a slower pace than with unit strides.
Note
Doing so is inefficient and realworld implementations avoid useless multiplications by zero, but conceptually it is how the transpose of a strided convolution can be thought of.
For the moment, it will be assumed that the convolution is nonpadded () and that its input size is such that is a multiple of . In that case, the following relationship holds:
Relationship 11
A convolution described by , and and whose input size is such that is a multiple of , has an associated transposed convolution described by , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and its output size is
In other words,
input = theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output, filters, filter_shape=(c1, c2, k1, k2), border_mode=(0, 0),
subsample=(s1, s2))
# input.shape[2] == s1 * (output.shape[2]  1) + k1
# input.shape[3] == s2 * (output.shape[3]  1) + k2
When the convolution’s input size is such that is a multiple of , the analysis can extended to the zero padded case by combining Relationship 8 and Relationship 11:
Relationship 12
A convolution described by , and and whose input size is such that is a multiple of has an associated transposed convolution described by , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and its output size is
In other words,
o_prime1 = s1 * (output.shape[2]  1) + k1  2 * p1
o_prime2 = s2 * (output.shape[3]  1) + k2  2 * p2
input = theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output, filters, input_shape=(b, c1, o_prime1, o_prime2),
filter_shape=(c1, c2, k1, k2), border_mode=(p1, p2),
subsample=(s1, s2))
Here is an example for , , and :
The constraint on the size of the input can be relaxed by introducing another parameter that allows to distinguish between the different cases that all lead to the same :
Relationship 13
A convolution described by , and has an associated transposed convolution described by , , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and represents the number of zeros added to the top and right edges of the input, and its output size is
In other words,
o_prime1 = s1 * (output.shape[2]  1) + a1 + k1  2 * p1
o_prime2 = s2 * (output.shape[3]  1) + a2 + k2  2 * p2
input = theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output, filters, input_shape=(b, c1, o_prime1, o_prime2),
filter_shape=(c1, c2, k1, k2), border_mode=(p1, p2),
subsample=(s1, s2))
Here is an example for , , and :
Miscellaneous convolutionsÂ¶
Those familiar with the deep learning literature may have noticed the term “dilated convolutions” (or “atrous convolutions”, from the French expression convolutions Ã trous) appear in recent papers. Here we attempt to provide an intuitive understanding of dilated convolutions. For a more indepth description and to understand in what contexts they are applied, see Chen et al. (2014) [2]; Yu and Koltun (2015) [3].
Dilated convolutions “inflate” the kernel by inserting spaces between the kernel elements. The dilation “rate” is controlled by an additional hyperparameter . Implementations may vary, but there are usually spaces inserted between kernel elements such that corresponds to a regular convolution.
To understand the relationship tying the dilation rate and the output size , it is useful to think of the impact of on the effective kernel size. A kernel of size dilated by a factor has an effective size
This can be combined with Relationship 6 to form the following relationship for dilated convolutions:
Relationship 14
For any , , and , and for a dilation rate ,
This translates to the following Theano code using the filter_dilation
parameter:
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
border_mode=(p1, p2), subsample=(s1, s2), filter_dilation=(d1, d2))
# output.shape[2] == (i1 + 2 * p1  k1  (k1  1) * (d1  1)) // s1 + 1
# output.shape[3] == (i2 + 2 * p2  k2  (k2  1) * (d2  1)) // s2 + 1
Here is an example for , , , and :
[1]  Dumoulin, Vincent, and Visin, Francesco. “A guide to convolution arithmetic for deep learning”. arXiv preprint arXiv:1603.07285 (2016) 
[2]  Chen, LiangChieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin and Yuille, Alan L. “Semantic image segmentation with deep convolutional nets and fully connected CRFs”. arXiv preprint arXiv:1412.7062 (2014). 
[3]  Yu, Fisher and Koltun, Vladlen. “Multiscale context aggregation by dilated convolutions”. arXiv preprint arXiv:1511.07122 (2015) 
Quick referenceÂ¶
Convolution relationship
A convolution specified by
 input size ,
 kernel size ,
 stride ,
 padding size ,
has an output size given by
In Theano, this translates to
output = theano.tensor.nnet.conv2d(
input, filters, input_shape=(b, c2, i1, i2), filter_shape=(c1, c2, k1, k2),
border_mode=(p1, p2), subsample=(s1, s2))
# output.shape[2] == (i1 + 2 * p1  k1) // s1 + 1
# output.shape[3] == (i2 + 2 * p2  k2) // s2 + 1
Transposed convolution relationship
A transposed convolution specified by
 input size ,
 kernel size ,
 stride ,
 padding size ,
has an output size given by
where is a userspecified quantity used to distinguish between the different possible output sizes.
Unless , Theano requires that is implicitly passed
via an input_shape
argument. For instance, if ,
, , and , then
and the Theano code would look like
input = theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output, filters, input_shape=(9, 9), filter_shape=(c1, c2, 4, 4),
border_mode='valid', subsample=(2, 2))
Advanced configuration and debuggingÂ¶
Configuration Settings and Compiling ModesÂ¶
ConfigurationÂ¶
The config
module contains several attributes that modify Theano’s behavior. Many of these
attributes are examined during the import of the theano
module and several are assumed to be
readonly.
As a rule, the attributes in the config
module should not be modified inside the user code.
Theano’s code comes with default values for these attributes, but you can
override them from your .theanorc
file, and override those values in turn by
the THEANO_FLAGS
environment variable.
The order of precedence is:
 an assignment to theano.config.<property>
 an assignment in
THEANO_FLAGS
 an assignment in the .theanorc file (or the file indicated in
THEANORC
)
You can display the current/effective configuration at any time by printing theano.config. For example, to see a list of all active configuration variables, type this from the commandline:
python c 'import theano; print(theano.config)'  less
For more detail, see Configuration in the library.
ExerciseÂ¶
Consider the logistic regression:
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(T.dot(x, w)b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = y*T.log(p_1)  (1y)*T.log(1p_1) # Crossentropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=[(w, w0.01*gw), (b, b0.01*gb)],
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if theano used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("target values for D")
print(D[1])
print("prediction on D")
print(predict(D[0]))
Modify and execute this example to run on CPU (the default) with floatX=float32 and
time the execution using the command line time python file.py
. Save your code
as it will be useful later on.
Note
 Apply the Theano flag
floatX=float32
(throughtheano.config.floatX
) in your code.  Cast inputs before storing them into a shared variable.
 Circumvent the automatic cast of int32 with float32 to float64:
 Insert manual cast in your code or use [u]int{8,16}.
 Insert manual cast around the mean operator (this involves division by length, which is an int64).
 Note that a new casting mechanism is being developed.
ModeÂ¶
Every time theano.function
is called,
the symbolic relationships between the input and output Theano variables
are optimized and compiled. The way this compilation occurs
is controlled by the value of the mode
parameter.
Theano defines the following modes by name:
'FAST_COMPILE'
: Apply just a few graph optimizations and only use Python implementations. So GPU is disabled.'FAST_RUN'
: Apply all optimizations and use C implementations where possible.'DebugMode'
: Verify the correctness of all optimizations, and compare C and Python implementations. This mode can take much longer than the other modes, but can identify several kinds of problems.
'NanGuardMode'
: Same optimization as FAST_RUN, but check if a node generate nans.
The default mode is typically FAST_RUN
, but it can be controlled via
the configuration variable config.mode
,
which can be overridden by passing the keyword argument to
theano.function
.
short name  Full constructor  What does it do? 

FAST_COMPILE 
compile.mode.Mode(linker='py', optimizer='fast_compile') 
Python implementations only, quick and cheap graph transformations 
FAST_RUN 
compile.mode.Mode(linker='cvm', optimizer='fast_run') 
C implementations where available, all available graph transformations. 
DebugMode 
compile.debugmode.DebugMode() 
Both implementations where available, all available graph transformations. 
Note
For debugging purpose, there also exists a MonitorMode
(which has no
short name). It can be used to step through the execution of a function:
see the debugging FAQ for details.
LinkersÂ¶
A mode is composed of 2 things: an optimizer and a linker. Some modes,
like NanGuardMode
and DebugMode
, add logic around the
optimizer and linker. DebugMode
uses its own linker.
You can select which linker to use with the Theano flag config.linker
.
Here is a table to compare the different linkers.
linker  gc [1]  Raise error by op  Overhead  Definition 

cvm  yes  yes  “++”  As cpy, but the runtime algo to execute the code is in c 
cvm_nogc  no  yes  “+”  As cvm, but without gc 
cpy [2]  yes  yes  “+++”  Try C code. If none exists for an op, use Python 
cpy_nogc  no  yes  “++”  As cpy, but without gc 
c  no  yes  “+”  Use only C code (if none available for an op, raise an error) 
py  yes  yes  “+++”  Use only Python code 
NanGuardMode  yes  yes  “++++”  Check if nodes generate NaN 
DebugMode  no  yes  VERY HIGH  Make many checks on what Theano computes 
[1]  Garbage collection of intermediate results during computation. Otherwise, their memory space used by the ops is kept between Theano function calls, in order not to reallocate memory, and lower the overhead (make it faster...). 
[2]  Default 
For more detail, see Mode in the library.
OptimizersÂ¶
Theano allows compilations with a number of predefined optimizers. An optimizer consists of a particular set of optimizations, that speed up execution of Theano programs.
The optimizers Theano provides are summarized below to indicate the tradeoffs one might make between compilation time and execution time.
These optimizers can be enabled globally with the Theano flag: optimizer=name
or per call to theano functions with theano.function(...mode=theano.Mode(optimizer="name"))
.
optimizer  Compile time  Execution time  Description 

None  “++++++”  “+”  Applies none of Theano’s opts 
o1 (fast_compile)  “+++++”  “++”  Applies only basic opts 
o2  “++++”  “+++”  Applies few basic opts and some that compile fast 
o3  “+++”  “++++”  Applies all opts except ones that compile slower 
o4 (fast_run)  “++”  “+++++”  Applies all opts 
unsafe  “+”  “++++++”  Applies all opts, and removes safety checks 
stabilize  “+++++”  “++”  Only applies stability opts 
For a detailed list of the specific optimizations applied for each of these optimizers, see Optimizations. Also, see Unsafe optimization.
Using DebugModeÂ¶
While normally you should use the FAST_RUN
or FAST_COMPILE
mode,
it is useful at first (especially when you are defining new kinds of
expressions or new optimizations) to run your code using the DebugMode
(available via mode='DebugMode
). The DebugMode is designed to
run several selfchecks and assertions that can help diagnose
possible programming errors leading to incorrect output. Note that
DebugMode
is much slower than FAST_RUN
or FAST_COMPILE
so
use it only during development (not when you launch 1000 processes on a
cluster!).
DebugMode is used as follows:
x = T.dvector('x')
f = theano.function([x], 10 * x, mode='DebugMode')
f([5])
f([0])
f([7])
If any problem is detected, DebugMode will raise an exception according to
what went wrong, either at call time (f(5)) or compile time (
f = theano.function(x, 10 * x, mode='DebugMode')
). These exceptions
should not be ignored; talk to your local Theano guru or email the
users list if you cannot make the exception go away.
Some kinds of errors can only be detected for certain input value combinations. In the example above, there is no way to guarantee that a future call to, say f(1), won’t cause a problem. DebugMode is not a silver bullet.
If you instantiate DebugMode using the constructor (see DebugMode
)
rather than the keyword DebugMode
you can configure its behaviour via
constructor arguments. The keyword version of DebugMode (which you get by using mode='DebugMode'
)
is quite strict.
For more detail, see DebugMode in the library.
Printing/Drawing Theano graphsÂ¶
Theano provides the functions theano.printing.pprint()
and
theano.printing.debugprint()
to print a graph to the terminal before or
after compilation. pprint()
is more compact and mathlike,
debugprint()
is more verbose. Theano also provides pydotprint()
that creates an image of the function. You can read about them in
printing – Graph Printing and Symbolic Print Statement.
Note
When printing Theano functions, they can sometimes be hard to
read. To help with this, you can disable some Theano optimizations
by using the Theano flag:
optimizer_excluding=fusion:inplace
. Do not use this during
real job execution, as this will make the graph slower and use more
memory.
Consider again the logistic regression example:
>>> import numpy
>>> import theano
>>> import theano.tensor as T
>>> rng = numpy.random
>>> # Training data
>>> N = 400
>>> feats = 784
>>> D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
>>> training_steps = 10000
>>> # Declare Theano symbolic variables
>>> x = T.matrix("x")
>>> y = T.vector("y")
>>> w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
>>> b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
>>> x.tag.test_value = D[0]
>>> y.tag.test_value = D[1]
>>> # Construct Theano expression graph
>>> p_1 = 1 / (1 + T.exp(T.dot(x, w)b)) # Probability of having a one
>>> prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
>>> # Compute gradients
>>> xent = y*T.log(p_1)  (1y)*T.log(1p_1) # Crossentropy
>>> cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
>>> gw,gb = T.grad(cost, [w,b])
>>> # Training and prediction function
>>> train = theano.function(inputs=[x,y], outputs=[prediction, xent], updates=[[w, w0.01*gw], [b, b0.01*gb]], name = "train")
>>> predict = theano.function(inputs=[x], outputs=prediction, name = "predict")
Pretty PrintingÂ¶
>>> theano.printing.pprint(prediction)
'gt((TensorConstant{1} / (TensorConstant{1} + exp((((x \\dot w))  b)))),
TensorConstant{0.5})'
Debug PrintÂ¶
The precompilation graph:
>>> theano.printing.debugprint(prediction)
Elemwise{gt,no_inplace} [id A] ''
Elemwise{true_div,no_inplace} [id B] ''
 InplaceDimShuffle{x} [id C] ''
  TensorConstant{1} [id D]
 Elemwise{add,no_inplace} [id E] ''
 InplaceDimShuffle{x} [id F] ''
  TensorConstant{1} [id D]
 Elemwise{exp,no_inplace} [id G] ''
 Elemwise{sub,no_inplace} [id H] ''
 Elemwise{neg,no_inplace} [id I] ''
  dot [id J] ''
  x [id K]
  w [id L]
 InplaceDimShuffle{x} [id M] ''
 b [id N]
InplaceDimShuffle{x} [id O] ''
TensorConstant{0.5} [id P]
The postcompilation graph:
>>> theano.printing.debugprint(predict)
Elemwise{Composite{GT(scalar_sigmoid((((i0)  i1))), i2)}} [id A] '' 4
...Gemv{inplace} [id B] '' 3
 AllocEmpty{dtype='float64'} [id C] '' 2
  Shape_i{0} [id D] '' 1
  x [id E]
 TensorConstant{1.0} [id F]
 x [id E]
 w [id G]
 TensorConstant{0.0} [id H]
InplaceDimShuffle{x} [id I] '' 0
 b [id J]
TensorConstant{(1,) of 0.5} [id K]
Picture Printing of GraphsÂ¶
The precompilation graph:
>>> theano.printing.pydotprint(prediction, outfile="pics/logreg_pydotprint_prediction.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_prediction.png
The postcompilation graph:
>>> theano.printing.pydotprint(predict, outfile="pics/logreg_pydotprint_predict.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_predict.png
The optimized training graph:
>>> theano.printing.pydotprint(train, outfile="pics/logreg_pydotprint_train.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_train.png
Interactive Graph VisualizationÂ¶
The new d3viz
module complements theano.printing.pydotprint()
to
visualize complex graph structures. Instead of creating a static image, it
generates an HTML file, which allows to dynamically inspect graph structures in
a web browser. Features include zooming, draganddrop, editing node labels, or
coloring nodes by their compute time.
=> d3viz
<=
Debugging Theano: FAQ and TroubleshootingÂ¶
There are many kinds of bugs that might come up in a computer program. This page is structured as a FAQ. It provides recipes to tackle common problems, and introduces some of the tools that we use to find problems in our own Theano code, and even (it happens) in Theano’s internals, in Using DebugMode.
Isolating the Problem/Testing Theano CompilerÂ¶
You can run your Theano function in a DebugMode. This tests the Theano optimizations and helps to find where NaN, inf and other problems come from.
Interpreting Error MessagesÂ¶
Even in its default configuration, Theano tries to display useful error messages. Consider the following faulty code.
import numpy as np
import theano
import theano.tensor as T
x = T.vector()
y = T.vector()
z = x + x
z = z + y
f = theano.function([x, y], z)
f(np.ones((2,)), np.ones((3,)))
Running the code above we see:
Traceback (most recent call last):
...
ValueError: Input dimension mismatch. (input[0].shape[0] = 3, input[1].shape[0] = 2)
Apply node that caused the error: Elemwise{add,no_inplace}(<TensorType(float64, vector)>, <TensorType(float64, vector)>, <TensorType(float64, vector)>)
Inputs types: [TensorType(float64, vector), TensorType(float64, vector), TensorType(float64, vector)]
Inputs shapes: [(3,), (2,), (2,)]
Inputs strides: [(8,), (8,), (8,)]
Inputs scalar values: ['not scalar', 'not scalar', 'not scalar']
HINT: Rerunning with most Theano optimization disabled could give you a backtraces when this node was created. This can be done with by setting the Theano flags 'optimizer=fast_compile'. If that does not work, Theano optimization can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
Arguably the most useful information is approximately halfway through the error message, where the kind of error is displayed along with its cause (ValueError: Input dimension mismatch. (input[0].shape[0] = 3, input[1].shape[0] = 2). Below it, some other information is given, such as the apply node that caused the error, as well as the input types, shapes, strides and scalar values.
The two hints can also be helpful when debugging. Using the theano flag
optimizer=fast_compile
or optimizer=None
can often tell you
the faulty line, while exception_verbosity=high
will display a
debugprint of the apply node. Using these hints, the end of the error
message becomes :
Backtrace when the node is created:
File "test0.py", line 8, in <module>
z = z + y
Debugprint of the apply node:
Elemwise{add,no_inplace} [id A] <TensorType(float64, vector)> ''
Elemwise{add,no_inplace} [id B] <TensorType(float64, vector)> ''
 <TensorType(float64, vector)> [id C] <TensorType(float64, vector)>
 <TensorType(float64, vector)> [id C] <TensorType(float64, vector)>
<TensorType(float64, vector)> [id D] <TensorType(float64, vector)>
We can here see that the error can be traced back to the line z = z + y
.
For this example, using optimizer=fast_compile
worked. If it did not,
you could set optimizer=None
or use test values.
Using Test ValuesÂ¶
As of v.0.4.0, Theano has a new mechanism by which graphs are executed
onthefly, before a theano.function
is ever compiled. Since optimizations
haven’t been applied at this stage, it is easier for the user to locate the
source of some bug. This functionality is enabled through the config flag
theano.config.compute_test_value
. Its use is best shown through the
following example. Here, we use exception_verbosity=high
and
optimizer=fast_compile
, which would not tell you the line at fault.
optimizer=None
would and it could therefore be used instead of test values.
import numpy
import theano
import theano.tensor as T
# compute_test_value is 'off' by default, meaning this feature is inactive
theano.config.compute_test_value = 'off' # Use 'warn' to activate this feature
# configure shared variables
W1val = numpy.random.rand(2, 10, 10).astype(theano.config.floatX)
W1 = theano.shared(W1val, 'W1')
W2val = numpy.random.rand(15, 20).astype(theano.config.floatX)
W2 = theano.shared(W2val, 'W2')
# input which will be of shape (5,10)
x = T.matrix('x')
# provide Theano with a default testvalue
#x.tag.test_value = numpy.random.rand(5, 10)
# transform the shared variable in some way. Theano does not
# know off hand that the matrix func_of_W1 has shape (20, 10)
func_of_W1 = W1.dimshuffle(2, 0, 1).flatten(2).T
# source of error: dot product of 5x10 with 20x10
h1 = T.dot(x, func_of_W1)
# do more stuff
h2 = T.dot(h1, W2.T)
# compile and call the actual function
f = theano.function([x], h2)
f(numpy.random.rand(5, 10))
Running the above code generates the following error message:
Traceback (most recent call last):
File "test1.py", line 31, in <module>
f(numpy.random.rand(5, 10))
File "PATH_TO_THEANO/theano/compile/function_module.py", line 605, in __call__
self.fn.thunks[self.fn.position_of_error])
File "PATH_TO_THEANO/theano/compile/function_module.py", line 595, in __call__
outputs = self.fn()
ValueError: Shape mismatch: x has 10 cols (and 5 rows) but y has 20 rows (and 10 cols)
Apply node that caused the error: Dot22(x, DimShuffle{1,0}.0)
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(5, 10), (20, 10)]
Inputs strides: [(80, 8), (8, 160)]
Inputs scalar values: ['not scalar', 'not scalar']
Debugprint of the apply node:
Dot22 [id A] <TensorType(float64, matrix)> ''
x [id B] <TensorType(float64, matrix)>
DimShuffle{1,0} [id C] <TensorType(float64, matrix)> ''
Flatten{2} [id D] <TensorType(float64, matrix)> ''
DimShuffle{2,0,1} [id E] <TensorType(float64, 3D)> ''
W1 [id F] <TensorType(float64, 3D)>
HINT: Rerunning with most Theano optimization disabled could give you a backtraces when this node was created. This can be done with by setting the Theano flags 'optimizer=fast_compile'. If that does not work, Theano optimization can be disabled with 'optimizer=None'.
If the above is not informative enough, by instrumenting the code ever so slightly, we can get Theano to reveal the exact source of the error.
# enable onthefly graph computations
theano.config.compute_test_value = 'warn'
...
# input which will be of shape (5, 10)
x = T.matrix('x')
# provide Theano with a default testvalue
x.tag.test_value = numpy.random.rand(5, 10)
In the above, we are tagging the symbolic matrix x with a special test
value. This allows Theano to evaluate symbolic expressions onthefly (by
calling the perform
method of each op), as they are being defined. Sources
of error can thus be identified with much more precision and much earlier in
the compilation pipeline. For example, running the above code yields the
following error message, which properly identifies line 24 as the culprit.
Traceback (most recent call last):
File "test2.py", line 24, in <module>
h1 = T.dot(x, func_of_W1)
File "PATH_TO_THEANO/theano/tensor/basic.py", line 4734, in dot
return _dot(a, b)
File "PATH_TO_THEANO/theano/gof/op.py", line 545, in __call__
required = thunk()
File "PATH_TO_THEANO/theano/gof/op.py", line 752, in rval
r = p(n, [x[0] for x in i], o)
File "PATH_TO_THEANO/theano/tensor/basic.py", line 4554, in perform
z[0] = numpy.asarray(numpy.dot(x, y))
ValueError: matrices are not aligned
The compute_test_value
mechanism works as follows:
 Theano
constants
andshared
variables are used as is. No need to instrument them.  A Theano variable (i.e.
dmatrix
,vector
, etc.) should be given a special test value through the attributetag.test_value
.  Theano automatically instruments intermediate results. As such, any quantity
derived from x will be given a
tag.test_value
automatically.
compute_test_value
can take the following values:
off
: Default behavior. This debugging mechanism is inactive.raise
: Compute test values on the fly. Any variable for which a test value is required, but not provided by the user, is treated as an error. An exception is raised accordingly.warn
: Idem, but a warning is issued instead of an Exception.ignore
: Silently ignore the computation of intermediate test values, if a variable is missing a test value.
Note
This feature is currently incompatible with Scan
and also with ops
which do not implement a perform
method.
It is also possible to override variables __repr__
method to have them return tag.test_value.
x = T.scalar('x')
# Assigning test value
x.tag.test_value = 42
# Enable test value printing
theano.config.print_test_value = True
print(x.__repr__())
# Disable test value printing
theano.config.print_test_value = False
print(x.__repr__())
Running the code above returns the following output:
x
array(42.0)
x
“How do I Print an Intermediate Value in a Function?”Â¶
Theano provides a ‘Print’ op to do this.
import numpy
import theano
x = theano.tensor.dvector('x')
x_printed = theano.printing.Print('this is a very important value')(x)
f = theano.function([x], x * 5)
f_with_print = theano.function([x], x_printed * 5)
#this runs the graph without any printing
assert numpy.all( f([1, 2, 3]) == [5, 10, 15])
#this runs the graph with the message, and value printed
assert numpy.all( f_with_print([1, 2, 3]) == [5, 10, 15])
this is a very important value __str__ = [ 1. 2. 3.]
Since Theano runs your program in a topological order, you won’t have precise
control over the order in which multiple Print()
ops are evaluated. For a more
precise inspection of what’s being computed where, when, and how, see the discussion
“How do I Step through a Compiled Function?”.
Warning
Using this Print
Theano Op can prevent some Theano
optimization from being applied. This can also happen with
stability optimization. So if you use this Print and have nan, try
to remove them to know if this is the cause or not.
“How do I Print a Graph?” (before or after compilation)Â¶
Theano provides two functions (theano.pp()
and
theano.printing.debugprint()
) to print a graph to the terminal before or after
compilation. These two functions print expression graphs in different ways:
pp()
is more compact and mathlike, debugprint()
is more verbose.
Theano also provides theano.printing.pydotprint()
that creates a png image of the function.
You can read about them in printing – Graph Printing and Symbolic Print Statement.
“The Function I Compiled is Too Slow, what’s up?”Â¶
First, make sure you’re running in FAST_RUN
mode. Even though
FAST_RUN
is the default mode, insist by passing mode='FAST_RUN'
to theano.function
(or theano.make
) or by setting config.mode
to FAST_RUN
.
Second, try the Theano profiling. This will tell you which
Apply
nodes, and which ops are eating up your CPU cycles.
Tips:
 Use the flags
floatX=float32
to require type float32 instead of float64; Use the Theano constructors matrix(),vector(),... instead of dmatrix(), dvector(),... since they respectively involve the default types float32 and float64.  Check in the
profile
mode that there is noDot
op in the postcompilation graph while you are multiplying two matrices of the same type.Dot
should be optimized todot22
when the inputs are matrices and of the same type. This can still happen when usingfloatX=float32
when one of the inputs of the graph is of type float64.
“Why does my GPU function seem to be slow?”Â¶
When you compile a theano function, if you do not get the speedup that you expect over the CPU performance of the same code. It is oftentimes due to the fact that some Ops might be running on CPU instead GPU. If that is the case, you can use assert_no_cpu_op to check if there is a CPU Op on your computational graph. assert_no_cpu_op can take the following one of the three options:
warn
: Raise a warningpdb
: Stop with a pdb in the computational graph during the compilationraise
: Raise an error, if there is a CPU Op in the computational graph.
It is possible to use this mode by providing the flag in THEANO_FLAGS, such as:
THEANO_FLAGS="float32,device=gpu,assert_no_cpu_op='raise'" python test.py
But note that this optimization will not catch all the CPU Ops, it might miss some Ops.
“How do I Step through a Compiled Function?”Â¶
You can use MonitorMode
to inspect the inputs and outputs of each
node being executed when the function is called. The code snipped below
shows how to print all inputs and outputs:
from __future__ import print_function
import theano
def inspect_inputs(i, node, fn):
print(i, node, "input(s) value(s):", [input[0] for input in fn.inputs],
end='')
def inspect_outputs(i, node, fn):
print(" output(s) value(s):", [output[0] for output in fn.outputs])
x = theano.tensor.dscalar('x')
f = theano.function([x], [5 * x],
mode=theano.compile.MonitorMode(
pre_func=inspect_inputs,
post_func=inspect_outputs))
f(3)
0 Elemwise{mul,no_inplace}(TensorConstant{5.0}, x) input(s) value(s): [array(5.0), array(3.0)] output(s) value(s): [array(15.0)]
When using these inspect_inputs
and inspect_outputs
functions
with MonitorMode
, you should see [potentially a lot of] printed output.
Every Apply
node will be printed out,
along with its position in the graph, the arguments to the functions perform
or
c_code
and the output it computed.
Admittedly, this may be a huge amount of
output to read through if you are using big tensors... but you can choose to
add logic that would, for instance, print
something out only if a certain kind of op were used, at a certain program
position, or only if a particular value showed up in one of the inputs or outputs.
A typical example is to detect when NaN values are added into computations, which
can be achieved as follows:
import numpy
import theano
# This is the current suggested detect_nan implementation to
# show you how it work. That way, you can modify it for your
# need. If you want exactly this method, you can use
# ``theano.compile.monitormode.detect_nan`` that will always
# contain the current suggested version.
def detect_nan(i, node, fn):
for output in fn.outputs:
if (not isinstance(output[0], numpy.random.RandomState) and
numpy.isnan(output[0]).any()):
print('*** NaN detected ***')
theano.printing.debugprint(node)
print('Inputs : %s' % [input[0] for input in fn.inputs])
print('Outputs: %s' % [output[0] for output in fn.outputs])
break
x = theano.tensor.dscalar('x')
f = theano.function([x], [theano.tensor.log(x) * x],
mode=theano.compile.MonitorMode(
post_func=detect_nan))
f(0) # log(0) * 0 = inf * 0 = NaN
*** NaN detected ***
Elemwise{Composite{(log(i0) * i0)}} [id A] ''
x [id B]
Inputs : [array(0.0)]
Outputs: [array(nan)]
To help understand what is happening in your graph, you can
disable the local_elemwise_fusion
and all inplace
optimizations. The first is a speed optimization that merges elemwise
operations together. This makes it harder to know which particular
elemwise causes the problem. The second optimization makes some ops’
outputs overwrite their inputs. So, if an op creates a bad output, you
will not be able to see the input that was overwritten in the post_func
function. To disable those optimizations (with a Theano version after
0.6rc3), define the MonitorMode like this:
mode = theano.compile.MonitorMode(post_func=detect_nan).excluding(
'local_elemwise_fusion', 'inplace')
f = theano.function([x], [theano.tensor.log(x) * x],
mode=mode)
Note
The Theano flags optimizer_including
, optimizer_excluding
and optimizer_requiring
aren’t used by the MonitorMode, they
are used only by the default
mode. You can’t use the default
mode with MonitorMode, as you need to define what you monitor.
To be sure all inputs of the node are available during the call to
post_func
, you must also disable the garbage collector. Otherwise,
the execution of the node can garbage collect its inputs that aren’t
needed anymore by the Theano function. This can be done with the Theano
flag:
allow_gc=False
How to Use pdbÂ¶
In the majority of cases, you won’t be executing from the interactive shell but from a set of Python scripts. In such cases, the use of the Python debugger can come in handy, especially as your models become more complex. Intermediate results don’t necessarily have a clear name and you can get exceptions which are hard to decipher, due to the “compiled” nature of the functions.
Consider this example script (“ex.py”):
import theano
import numpy
import theano.tensor as T
a = T.dmatrix('a')
b = T.dmatrix('b')
f = theano.function([a, b], [a * b])
# matrices chosen so dimensions are unsuitable for multiplication
mat1 = numpy.arange(12).reshape((3, 4))
mat2 = numpy.arange(25).reshape((5, 5))
f(mat1, mat2)
This is actually so simple the debugging could be done easily, but it’s for illustrative purposes. As the matrices can’t be multiplied elementwise (unsuitable shapes), we get the following exception:
File "ex.py", line 14, in <module>
f(mat1, mat2)
File "/u/username/Theano/theano/compile/function_module.py", line 451, in __call__
File "/u/username/Theano/theano/gof/link.py", line 271, in streamline_default_f
File "/u/username/Theano/theano/gof/link.py", line 267, in streamline_default_f
File "/u/username/Theano/theano/gof/cc.py", line 1049, in execute ValueError: ('Input dimension mismatch. (input[0].shape[0] = 3, input[1].shape[0] = 5)', Elemwise{mul,no_inplace}(a, b), Elemwise{mul,no_inplace}(a, b))
The call stack contains some useful information to trace back the source of the error. There’s the script where the compiled function was called – but if you’re using (improperly parameterized) prebuilt modules, the error might originate from ops in these modules, not this script. The last line tells us about the op that caused the exception. In this case it’s a “mul” involving variables with names “a” and “b”. But suppose we instead had an intermediate result to which we hadn’t given a name.
After learning a few things about the graph structure in Theano, we can use the Python debugger to explore the graph, and then we can get runtime information about the error. Matrix dimensions, especially, are useful to pinpoint the source of the error. In the printout, there are also 2 of the 4 dimensions of the matrices involved, but for the sake of example say we’d need the other dimensions to pinpoint the error. First, we relaunch with the debugger module and run the program with “c”:
python m pdb ex.py
> /u/username/experiments/doctmp1/ex.py(1)<module>()
> import theano
(Pdb) c
Then we get back the above error printout, but the interpreter breaks in that state. Useful commands here are
 “up” and “down” (to move up and down the call stack),
 “l” (to print code around the line in the current stack position),
 “p variable_name” (to print the string representation of ‘variable_name’),
 “p dir(object_name)”, using the Python dir() function to print the list of an object’s members
Here, for example, I do “up”, and a simple “l” shows me there’s a local variable “node”. This is the “node” from the computation graph, so by following the “node.inputs”, “node.owner” and “node.outputs” links I can explore around the graph.
That graph is purely symbolic (no data, just symbols to manipulate it abstractly). To get information about the actual parameters, you explore the “thunk” objects, which bind the storage for the inputs (and outputs) with the function itself (a “thunk” is a concept related to closures). Here, to get the current node’s first input’s shape, you’d therefore do “p thunk.inputs[0][0].shape”, which prints out “(3, 4)”.
Dumping a Function to help debugÂ¶
If you are reading this, there is high chance that you emailed our
mailing list and we asked you to read this section. This section
explain how to dump all the parameter passed to
theano.function()
. This is useful to help us reproduce a problem
during compilation and it doesn’t request you to make a self contained
example.
For this to work, we need to be able to import the code for all Op in the graph. So if you create your own Op, we will need this code. Otherwise, we won’t be able to unpickle it. We already have all the Ops from Theano and Pylearn2.
# Replace this line:
theano.function(...)
# with
theano.function_dump(filename, ...)
# Where filename is a string to a file that we will write to.
Then send us filename.
Breakpoint during Theano function executionÂ¶
You can set a breakpoint during the execution of a Theano function with
PdbBreakpoint
.
PdbBreakpoint
automatically
detects available debuggers and uses the first available in the following order:
pudb, ipdb, or pdb.
Dealing with NaNsÂ¶
Having a model yielding NaNs or Infs is quite common if some of the tiny components in your model are not set properly. NaNs are hard to deal with because sometimes it is caused by a bug or error in the code, sometimes it’s because of the numerical stability of your computational environment (library versions, etc.), and even, sometimes it relates to your algorithm. Here we try to outline common issues which cause the model to yield NaNs, as well as provide nails and hammers to diagnose it.
Check Superparameters and Weight InitializationÂ¶
Most frequently, the cause would be that some of the hyperparameters, especially learning rates, are set incorrectly. A high learning rate can blow up your whole model into NaN outputs even within one epoch of training. So the first and easiest solution is try to lower it. Keep halving your learning rate until you start to get reasonable output values.
Other hyperparameters may also play a role. For example, are your training algorithms involve regularization terms? If so, are their corresponding penalties set reasonably? Search a wider hyperparameter space with a few (one or two) training epochs each to see if the NaNs could disappear.
Some models can be very sensitive to the initialization of weight vectors. If those weights are not initialized in a proper range, then it is not surprising that the model ends up with yielding NaNs.
Run in NanGuardMode, DebugMode, or MonitorModeÂ¶
If adjusting hyperparameters doesn’t work for you, you can still get help from
Theano’s NanGuardMode. Change the mode of your theano function to NanGuardMode
and run them again. The NanGuardMode will monitor all input/output variables in
each node, and raises an error if NaNs are detected. For how to use the
NanGuardMode
, please refer to nanguardmode. Using optimizer_including=alloc_empty_to_zeros
with NanGuardMode
could be helpful to detect NaN, for more information please refer
to NaN Introduced by AllocEmpty.
DebugMode can also help. Run your code in DebugMode with flag
mode=DebugMode,DebugMode.check_py=False
. This will give you clue about which
op is causing this problem, and then you can inspect that op in more detail. For
details of using DebugMode
, please refer to debugmode.
Theano’s MonitorMode provides another helping hand. It can be used to step through the execution of a function. You can inspect the inputs and outputs of each node being executed when the function is called. For how to use that, please check “How do I Step through a Compiled Function?”.
Numerical StabilityÂ¶
After you have located the op which causes the problem, it may turn out that the NaNs yielded by that op are related to numerical issues. For example, may result in NaNs for those nodes who have learned to yield a low probability p(x) for some input x.
CUDA Specific OptionÂ¶
The Theano flag nvcc.fastmath=True
can genarate NaN. Don’t set
this flag while debugging NaN.
NaN Introduced by AllocEmptyÂ¶
AllocEmpty is used by many operation such as scan to allocate some memory without properly clearing it. The reason for that is that the allocated memory will subsequently be overwritten. However, this can sometimes introduce NaN depending on the operation and what was previously stored in the memory it is working on. For instance, trying to zero out memory using a multiplication before applying an operation could cause NaN if NaN is already present in the memory, since 0 * NaN => NaN.
Using optimizer_including=alloc_empty_to_zeros
replaces AllocEmpty by Alloc{0}, which is helpful to diagnose where NaNs come from. Please note that when running in NanGuardMode, this optimizer is not included by default. Therefore, it might be helpful to use them both together.
Profiling Theano functionÂ¶
Note
This method replace the old ProfileMode. Do not use ProfileMode anymore.
Besides checking for errors, another important task is to profile your code in terms of speed and/or memory usage.
You can profile your functions using either of the following two options:
 Use Theano flag
config.profile
to enable profiling.  To enable the memory profiler use the Theano flag:
config.profile_memory
in addition toconfig.profile
.  Moreover, to enable the profiling of Theano optimization phase,
use the Theano flag:
config.profile_optimizer
in addition toconfig.profile
.  You can also use the Theano flags
profiling.n_apply
,profiling.n_ops
andprofiling.min_memory_size
to modify the quantity of information printed.
 To enable the memory profiler use the Theano flag:
 Use Theano flag
 Pass the argument
profile=True
to the functiontheano.function
. And then callf.profile.summary()
for a single function.  Use this option when you want to profile not all the functions but one or more specific function(s).
 You can also combine the profile of many functions:
 Pass the argument
The profiler will output one profile per Theano function and profile that is the sum of the printed profiles. Each profile contains 4 sections: global info, class info, Ops info and Apply node info.
In the global section, the “Message” is the name of the Theano
function. theano.function() has an optional parameter name
that
defaults to None. Change it to something else to help you profile many
Theano functions. In that section, we also see the number of times the
function was called (1) and the total time spent in all those
calls. The time spent in Function.fn.__call__ and in thunks is useful
to understand Theano overhead.
Also, we see the time spent in the two parts of the compilation process: optimization (modify the graph to make it more stable/faster) and the linking (compile c code and make the Python callable returned by function).
The class, Ops and Apply nodes sections are the same information: information about the Apply node that ran. The Ops section takes the information from the Apply section and merge the Apply nodes that have exactly the same op. If two Apply nodes in the graph have two Ops that compare equal, they will be merged. Some Ops like Elemwise, will not compare equal, if their parameters differ (the scalar being executed). So the class section will merge more Apply nodes then the Ops section.
Note that the profile also shows which Ops were running a c implementation.
Developers wishing to optimize the performance of their graph should focus on the worst offending Ops and Apply nodes â€“ either by optimizing an implementation, providing a missing C implementation, or by writing a graph optimization that eliminates the offending Op altogether. You should strongly consider emailing one of our lists about your issue before spending too much time on this.
Here is an example output when we disable some Theano optimizations to give you a better idea of the difference between sections. With all optimizations enabled, there would be only one op left in the graph.
to run the example:
THEANO_FLAGS=optimizer_excluding=fusion:inplace,profile=True python doc/tutorial/profiling_example.py
The output:
Function profiling
==================
Message: None
Time in 1 calls to Function.__call__: 5.698204e05s
Time in Function.fn.__call__: 1.192093e05s (20.921%)
Time in thunks: 6.198883e06s (10.879%)
Total compile time: 3.642474e+00s
Theano Optimizer time: 7.326508e02s
Theano validate time: 3.712177e04s
Theano Linker time (includes C, CUDA code generation/compiling): 9.584920e01s
Class

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.000s 2.07e06s C 3 3 <class 'theano.tensor.elemwise.Elemwise'>
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
65.4% 65.4% 0.000s 2.03e06s C 2 2 Elemwise{add,no_inplace}
34.6% 100.0% 0.000s 2.15e06s C 1 1 Elemwise{mul,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
50.0% 50.0% 0.000s 3.10e06s 1 0 Elemwise{add,no_inplace}(x, y)
34.6% 84.6% 0.000s 2.15e06s 1 2 Elemwise{mul,no_inplace}(TensorConstant{(1,) of 2.0}, Elemwise{add,no_inplace}.0)
15.4% 100.0% 0.000s 9.54e07s 1 1 Elemwise{add,no_inplace}(Elemwise{add,no_inplace}.0, z)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Further readingsÂ¶
Graph StructuresÂ¶
Debugging or profiling code written in Theano is not that simple if you do not know what goes on under the hood. This chapter is meant to introduce you to a required minimum of the inner workings of Theano.
The first step in writing Theano code is to write down all mathematical
relations using symbolic placeholders (variables). When writing down
these expressions you use operations like +
, 
, **
,
sum()
, tanh()
. All these are represented internally as ops.
An op represents a certain computation on some type of inputs
producing some type of output. You can see it as a function definition
in most programming languages.
Theano represents symbolic mathematical computations as graphs. These graphs are composed of interconnected Apply, Variable and Op nodes. Apply node represents the application of an op to some variables. It is important to draw the difference between the definition of a computation represented by an op and its application to some actual data which is represented by the apply node. Furthermore, data types are represented by Type instances. Here is a piece of code and a diagram showing the structure built by that piece of code. This should help you understand how these pieces fit together:
Code
import theano.tensor as T
x = T.dmatrix('x')
y = T.dmatrix('y')
z = x + y
Diagram
Arrows represent references to the Python objects pointed at. The blue box is an Apply node. Red boxes are Variable nodes. Green circles are Ops. Purple boxes are Types.
When we create Variables and then Apply
Ops to them to make more Variables, we build a
bipartite, directed, acyclic graph. Variables point to the Apply nodes
representing the function application producing them via their
owner
field. These Apply nodes point in turn to their input and
output Variables via their inputs
and outputs
fields.
(Apply instances also contain a list of references to their outputs
, but
those pointers don’t count in this graph.)
The owner
field of both x
and y
point to None
because
they are not the result of another computation. If one of them was the
result of another computation, it’s owner
field would point to another
blue box like z
does, and so on.
Note that the Apply
instance’s outputs points to
z
, and z.owner
points back to the Apply
instance.
Traversing the graphÂ¶
The graph can be traversed starting from outputs (the result of some computation) down to its inputs using the owner field. Take for example the following code:
>>> import theano
>>> x = theano.tensor.dmatrix('x')
>>> y = x * 2.
If you enter type(y.owner)
you get <class 'theano.gof.graph.Apply'>
,
which is the apply node that connects the op and the inputs to get this
output. You can now print the name of the op that is applied to get
y:
>>> y.owner.op.name
'Elemwise{mul,no_inplace}'
Hence, an elementwise multiplication is used to compute y. This multiplication is done between the inputs:
>>> len(y.owner.inputs)
2
>>> y.owner.inputs[0]
x
>>> y.owner.inputs[1]
InplaceDimShuffle{x,x}.0
Note that the second input is not 2 as we would have expected. This is
because 2 was first broadcasted to a matrix of
same shape as x. This is done by using the op DimShuffle
:
>>> type(y.owner.inputs[1])
<class 'theano.tensor.var.TensorVariable'>
>>> type(y.owner.inputs[1].owner)
<class 'theano.gof.graph.Apply'>
>>> y.owner.inputs[1].owner.op
<theano.tensor.elemwise.DimShuffle object at 0x106fcaf10>
>>> y.owner.inputs[1].owner.inputs
[TensorConstant{2.0}]
Starting from this graph structure it is easier to understand how automatic differentiation proceeds and how the symbolic relations can be optimized for performance or stability.
Graph StructuresÂ¶
The following section outlines each type of structure that may be used in a Theanobuilt computation graph. The following structures are explained: Apply, Constant, Op, Variable and Type.
An Apply node is a type of internal node used to represent a
computation graph in Theano. Unlike
Variable nodes, Apply nodes are usually not
manipulated directly by the end user. They may be accessed via
a Variable’s owner
field.
An Apply node is typically an instance of the Apply
class. It represents the application
of an Op on one or more inputs, where each input is a
Variable. By convention, each Op is responsible for
knowing how to build an Apply node from a list of
inputs. Therefore, an Apply node may be obtained from an Op
and a list of inputs by calling Op.make_node(*inputs)
.
Comparing with the Python language, an Apply node is Theano’s version of a function call whereas an Op is Theano’s version of a function definition.
An Apply instance has three important fields:
 op
 An Op that determines the function/transformation being applied here.
 inputs
 A list of Variables that represent the arguments of the function.
 outputs
 A list of Variables that represent the return values of the function.
An Apply instance can be created by calling gof.Apply(op, inputs, outputs)
.
An Op in Theano defines a certain computation on some types of inputs, producing some types of outputs. It is equivalent to a function definition in most programming languages. From a list of input Variables and an Op, you can build an Apply node representing the application of the Op to the inputs.
It is important to understand the distinction between an Op (the
definition of a function) and an Apply node (the application of a
function). If you were to interpret the Python language using Theano’s
structures, code going like def f(x): ...
would produce an Op for
f
whereas code like a = f(x)
or g(f(4), 5)
would produce an
Apply node involving the f
Op.
A Type in Theano represents a set of constraints on potential
data objects. These constraints allow Theano to tailor C code to handle
them and to statically optimize the computation graph. For instance,
the irow type in the theano.tensor
package
gives the following constraints on the data the Variables of type irow
may contain:
 Must be an instance of
numpy.ndarray
:isinstance(x, numpy.ndarray)
 Must be an array of 32bit integers:
str(x.dtype) == 'int32'
 Must have a shape of 1xN:
len(x.shape) == 2 and x.shape[0] == 1
Knowing these restrictions, Theano may generate C code for addition, etc. that declares the right data types and that contains the right number of loops over the dimensions.
Note that a Theano Type is not equivalent to a Python type or
class. Indeed, in Theano, irow and dmatrix both use numpy.ndarray
as the underlying type
for doing computations and storing data, yet they are different Theano
Types. Indeed, the constraints set by dmatrix
are:
 Must be an instance of
numpy.ndarray
:isinstance(x, numpy.ndarray)
 Must be an array of 64bit floating point numbers:
str(x.dtype) == 'float64'
 Must have a shape of MxN, no restriction on M or N:
len(x.shape) == 2
These restrictions are different from those of irow
which are listed above.
There are cases in which a Type can fully correspond to a Python type,
such as the double
Type we will define here, which corresponds to
Python’s float
. But, it’s good to know that this is not necessarily
the case. Unless specified otherwise, when we say “Type” we mean a
Theano Type.
A Variable is the main data structure you work with when using Theano. The symbolic inputs that you operate on are Variables and what you get from applying various Ops to these inputs are also Variables. For example, when I type
>>> import theano
>>> x = theano.tensor.ivector()
>>> y = x
x
and y
are both Variables, i.e. instances of the Variable
class. The Type of both x
and
y
is theano.tensor.ivector
.
Unlike x
, y
is a Variable produced by a computation (in this
case, it is the negation of x
). y
is the Variable corresponding to
the output of the computation, while x
is the Variable
corresponding to its input. The computation itself is represented by
another type of node, an Apply node, and may be accessed
through y.owner
.
More specifically, a Variable is a basic structure in Theano that
represents a datum at a certain point in computation. It is typically
an instance of the class Variable
or
one of its subclasses.
A Variable r
contains four important fields:
 type
 a Type defining the kind of value this Variable can hold in computation.
 owner
 this is either None or an Apply node of which the Variable is an output.
 index
 the integer such that
owner.outputs[index] is r
(ignored ifowner
is None)  name
 a string to use in prettyprinting and debugging.
Variable has one special subclass: Constant.
A Constant is a Variable with one extra field, data (only settable once). When used in a computation graph as the input of an Op application, it is assumed that said input will always take the value contained in the constant’s data field. Furthermore, it is assumed that the Op will not under any circumstances modify the input. This means that a constant is eligible to participate in numerous optimizations: constant inlining in C code, constant folding, etc.
A constant does not need to be specified in a function
‘s list
of inputs. In fact, doing so will raise an exception.
Graph Structures ExtensionÂ¶
When we start the compilation of a Theano function, we compute some extra information. This section describes a portion of the information that is made available. Not everything is described, so email theanodev if you need something that is missing.
The graph gets cloned at the start of compilation, so modifications done during compilation won’t affect the user graph.
Each variable receives a new field called clients. It is a list with references to every place in the graph where this variable is used. If its length is 0, it means the variable isn’t used. Each place where it is used is described by a tuple of 2 elements. There are two types of pairs:
 The first element is an Apply node.
 The first element is the string “output”. It means the function outputs this variable.
In both types of pairs, the second element of the tuple is an index,
such that: var.clients[*][0].inputs[index]
or
fgraph.outputs[index]
is that variable.
>>> import theano
>>> v = theano.tensor.vector()
>>> f = theano.function([v], (v+1).sum())
>>> theano.printing.debugprint(f)
Sum{acc_dtype=float64} [id A] '' 1
Elemwise{add,no_inplace} [id B] '' 0
TensorConstant{(1,) of 1.0} [id C]
<TensorType(float64, vector)> [id D]
>>> # Sorted list of all nodes in the compiled graph.
>>> topo = f.maker.fgraph.toposort()
>>> topo[0].outputs[0].clients
[(Sum{acc_dtype=float64}(Elemwise{add,no_inplace}.0), 0)]
>>> topo[1].outputs[0].clients
[('output', 0)]
>>> # An internal variable
>>> var = topo[0].outputs[0]
>>> client = var.clients[0]
>>> client
(Sum{acc_dtype=float64}(Elemwise{add,no_inplace}.0), 0)
>>> type(client[0])
<class 'theano.gof.graph.Apply'>
>>> assert client[0].inputs[client[1]] is var
>>> # An output of the graph
>>> var = topo[1].outputs[0]
>>> client = var.clients[0]
>>> client
('output', 0)
>>> assert f.maker.fgraph.outputs[client[1]] is var
Automatic DifferentiationÂ¶
Having the graph structure, computing automatic differentiation is
simple. The only thing tensor.grad()
has to do is to traverse the
graph from the outputs back towards the inputs through all apply
nodes (apply nodes are those that define which computations the
graph does). For each such apply node, its op defines
how to compute the gradient of the node’s outputs with respect to its
inputs. Note that if an op does not provide this information,
it is assumed that the gradient is not defined.
Using the
chain rule
these gradients can be composed in order to obtain the expression of the
gradient of the graph’s output with respect to the graph’s inputs.
A following section of this tutorial will examine the topic of differentiation in greater detail.
OptimizationsÂ¶
When compiling a Theano function, what you give to the
theano.function
is actually a graph
(starting from the output variables you can traverse the graph up to
the input variables). While this graph structure shows how to compute
the output from the input, it also offers the possibility to improve the
way this computation is carried out. The way optimizations work in
Theano is by identifying and replacing certain patterns in the graph
with other specialized patterns that produce the same results but are either
faster or more stable. Optimizations can also detect
identical subgraphs and ensure that the same values are not computed
twice or reformulate parts of the graph to a GPU specific version.
For example, one (simple) optimization that Theano uses is to replace the pattern by x.
Further information regarding the optimization process and the specific optimizations that are applicable is respectively available in the library and on the entrance page of the documentation.
Example
Symbolic programming involves a change of paradigm: it will become clearer as we apply it. Consider the following example of optimization:
>>> import theano
>>> a = theano.tensor.vector("a") # declare symbolic variable
>>> b = a + a ** 10 # build symbolic expression
>>> f = theano.function([a], b) # compile function
>>> print(f([0, 1, 2])) # prints `array([0,2,1026])`
[ 0. 2. 1026.]
>>> theano.printing.pydotprint(b, outfile="./pics/symbolic_graph_unopt.png", var_with_name_simple=True)
The output file is available at ./pics/symbolic_graph_unopt.png
>>> theano.printing.pydotprint(f, outfile="./pics/symbolic_graph_opt.png", var_with_name_simple=True)
The output file is available at ./pics/symbolic_graph_opt.png
We used theano.printing.pydotprint()
to visualize the optimized graph
(right), which is much more compact than the unoptimized graph (left).
Unoptimized graph  Optimized graph  

Loading and SavingÂ¶
Python’s standard way of saving class instances and reloading them
is the pickle mechanism. Many Theano objects can be serialized (and
deserialized) by pickle
, however, a limitation of pickle
is that
it does not save the code or data of a class along with the instance of
the class being serialized. As a result, reloading objects created by a
previous version of a class can be really problematic.
Thus, you will want to consider different mechanisms depending on the amount of time you anticipate between saving and reloading. For shortterm (such as temp files and network transfers), pickling of the Theano objects or classes is possible. For longerterm (such as saving models from an experiment) you should not rely on pickled Theano objects; we recommend loading and saving the underlying shared objects as you would in the course of any other Python program.
The Basics of PicklingÂ¶
The two modules pickle
and cPickle
have the same functionalities, but
cPickle
, coded in C, is much faster.
>>> from six.moves import cPickle
You can serialize (or save, or pickle) objects to a file with
cPickle.dump
:
>>> f = open('obj.save', 'wb')
>>> cPickle.dump(my_obj, f, protocol=cPickle.HIGHEST_PROTOCOL)
>>> f.close()
Note
If you want your saved object to be stored efficiently, don’t forget
to use cPickle.HIGHEST_PROTOCOL
. The resulting file can be
dozens of times smaller than with the default protocol.
Note
Opening your file in binary mode ('b'
) is required for portability
(especially between Unix and Windows).
To deserialize (or load, or unpickle) a pickled file, use
cPickle.load
:
>>> f = open('obj.save', 'rb')
>>> loaded_obj = cPickle.load(f)
>>> f.close()
You can pickle several objects into the same file, and load them all (in the same order):
>>> f = open('objects.save', 'wb')
>>> for obj in [obj1, obj2, obj3]:
... cPickle.dump(obj, f, protocol=cPickle.HIGHEST_PROTOCOL)
>>> f.close()
Then:
>>> f = open('objects.save', 'rb')
>>> loaded_objects = []
>>> for i in range(3):
... loaded_objects.append(cPickle.load(f))
>>> f.close()
For more details about pickle’s usage, see Python documentation.
ShortTerm SerializationÂ¶
If you are confident that the class instance you are serializing will be deserialized by a compatible version of the code, pickling the whole model is an adequate solution. It would be the case, for instance, if you are saving models and reloading them during the same execution of your program, or if the class you’re saving has been really stable for a while.
You can control what pickle will save from your object, by defining a __getstate__ method, and similarly __setstate__.
This will be especially useful if, for instance, your model class contains a link to the data set currently in use, that you probably don’t want to pickle along every instance of your model.
For instance, you can define functions along the lines of:
def __getstate__(self):
state = dict(self.__dict__)
del state['training_set']
return state
def __setstate__(self, d):
self.__dict__.update(d)
self.training_set = cPickle.load(open(self.training_set_file, 'rb'))
Robust SerializationÂ¶
This type of serialization uses some helper functions particular to Theano. It
serializes the object using Python’s pickling protocol, but any ndarray
or
CudaNdarray
objects contained within the object are saved separately as NPY
files. These NPY files and the Pickled file are all saved together in single
ZIPfile.
The main advantage of this approach is that you don’t even need Theano installed in order to look at the values of shared variables that you pickled. You can just load the parameters manually with numpy.
import numpy
numpy.load('model.zip')
This approach could be beneficial if you are sharing your model with people who might not have Theano installed, who are using a different Python version, or if you are planning to save your model for a long time (in which case version mismatches might make it difficult to unpickle objects).
See theano.misc.pkl_utils.dump()
and theano.misc.pkl_utils.load()
.
LongTerm SerializationÂ¶
If the implementation of the class you want to save is quite unstable, for instance if functions are created or removed, class members are renamed, you should save and load only the immutable (and necessary) part of your class.
You can do that by defining __getstate__ and __setstate__ functions as above, maybe defining the attributes you want to save, rather than the ones you don’t.
For instance, if the only parameters you want to save are a weight matrix W and a bias b, you can define:
def __getstate__(self):
return (self.W, self.b)
def __setstate__(self, state):
W, b = state
self.W = W
self.b = b
If at some point in time W is renamed to weights and b to bias, the older pickled files will still be usable, if you update these functions to reflect the change in name:
def __getstate__(self):
return (self.weights, self.bias)
def __setstate__(self, state):
W, b = state
self.weights = W
self.bias = b
For more information on advanced use of pickle
and its internals, see Python’s
pickle documentation.
Understanding Memory Aliasing for Speed and CorrectnessÂ¶
The aggressive reuse of memory is one of the ways through which Theano makes code fast, and it is important for the correctness and speed of your program that you understand how Theano might alias buffers.
This section describes the principles based on which Theano handles memory, and explains when you might want to alter the default behaviour of some functions and methods for faster performance.
The Memory Model: Two SpacesÂ¶
There are some simple principles that guide Theano’s handling of memory. The main idea is that there is a pool of memory managed by Theano, and Theano tracks changes to values in that pool.
 Theano manages its own memory space, which typically does not overlap with the memory of normal Python variables that nonTheano code creates.
 Theano functions only modify buffers that are in Theano’s memory space.
 Theano’s memory space includes the buffers allocated to store
shared
variables and the temporaries used to evaluate functions.  Physically, Theano’s memory space may be spread across the host, a GPU device(s), and in the future may even include objects on a remote machine.
 The memory allocated for a
shared
variable buffer is unique: it is never aliased to anothershared
variable.  Theano’s managed memory is constant while Theano functions are not running and Theano’s library code is not running.
 The default behaviour of a function is to return userspace values for outputs, and to expect userspace values for inputs.
The distinction between Theanomanaged memory and usermanaged memory can be
broken down by some Theano functions (e.g. shared
, get_value
and the
constructors for In
and Out
) by using a borrow=True
flag.
This can make those methods faster (by avoiding copy operations) at the expense
of risking subtle bugs in the overall program (by aliasing memory).
The rest of this section is aimed at helping you to understand when it is safe
to use the borrow=True
argument and reap the benefits of faster code.
Borrowing when Constructing Function ObjectsÂ¶
A borrow
argument can also be provided to the In
and Out
objects
that control how theano.function
handles its argument[s] and return value[s].
import theano, theano.tensor
x = theano.tensor.matrix()
y = 2 * x
f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True))
Borrowing an input means that Theano will treat the argument you provide as if
it were part of Theano’s pool of temporaries. Consequently, your input
may be reused as a buffer (and overwritten!) during the computation of other variables in the
course of evaluating that function (e.g. f
).
Borrowing an output means that Theano will not insist on allocating a fresh
output buffer every time you call the function. It will possibly reuse the same one as
on a previous call, and overwrite the old content. Consequently, it may overwrite
old return values through sideeffect.
Those return values may also be overwritten in
the course of evaluating another compiled function (for example, the output
may be aliased to a shared
variable). So be careful to use a borrowed return
value right away before calling any more Theano functions.
The default is of course to not borrow internal results.
It is also possible to pass a return_internal_type=True
flag to the Out
variable which has the same interpretation as the return_internal_type
flag
to the shared
variable’s get_value
function. Unlike get_value()
, the
combination of return_internal_type=True
and borrow=True
arguments to
Out()
are not guaranteed to avoid copying an output value. They are just
hints that give more flexibility to the compilation and optimization of the
graph.
Take home message:
When an input x to a function is not needed after the function
returns and you would like to make it available to Theano as
additional workspace, then consider marking it with In(x,
borrow=True)
. It may make the function faster and reduce its memory
requirement. When a return value y is large (in terms of memory
footprint), and you only need to read from it once, right away when
it’s returned, then consider marking it with an Out(y,
borrow=True)
.
Python Memory ManagementÂ¶
One of the major challenges in writing (somewhat) largescale Python programs is to keep memory usage at a minimum. However, managing memory in Python is easyâ€”if you just don’t care. Python allocates memory transparently, manages objects using a reference count system, and frees memory when an object’s reference count falls to zero. In theory, it’s swell. In practice, you need to know a few things about Python memory management to get a memoryefficient program running. One of the things you should know, or at least get a good feel about, is the sizes of basic Python objects. Another thing is how Python manages its memory internally.
So let us begin with the size of basic objects. In Python, there’s not a lot of primitive data types: there are ints, longs (an unlimited precision version of ints), floats (which are doubles), tuples, strings, lists, dictionaries, and classes.
Basic ObjectsÂ¶
What is the size of int
? A programmer with a C or C++ background will
probably guess that the size of a machinespecific int
is something
like 32 bits, maybe 64; and that therefore it occupies at most 8 bytes. But
is that so in Python?
Let us first write a function that shows the sizes of objects (recursively if necessary):
import sys
def show_sizeof(x, level=0):
print "\t" * level, x.__class__, sys.getsizeof(x), x
if hasattr(x, '__iter__'):
if hasattr(x, 'items'):
for xx in x.items():
show_sizeof(xx, level + 1)
else:
for xx in x:
show_sizeof(xx, level + 1)
We can now use the function to inspect the sizes of the different basic data types:
show_sizeof(None)
show_sizeof(3)
show_sizeof(2**63)
show_sizeof(102947298469128649161972364837164)
show_sizeof(918659326943756134897561304875610348756384756193485761304875613948576297485698417)
If you have a 32bit 2.7x Python, you’ll see:
8 None
12 3
22 9223372036854775808
28 102947298469128649161972364837164
48 918659326943756134897561304875610348756384756193485761304875613948576297485698417
and if you have a 64bit 2.7x Python, you’ll see:
16 None
24 3
36 9223372036854775808
40 102947298469128649161972364837164
60 918659326943756134897561304875610348756384756193485761304875613948576297485698417
Let us focus on the 64bit version (mainly because that’s what we need the
most often in our case). None
takes 16 bytes. int
takes 24 bytes,
three times as much memory as a C int64_t
, despite being some kind of
“machinefriendly” integer. Long integers (unbounded precision), used to
represent integers larger than 2^{63}1, have a minimum size of 36
bytes. Then it grows linearly in the logarithm of the integer represented.
Python’s floats are implementationspecific but seem to be C doubles. However, they do not eat up only 8 bytes:
show_sizeof(3.14159265358979323846264338327950288)
Outputs
16 3.14159265359
on a 32bit platform and
24 3.14159265359
on a 64bit platform. That’s again, three times the size a C programmer would expect. Now, what about strings?
show_sizeof("")
show_sizeof("My hovercraft is full of eels")
outputs, on a 32 bit platform:
21
50 My hovercraft is full of eels
and
37
66 My hovercraft is full of eels
An empty string costs 37 bytes in a 64bit environment! Memory used by string then linearly grows in the length of the (useful) string.
* * *
Other structures commonly used, tuples, lists, and dictionaries are worthwhile to examine. Lists (which are implemented as array lists, not as linked lists, with everything it entails) are arrays of references to Python objects, allowing them to be heterogeneous. Let us look at our sizes:
show_sizeof([])
show_sizeof([4, "toaster", 230.1])
outputs
32 []
44 [4, 'toaster', 230.1]
on a 32bit platform and
72 []
96 [4, 'toaster', 230.1]
on a 64bit platform. An empty list eats up 72 bytes. The size of an
empty, 64bit C++ std::list()
is only 16 bytes, 45 times less. What
about tuples? (and dictionaries?):
show_sizeof({})
show_sizeof({'a':213, 'b':2131})
outputs, on a 32bit box
136 {}
136 {'a': 213, 'b': 2131}
32 ('a', 213)
22 a
12 213
32 ('b', 2131)
22 b
12 2131
and
280 {}
280 {'a': 213, 'b': 2131}
72 ('a', 213)
38 a
24 213
72 ('b', 2131)
38 b
24 2131
for a 64bit box.
This last example is particularly interesting because it “doesn’t add up.” If we look at individual key/value pairs, they take 72 bytes (while their components take 38+24=62 bytes, leaving 10 bytes for the pair itself), but the dictionary takes 280 bytes (rather than a strict minimum of 144=72Ã—2 bytes). The dictionary is supposed to be an efficient data structure for search and the two likely implementations will use more space that strictly necessary. If it’s some kind of tree, then we should pay the cost of internal nodes that contain a key and two pointers to children nodes; if it’s a hash table, then we must have some room with free entries to ensure good performance.
The (somewhat) equivalent std::map
C++ structure takes 48 bytes when
created (that is, empty). An empty C++ string takes 8 bytes (then allocated
size grows linearly the size of the string). An integer takes 4 bytes (32 bits).
* * *
Why does all this matter? It seems that whether an empty string takes 8 bytes or 37 doesn’t change anything much. That’s true. That’s true until you need to scale. Then, you need to be really careful about how many objects you create to limit the quantity of memory your program uses. It is a problem in reallife applications. However, to devise a really good strategy about memory management, we must not only consider the sizes of objects, but how many and in which order they are created. It turns out to be very important for Python programs. One key element to understand is how Python allocates its memory internally, which we will discuss next.
Internal Memory ManagementÂ¶
To speedup memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.
There are some internal details on how Python manages those lists into blocks, pools, and “arena”: a number of block forms a pool, pools are gathered into arena, etc., but they’re not very relevant to the point we want to make (if you really want to know, read Evan Jones’ ideas on how to improve Python’s memory allocation). The important point is that those lists never shrink.
Indeed: if an item (of size x) is deallocated (freed by lack of reference) its location is not returned to Python’s global memory pool (and even less to the system), but merely marked as free and added to the free list of items of size x. The dead object’s location will be reused if another object of compatible size is needed. If there are no dead objects available, new ones are created.
If small objects memory is never freed, then the inescapable conclusion is that, like goldfishes, these small object lists only keep growing, never shrinking, and that the memory footprint of your application is dominated by the largest number of small objects allocated at any given point.
* * *
Therefore, one should work hard to allocate only the number of small objects necessary for one task, favoring (otherwise unpythonÃ¨sque) loops where only a small number of elements are created/processed rather than (more pythonÃ¨sque) patterns where lists are created using list generation syntax then processed.
While the second pattern is more Ã la Python, it is rather the worst case: you end up creating lots of small objects that will come populate the small object lists, and even once the list is dead, the dead objects (now all in the free lists) will still occupy a lot of memory.
* * *
The fact that the free lists grow does not seem like much of a problem because the memory it contains is still accessible to the Python program. But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase.
* * *
Let us prove my point using memory_profiler, a Python addon module
(which depends on the pythonpsutil
package) by Fabian Pedregosa (the module’s github page). This addon provides the
decorator @profile
that allows one to monitor one specific function
memory usage. It is extremely simple to use. Let us consider the following
program:
import copy
import memory_profiler
@profile
def function():
x = list(range(1000000)) # allocate a big list
y = copy.deepcopy(x)
del x
return y
if __name__ == "__main__":
function()
invoking
python m memory_profiler memoryprofileme.py
prints, on a 64bit computer
Filename: memoryprofileme.py
Line # Mem usage Increment Line Contents
================================================
4 @profile
5 9.11 MB 0.00 MB def function():
6 40.05 MB 30.94 MB x = list(range(1000000)) # allocate a big list
7 89.73 MB 49.68 MB y = copy.deepcopy(x)
8 82.10 MB 7.63 MB del x
9 82.10 MB 0.00 MB return y
This program creates a list of n=1,000,000 ints (n x 24 bytes = ~23 MB) and an
additional list of references (n x 8 bytes = ~7.6 MB), which amounts to a total
memory usage of ~31 MB. copy.deepcopy
copies both lists, which allocates
again ~50 MB (I am not sure where the additional overhead of 50 MB  31 MB = 19
MB comes from). The interesting part is del x
: it deletes x
, but the
memory usage only decreases by 7.63 MB! This is because del
only deletes the
reference list, not the actual integer values, which remain on the heap and
cause a memory overhead of ~23 MB.
This example allocates in total ~73 MB, which is more than twice the amount of memory needed to store a single list of ~31 MB. You can see that memory can increase surprisingly if you are not careful!
Note that you might get different results on a different platform or with a different python version.
PickleÂ¶
On a related note: is pickle
wasteful?
Pickle is the standard way of (de)serializing Python objects to file. What is its memory footprint? Does it create extra copies of the data or is it rather smart about it? Consider this short example:
import memory_profiler
import pickle
import random
def random_string():
return "".join([chr(64 + random.randint(0, 25)) for _ in xrange(20)])
@profile
def create_file():
x = [(random.random(),
random_string(),
random.randint(0, 2 ** 64))
for _ in xrange(1000000)]
pickle.dump(x, open('machin.pkl', 'w'))
@profile
def load_file():
y = pickle.load(open('machin.pkl', 'r'))
return y
if __name__=="__main__":
create_file()
#load_file()
With one invocation to profile the creation of the pickled data, and one
invocation to reread it (you comment out the function not to be
called). Using memory_profiler
, the creation uses a lot of memory:
Filename: testpickle.py
Line # Mem usage Increment Line Contents
================================================
8 @profile
9 9.18 MB 0.00 MB def create_file():
10 9.33 MB 0.15 MB x=[ (random.random(),
11 random_string(),
12 random.randint(0,2**64))
13 246.11 MB 236.77 MB for _ in xrange(1000000) ]
14
15 481.64 MB 235.54 MB pickle.dump(x,open('machin.pkl','w'))
and rereading a bit less:
Filename: testpickle.py
Line # Mem usage Increment Line Contents
================================================
18 @profile
19 9.18 MB 0.00 MB def load_file():
20 311.02 MB 301.83 MB y=pickle.load(open('machin.pkl','r'))
21 311.02 MB 0.00 MB return y
So somehow, pickling is very bad for memory consumption. The initial list takes up more or less 230MB, but pickling it creates an extra 230something MB worth of memory allocation.
Unpickling, on the other hand, seems fairly efficient. It does create more memory than the original list (300MB instead of 230something) but it does not double the quantity of allocated memory.
Overall, then, (un)pickling should be avoided for memorysensitive applications. What are the alternatives? Pickling preserves all the structure of a data structure, so you can recover it exactly from the pickled file at a later time. However, that might not always be needed. If the file is to contain a list as in the example above, then maybe a simple flat, textbased, file format is in order. Let us see what it gives.
A naÃ¯ve implementation would give:
import memory_profiler
import random
import pickle
def random_string():
return "".join([chr(64 + random.randint(0, 25)) for _ in xrange(20)])
@profile
def create_file():
x = [(random.random(),
random_string(),
random.randint(0, 2 ** 64))
for _ in xrange(1000000) ]
f = open('machin.flat', 'w')
for xx in x:
print >>f, xx
f.close()
@profile
def load_file():
y = []
f = open('machin.flat', 'r')
for line in f:
y.append(eval(line))
f.close()
return y
if __name__== "__main__":
create_file()
#load_file()
Creating the file:
Filename: testflat.py
Line # Mem usage Increment Line Contents
================================================
8 @profile
9 9.19 MB 0.00 MB def create_file():
10 9.34 MB 0.15 MB x=[ (random.random(),
11 random_string(),
12 random.randint(0, 2**64))
13 246.09 MB 236.75 MB for _ in xrange(1000000) ]
14
15 246.09 MB 0.00 MB f=open('machin.flat', 'w')
16 308.27 MB 62.18 MB for xx in x:
17 print >>f, xx
and reading the file back:
Filename: testflat.py
Line # Mem usage Increment Line Contents
================================================
20 @profile
21 9.19 MB 0.00 MB def load_file():
22 9.34 MB 0.15 MB y=[]
23 9.34 MB 0.00 MB f=open('machin.flat', 'r')
24 300.99 MB 291.66 MB for line in f:
25 300.99 MB 0.00 MB y.append(eval(line))
26 301.00 MB 0.00 MB return y
Memory consumption on writing is now much better. It still creates a lot of temporary small objects (for 60MB’s worth), but it’s not doubling memory usage. Reading is comparable (using only marginally less memory).
This particular example is trivial but it generalizes to strategies where you don’t load the whole thing first then process it but rather read a few items, process them, and reuse the allocated memory. Loading data to a Numpy array, for example, one could first create the Numpy array, then read the file line by line to fill the array: this allocates one copy of the whole data. Using pickle, you would allocate the whole data (at least) twice: once by pickle, and once through Numpy.
Or even better yet: use Numpy (or PyTables) arrays. But that’s a different topic. In the mean time, you can have a look at loading and saving another tutorial in the Theano/doc/tutorial directory.
* * *
Python design goals are radically different than, say, C design goals. While the latter is designed to give you good control on what you’re doing at the expense of more complex and explicit programming, the former is designed to let you code rapidly while hiding most (if not all) of the underlying implementation details. While this sounds nice, in a production environment ignoring the implementation inefficiencies of a language can bite you hard, and sometimes when it’s too late. I think that having a good feel of how inefficient Python is with memory management (by design!) will play an important role in whether or not your code meets production requirements, scales well, or, on the contrary, will be a burning hell of memory.
Multi cores support in TheanoÂ¶
Convolution and PoolingÂ¶
Since Theano 0.9dev2, the convolution and pooling are parallelized on CPU.
BLAS operationÂ¶
BLAS is an interface for some mathematical operations between two vectors, a vector and a matrix or two matrices (e.g. the dot product between vector/matrix and matrix/matrix). Many different implementations of that interface exist and some of them are parallelized.
Theano tries to use that interface as frequently as possible for performance reasons. So if Theano links to a parallel implementation, those operations will run in parallel in Theano.
The most frequent way to control the number of threads used is via the
OMP_NUM_THREADS
environment variable. Set it to the number of
threads you want to use before starting the Python process. Some BLAS
implementations support other environment variables.
To test if you BLAS supports OpenMP/Multiple cores, you can use the theano/misc/check_blas.py script from the command line like this:
OMP_NUM_THREADS=1 python theano/misc/check_blas.py q
OMP_NUM_THREADS=2 python theano/misc/check_blas.py q
Parallel element wise ops with OpenMPÂ¶
Because element wise ops work on every tensor entry independently they can be easily parallelized using OpenMP.
To use OpenMP you must set the openmp
flag
to True
.
You can use the flag openmp_elemwise_minsize
to set the minimum
tensor size for which the operation is parallelized because for short
tensors using OpenMP can slow down the operation. The default value is
200000
.
For simple (fast) operations you can obtain a speedup with very large tensors while for more complex operations you can obtain a good speedup also for smaller tensors.
There is a script elemwise_openmp_speedup.py
in theano/misc/
which you can use to tune the value of openmp_elemwise_minsize
for
your machine. The script runs two elemwise operations (a fast one and
a slow one) for a vector of size openmp_elemwise_minsize
with and
without OpenMP and shows the time difference between the cases.
The only way to control the number of threads used is via the
OMP_NUM_THREADS
environment variable. Set it to the number of
threads you want to use before starting the Python process. You can
test this with this command:
OMP_NUM_THREADS=2 python theano/misc/elemwise_openmp_speedup.py
#The output
Fast op time without openmp 0.000533s with openmp 0.000474s speedup 1.12
Slow op time without openmp 0.002987s with openmp 0.001553s speedup 1.92
Frequently Asked QuestionsÂ¶
How to update a subset of weights?Â¶
If you want to update only a subset of a weight matrix (such as some rows or some columns) that are used in the forward propagation of each iteration, then the cost function should be defined in a way that it only depends on the subset of weights that are used in that iteration.
For example if you want to learn a lookup table, e.g. used for word embeddings, where each row is a vector of weights representing the embedding that the model has learned for a word, in each iteration, the only rows that should get updated are those containing embeddings used during the forward propagation. Here is how the theano function should be written:
Defining a shared variable for the lookup table
lookup_table = theano.shared(matrix_ndarray)
Getting a subset of the table (some rows or some columns) by passing an integer vector of indices corresponding to those rows or columns.
subset = lookup_table[vector_of_indices]
From now on, use only ‘subset’. Do not call lookup_table[vector_of_indices] again. This causes problems with grad as this will create new variables.
Defining cost which depends only on subset and not the entire lookup_table
cost = something that depends on subset
g = theano.grad(cost, subset)
There are two ways for updating the parameters: Either use inc_subtensor or set_subtensor. It is recommended to use inc_subtensor. Some theano optimizations do the conversion between the two functions, but not in all cases.
updates = inc_subtensor(subset, g*lr)
OR
updates = set_subtensor(subset, subset + g*lr)
Currently we just cover the case here, not if you use inc_subtensor or set_subtensor with other types of indexing.
Defining the theano function
f = theano.function(..., updates=[(lookup_table, updates)])
Note that you can compute the gradient of the cost function w.r.t. the entire lookup_table, and the gradient will have nonzero rows only for the rows that were selected during forward propagation. If you use gradient descent to update the parameters, there are no issues except for unnecessary computation, e.g. you will update the lookup table parameters with many zero gradient rows. However, if you want to use a different optimization method like rmsprop or HessianFree optimization, then there will be issues. In rmsprop, you keep an exponentially decaying squared gradient by whose square root you divide the current gradient to rescale the update step componentwise. If the gradient of the lookup table row which corresponds to a rare word is very often zero, the squared gradient history will tend to zero for that row because the history of that row decays towards zero. Using HessianFree, you will get many zero rows and columns. Even one of them would make it noninvertible. In general, it would be better to compute the gradient only w.r.t. to those lookup table rows or columns which are actually used during the forward propagation.
Extending TheanoÂ¶
This advanced tutorial is for users who want to extend Theano with new Types, new Operations (Ops), and new graph optimizations. This first page of the tutorial mainly focuses on the Python implementation of an Op and then proposes an overview of the most important methods that define an op. The second page of the tutorial (Extending Theano with a C Op) provides then information on the C implementation of an Op. The rest of the tutorial goes more in depth on advanced topics related to Ops, such as how to write efficient code for an Op and how to write an optimization to speed up the execution of an Op.
Along the way, this tutorial also introduces many aspects of how Theano works, so it is also good for you if you are interested in getting more under the hood with Theano itself.
Note
Before tackling this more advanced presentation, it is highly recommended to read the introductory Tutorial, especially the sections that introduce the Theano Graphs, as providing a novel Theano op requires a basic understanting of the Theano Graphs.
See also the Developer Start Guide for information regarding the versioning framework, namely about git and GitHub, regarding the development workflow and how to make a quality contribution.
Creating a new Op: Python implementationÂ¶
So suppose you have looked through the library documentation and you don’t see a function that does what you want.
If you can implement something in terms of existing Ops, you should do that. Odds are your function that uses existing Theano expressions is short, has no bugs, and potentially profits from optimizations that have already been implemented.
However, if you cannot implement an Op in terms of existing Ops, you have to write a new one. Don’t worry, Theano was designed to make it easy to add new Ops, Types, and Optimizations.
As an illustration, this tutorial shows how to write a simple Pythonbased
operations which performs operations on
Type, double<Double>
.
.. It also shows how to implement tests that
.. ensure the proper working of an op.
Note
This is an introductury tutorial and as such it does not cover how to make
an op that returns a view or modifies the values in its inputs. Thus, all
ops created with the instructions described here MUST return newly
allocated memory or reuse the memory provided in the parameter
output_storage
of the perform()
function. See
Views and inplace operations for an explanation on how to do this.
If your op returns a view or changes the value of its inputs without doing as prescribed in that page, Theano will run, but will return correct results for some graphs and wrong results for others.
It is recommended that you run your tests in DebugMode (Theano flag
mode=DebugMode
) since it verifies if your op behaves correctly in this
regard.
Theano Graphs refresherÂ¶
Theano represents symbolic mathematical computations as graphs. Those graphs are bipartite graphs (graphs with 2 types of nodes), they are composed of interconnected Apply and Variable nodes. Variable nodes represent data in the graph, either inputs, outputs or intermediary values. As such, Inputs and Outputs of a graph are lists of Theano Variable nodes. Apply nodes perform computation on these variables to produce new variables. Each Apply node has a link to an instance of Op which describes the computation to perform. This tutorial details how to write such an Op instance. Please refers to Graph Structures for a more detailed explanation about the graph structure.
Op’s basic methodsÂ¶
An op is any Python object which inherits from gof.Op
.
This section provides an overview of the basic methods you typically have to
implement to make a new op. It does not provide extensive coverage of all the
possibilities you may encounter or need. For that refer to
Op’s contract.
import theano
class MyOp(theano.Op):
# Properties attribute
__props__ = ()
#itypes and otypes attributes are
#compulsory if make_node method is not defined.
#They're the type of input and output respectively
itypes = None
otypes = None
#Compulsory if itypes and otypes are not defined
def make_node(self, *inputs):
pass
# Python implementation:
def perform(self, node, inputs_storage, output_storage):
pass
# Other type of implementation
# C implementation: [see theano web site for other functions]
def c_code(self, node, inputs, outputs, sub):
pass
# Other implementations:
def make_thunk(self, node, storage_map, _, _2, impl=None):
pass
# optional:
check_input = True
def __init__(self, *args):
pass
def grad(self, inputs, g):
pass
def R_op(self, inputs, eval_points):
pass
def infer_shape(node, input_shapes):
pass
An op has to implement some methods defined in the the interface of
gof.Op
. More specifically, it is mandatory for an op to define either
the method make_node()
or itypes
, otypes
and one of the
implementation methods, either perform()
, Op.c_code()
or make_thunk()
.
make_node()
method creates an Apply node representing the application of the op on the inputs provided. This method is reponsible for three things:
 it first checks that the input Variables types are compatible with the current op. If the op cannot be applied on the provided input types, it must raises an exception (such as
TypeError
). it operates on the Variables found in
*inputs
in Theano’s symbolic language to infer the type of the symbolic output Variables. It creates output Variables of a suitable symbolic Type to serve as the outputs of this op’s application. it creates an Apply instance with the input and output Variable, and return the Apply instance.
perform()
method defines the Python implementation of an op. It takes several arguments:
node
is a reference to an Apply node which was previously obtained via theOp
‘smake_node()
method. It is typically not used in simple ops, but it contains symbolic information that could be required for complex ops.inputs
is a list of references to data which can be operated on using nonsymbolic statements, (i.e., statements in Python, Numpy).output_storage
is a list of storage cells where the output is to be stored. There is one storage cell for each output of the op. The data put inoutput_storage
must match the type of the symbolic output. It is forbidden to change the length of the list(s) contained inoutput_storage
. A function Mode may allowoutput_storage
elements to persist between evaluations, or it may resetoutput_storage
cells to hold a value ofNone
. It can also preallocate some memory for the op to use. This feature can allowperform
to reuse memory between calls, for example. If there is something preallocated in theoutput_storage
, it will be of the good dtype, but can have the wrong shape and have any stride pattern.
perform()
method must be determined by the inputs. That is to say, when applied to identical inputs the method must return the same outputs.
gof.Op
allows some other way to define the op implentation. For instance, it is possible to defineOp.c_code()
to provide a Cimplementation to the op. Please refers to tutorial Extending Theano with a C Op for a description ofOp.c_code()
and other related c_methods. Note that an op can provide both Python and C implementation.
make_thunk()
method is another alternative toperform()
. It returns a thunk. A thunk is defined as a zeroarguments function which encapsulates the computation to be performed by an op on the arguments of its corresponding node. It takes several parameters:
node
is the Apply instance for which a thunk is requested,storage_map
is a dict of lists which maps variables to a oneelement lists holding the variable’s current value. The oneelement list acts as pointer to the value and allows sharing that “pointer” with other nodes and instances.compute_map
is also a dict of lists. It maps variables to oneelement lists holding booleans. If the value is 0 then the variable has not been computed and the value should not be considered valid. If the value is 1 the variable has been computed and the value is valid. If the value is 2 the variable has been garbagecollected and is no longer valid, but shouldn’t be required anymore for this call. The returned function must ensure that it sets the computed variables as computed in the compute_map.impl
allow to select between multiple implementation. It should have a default value of None.
make_thunk()
is useful if you want to generate code and compile it yourself.If
make_thunk()
is defined by an op, it will be used by Theano to obtain the op’s implementation.perform()
andOp.c_code()
will be ignored.If
make_node()
is not defined, theitypes
andotypes
are used by the Op’smake_node()
method to implement the functionality ofmake_node()
method mentioned above.
Op’s auxiliary methodsÂ¶
There are other methods that can be optionally defined by the op:
The
__str__()
method provides a meaningful string representation of your op.
__eq__()
and__hash__()
define respectivelly equality between two ops and the hash of an op instance. They will be used by the optimization phase to merge nodes that are doing equivalent computations (same inputs, same operation). Two ops that are equal according__eq__()
should return the same output when they are applied on the same inputs.The
__props__
lists the properties that influence how the computation is performed (Ususally these are those that you set in__init__()
). It must be a tuple. If you don’t have any properties, then you should set this attribute to the emtpy tuple ().
__props__
enables the automatic generation of appropriate__eq__()
and__hash__()
. Given the method__eq__()
, automatically generated from__props__
, two ops will be equal if they have the same values for all the properties listed in__props__
. Given to the method__hash__()
automatically generated from__props__
, two ops will be have the same hash if they have the same values for all the properties listed in__props__
.__props__
will also generate a suitable__str__()
for your op. This requires development version after September 1st, 2014 or version 0.7.The
infer_shape()
method allows to infer the shape of the op output variables, without actually computing the outputs. It takes as inputnode
, a reference to the op Apply node, and a list of Theano symbolic Varables (i0_shape
,i1_shape
, ...) which are the shape of the op input Variables.infer_shape()
returns a list where each element is a tuple representing the shape of one output. This could be helpful if one only needs the shape of the output instead of the actual outputs, which can be useful, for instance, for optimization procedures.The
grad()
method is required if you want to differentiate some cost whose expression includes your op. The gradient may be specified symbolically in this method. It takes two argumentsinputs
andoutput_gradients
which are both lists of symbolic Theano Variables and those must be operated on using Theano’s symbolic language. The grad method must return a list containing one Variable for each input. Each returned Variable represents the gradient with respect to that input computed based on the symbolic gradients with respect to each output. If the output is not differentiable with respect to an input then this method should be defined to return a variable of type NullType for that input. Likewise, if you have not implemented the grad computation for some input, you may return a variable of type NullType for that input. Please refer tograd()
for a more detailed view.The
R_op()
method is needed if you wanttheano.tensor.Rop
to work with your op. This function implements the application of the Roperator on the function represented by your op. Let assume that function is , with input , applying the Roperator means computing the Jacobian of and rightmultiplying it by , the evaluation point, namely: .The optional boolean
check_input
attribute is used to specify if you want the types used in your op to check their inputs in their c_code. It can be used to speed up compilation, reduce overhead (particularly for scalars) and reduce the number of generated C files.
Example: Op definitionÂ¶
import theano
#Using make_node
class DoubleOp1(theano.Op):
__props__ = ()
def make_node(self, x):
x = theano.tensor.as_tensor_variable(x)
# Note: using x_.type() is dangerous, as it copies x's broadcasting
# behaviour
return theano.Apply(self, [x], [x.type()])
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = x * 2
def infer_shape(self, node, i0_shapes):
return i0_shapes
def grad(self, inputs, output_grads):
return [output_grads[0] * 2]
def R_op(self, inputs, eval_points):
# R_op can receive None as eval_points.
# That mean there is no diferientiable path through that input
# If this imply that you cannot compute some outputs,
# return None for those.
if eval_points[0] is None:
return eval_points
return self.grad(inputs, eval_points)
doubleOp1 = DoubleOp1()
#Using itypes and otypes
class DoubleOp2(theano.Op):
__props__ = ()
itypes = [theano.tensor.dmatrix]
otypes = [theano.tensor.dmatrix]
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = x * 2
def infer_shape(self, node, i0_shapes):
return i0_shapes
def grad(self, inputs, output_grads):
return [output_grads[0] * 2]
def R_op(self, inputs, eval_points):
# R_op can receive None as eval_points.
# That mean there is no diferientiable path through that input
# If this imply that you cannot compute some outputs,
# return None for those.
if eval_points[0] is None:
return eval_points
return self.grad(inputs, eval_points)
doubleOp2 = DoubleOp2()
At a high level, the code fragment declares a class (e.g., DoubleOp1
) and then
creates one instance of it (e.g., doubleOp1
).
We often gloss over this distinction, but will be precise here:
doubleOp1
(the instance) is an Op, not DoubleOp1
(the class which is a
subclass of theano.Op
). You can call doubleOp1(tensor.vector())
on a
Variable to build an expression, and in the expression there will be
a .op
attribute that refers to doubleOp1
.
The make_node
method creates a node to be included in the expression graph.
It runs when we apply our Op (doubleOp1
) to the Variable (x
), as
in doubleOp1(tensor.vector())
.
When an Op has multiple inputs, their order in the inputs argument to Apply
is important: Theano will call make_node(*inputs)
to copy the graph,
so it is important not to change the semantics of the expression by changing
the argument order.
All the inputs
and outputs
arguments to Apply
must be Variables.
A common and easy way to ensure inputs are variables is to run them through
as_tensor_variable
. This function leaves TensorType variables alone, raises
an error for nonTensorType variables, and copies any numpy.ndarray
into
the storage for a TensorType Constant. The make_node
method dictates the
appropriate Type for all output variables.
The perform
method implements the Op’s mathematical logic in Python.
The inputs (here x
) are passed by value, but a single output is returned
indirectly as the first element of singleelement lists. If doubleOp1
had
a second output, it would be stored in output_storage[1][0]
.
In some execution modes, the output storage might contain the return value of a previous call. That old value can be reused to avoid memory reallocation, but it must not influence the semantics of the Op output.
You can try the new Op as follows:
import theano
x = theano.tensor.matrix()
f = theano.function([x], DoubleOp1()(x))
import numpy
inp = numpy.random.rand(5, 4)
out = f(inp)
assert numpy.allclose(inp * 2, out)
print(inp)
print(out)
[[ 0.08257206 0.34308357 0.5288043 0.06582951]
[ 0.65977826 0.10040307 0.5402353 0.55472296]
[ 0.82358552 0.29502171 0.97387481 0.0080757 ]
[ 0.77327215 0.65401857 0.76562992 0.94145702]
[ 0.8452076 0.30500101 0.88430501 0.95818655]]
[[ 0.16514411 0.68616713 1.0576086 0.13165902]
[ 1.31955651 0.20080613 1.08047061 1.10944593]
[ 1.64717104 0.59004341 1.94774962 0.0161514 ]
[ 1.5465443 1.30803715 1.53125983 1.88291403]
[ 1.6904152 0.61000201 1.76861002 1.9163731 ]]
import theano
x = theano.tensor.matrix()
f = theano.function([x], DoubleOp2()(x))
import numpy
inp = numpy.random.rand(5, 4)
out = f(inp)
assert numpy.allclose(inp * 2, out)
print(inp)
print(out)
[[ 0.02443785 0.67833979 0.91954769 0.95444365]
[ 0.60853382 0.7770539 0.78163219 0.92838837]
[ 0.04427765 0.37895602 0.23155797 0.4934699 ]
[ 0.20551517 0.7419955 0.34500905 0.49347629]
[ 0.24082769 0.49321452 0.24566545 0.15351132]]
[[ 0.04887571 1.35667957 1.83909538 1.90888731]
[ 1.21706764 1.55410779 1.56326439 1.85677674]
[ 0.08855531 0.75791203 0.46311594 0.9869398 ]
[ 0.41103034 1.48399101 0.69001811 0.98695258]
[ 0.48165539 0.98642904 0.4913309 0.30702264]]
Example: __props__ definitionÂ¶
We can modify the previous piece of code in order to demonstrate
the usage of the __props__
attribute.
We create an Op that takes a variable x
and returns a*x+b
.
We want to say that two such ops are equal when their values of a
and b
are equal.
import theano
class AXPBOp(theano.Op):
"""
This creates an Op that takes x to a*x+b.
"""
__props__ = ("a", "b")
def __init__(self, a, b):
self.a = a
self.b = b
super(AXPBOp, self).__init__()
def make_node(self, x):
x = theano.tensor.as_tensor_variable(x)
return theano.Apply(self, [x], [x.type()])
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = self.a * x + self.b
def infer_shape(self, node, i0_shapes):
return i0_shapes
def grad(self, inputs, output_grads):
return [a * output_grads[0] + b]
The use of __props__
saves
the user the trouble of implementing __eq__()
and __hash__()
manually. It also generates a default __str__()
method that prints the
attribute names and their values.
We can test this by running the following segment:
mult4plus5op = AXPBOp(4, 5)
another_mult4plus5op = AXPBOp(4, 5)
mult2plus3op = AXPBOp(2, 3)
assert mult4plus5op == another_mult4plus5op
assert mult4plus5op != mult2plus3op
x = theano.tensor.matrix()
f = theano.function([x], mult4plus5op(x))
g = theano.function([x], mult2plus3op(x))
import numpy
inp = numpy.random.rand(5, 4).astype(numpy.float32)
assert numpy.allclose(4 * inp + 5, f(inp))
assert numpy.allclose(2 * inp + 3, g(inp))
How To Test itÂ¶
Theano has some functionalities to simplify testing. These help test the
infer_shape
, grad
and R_op
methods. Put the following code
in a file and execute it with the theanonose
program.
Basic TestsÂ¶
Basic tests are done by you just by using the op and checking that it
returns the right answer. If you detect an error, you must raise an
exception. You can use the assert
keyword to automatically raise an
AssertionError
.
import numpy
import theano
from theano.tests import unittest_tools as utt
from theano import config
class test_Double(utt.InferShapeTester):
def setUp(self):
super(test_Double, self).setUp()
self.op_class = DoubleOp
self.op = DoubleOp()
def test_basic(self):
x = theano.tensor.matrix()
f = theano.function([x], self.op(x))
inp = numpy.asarray(numpy.random.rand(5, 4), dtype=config.floatX)
out = f(inp)
# Compare the result computed to the expected value.
utt.assert_allclose(inp * 2, out)
We call utt.assert_allclose(expected_value, value)
to compare
NumPy ndarray.This raise an error message with more information. Also,
the default tolerance can be changed with the Theano flags
config.tensor.cmp_sloppy
that take values in 0, 1 and 2. The
defaul value do the most strict comparison, 1 and 2 make less strict
comparison.
Testing the infer_shapeÂ¶
When a class inherits from the InferShapeTester
class, it gets the
self._compile_and_check
method that tests the op’s infer_shape
method. It tests that the op gets optimized out of the graph if only
the shape of the output is needed and not the output
itself. Additionally, it checks that the optimized graph computes
the correct shape, by comparing it to the actual shape of the computed
output.
self._compile_and_check
compiles a Theano function. It takes as
parameters the lists of input and output Theano variables, as would be
provided to theano.function
, and a list of real values to pass to the
compiled function. It also takes the op class as a parameter
in order to verify that no instance of it appears in the shapeoptimized graph.
If there is an error, the function raises an exception. If you want to
see it fail, you can implement an incorrect infer_shape
.
When testing with input values with shapes that take the same value
over different dimensions (for instance, a square matrix, or a tensor3
with shape (n, n, n), or (m, n, m)), it is not possible to detect if
the output shape was computed correctly, or if some shapes with the
same value have been mixed up. For instance, if the infer_shape uses
the width of a matrix instead of its height, then testing with only
square matrices will not detect the problem. This is why the
self._compile_and_check
method prints a warning in such a case. If
your op works only with such matrices, you can disable the warning with the
warn=False
parameter.
from theano.tests import unittest_tools as utt
from theano import config
class test_Double(utt.InferShapeTester):
# [...] as previous tests.
def test_infer_shape(self):
x = theano.tensor.matrix()
self._compile_and_check([x], # theano.function inputs
[self.op(x)], # theano.function outputs
# Always use not square matrix!
# inputs data
[numpy.asarray(numpy.random.rand(5, 4),
dtype=config.floatX)],
# Op that should be removed from the graph.
self.op_class)
Testing the gradientÂ¶
The function verify_grad verifies the gradient of an op or Theano graph. It compares the analytic (symbolically computed) gradient and the numeric gradient (computed through the Finite Difference Method).
If there is an error, the function raises an exception. If you want to see it fail, you can implement an incorrect gradient (for instance, by removing the multiplication by 2).
def test_grad(self):
theano.tests.unittest_tools.verify_grad(self.op,
[numpy.random.rand(5, 7, 2)])
Testing the RopÂ¶
The class RopLop_checker
defines the functions
RopLop_checker.check_mat_rop_lop()
, RopLop_checker.check_rop_lop()
and
RopLop_checker.check_nondiff_rop()
. These allow to test the
implementation of the Rop method of a particular op.
For instance, to verify the Rop method of the DoubleOp, you can use this:
import numpy
import theano.tests
from theano.tests.test_rop import RopLop_checker
class test_DoubleRop(RopLop_checker):
def setUp(self):
super(test_DoubleRop, self).setUp()
def test_double_rop(self):
self.check_rop_lop(DoubleRop()(self.x), self.in_shape)
Running Your TestsÂ¶
To perform your tests, you may select either one of the three following methods:
The method of choice to conduct tests is to run the file
theanonose
. In a regular Theano installation, the latter will be
on the operating system’s path and directly accessible from any
folder. Otherwise, it can be accessed in the Theano/bin
folder. The following command lines may be used for the corresponding
purposes:
theanonose theano
: Run every test found in Theano’s path.theanonose folder_name
: Run every test found in the folder folder_name.theanonose test_file.py
: Run every test found in the file test_file.py.
The following are particularly useful for development purposes since they call for particular classes or even for particular tests:
theanonose test_file.py:test_DoubleRop
: Run every test found inside the class test_DoubleRop.theanonose test_file.py:test_DoubleRop.test_double_op
: Run only the test test_double_op in the class test_DoubleRop.
Help with the use and functionalities of theanonose
may be
obtained by running it with the command line parameter help
(h)
.
The command nosetests
can also be used. Although it lacks the
useful functionalities that theanonose
provides, nosetests
can be called similarly to theanonose
from any folder in Python’s
path like so:
nosetests [suffix similar to the above]
.
More documentation on nosetests
is available here:
nosetests.
One may also add a block of code similar to the following at the end of the file containing a specific test of interest and run the file. In this example, the test test_DoubleRop in the class test_double_op would be performed.
if __name__ == '__main__':
t = test_DoubleRop("test_double_rop")
t.setUp()
t.test_double_rop()
We recommend that when we execute a file, we run all tests in that file. This can be done by adding this at the end of your test files:
if __name__ == '__main__':
unittest.main()
Run the code of the DoubleOp example above.
Modify and execute to compute: x * y.
Modify and execute the example to return two outputs: x + y and x  y.
You can omit the Rop functions. Try to implement the testing apparatus described above.
(Notice that Theano’s current elemwise fusion optimization is only applicable to computations involving a single output. Hence, to gain efficiency over the basic solution that is asked here, the two operations would have to be jointly optimized explicitly in the code.)
Making tests errors more reproducible is a good practice. To make your tests more reproducible, you need a way to get the same random numbers. You can do this by seeding NumPy’s random number generator.
For convenience, the classes InferShapeTester and RopLop_checker
already do this for you. If you implement your own setUp
function,
don’t forget to call the parent setUp
function.
For more details see Using Random Values in Test Cases.
as_opÂ¶
as_op is a python decorator that converts a python function into a basic Theano op that will call the supplied function during execution.
This isn’t the recommended way to build an op, but allows for a quick implementation.
It takes an optional infer_shape()
parameter that must have this
signature:
def infer_shape(node, input_shapes):
# ...
return output_shapes
 `input_shapes` and `output_shapes` are lists of tuples that
represent the shape of the corresponding inputs/outputs.
Note
Not providing the infer_shape method prevents shaperelated optimizations from working with this op. For example your_op(inputs, ...).shape will need the op to be executed just to get the shape.
Note
As no grad is defined, this means you won’t be able to differentiate paths that include this op.
Note
It converts the Python function to a callable object that takes as inputs Theano variables that were declared.
Note
The python function wrapped by the as_op decorator needs to return a new data allocation, no views or in place modification of the input.
as_op ExampleÂ¶
import theano
import numpy
from theano import function
from theano.compile.ops import as_op
def infer_shape_numpy_dot(node, input_shapes):
ashp, bshp = input_shapes
return [ashp[:1] + bshp[1:]]
@as_op(itypes=[theano.tensor.fmatrix, theano.tensor.fmatrix],
otypes=[theano.tensor.fmatrix], infer_shape=infer_shape_numpy_dot)
def numpy_dot(a, b):
return numpy.dot(a, b)
You can try it as follows:
x = theano.tensor.fmatrix()
y = theano.tensor.fmatrix()
f = function([x, y], numpy_dot(x, y))
inp1 = numpy.random.rand(5, 4).astype('float32')
inp2 = numpy.random.rand(4, 7).astype('float32')
out = f(inp1, inp2)
ExerciseÂ¶
Run the code of the numpy_dot example above.
Modify and execute to compute: numpy.add and numpy.subtract.
 Modify and execute the example to return two outputs: x + y
 and x  y.
Documentation and Coding StyleÂ¶
Please always respect the Requirements for Quality Contributions or your contribution will not be accepted.
NanGuardMode and AllocEmptyÂ¶
NanGuardMode help users find where in the graph NaN appear. But sometimes, we want some variables to not be checked. For example, in the old GPU backend, we use a float32 CudaNdarray to store the MRG random number generator state (they are integers). So if NanGuardMode check it, it will generate false positive. Another case is related to [Gpu]AllocEmpty or some computation on it (like done by Scan).
You can tell NanGuardMode to do not check a variable with:
variable.tag.nan_guard_mode_check
. Also, this tag automatically
follow that variable during optimization. This mean if you tag a
variable that get replaced by an inplace version, it will keep that
tag.
Final NoteÂ¶
A more extensive discussion of this section’s content may be found in the advanced tutorial Extending Theano.
The section Other ops includes more instructions for the following specific cases:
Extending Theano with a C OpÂ¶
This tutorial covers how to extend Theano with an op that offers a C implementation. It does not cover ops that run on a GPU but it does introduce many elements and concepts which are relevant for GPU ops. This tutorial is aimed at individuals who already know how to extend Theano (see tutorial Creating a new Op: Python implementation) by adding a new op with a Python implementation and will only cover the additional knowledge required to also produce ops with C implementations.
Providing a Theano op with a C implementation requires to interact with Python’s CAPI and Numpy’s CAPI. Thus, the first step of this tutorial is to introduce both and highlight their features which are most relevant to the task of implementing a C op. This tutorial then introduces the most important methods that the op needs to implement in order to provide a usable C implementation. Finally, it shows how to combine these elements to write a simple C op for performing the simple task of multiplying every element in a vector by a scalar.
Python CAPIÂ¶
Python provides a CAPI to allows the manipulation of python objects from C
code. In this API, all variables that represent Python objects are of type
PyObject *
. All objects have a pointer to their type object and a reference
count field (that is shared with the python side). Most python methods have
an equivalent C function that can be called on the PyObject *
pointer.
As such, manipulating a PyObject instance is often straightforward but it is important to properly manage its reference count. Failing to do so can lead to undesired behavior in the C code.
Reference countingÂ¶
Reference counting is a mechanism for keeping track, for an object, of the number of references to it held by other entities. This mechanism is often used for purposes of garbage collecting because it allows to easily see if an object is still being used by other entities. When the reference count for an object drops to 0, it means it is not used by anyone any longer and can be safely deleted.
PyObjects implement reference counting and the Python CAPI defines a number of macros to help manage those reference counts. The definition of these macros can be found here : Python CAPI Reference Counting. Listed below are the two macros most often used in Theano C ops.

void Py_XINCREF(PyObject *o)
Increments the reference count of object
o
. Without effect if the object is NULL.

void Py_XDECREF(PyObject *o)
Decrements the reference count of object
o
. If the reference count reaches 0, it will trigger a call of the object’s deallocation function. Without effect if the object is NULL.
The general principle, in the reference counting paradigm, is that the owner of a reference to an object is responsible for disposing properly of it. This can be done by decrementing the reference count once the reference is no longer used or by transfering ownership; passing on the reference to a new owner which becomes responsible for it.
Some functions return “borrowed references”; this means that they return a reference to an object without transfering ownership of the reference to the caller of the function. This means that if you call a function which returns a borrowed reference, you do not have the burden of properly disposing of that reference. You should not call Py_XDECREF() on a borrowed reference.
Correctly managing the reference counts is important as failing to do so can lead to issues ranging from memory leaks to segmentation faults.
NumPy CAPIÂ¶
The NumPy library provides a CAPI to allow users to create, access and manipulate NumPy arrays from within their own C routines. NumPy’s ndarrays are used extensively inside Theano and so extending Theano with a C op will require interaction with the NumPy CAPI.
This sections covers the API’s elements that are often required to write code for a Theano C op. The full documentation for the API can be found here : NumPy CAPI.
NumPy data typesÂ¶
To allow portability between platforms, the NumPy CAPI defines its own data
types which should be used whenever you are manipulating a NumPy array’s
internal data. The data types most commonly used to implement C ops are the
following : npy_int{8,16,32,64}
, npy_uint{8,16,32,64}
and
npy_float{32,64}
.
You should use these data types when manipulating a NumPy array’s internal
data instead of C primitives because the size of the memory representation
for C primitives can vary between platforms. For instance, a C long
can be
represented in memory with 4 bytes but it can also be represented with 8.
On the other hand, the inmemory size of NumPy data types remains constant
across platforms. Using them will make your code simpler and more portable.
The full list of defined data types can be found here : NumPy CAPI data types.
NumPy ndarraysÂ¶
In the NumPy CAPI, NumPy arrays are represented as instances of the PyArrayObject class which is a descendant of the PyObject class. This means that, as for any other Python object that you manipulate from C code, you need to appropriatedly manage the reference counts of PyArrayObject instances.
Unlike in a standard multidimensionnal C array, a NumPy array’s internal data representation does not have to occupy a continuous region in memory. In fact, it can be Ccontiguous, Fcontiguous or noncontiguous. Ccontiguous means that the data is not only contiguous in memory but also that it is organized such that the index of the latest dimension changes the fastest. If the following array
x = [[1, 2, 3],
[4, 5, 6]]
is Ccontiguous, it means that, in memory, the six values contained in the
array x
are stored in the order [1, 2, 3, 4, 5, 6]
(the first value is
x[0,0]
, the second value is x[0,1]
, the third value is x[0,2]
, the,
fourth value is x[1,0]
, etc). Fcontiguous (or Fortran Contiguous) also
means that the data is contiguous but that it is organized such that the index
of the latest dimension changes the slowest. If the array x
is
Fcontiguous, it means that, in memory, the values appear in the order
[1, 4, 2, 5, 3, 6]
(the first value is x[0,0]
, the second value is
x[1,0]
, the third value is x[0,1]
, etc).
Finally, the internal data can be noncontiguous. In this case, it occupies
a noncontiguous region in memory but it is still stored in an organized
fashion : the distance between the element x[i,j]
and the element
x[i+1,j]
of the array is constant over all valid values of i
and
j
, just as the distance between the element x[i,j]
and the element
x[i,j+1]
of the array is constant over all valid values of i
and j
.
This distance between consecutive elements of an array over a given dimension,
is called the stride of that dimension.
Accessing NumPy ndarrays’ data and propertiesÂ¶
The following macros serve to access various attributes of NumPy ndarrays.

void* PyArray_DATA(PyArrayObject* arr)
Returns a pointer to the first element of the array’s data. The returned pointer must be cast to a pointer of the proper Numpy CAPI data type before use.

int PyArray_NDIM(PyArrayObject* arr)
Returns the number of dimensions in the the array pointed by
arr

npy_intp* PyArray_DIMS(PyArrayObject* arr)
Returns a pointer on the first element of
arr
‘s internal array describing its dimensions. This internal array contains as many elements as the arrayarr
has dimensions.The macro
PyArray_SHAPE()
is a synonym ofPyArray_DIMS()
: it has the same effect and is used in an identical way.

npy_intp* PyArray_STRIDES(PyArrayObject* arr)
Returns a pointer on the first element of
arr
‘s internal array describing the stride for each of its dimension. This array has as many elements as the number of dimensions inarr
. In this array, the strides are expressed in number of bytes.

PyArray_Descr* PyArray_DESCR(PyArrayObject* arr)
Returns a reference to the object representing the dtype of the array.
The macro
PyArray_DTYPE()
is a synonym of thePyArray_DESCR()
: it has the same effect and is used in an identical way.Note: This is a borrowed reference so you do not need to decrement its reference count once you are done with it.

int PyArray_TYPE(PyArrayObject* arr)
Returns the typenumber for the elements of the array. Like the dtype, the typenumber is a descriptor for the type of the data in the array. However, the two are not synonyms and, as such, cannot be used in place of the other.

npy_intp PyArray_SIZE(PyArrayObject* arr)
Returns to total number of elements in the array

bool PyArray_CHKFLAGS(PyArrayObject* arr, flags)
Returns true if the array has the specified flags. The variable flag should either be a NumPy array flag or an integer obtained by applying bitwise or to an ensemble of flags.
The flags that can be used in with this macro are : NPY_ARRAY_C_CONTIGUOUS, NPY_ARRAY_F_CONTIGUOUS, NPY_ARRAY_OWNDATA, NPY_ARRAY_ALIGNED, NPY_ARRAY_WRITEABLE, NPY_ARRAY_UPDATEIFCOPY.
Creating NumPy ndarraysÂ¶
The following functions allow the creation and copy of NumPy arrays :

PyObject* PyArray_EMPTY(int nd, npy_intp* dims, typenum dtype,

int fortran)
Constructs a new ndarray with the number of dimensions specified by
nd
, shape specified bydims
and data type specified bydtype
. Iffortran
is equal to 0, the data is organized in a Ccontiguous layout, otherwise it is organized in a Fcontiguous layout. The array elements are not initialized in any way.The function
PyArray_Empty()
performs the same function as the macroPyArray_EMPTY()
but the data type is given as a pointer to aPyArray_Descr
object instead of atypenum
.

PyObject* PyArray_ZEROS(int nd, npy_intp* dims, typenum dtype,

int fortran)
Constructs a new ndarray with the number of dimensions specified by
nd
, shape specified bydims
and data type specified bydtype
. Iffortran
is equal to 0, the data is organized in a Ccontiguous layout, otherwise it is organized in a Fcontiguous layout. Every element in the array is initialized to 0.The function
PyArray_Zeros()
performs the same function as the macroPyArray_ZEROS()
but the data type is given as a pointer to aPyArray_Descr
object instead of atypenum
.

PyArrayObject* PyArray_GETCONTIGUOUS(PyObject* op)
Returns a Ccontiguous and wellbehaved copy of the array op. If op is already Ccontiguous and wellbehaved, this function simply returns a new reference to op.
Methods the C Op needs to defineÂ¶
There is a key difference between an op defining a Python implementation for
its computation and defining a C implementation. In the case of a Python
implementation, the op defines a function perform()
which executes the
required Python code to realize the op. In the case of a C implementation,
however, the op does not define a function that will execute the C code; it
instead defines functions that will return the C code to the caller.
This is because calling C code from Python code comes with a significant overhead. If every op was responsible for executing its own C code, every time a Theano function was called, this overhead would occur as many times as the number of ops with C implementations in the function’s computational graph.
To maximize performance, Theano instead requires the C ops to simply return the code needed for their execution and takes upon itself the task of organizing, linking and compiling the code from the various ops. Through this, Theano is able to minimize the number of times C code is called from Python code.
The following is a very simple example to illustrate how it’s possible to obtain performance gains with this process. Suppose you need to execute, from Python code, 10 different ops, each one having a C implementation. If each op was responsible for executing its own C code, the overhead of calling C code from Python code would occur 10 times. Consider now the case where the ops instead return the C code for their execution. You could get the C code from each op and then define your own C module that would call the C code from each op in succession. In this case, the overhead would only occur once; when calling your custom module itself.
Moreover, the fact that Theano itself takes care of compiling the C code, instead of the individual ops, allows Theano to easily cache the compiled C code. This allows for faster compilation times.
See Implementing the arithmetic Ops in C for the full documentation of the various methods of the class Op that are related to the C implementation. Of particular interest are:
 The methods
Op.c_libraries()
andOp.c_lib_dirs()
to allow your op to use external libraries.  The method
Op.c_code_cleanup()
to specify how the op should clean up what it has allocated during its execution.  The methods
Op.c_init_code()
andOp.c_init_code_apply()
to specify code that should be executed once when the module is initialized, before anything else is executed.  The methods
Op.c_compile_args()
andOp.c_no_compile_args()
to specify requirements regarding how the op’s C code should be compiled.
This section describes the methods Op.c_code()
,
Op.c_support_code()
, Op.c_support_code_apply()
and
Op.c_code_cache_version()
because they are the ones that are most
commonly used.

c_code
(node, name, input_names, output_names, sub)Â¶ This method returns a string containing the C code to perform the computation required by this op.
The
node
argument is an Apply node representing an application of the current Op on a list of inputs, producing a list of outputs.input_names
is a sequence of strings which contains as many strings as the op has inputs. Each string contains the name of the C variable to which the corresponding input has been assigned. For example, the name of the C variable representing the first input of the op is given byinput_names[0]
. You should therefore use this name in your C code to interact with that variable.output_names
is used identically toinput_names
, but for the op’s outputs.Finally,
sub
is a dictionary of extras parameters to the c_code method. Among other things, it containssub['fail']
which is a string of C code that you should include in your C code (after ensuring that a Python exception is set) if it needs to raise an exception. Ex:c_code = """ PyErr_Format(PyExc_ValueError, "X does not have the right value"); %(fail)s; """ % {'fail' : sub['fail']}
to raise a ValueError Python exception with the specified message. The function
PyErr_Format()
supports string formatting so it is possible to tailor the error message to the specifics of the error that occured. IfPyErr_Format()
is called with more than two arguments, the subsequent arguments are used to format the error message with the same behavior as the function PyString_FromFormat(). The%
characters in the format characters need to be escaped since the C code itself is defined in a string which undergoes string formatting.c_code = """ PyErr_Format(PyExc_ValueError, "X==%%i but it should be greater than 0", X); %(fail)s; """ % {'fail' : sub['fail']}
Note: Your C code should not return the output of the computation but rather put the results in the C variables whose names are contained in the output_names
.

c_support_code
()Â¶ Returns a string or a list of strings containing some support C code for this op. This code will be included at the global scope level and can be used to define functions and structs that will be used by every apply of this op.

c_support_code_apply
(node, name)Â¶ Returns a string containing some support C code for this op. This code will be included at the global scope level and can be used to define functions and structs that will be used by this op. The difference between this method and
c_support_code()
is that the C code specified inc_support_code_apply()
should be specific to each apply of the Op, whilec_support_code()
is for support code that is not specific to each apply.Both
c_support_code()
andc_support_code_apply ()
are necessary because a Theano op can be used more than once in a given Theano function. For example, an op that adds two matrices could be used at some point in the Theano function to add matrices of integers and, at another point, to add matrices of doubles. Because the dtype of the inputs and outputs can change between different applies of the op, any support code that relies on a certain dtype is specific to a given apply of the op and should therefore be defined inc_support_code_apply()
.

c_code_cache_version
()Â¶ Returns a tuple of integers representing the version of the C code in this op. Ex : (1, 4, 0) for version 1.4.0
This tuple is used by Theano to cache the compiled C code for this op. As such, the return value MUST BE CHANGED every time the C code is altered or else Theano will disregard the change in the code and simply load a previous version of the op from the cache. If you want to avoid caching of the C code of this op, return an empty tuple or do not implement this method.
Note: Theano can handle tuples of any hashable objects as return values for this function but, for greater readability and easier management, this function should return a tuple of integers as previously described.
Important restrictions when implementing an OpÂ¶
There are some important restrictions to remember when implementing an Op.
Unless your Op correctly defines a view_map
attribute, the perform
and c_code
must not
produce outputs whose memory is aliased to any input (technically, if changing the
output could change the input object in some sense, they are aliased).
Unless your Op correctly defines a destroy_map
attribute, perform
and c_code
must
not modify any of the inputs.
TODO: EXPLAIN DESTROYMAP and VIEWMAP BETTER AND GIVE EXAMPLE.
When developing an Op, you should run computations in DebugMode, by using
argument mode='DebugMode'
to theano.function
. DebugMode is
slow, but it can catch many common violations of the Op contract.
TODO: Like what? How? Talk about Python vs. C too.
DebugMode is no silver bullet though.
For example, if you modify an Op self.*
during any of
make_node
, perform
, or c_code
, you are probably doing something
wrong but DebugMode will not detect this.
TODO: jpt: I don’t understand the following sentence.
Ops and Types should usually be considered immutable – you should
definitely not make a change that would have an impact on __eq__
,
__hash__
, or the mathematical value that would be computed by perform
or c_code
.
Simple C Op exampleÂ¶
In this section, we put together the concepts that were covered in this
tutorial to generate an op which multiplies every element in a vector
by a scalar and returns the resulting vector. This is intended to be a simple
example so the methods c_support_code()
and c_support_code_apply()
are
not used because they are not required.
In the C code below notice how the reference count on the output variable is managed. Also take note of how the new variables required for the op’s computation are declared in a new scope to avoid crossinitialization errors.
Also, in the C code, it is very important to properly validate the inputs and outputs storage. Theano guarantees that the inputs exist and have the right number of dimensions but it does not guarantee their exact shape. For instance, if an op computes the sum of two vectors, it needs to validate that its two inputs have the same shape. In our case, we do not need to validate the exact shapes of the inputs because we don’t have a need that they match in any way.
For the outputs, things are a little bit more subtle. Theano does not guarantee that they have been allocated but it does guarantee that, if they have been allocated, they have the right number of dimension. Again, Theano offers no guarantee on the exact shapes. This means that, in our example, we need to validate that the output storage has been allocated and has the same shape as our vector input. If it is not the case, we allocate a new output storage with the right shape and number of dimensions.
import numpy
import theano
from theano import gof
import theano.tensor as T
class VectorTimesScalar(gof.Op):
__props__ = ()
def make_node(self, x, y):
# Validate the inputs' type
if x.type.ndim != 1:
raise TypeError('x must be a 1d vector')
if y.type.ndim != 0:
raise TypeError('y must be a scalar')
# Create an output variable of the same type as x
output_var = x.type()
return gof.Apply(self, [x, y], [output_var])
def c_code_cache_version(self):
return (1, 0)
def c_code(self, node, name, inp, out, sub):
x, y = inp
z, = out
# Extract the dtypes of the inputs and outputs storage to
# be able to declare pointers for those dtypes in the C
# code.
dtype_x = node.inputs[0].dtype
dtype_y = node.inputs[1].dtype
dtype_z = node.outputs[0].dtype
itemsize_x = numpy.dtype(dtype_x).itemsize
itemsize_z = numpy.dtype(dtype_z).itemsize
fail = sub['fail']
c_code = """
// Validate that the output storage exists and has the same
// dimension as x.
if (NULL == %(z)s 
PyArray_DIMS(%(x)s)[0] != PyArray_DIMS(%(z)s)[0])
{
/* Reference received to invalid output variable.
Decrease received reference's ref count and allocate new
output variable */
Py_XDECREF(%(z)s);
%(z)s = (PyArrayObject*)PyArray_EMPTY(1,
PyArray_DIMS(%(x)s),
PyArray_TYPE(%(x)s),
0);
if (!%(z)s) {
%(fail)s;
}
}
// Perform the vector multiplication by a scalar
{
/* The declaration of the following variables is done in a new
scope to prevent cross initialization errors */
npy_%(dtype_x)s* x_data_ptr =
(npy_%(dtype_x)s*)PyArray_DATA(%(x)s);
npy_%(dtype_z)s* z_data_ptr =
(npy_%(dtype_z)s*)PyArray_DATA(%(z)s);
npy_%(dtype_y)s y_value =
((npy_%(dtype_y)s*)PyArray_DATA(%(y)s))[0];
int x_stride = PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s;
int z_stride = PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s;
int x_dim = PyArray_DIMS(%(x)s)[0];
for(int i=0; i < x_dim; i++)
{
z_data_ptr[i * z_stride] = (x_data_ptr[i * x_stride] *
y_value);
}
}
"""
return c_code % locals()
The c_code
method accepts variable names as arguments (name
, inp
,
out
, sub
) and returns a C code fragment that computes the expression
output. In case of error, the %(fail)s
statement cleans up and returns
properly.
More complex C Op exampleÂ¶
This section introduces a new example, slightly more complex than the previous
one, with an op to perform an elementwise multiplication between the elements
of two vectors. This new example differs from the previous one in its use
of the methods c_support_code()
and c_support_code_apply()
(it does
not need to use them but it does so to explain their use) and its capacity
to support inputs of different dtypes.
Recall the method c_support_code()
is meant to produce code that will
be used for every apply of the op. This means that the C code in this
method must be valid in every setting your op supports. If the op is meant
to supports inputs of various dtypes, the C code in this method should be
generic enough to work with every supported dtype. If the op operates on
inputs that can be vectors or matrices, the C code in this method should
be able to accomodate both kinds of inputs.
In our example, the method c_support_code()
is used to declare a C
function to validate that two vectors have the same shape. Because our
op only supports vectors as inputs, this function is allowed to rely
on its inputs being vectors. However, our op should support multiple
dtypes so this function cannot rely on a specific dtype in its inputs.
The method c_support_code_apply()
, on the other hand, is allowed
to depend on the inputs to the op because it is applyspecific. Therefore, we
use it to define a function to perform the multiplication between two vectors.
Variables or functions defined in the method c_support_code_apply()
will
be included at the global scale for every apply of the Op. Because of this,
the names of those variables and functions should include the name of the op,
like in the example. Otherwise, using the op twice in the same graph will give
rise to conflicts as some elements will be declared more than once.
The last interesting difference occurs in the c_code()
method. Because the
dtype of the output is variable and not guaranteed to be the same as any of
the inputs (because of the upcast in the method make_node()
), the typenum
of the output has to be obtained in the Python code and then included in the
C code.
class VectorTimesVector(gof.Op):
__props__ = ()
def make_node(self, x, y):
# Validate the inputs' type
if x.type.ndim != 1:
raise TypeError('x must be a 1d vector')
if y.type.ndim != 1:
raise TypeError('y must be a 1d vector')
# Create an output variable of the same type as x
output_var = theano.tensor.TensorType(
dtype=theano.scalar.upcast(x.dtype, y.dtype),
broadcastable=[False])()
return gof.Apply(self, [x, y], [output_var])
def c_code_cache_version(self):
return (1, 0, 2)
def c_support_code(self):
c_support_code = """
bool vector_same_shape(PyArrayObject* arr1,
PyArrayObject* arr2)
{
return (PyArray_DIMS(arr1)[0] == PyArray_DIMS(arr2)[0]);
}
"""
return c_support_code
def c_support_code_apply(self, node, name):
dtype_x = node.inputs[0].dtype
dtype_y = node.inputs[1].dtype
dtype_z = node.outputs[0].dtype
c_support_code = """
void vector_elemwise_mult_%(name)s(npy_%(dtype_x)s* x_ptr,
int x_str, npy_%(dtype_y)s* y_ptr, int y_str,
npy_%(dtype_z)s* z_ptr, int z_str, int nbElements)
{
for (int i=0; i < nbElements; i++){
z_ptr[i * z_str] = x_ptr[i * x_str] * y_ptr[i * y_str];
}
}
"""
return c_support_code % locals()
def c_code(self, node, name, inp, out, sub):
x, y = inp
z, = out
dtype_x = node.inputs[0].dtype
dtype_y = node.inputs[1].dtype
dtype_z = node.outputs[0].dtype
itemsize_x = numpy.dtype(dtype_x).itemsize
itemsize_y = numpy.dtype(dtype_y).itemsize
itemsize_z = numpy.dtype(dtype_z).itemsize
typenum_z = numpy.dtype(dtype_z).num
fail = sub['fail']
c_code = """
// Validate that the inputs have the same shape
if ( !vector_same_shape(%(x)s, %(y)s))
{
PyErr_Format(PyExc_ValueError, "Shape mismatch : "
"x.shape[0] and y.shape[0] should match but "
"x.shape[0] == %%i and y.shape[0] == %%i",
PyArray_DIMS(%(x)s)[0], PyArray_DIMS(%(y)s)[0]);
%(fail)s;
}
// Validate that the output storage exists and has the same
// dimension as x.
if (NULL == %(z)s  !(vector_same_shape(%(x)s, %(z)s)))
{
/* Reference received to invalid output variable.
Decrease received reference's ref count and allocate new
output variable */
Py_XDECREF(%(z)s);
%(z)s = (PyArrayObject*)PyArray_EMPTY(1,
PyArray_DIMS(%(x)s),
%(typenum_z)s,
0);
if (!%(z)s) {
%(fail)s;
}
}
// Perform the vector elemwise multiplication
vector_elemwise_mult_%(name)s(
(npy_%(dtype_x)s*)PyArray_DATA(%(x)s),
PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s,
(npy_%(dtype_y)s*)PyArray_DATA(%(y)s),
PyArray_STRIDES(%(y)s)[0] / %(itemsize_y)s,
(npy_%(dtype_z)s*)PyArray_DATA(%(z)s),
PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s,
PyArray_DIMS(%(x)s)[0]);
"""
return c_code % locals()
Alternate way of defining C OpsÂ¶
The two previous examples have covered the standard way of implementing C Ops
in Theano by inheriting from the class Op
. This process is mostly
simple but it still involves defining many methods as well as mixing, in the
same file, both Python and C code which tends to make the result less
readable.
To help with this, Theano defines a class, COp
, from which new C ops
can inherit. The class COp
aims to simplify the process of implementing
C ops by doing the following :
 It allows you to define the C implementation of your op in a distinct C code file. This makes it easier to keep your Python and C code readable and well indented.
 It can automatically handle all the methods that return C code,
in addition to
Op.c_code_cache_version()
based on the provided external C implementation.
To illustrate how much simpler the class COp
makes the process of defining
a new op with a C implementation, let’s revisit the second example of this
tutorial, the VectorTimesVector
op. In that example, we implemented an op
to perform the task of elementwise vectorvector multiplication. The two
following blocks of code illustrate what the op would look like if it was
implemented using the COp
class.
The new op is defined inside a Python file with the following code :
import theano
from theano import gof
class VectorTimesVector(gof.COp):
__props__ = ()
func_file = "./vectorTimesVector.c"
func_name = "APPLY_SPECIFIC(vector_times_vector)"
def __init__(self):
super(VectorTimesVector, self).__init__(self.func_file,
self.func_name)
def make_node(self, x, y):
# Validate the inputs' type
if x.type.ndim != 1:
raise TypeError('x must be a 1d vector')
if y.type.ndim != 1:
raise TypeError('y must be a 1d vector')
# Create an output variable of the same type as x
output_var = theano.tensor.TensorType(
dtype=theano.scalar.upcast(x.dtype, y.dtype),
broadcastable=[False])()
return gof.Apply(self, [x, y], [output_var])
And the following is the C implementation of the op, defined in an external C file named vectorTimesVector.c :
#section support_code
// Support code function
bool vector_same_shape(PyArrayObject* arr1, PyArrayObject* arr2)
{
return (PyArray_DIMS(arr1)[0] == PyArray_DIMS(arr2)[0]);
}
#section support_code_apply
// Applyspecific support function
void APPLY_SPECIFIC(vector_elemwise_mult)(
DTYPE_INPUT_0* x_ptr, int x_str,
DTYPE_INPUT_1* y_ptr, int y_str,
DTYPE_OUTPUT_0* z_ptr, int z_str, int nbElements)
{
for (int i=0; i < nbElements; i++){
z_ptr[i * z_str] = x_ptr[i * x_str] * y_ptr[i * y_str];
}
}
// Applyspecific main function
int APPLY_SPECIFIC(vector_times_vector)(PyArrayObject* input0,
PyArrayObject* input1,
PyArrayObject** output0)
{
// Validate that the inputs have the same shape
if ( !vector_same_shape(input0, input1))
{
PyErr_Format(PyExc_ValueError, "Shape mismatch : "
"input0.shape[0] and input1.shape[0] should "
"match but x.shape[0] == %i and "
"y.shape[0] == %i",
PyArray_DIMS(input0)[0], PyArray_DIMS(input1)[0]);
return 1;
}
// Validate that the output storage exists and has the same
// dimension as x.
if (NULL == *output0  !(vector_same_shape(input0, *output0)))
{
/* Reference received to invalid output variable.
Decrease received reference's ref count and allocate new
output variable */
Py_XDECREF(*output0);
*output0 = (PyArrayObject*)PyArray_EMPTY(1,
PyArray_DIMS(input0),
TYPENUM_OUTPUT_0,
0);
if (!*output0) {
PyErr_Format(PyExc_ValueError,
"Could not allocate output storage");
return 1;
}
}
// Perform the actual vectorvector multiplication
APPLY_SPECIFIC(vector_elemwise_mult)(
(DTYPE_INPUT_0*)PyArray_DATA(input0),
PyArray_STRIDES(input0)[0] / ITEMSIZE_INPUT_0,
(DTYPE_INPUT_1*)PyArray_DATA(input1),
PyArray_STRIDES(input1)[0] / ITEMSIZE_INPUT_1,
(DTYPE_OUTPUT_0*)PyArray_DATA(*output0),
PyArray_STRIDES(*output0)[0] / ITEMSIZE_OUTPUT_0,
PyArray_DIMS(input0)[0]);
return 0;
}
As you can see from this example, the Python and C implementations are nicely decoupled which makes them much more readable than when they were intertwined in the same file and the C code contained string formatting markers.
Now that we have motivated the COp class, we can have a more precise look at what it does for us. For this, we go through the various elements that make up this new version of the VectorTimesVector op :
 Parent class : instead of inheriting from the class
Op
, VectorTimesVector inherits from the classCOp
.  Constructor : in our new op, the
__init__()
method has an important use; to inform the constructor of theCOp
class of the location, on the filesystem of the C implementation of this op. To do this, it gives a list of file paths containing the C code for this op. To autogenerate the c_code method with a function call you can specify the function name as the second parameter. The paths should be given as a relative path from the folder where the descendant of theCOp
class is defined. make_node()
: themake_node()
method is absolutely identical to the one in our old example. Using theCOp
class doesn’t change anything here. External C code : the external C code implements the various functions associated with the op. Writing this C code involves a few subtleties which deserve their own respective sections.
Main functionÂ¶
If you pass a function name to the __init__()
method of the
COp
class, it must respect the following constraints:
 It must return an int. The value of that int indicates whether the op could perform its task or not. A value of 0 indicates success while any nonzero value will interrupt the execution of the Theano function. When returning nonzero the function must set a python exception indicating the details of the problem.
 It must receive one argument for each input to the op followed by one pointer to an argument for each output of the op. The types for the argument is dependant on the Types (that is theano Types) of your inputs and outputs.
 You can sepcify the number of inputs and outputs for your op
by setting the
_cop_num_inputs
and_cop_num_outputs
attributes on your op. The main function will always be called with that number of arguments, using NULL to fill in for missing values at the end. This can be used if your op has a variable number of inputs or outputs, but with a fixed maximum.
For example, the main C function of an op that takes two TensorTypes
(which has PyArrayObject *
as its C type) as inputs and returns
both their sum and the difference between them would have four
parameters (two for the op’s inputs and two for its outputs) and it’s
signature would look something like this :
int sumAndDiffOfScalars(PyArrayObject* in0, PyArrayObject* in1,
PyArrayObject** out0, PyArrayObject** out1)
MacrosÂ¶
For certain section tags, your C code can benefit from a number of
predefined macros. These section tags have no macros: init_code
,
support_code
. All other tags will have the support macros
discussed below.
APPLY_SPECIFIC(str)
which will automatically append a name unique to the Apply node that applies the Op at the end of the providedstr
. The use of this macro is discussed futher below.
For every input which has a dtype
attribute (this means
Tensors, and equivalent types on GPU), the following macros will be
defined unless your Op class has an Op.check_input
attribute
defined to False. In these descrptions ‘i’ refers to the position
(indexed from 0) in the input array.
DTYPE_INPUT_{i}
: NumPy dtype of the data in the array. This is the variable type corresponding to the NumPy dtype, not the string representation of the NumPy dtype. For instance, if the op’s first input is a float32 ndarray, then the macroDTYPE_INPUT_0
corresponds tonpy_float32
and can directly be used to declare a new variable of the same dtype as the data in the array :DTYPE_INPUT_0 myVar = someValue;
TYPENUM_INPUT_{i}
: Typenum of the data in the arrayITEMSIZE_INPUT_{i}
: Size, in bytes, of the elements in the array.
In the same way, the macros DTYPE_OUTPUT_{i}
,
ITEMSIZE_OUTPUT_{i}
and TYPENUM_OUTPUT_{i}
are defined for
every output ‘i’ of the op.
In addition to these macros, the init_code_struct
, code
, and
code_cleanup
section tags also have the following macros:
FAIL
: Code to insert at error points. A python exception should be set prior to this code. An invocation look like this:if (error) { // Set python exception FAIL }
You can add a semicolon after the macro if it makes your editor happy.
PARAMS
: Name of the params variable for this node. (only for Ops which have params, which is discussed elsewhere)
Finally the tag code
and code_cleanup
have macros to
pass the inputs and output names. These are name INPUT_{i}
and
OUTPUT_{i}
where i is the 0based index position in the input
and output arrays respectively.
Support codeÂ¶
Certain section are limited in what you can place in them due to
semantic and syntactic restrictions of the C++ language. Most of
these restrictions apply to the tags that end in _struct
.
When we defined the VectorTimesVector op without using the COp
class, we had to make a distinction between two types of support_code
: the support code that was applyspecific and the support code that
wasn’t. The applyspecific code was defined in the
c_support_code_apply()
method and the elements defined in that
code (global variables and functions) had to include the name of the
Apply node in their own names to avoid conflicts between the different
versions of the applyspecific code. The code that wasn’t
applyspecific was simply defined in the c_support_code()
method.
To make indentifiers that include the Apply node name use the
APPLY_SPECIFIC(str)
macro. In the above example, this macro is
used when defining the functions vector_elemwise_mult()
and
vector_times_vector()
as well as when calling function
vector_elemwise_mult()
from inside vector_times_vector()
.
When using the COp
class, we still have to make the distinction
between C code for each of the methods of a C class. These sections of
code are separated by #section <tag>
markers. The tag determines
the name of the method this C code applies to with the rule that
<tag>
applies to c_<tag>. Unknown tags are an error and will be
reported. Duplicate tags will be merged together in the order the
appear in the C files.
The rules for knowing if where a piece of code should be put can be
sometimes tricky. The key thing to remember is that things that can
be shared between instances of the op should be applyagnostic and go
into a section which does not end in _apply
or _struct
. The
distinction of _apply
and _struct
mostly hinghes on how you
want to manange the lifetime of the object. Note that to use an
applyspecific object, you have to be in a applyspecific section, so
some portions of the code that might seem applyagnostic may still be
applyspecific because of the data they use (this does not include
arguments).
In the above example, the function vector_same_shape()
is
applyagnostic because it uses none of the macros defined by the class
COp
and it doesn’t rely on any applyspecific code. The function
vector_elemwise_mult()
is applyspecific because it uses the
macros defined by COp
. Finally, the function
vector_times_vector()
is applyspecific because it uses those same
macros and also because it calls vector_elemwise_mult()
which is
an applyspecific function.
Using GDB to debug Op’s C codeÂ¶
When debugging C code, it can be useful to use GDB for code compiled by Theano.
For this, you must enable this Theano: cmodule.remove_gxx_opt=True. For the GPU, you must add in this second flag nvcc.flags=g (it slow down computation on the GPU, but it is enabled by default on the CPU).
Then you must start Python inside GDB and in it start your Python process (e.g. theanonose):
$gdb python
(gdb)r bin/theanonose theano/
Final NoteÂ¶
This tutorial focuses on providing C implementations to ops that manipulate Theano tensors. For more information about other Theano types, you can refer to the section Alternate Theano Types.
Writing an Op to work on an ndarray
in CÂ¶
This section walks through a nontrivial example Op that does something pretty
weird and unrealistic, that is hard to express with existing Ops.
(Technically, we could use Scan
to implement the Op we’re about to describe,
but we ignore that possibility for the sake of example.)
The following code works, but important errorchecking has been omitted for clarity. For example, when you write C code that assumes memory is contiguous, you should check the strides and alignment.
import theano
class Fibby(theano.Op):
"""
An arbitrarily generalized Fibbonacci sequence
"""
__props__ = ()
def make_node(self, x):
x_ = tensor.as_tensor_variable(x)
assert x_.ndim == 1
return theano.Apply(self,
inputs=[x_],
outputs=[x_.type()])
# using x_.type() is dangerous, it copies x's broadcasting behaviour
def perform(self, node, inputs, output_storage):
x, = inputs
y = output_storage[0][0] = x.copy()
for i in range(2, len(x)):
y[i] = y[i1] * y[i2] + x[i]
def c_code(self, node, name, inames, onames, sub):
x, = inames
y, = onames
fail = sub['fail']
return """
Py_XDECREF(%(y)s);
%(y)s = (PyArrayObject*)PyArray_FromArray(
%(x)s, 0, NPY_ARRAY_ENSURECOPY);
if (!%(y)s)
%(fail)s;
{//New scope needed to make compilation work
dtype_%(y)s * y = (dtype_%(y)s*)PyArray_DATA(%(y)s);
dtype_%(x)s * x = (dtype_%(x)s*)PyArray_DATA(%(x)s);
for (int i = 2; i < PyArray_DIMS(%(x)s)[0]; ++i)
y[i] = y[i1]*y[i2] + x[i];
}
""" % locals()
def c_code_cache_version(self):
return (1,)
fibby = Fibby()
In the first two lines of the C function, we make y point to a new array with
the correct size for the output. This is essentially simulating the line
y = x.copy()
.
The variables %(x)s
and %(y)s
are set up by the TensorType to be PyArrayObject
pointers.
TensorType also set up dtype_%(x)s
to be a typdef to the C type for x
.
Py_XDECREF(%(y)s);
%(y)s = (PyArrayObject*)PyArray_FromArray(
%(x)s, 0, NPY_ARRAY_ENSURECOPY);
The first line reduces the reference count of the data that y originally pointed to. The second line allocates the new data and makes y point to it.
In C code for a theano op, numpy arrays are represented as PyArrayObject
C
structs. This is part of the numpy/scipy C API documented at
http://docs.scipy.org/doc/numpy/reference/capi.typesandstructures.html
TODO: NEEDS MORE EXPLANATION.
Writing an OptimizationÂ¶
fibby
of a vector of zeros is another vector of zeros of
the same size.
Theano does not attempt to infer this from the code provided via Fibby.perform
or Fibby.c_code
.
However, we can write an optimization that makes use of this observation.
This sort of local substitution of special cases is common,
and there is a stage of optimization (specialization) devoted to such optimizations.
The following optimization (fibby_of_zero
) tests whether the input is
guaranteed to be all zero, and if so it returns the input itself as a replacement
for the old output.
TODO: talk about OPTIMIZATION STAGES
from theano.tensor.opt import get_scalar_constant_value, NotScalarConstantError
# Remove any fibby(zeros(...))
@theano.tensor.opt.register_specialize
@theano.gof.local_optimizer([fibby])
def fibby_of_zero(node):
if node.op == fibby:
x = node.inputs[0]
try:
if numpy.all(0 == get_scalar_constant_value(x)):
return [x]
except NotScalarConstantError:
pass
The register_specialize
decorator is what activates our optimization, and
tells Theano to use it in the specialization stage.
The local_optimizer
decorator builds a class instance around our global
function. The [fibby]
argument is a hint that our optimizer works on nodes
whose .op
attribute equals fibby
.
The function here (fibby_of_zero
) expects an Apply
instance as an
argument for parameter node
. It tests using
function get_scalar_constant_value
, which determines if a
Variable (x
) is guaranteed to be a constant, and if so, what constant.
Test the optimizationÂ¶
Here is some code to test that the optimization is applied only when needed.
import numpy
import theano.tensor as T
from theano import function
from theano import tensor
# Test it does not apply when not needed
x = T.dvector()
f = function([x], fibby(x))
# We call the function to make sure it runs.
# If you run in DebugMode, it will compare the C and Python outputs.
f(numpy.random.rand(5))
topo = f.maker.fgraph.toposort()
assert len(topo) == 1
assert isinstance(topo[0].op, Fibby)
# Test that the optimization gets applied.
f_zero = function([], fibby(T.zeros([5])))
# If you run in DebugMode, it will compare the output before
# and after the optimization.
f_zero()
# Check that the optimization removes the Fibby Op.
# For security, the Theano memory interface ensures that the output
# of the function is always memory not aliased to the input.
# That is why there is a DeepCopyOp op.
topo = f_zero.maker.fgraph.toposort()
assert len(topo) == 1
assert isinstance(topo[0].op, theano.compile.ops.DeepCopyOp)
Overview of the compilation pipelineÂ¶
The purpose of this page is to explain each step of defining and compiling a Theano function.
Definition of the computation graphÂ¶
By creating Theano Variables using
theano.tensor.lscalar
or theano.tensor.dmatrix
or by using
Theano functions such as theano.tensor.sin
or
theano.tensor.log
, the user builds a computation graph. The
structure of that graph and details about its components can be found
in the Graph Structures article.
Compilation of the computation graphÂ¶
Once the user has built a computation graph, she can use
theano.function
in order to make one or more functions that
operate on real data. function takes a list of input Variables as well as a list of output Variables that define a
precise subgraph corresponding to the function(s) we want to define,
compile that subgraph and produce a callable.
Here is an overview of the various steps that are done with the computation graph in the compilation phase:
Step 1  Create a FunctionGraphÂ¶
The subgraph given by the end user is wrapped in a structure called FunctionGraph. That structure defines several hooks on adding and removing (pruning) nodes as well as on modifying links between nodes (for example, modifying an input of an Apply node) (see the article about fg – Graph Container [doc TODO] for more information).
FunctionGraph provides a method to change the input of an Apply node from one Variable to another and a more highlevel method to replace a Variable with another. This is the structure that Optimizers work on.
Some relevant Features are typically added to the FunctionGraph, namely to prevent any optimization from operating inplace on inputs declared as immutable.
Step 2  Execute main OptimizerÂ¶
Once the FunctionGraph is made, an optimizer is produced by
the mode passed to function
(the Mode basically has two
important fields, linker
and optimizer
). That optimizer is
applied on the FunctionGraph using its optimize() method.
The optimizer is typically obtained through optdb
.
Step 3  Execute linker to obtain a thunkÂ¶
Once the computation graph is optimized, the linker is
extracted from the Mode. It is then called with the FunctionGraph as
argument to
produce a thunk
, which is a function with no arguments that
returns nothing. Along with the thunk, one list of input containers (a
theano.gof.Container is a sort of object that wraps another and does
type casting) and one list of output containers are produced,
corresponding to the input and output Variables as well as the updates
defined for the inputs when applicable. To perform the computations,
the inputs must be placed in the input containers, the thunk must be
called, and the outputs must be retrieved from the output containers
where the thunk put them.
Typically, the linker calls the toposort
method in order to obtain
a linear sequence of operations to perform. How they are linked
together depends on the Linker used. The CLinker produces a single
block of C code for the whole computation, whereas the OpWiseCLinker
produces one thunk for each individual operation and calls them in
sequence.
The linker is where some options take effect: the strict
flag of
an input makes the associated input container do type checking. The
borrow
flag of an output, if False, adds the output to a
no_recycling
list, meaning that when the thunk is called the
output containers will be cleared (if they stay there, as would be the
case if borrow
was True, the thunk would be allowed to reuse (or
“recycle”) the storage).
Note
Compiled libraries are stored within a specific compilation directory,
which by default is set to $HOME/.theano/compiledir_xxx
, where
xxx
identifies the platform (under Windows the default location
is instead $LOCALAPPDATA\Theano\compiledir_xxx
). It may be manually set
to a different location either by setting config.compiledir
or
config.base_compiledir
, either within your Python script or by
using one of the configuration mechanisms described in config
.
The compile cache is based upon the C++ code of the graph to be compiled.
So, if you change compilation configuration variables, such as
config.blas.ldflags
, you will need to manually remove your compile cache,
using Theano/bin/theanocache clear
Theano also implements a lock mechanism that prevents
multiple compilations within the same compilation directory (to avoid
crashes with paralell execution of some scripts). This mechanism is
currently enabled by default, but if it causes any problem it may be
disabled using the function
theano.gof.compilelock.set_lock_status(..)
.
Step 4  Wrap the thunk in a pretty packageÂ¶
The thunk returned by the linker along with input and output
containers is unwieldy. function
hides that complexity away so
that it can be used like a normal function with arguments and return
values.
Theano vs. CÂ¶
We describe some of the patterns in Theano, and present their closest analogue in a statically typed language such as C:
Theano  C 

Apply  function application / function call 
Variable  local function data / variable 
Shared Variable  global function data / variable 
Op  operations carried out in computation / function definition 
Type  data types 
For example:
int d = 0;
int main(int a) {
int b = 3;
int c = f(b)
d = b + c;
return g(a, c);
}
Based on this code snippet, we can relate f
and g
to Ops, a
,
b
and c
to Variables, d
to Shared Variable, g(a, c)
,
f(b)
and d = b + c
(taken as meaning
the action of computing f
, g
or +
on their respective inputs) to
Applies. Lastly, int
could be interpreted as the Theano Type of the
Variables a
, b
, c
and d
.
Making the double typeÂ¶
Type’s contractÂ¶
In Theano’s framework, a Type
(gof.type.Type
)
is any object which defines the following
methods. To obtain the default methods described below, the Type should
be an instance of Type
or should be an instance of a
subclass of Type
. If you will write all methods yourself,
you need not use an instance of Type
.
Methods with default arguments must be defined with the same signature, i.e. the same default argument names and values. If you wish to add extra arguments to any of these methods, these extra arguments must have default values.

class
PureType
Â¶ 
filter
(value, strict=False, allow_downcast=None)Â¶ This casts a value to match the Type and returns the cast value. If
value
is incompatible with the Type, the method must raise an exception. Ifstrict
is True,filter
must return a reference tovalue
(i.e. casting prohibited). Ifstrict
is False, then casting may happen, but downcasting should only be used in two situations: if
allow_downcast
is True  if
allow_downcast
isNone
and the default behavior for this type allows downcasting for the givenvalue
(this behavior is typedependent, you may decide what your own type does by default)
We need to define
filter
with three arguments. The second argument must be calledstrict
(Theano often calls it by keyword) and must have a default value ofFalse
. The third argument must be calledallow_downcast
and must have a default value ofNone
. if

filter_inplace
(value, storage, strict=False, allow_downcast=None)Â¶ If filter_inplace is defined, it will be called instead of filter() This is to allow reusing the old allocated memory. As of this writing this is used only when we transfer new data to a shared variable on the gpu.
storage
will be the old value. i.e. The old numpy array, CudaNdarray, ...

is_valid_value
(value)Â¶ Returns True iff the value is compatible with the Type. If
filter(value, strict = True)
does not raise an exception, the value is compatible with the Type.Default: True iff
filter(value, strict=True)
does not raise an exception.

values_eq
(a, b)Â¶ Returns True iff
a
andb
are equal.Default:
a == b

values_eq_approx
(a, b)Â¶ Returns True iff
a
andb
are approximately equal, for a definition of “approximately” which varies from Type to Type.Default:
values_eq(a, b)

make_variable
(name=None)Â¶ Makes a Variable of this Type with the specified name, if
name
is notNone
. Ifname
isNone
, then the Variable does not have a name. The Variable will have itstype
field set to the Type object.Default: there is a generic definition of this in Type. The Variable’s
type
will be the object that defines this method (in other words,self
).

__call__
(name=None)Â¶ Syntactic shortcut to
make_variable
.Default:
make_variable

__eq__
(other)Â¶ Used to compare Type instances themselves
Default:
object.__eq__

__hash__
()Â¶ Types should not be mutable, so it should be OK to define a hash function. Typically this function should hash all of the terms involved in
__eq__
.Default:
id(self)

get_shape_info
(obj)Â¶ Optional. Only needed to profile the memory of this Type of object.
Return the information needed to compute the memory size of
obj
.The memory size is only the data, so this excludes the container. For an ndarray, this is the data, but not the ndarray object and other data structures such as shape and strides.
get_shape_info()
andget_size()
work in tandem for the memory profiler.get_shape_info()
is called during the execution of the function. So it is better that it is not too slow.get_size()
will be called on the output of this function when printing the memory profile.Parameters: obj – The object that this Type represents during execution Returns: Python object that self.get_size()
understands

get_size
(shape_info)Â¶ Number of bytes taken by the object represented by shape_info.
Optional. Only needed to profile the memory of this Type of object.
Parameters: shape_info – the output of the call to get_shape_info() Returns: the number of bytes taken by the object described by shape_info
.

clone
(dtype=None, broadcastable=None)Â¶ Optional, for TensorTypealikes.
Return a copy of the type with a possibly changed value for dtype and broadcastable (if they aren’t None).
Parameters:  dtype – New dtype for the copy.
 broadcastable – New broadcastable tuple for the copy.
Optional to run, but mandatory for DebugMode. Return True if the Python objects a and b could share memory. Return False otherwise. It is used to debug when Ops did not declare memory aliasing between variables. Can be a static method. It is highly recommended to use and is mandatory for Type in Theano as our buildbot runs in DebugMode.

For each method, the default is what Type
defines
for you. So, if you create an instance of Type
or an
instance of a subclass of Type
, you
must define filter
. You might want to override values_eq_approx
,
as well as values_eq
. The other defaults generally need not be
overridden.
For more details you can go see the documentation for Type.
Additional definitionsÂ¶
For certain mechanisms, you can register functions and other such things to plus your type into theano’s mechanisms. These are optional but will allow people to use you type with familiar interfaces.
To plug in additional options for the transfer target, define a function which takes a theano variable and a target argument and returns eitehr a new transferred variable (which can be the same as the input if no transfer is nessecary) or returns None if the transfer can’t be done.
Then register that function by calling register_transfer()
with it as argument.
Defining doubleÂ¶
We are going to base Type double
on Python’s float
. We
must define filter
and shall override values_eq_approx
.
filter
# Note that we shadow Python's function ``filter`` with this
# definition.
def filter(x, strict=False, allow_downcast=None):
if strict:
if isinstance(x, float):
return x
else:
raise TypeError('Expected a float!')
elif allow_downcast:
return float(x)
else: # Covers both the False and None cases.
x_float = float(x)
if x_float == x:
return x_float
else:
raise TypeError('The double type cannot accurately represent '
'value %s (of type %s): you must explicitly '
'allow downcasting if you want to do this.'
% (x, type(x)))
If strict
is True we need to return x
. If strict
is True and x
is not a
float
(for example, x
could easily be an int
) then it is
incompatible with our Type and we must raise an exception.
If strict is False
then we are allowed to cast x
to a float
,
so if x
is an int
it we will return an equivalent float
.
However if this cast triggers a precision loss (x != float(x)
) and
allow_downcast
is not True, then we also raise an exception.
Note that here we decided that the default behavior of our type
(when allow_downcast
is set to None
) would be the same as
when allow_downcast
is False, i.e. no precision loss is allowed.
values_eq_approx
def values_eq_approx(x, y, tolerance=1e4):
return abs(x  y) / (abs(x) + abs(y)) < tolerance
The second method we define is values_eq_approx
. This method
allows approximate comparison between two values respecting our Type’s
constraints. It might happen that an optimization changes the computation
graph in such a way that it produces slightly different variables, for
example because of numerical instability like rounding errors at the
end of the mantissa. For instance, a + a + a + a + a + a
might not
actually produce the exact same output as 6 * a
(try with a=0.1),
but with values_eq_approx
we do not necessarily mind.
We added an extra tolerance
argument here. Since this argument is
not part of the API, it must have a default value, which we
chose to be 1e4.
Note
values_eq
is never actually used by Theano, but it might be used
internally in the future. Equality testing in
DebugMode is done using values_eq_approx
.
Putting them together
What we want is an object that respects the aforementioned
contract. Recall that Type defines default implementations for all
required methods of the interface, except filter
. One way to make
the Type is to instantiate a plain Type and set the needed fields:
from theano import gof
double = gof.Type()
double.filter = filter
double.values_eq_approx = values_eq_approx
Another way to make this Type is to make a subclass of gof.Type
and define filter
and values_eq_approx
in the subclass:
from theano import gof
class Double(gof.Type):
def filter(self, x, strict=False, allow_downcast=None):
# See code above.
...
def values_eq_approx(self, x, y, tolerance=1e4):
# See code above.
...
double = Double()
double
is then an instance of Type Double
, which in turn is a
subclass of Type
.
There is a small issue with defining double
this way. All
instances of Double
are technically the same Type. However, different
Double
Type instances do not compare the same:
>>> double1 = Double()
>>> double2 = Double()
>>> double1 == double2
False
Theano compares Types using ==
to see if they are the same.
This happens in DebugMode. Also, Ops can (and should) ensure that their inputs
have the expected Type by checking something like if x.type == lvector
.
There are several ways to make sure that equality testing works properly:
Define
Double.__eq__
so that instances of type Double are equal. For example:def __eq__(self, other): return type(self) is Double and type(other) is DoubleOverride
Double.__new__
to always return the same instance.Hide the Double class and only advertise a single instance of it.
Here we will prefer the final option, because it is the simplest.
Ops in the Theano code often define the __eq__
method though.
Untangling some conceptsÂ¶
Initially, confusion is common on what an instance of Type is versus
a subclass of Type or an instance of Variable. Some of this confusion is
syntactic. A Type is any object which has fields corresponding to the
functions defined above. The Type class provides sensible defaults for
all of them except filter
, so when defining new Types it is natural
to subclass Type. Therefore, we often end up with Type subclasses and
it is can be confusing what these represent semantically. Here is an
attempt to clear up the confusion:
 An instance of Type (or an instance of a subclass) is a set of constraints on real data. It is akin to a primitive type or class in C. It is a static annotation.
 An instance of Variable symbolizes data nodes in a data flow
graph. If you were to parse the C expression
int x;
,int
would be a Type instance andx
would be a Variable instance of that Type instance. If you were to parse the C expressionc = a + b;
,a
,b
andc
would all be Variable instances.  A subclass of Type is a way of implementing
a set of Type instances that share
structural similarities. In the
double
example that we are doing, there is actually only one Type in that set, therefore the subclass does not represent anything that one of its instances does not. In this case it is a singleton, a set with one element. However, theTensorType
class in Theano (which is a subclass of Type) represents a set of types of tensors parametrized by their data type or number of dimensions. We could say that subclassing Type builds a hierarchy of Types which is based upon structural similarity rather than compatibility.
Final versionÂ¶
from theano import gof
class Double(gof.Type):
def filter(self, x, strict=False, allow_downcast=None):
if strict:
if isinstance(x, float):
return x
else:
raise TypeError('Expected a float!')
elif allow_downcast:
return float(x)
else: # Covers both the False and None cases.
x_float = float(x)
if x_float == x:
return x_float
else:
raise TypeError('The double type cannot accurately represent '
'value %s (of type %s): you must explicitly '
'allow downcasting if you want to do this.'
% (x, type(x)))
def values_eq_approx(self, x, y, tolerance=1e4):
return abs(x  y) / (abs(x) + abs(y)) < tolerance
def __str__(self):
return "double"
double = Double()
We add one utility function, __str__
. That way, when we print
double
, it will print out something intelligible.
Making arithmetic Ops on doubleÂ¶
Now that we have a double
type, we have yet to use it to perform
computations. We’ll start by defining multiplication.
Op’s contractÂ¶
An Op is any object which inherits from gof.Op
. It has to
define the following methods.

make_node
(*inputs)Â¶ This method is responsible for creating output Variables of a suitable symbolic Type to serve as the outputs of this Op’s application. The Variables found in
*inputs
must be operated on using Theano’s symbolic language to compute the symbolic output Variables. This method should put these outputs into an Apply instance, and return the Apply instance.This method creates an Apply node representing the application of the Op on the inputs provided. If the Op cannot be applied to these inputs, it must raise an appropriate exception.
The inputs of the Apply instance returned by this call must be ordered correctly: a subsequent
self.make_node(*apply.inputs)
must produce something equivalent to the firstapply
.

perform
(node, inputs, output_storage)Â¶ This method computes the function associated to this Op.
node
is an Apply node created by the Op’smake_node
method.inputs
is a list of references to data to operate on using nonsymbolic statements, (i.e., statements in Python, Numpy).output_storage
is a list of storage cells where the variables of the computation must be put.More specifically:
node
: This is a reference to an Apply node which was previously obtained via theOp
‘smake_node
method. It is typically not used in simple Ops, but it contains symbolic information that could be required for complex Ops.inputs
: This is a list of data from which the values stored inoutput_storage
are to be computed using nonsymbolic language.output_storage
: This is a list of storage cells where the output is to be stored. A storage cell is a oneelement list. It is forbidden to change the length of the list(s) contained inoutput_storage
. There is one storage cell for each output of the Op.The data put in
output_storage
must match the type of the symbolic output. This is a situation where thenode
argument can come in handy.A function Mode may allow
output_storage
elements to persist between evaluations, or it may resetoutput_storage
cells to hold a value ofNone
. It can also preallocate some memory for the Op to use. This feature can allowperform
to reuse memory between calls, for example. If there is something preallocated in theoutput_storage
, it will be of the good dtype, but can have the wrong shape and have any stride pattern.
This method must be determined by the inputs. That is to say, if it is evaluated once on inputs A and returned B, then if ever inputs C, equal to A, are presented again, then outputs equal to B must be returned again.
You must be careful about aliasing outputs to inputs, and making modifications to any of the inputs. See Views and inplace operations before writing a
perform
implementation that does either of these things.
Instead (or in addition to) perform()
You can also provide a
C implementation of For more details, refer to the
documentation for Op.

__eq__
(other)Â¶ other
is also an Op.Returning
True
here is a promise to the optimization system that the other Op will produce exactly the same graph effects (from perform) as this one, given identical inputs. This means it will produce the same output values, it will destroy the same inputs (same destroy_map), and will alias outputs to the same inputs (same view_map). For more details, see Views and inplace operations.Note
If you set __props__, this will be automatically generated.

__hash__
()Â¶ If two Op instances compare equal, then they must return the same hash value.
Equally important, this hash value must not change during the lifetime of self. Op instances should be immutable in this sense.
Note
If you set __props__, this will be automatically generated.
Optional methods or attributesÂ¶

__props__
Â¶ Default: Undefined
Must be a tuple. Lists the name of the attributes which influence the computation performed. This will also enable the automatic generation of appropriate __eq__, __hash__ and __str__ methods. Should be set to () if you have no attributes that are relevant to the computation to generate the methods.
New in version 0.7.

default_output
Â¶ Default: None
If this member variable is an integer, then the default implementation of
__call__
will returnnode.outputs[self.default_output]
, wherenode
was returned bymake_node
. Otherwise, the entire list of outputs will be returned, unless it is of length 1, where the single element will be returned by itself.

make_thunk
(node, storage_map, compute_map, no_recycling, impl=None)Â¶ This function must return a thunk, that is a zeroarguments function that encapsulates the computation to be performed by this op on the arguments of the node.
Parameters:  node – Apply instance The node for which a thunk is requested.
 storage_map – dict of lists This maps variables to a oneelement lists holding the variable’s current value. The oneelement list acts as pointer to the value and allows sharing that “pointer” with other nodes and instances.
 compute_map – dict of lists This maps variables to oneelement lists holding booleans. If the value is 0 then the variable has not been computed and the value should not be considered valid. If the value is 1 the variable has been computed and the value is valid. If the value is 2 the variable has been garbagecollected and is no longer valid, but shouldn’t be required anymore for this call.
 no_recycling – WRITEME WRITEME
 impl – None, ‘c’ or ‘py’ Which implementation to use.
The returned function must ensure that is sets the computed variables as computed in the compute_map.
Defining this function removes the requirement for
perform()
or C code, as you will define the thunk for the computation yourself.

__call__
(*inputs, **kwargs)Â¶ By default this is a convenience function which calls
make_node()
with the supplied arguments and returns the result indexed by default_output. This can be overridden by subclasses to do anything else, but must return either a theano Variable or a list of Variables.If you feel the need to override __call__ to change the graph based on the arguments, you should instead create a function that will use your Op and build the graphs that you want and call that instead of the Op instance directly.

infer_shape
(node, shapes)Â¶ This function is needed for shape optimization.
shapes
is a list with one tuple for each input of the Apply node (which corresponds to the inputs of the op). Each tuple contains as many elements as the number of dimensions of the corresponding input. The value of each element is the shape (number of items) along the corresponding dimension of that specific input.While this might sound complicated, it is nothing more than the shape of each input as symbolic variables (one per dimension).
The function should return a list with one tuple for each output. Each tuple should contain the corresponding output’s computed shape.
Implementing this method will allow Theano to compute the output’s shape without computing the output itself, potentially sparing you a costly recomputation.

flops
(inputs, outputs)Â¶ It is only used to have more information printed by the memory profiler. It makes it print the mega flops and giga flops per second for each apply node. It takes as inputs two lists: one for the inputs and one for the outputs. They contain tuples that are the shapes of the corresponding inputs/outputs.

__str__
()Â¶ This allows you to specify a more informative string representation of your Op. If an Op has parameters, it is highly recommended to have the
__str__
method include the name of the op and the Op’s parameters’ values.Note
If you set __props__, this will be automatically generated. You can still overide it for custom output.

do_constant_folding
(node)