Welcome¶
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently. Theano features:
 tight integration with NumPy – Use numpy.ndarray in Theanocompiled functions.
 transparent use of a GPU – Perform dataintensive calculations up to 140x faster than with CPU.(float32 only)
 efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
 speed and stability optimizations – Get the right answer for
log(1+x)
even whenx
is really tiny.  dynamic C code generation – Evaluate expressions faster.
 extensive unittesting and selfverification – Detect and diagnose many types of errors.
Theano has been powering largescale computationally intensive scientific investigations since 2007. But it is also approachable enough to be used in the classroom (University of Montreal’s deep learning/machine learning classes).
News¶
 2016/05/09: New technical report on Theano: Theano: A Python framework for fast computation of mathematical expressions. This is the new preferred reference.
 2016/04/21: Release of Theano 0.8.2, adding support for CuDNN v5.
 2016/03/29: Release of Theano 0.8.1, fixing a compilation issue on MacOS X with XCode 7.3.
 2016/03/21: Release of Theano 0.8. Everybody is encouraged to update.
 MultiGPU.
 We added support for
CNMeM
to speed up the GPU memory allocation.  Theano 0.7 was released 26th March 2015. Everybody is encouraged to update.
 We support cuDNN if it is installed by the user.
 Open Machine Learning Workshop 2014 presentation.
 Colin Raffel tutorial on Theano.
 Ian Goodfellow did a 12h class with exercises on Theano.
 New technical report on Theano: Theano: new features and speed improvements.
 HPCS 2011 Tutorial. We included a few fixes discovered while doing the Tutorial.
You can watch a quick (20 minute) introduction to Theano given as a talk at SciPy 2010 via streaming (or downloaded) video:
Transparent GPU Computing With Theano. James Bergstra, SciPy 2010, June 30, 2010.
Download¶
Theano is now available on PyPI, and can be installed via easy_install
Theano
, pip install Theano
or by downloading and unpacking the tarball
and typing python setup.py install
.
Those interested in bleedingedge features should obtain the latest development version, available via:
git clone git://github.com/Theano/Theano.git
You can then place the checkout directory on your $PYTHONPATH
or use
python setup.py develop
to install a .pth
into your sitepackages
directory, so that when you pull updates via Git, they will be
automatically reflected the “installed” version. For more information about
installation and configuration, see installing Theano.
Citing Theano¶
If you use Theano for academic research, you are highly encouraged (though not required) to cite the following, most recent paper:
 Theano Development Team. “Theano: A Python framework for fast computation of mathematical expressions”.
(
short BibTeX
,full BibTeX
)
Theano is primarily developed by academics, and so citations matter a lot to us. As an added benefit, you increase Theano’s exposure and potential user (and developer) base, which is to the benefit of all users of Theano. Thanks in advance!
See our Theano Citation Policy for details.
Documentation¶
Roughly in order of what you’ll want to check out:
 Installing Theano – How to install Theano.
 Theano at a Glance – What is Theano?
 Tutorial – Learn the basics.
 API Documentation – Theano’s functionality, module by module.
 Frequently Asked Questions – A set of commonly asked questions.
 Optimizations – Guide to Theano’s graph optimizations.
 Extending Theano – Learn to add a Type, Op, or graph optimization.
 Developer Start Guide – How to contribute code to Theano.
 Theano Design and Implementation Documentation – Primarily of interest to developers of Theano
 Internal Documentation – How to maintain Theano and more...
 Release – How our release should work.
 Acknowledgements – What we took from other projects.
 Related Projects – link to other projects that implement new functionalities on top of Theano
You can download the latest PDF documentation, rather than reading it online.
Check out how Theano can be used for Machine Learning: Deep Learning Tutorials.
Theano was featured at SciPy 2010.
Community¶
“Thank YOU for correcting it so quickly. I wish all packages I worked with would have such an active maintenance  this is as good as it gets :)”
(theanousers, Aug 2, 2010)
 Register to theanoannounce if you want to be kept informed on important change on theano(low volume).
 Register and post to theanousers if you want to talk to all Theano users.
 Register and post to theanodev if you want to talk to the developers.
 Register to theanogithub if you want to receive an email for all changes to the GitHub repository.
 Register to theanobuildbot if you want to receive our daily buildbot email.
 Ask/view questions/answers at StackOverflow
 We use Github tickets to keep track of issues (however, some old tickets can still be found on Assembla).
 Come visit us in Montreal! Most developers are students in the LISA group at the University of Montreal.
Help!¶
How to Seek Help¶
The appropriate venue for seeking help depends on the kind of question you have.
 How do I? – theanousers mailing list or StackOverflow
 I got this error, why? – theanousers mailing list or StackOverflow (please include the full error message, even if it’s long)
 I got this error and I’m sure it’s a bug – Github ticket
 I have an idea/request – post the suggestion to theanodev or, even better, implement the idea and submit a GitHub pull request!
 Why do you? – theanousers mailing list (not appropriate for StackOverflow)
 When will you? – theanodev mailing list (not appropriate for StackOverflow)
Please do take some time to search for similar questions that were asked and answered in the past. If you find something similar that doesn’t fully answer your question, it can be helpful to say something like “I found X but it doesn’t address facet Y” and link to the previous discussion.
When asking questions on StackOverflow, please use the theano tag, so your question can be found, and follow StackOverflow’s guidance on asking questions. Consider also using the python and numpy tags, especially if you are unsure which library your problem relates to.
It’s often helpful to include the following details with your question:
 If you have an error, the full error message, even if it’s long
 Which versions of Python and Theano you’re using
 Whether you’re using a CPU or GPU device
 Details of your Theano configuration settings (you can print this in Python via print theano.config)
Spending the time to create a minimal specific example of a problem is likely to get you to an answer quicker than posting something quickly that has too much irrelevant detail or is too vague. A minimal example may take you a bit more time to create but the first response is more likely to be the answer you need than, rather than a frustrated request for clarification.
How to provide help¶
If you see a question on the theanousers mailing list, or on StackOverflow, that you feel reasonably confident you know an answer to, please do support the community by helping others.
We were all newbies to Theano once and, as the community expands, there is a constant stream of new Theano users looking for help. Perhaps you asked a question when you were first starting out? Now you can pay it forward by helping others. It’s also a good way to reinforce your own Theano knowledge.
Often it’s easiest to answer a question directly but sometimes it may be better to refer people to a good answer that was provided in the past. Pointing people to relevant sections in the documentation, or to a Theano tutorial, can also be helpful.
When answering questions please be nice (as always!) and, on StackOverflow, follow their guidance for answering questions.
Release Notes¶
Theano 0.8.2 (21th of April, 2016)¶
This is a point release with only the support for cudnn v5 convolution and minor fixes.
Highlights:  cuDNN v5 convolution support (cuDNN v3 isn’t supported anymore)  A few crash fixes
Theano 0.8.1 (29th of March, 2016)¶
This is a point release without any new feature.
It fixes compilation issues on MacOS X with the command line tools for XCode 7.3, which was released shortly after Theano 0.8.0.
Theano 0.8 (21th of March, 2016)¶
We recommend that everybody update to this version.
 Highlights:
 Python 2 and 3 support with the same code base
 Faster optimization
 Integration of cuDNN for better GPU performance
 Many Scan improvements (execution speed up, ...)
 optimizer=fast_compile moves computation to the GPU.
 Better convolution on CPU and GPU. (CorrMM, cuDNN, 3d conv, more parameter)
 Interactive visualization of graphs with d3viz
 cnmem (better memory management on GPU)
 BreakpointOp
 MultiGPU for data parallism via Platoon (https://github.com/milaudem/platoon/)
 More pooling parameter supported
 Bilinear interpolation of images
 New GPU backend:
 Float16 new backend (need cuda 7.5)
 Multi dtypes
 MultiGPU support in the same process
A total of 141 people contributed to this release, see the list at the bottom.
 Installation:
 Better BLAS detection
 Fixes for more recent software and OS versions
 Support Anaconda on Windows
 Bug fixes:
 GpuJoin now supports negative axis
 Fix GpuCumsum for negative axis
 Interface Deprecation (a warning is printed):
 Deprecate Param class, use In instead
 Interface Changes:
 Rename DownsampleFactorMax to Pool.
 tensor.stack now uses the same interface as numpy.stack
 optimizer=fast_compile moves computation to the GPU
 Raise the user stack trace more frequently.
 Change dev version numbering to follow the PEP 440
 New Interface (reuses existing functionality):
 theano.tensor.nnet.relu
 theano.tensor.nnet.elu
 BatchNormalization.
 MaxAndArgmax support axis=None
 Add theano.tensor.compress (equivalent of numpy.compress)
 theano.tensor.signal.downsamples.max_pool_2d_same_size
 COp
 __props__
 New features
 tensor.unique
 map_variables
 erfcx
 mgrid, ogrid
 allclose
 BreakpointOp
 Make bincount work on GPU
 SolveOp on GPU
 Optional optimization remove_all_assert
 AllocEmpty
 LogSoftmax, for stability optimization when the crossentropy optimization does not apply.
 theano.tensor.repeat works on GPU
 BatchedDot on the GPU and faster on the CPU.
 Faster batched_tensordot and make it work on GPU.
 SoftmaxGrad grad
 3d conv via CorrMM on the GPU
 CPU Max Pool support of padding and strides!=windows size
 theano.function() now accepts a dict for the outputs. When doing this, the function will return a dict. Helpful to keep track of which output is what.
 Warn for unknown or misspelled theano config variables
 theano.tensor.tile update (accept symbolic reps, work on GPU)
 scan how have a strict flag. If set to True, this make scan building faster and could make execution faster.
 theano.tensor.signal.conv2d(2d,2d) output 2d answer
 More convolution parameter supported
 Bilinear interpolation of images
 Speedups:
 Faster SetSubtensor on the GPU.
 Support more reduction pattern on the GPU.
 More graph optimization
 Faster graph optimization
 GpuCrossentropySoftmaxArgmax1HotWithBias
 Crash/no return fixes:
 Fix crash in the assert op grad
 Fix curand crash on Mac
 Multiple Fix scan crashes
 Finish to update all Op.grad() implementation to the new interface
 Others:
 Support ARM processor.
 Better tests
 Code clean up.
 Doc updates
 doctest and sphinx test in travis
 More tests tagged as slow
 Better same_shape implementation
 More op with c code to lower overhead
 Custom pickler for SharedVariable theano.misc.pkl_utils.{dump,load}
 function_dump to help us reproduce user error during compilation
 assert_no_cpu_op
 pep8, flake8
 Better error messages
 On nondefault modes, reduce the number of allocation when allow_gc=False
 Better lock
 Committers for this dev version only:
 Frederic Bastien
 Arnaud Bergeron
 Pierre Luc Carrier
 Iban Harlouchet
 Pascal Lamblin
 Chienli Ma
 Tim Cooijmans
 Nicolas Ballas
 Amjad Almahairi
 David WardeFarley
 Christof Angermueller
 Ziye Fan
 Caglar
 Sina Honari
 Roy Xue
 hantek
 Mohammad Pezeshki
 Melanie Ducoffe
 Alexandre de Brebisson
 Harm de Vries
 Samira Shabanian
 Alex Lamb
 Ramana.S
 Francesco Visin
 Saizheng Zhang
 Ying Zhang
 Jan Schlüter
 Xavier Bouthillier
 Bart van Merrienboer
 Cesar Laurent
 Iulian Vlad Serban
 Li Yao
 Sigurd Spieckermann
 Dmitrii Serdiuk
 Kelvin Xu
 Sebastien Jean
 Thomas Mesnard
 SeonWook Park
 Vincent Michalski
 Dustin Webb
 Mikhail Korobov
 Orhan Firat
 Olivier Mastropietro
 Daniel Renshaw
 Julien Rebetez
 Peng Liu
 Sean Lee
 TimSalimans
 Andre Holzner
 Gijs van Tulder
 Guillaume Alain
 Julien Demouth
 Markus Beissinger
 Mehdi Mirza
 Moslem Kazemi
 Saxenauts
 Søren Kaae Sønderby
 sentient07
 Anatoly Belikov
 Diogo Moitinho de Almeida
 Jakub Sygnowski
 Kashif Rasul
 Laurent Dinh
 Rémy Léone
 Taesup (TS) Kim
 gw0 [http://gw.tnode.com/]
 mronian
 vesis84
 Benni
 Chiheb Trabelsi
 JesseLivezey
 Marius Killinger
 Matt Graham
 Matthew Willson
 Piotr Frankowski
 Stefan Krastanov
 vdumoulin
 Adithya Ganesh
 Anish Shah
 Balázs Hidasi
 Colin Raffel
 Cory Lorenz
 Doug
 Jesse Livezey
 John Salvatier
 John Zedlewski
 Jonathan Ho
 Kaixhin
 LiangChi Hsieh
 Lucas Beyer
 Luke Metz
 MarcAlexandre Cote
 Martin Arjovsky
 Matthias Kümmerer
 Sirisha Rambhatla
 briancheung
 cailw
 ivdorelian
 janmatthis
 jojolalpin
 joncrall
 peterjsadowski
 scottsievert
 Étienne Simon
 Flaxman
 AlOa
 Albert Zeyer
 Andrea
 Andy Jiang
 Balázs
 Ben Poole
 Brian Cheung
 Christophe Van Gysel
 Claude Coulombe
 Clay McLeod
 Dario Garcia
 Jakob Lombacher
 Joao Felipe Santos
 John Arevalo
 Jonas Degrave
 Martin Thoma
 Mathieu Germain
 Matthew Koichi Grimes
 Michael Eickenberg
 Michael Opitz
 Paul Hollensen
 Prayag Verma
 Saatvik Shah
 Sergei Lebedev
 Vik Kamath
 Wei Ouyang
 Wojciech Głogowski
 YiLin Juang
 Yurii Shevchuk
 Zach Dwiel
 dan
 eulerreich
 jotterbach
 rolf
 theaverageguy
 wuaalb
Theano at a Glance¶
Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multidimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling handcrafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs.
Theano combines aspects of a computer algebra system (CAS) with aspects of an optimizing compiler. It can also generate customized C code for many mathematical operations. This combination of CAS with optimizing compilation is particularly useful for tasks in which complicated mathematical expressions are evaluated repeatedly and evaluation speed is critical. For situations where many different expressions are each evaluated once Theano can minimize the amount of compilation/analysis overhead, but still provide symbolic features such as automatic differentiation.
Theano’s compiler applies many optimizations of varying complexity to these symbolic expressions. These optimizations include, but are not limited to:
 use of GPU for computations
 constant folding
 merging of similar subgraphs, to avoid redundant calculation
 arithmetic simplification (e.g.
x*y/x > y
,x > x
)  inserting efficient BLAS operations (e.g.
GEMM
) in a variety of contexts  using memory aliasing to avoid calculation
 using inplace operations wherever it does not interfere with aliasing
 loop fusion for elementwise subexpressions
 improvements to numerical stability (e.g. and )
 for a complete list, see Optimizations
Theano was written at the LISA lab to support rapid development of efficient machine learning algorithms. Theano is named after the Greek mathematician, who may have been Pythagoras’ wife. Theano is released under a BSD license (link).
Sneak peek¶
Here is an example of how to use Theano. It doesn’t show off many of Theano’s features, but it illustrates concretely what Theano is.
import theano
from theano import tensor
# declare two symbolic floatingpoint scalars
a = tensor.dscalar()
b = tensor.dscalar()
# create a simple expression
c = a + b
# convert the expression into a callable object that takes (a,b)
# values as input and computes a value for c
f = theano.function([a,b], c)
# bind 1.5 to 'a', 2.5 to 'b', and evaluate 'c'
assert 4.0 == f(1.5, 2.5)
Theano is not a programming language in the normal sense because you write a program in Python that builds expressions for Theano. Still it is like a programming language in the sense that you have to
 declare variables (
a,b
) and give their types  build expressions for how to put those variables together
 compile expression graphs to functions in order to use them for computation.
It is good to think of theano.function
as the interface to a
compiler which builds a callable object from a purely symbolic graph.
One of Theano’s most important features is that theano.function
can optimize a graph and even compile some or all of it into native
machine instructions.
What does it do that they don’t?¶
Theano is a Python library and optimizing compiler for manipulating and evaluating expressions, especially matrixvalued ones. Manipulation of matrices is typically done using the numpy package, so what does Theano do that Python and numpy do not?
 execution speed optimizations: Theano can use g++ or nvcc to compile parts your expression graph into CPU or GPU instructions, which run much faster than pure Python.
 symbolic differentiation: Theano can automatically build symbolic graphs for computing gradients.
 stability optimizations: Theano can recognize [some] numerically unstable expressions and compute them with more stable algorithms.
The closest Python package to Theano is sympy. Theano focuses more on tensor expressions than Sympy, and has more machinery for compilation. Sympy has more sophisticated algebra rules and can handle a wider variety of mathematical operations (such as series, limits, and integrals).
If numpy is to be compared to MATLAB and sympy to Mathematica, Theano is a sort of hybrid of the two which tries to combine the best of both worlds.
Getting started¶
 Installing Theano
 Instructions to download and install Theano on your system.
 Tutorial
 Getting started with Theano’s basic features. Go here if you are new!
 API Documentation
 Details of what Theano provides. It is recommended to go through the Tutorial first though.
A PDF version of the online documentation may be found here.
Theano Vision¶
This is the vision we have for Theano. This is give people an idea of what to expect in the future of Theano, but we can’t promise to implement all of it. This should also help you to understand where Theano fits in relation to other computational tools.
Support tensor and sparse operations
Support linear algebra operations
 Graph Transformations
 Differentiation/higher order differentiation
 ‘R’ and ‘L’ differential operators
 Speed/memory optimizations
 Numerical stability optimizations
Can use many compiled languages, instructions sets: C/C++, CUDA, OpenCL, PTX, CAL, AVX, ...
Lazy evaluation
Loop
Parallel execution (SIMD, multicore, multinode on cluster, multinode distributed)
Support all NumPy/basic SciPy functionality
Easy wrapping of library functions in Theano
Note: There is no short term plan to support multinode computation.
Theano Vision State¶
Here is the state of that vision as of December 3th, 2013 (after Theano release 0.6):
 We support tensors using the numpy.ndarray object and we support many operations on them.
 We support sparse types by using the scipy.{csc,csr,bsr}_matrix object and support some operations on them.
 We have started implementing/wrapping more advanced linear algebra operations.
 We have many graph transformations that cover the 4 categories listed above.
 We can improve the graph transformation with better storage optimization
and instruction selection.
 Similar to autotuning during the optimization phase, but this doesn’t apply to only 1 op.
 Example of use: Determine if we should move computation to the GPU or not depending on the input size.
 Possible implementation note: allow Theano Variable in the fgraph to have more than 1 owner.
 We support Python 2 and Python 3.
 We have a CUDA backend for tensors of type float32 only.
 Efforts have begun towards a generic GPU ndarray (GPU tensor) (started in the
libgpuarray project)
 Move GPU backend outside of Theano.
 Will provide better support for GPU on Windows and support an OpenCL backend on CPU.
 Loops work, but not all related optimizations are currently done.
 The cvm linker allows lazy evaluation. It is the current default linker.
 How to have DebugMode check it? Right now, DebugMode checks the computation nonlazily.
 SIMD parallelism on the CPU comes from the compiler.
 Multicore parallelism support limited. If the external BLAS implementation supports it, many dot are parallelized via gemm, gemv and ger. Also, elementwise operation are supported. See Multi cores support in Theano.
 No multinode support.
 Most, but not all NumPy functions/aliases are implemented. * https://github.com/Theano/Theano/issues/1080
 Wrapping an existing Python function in easy and documented.
 We know how to separate the shared variable memory storage location from its object type (tensor, sparse, dtype, broadcast flags), but we need to do it.
Contact us¶
Discussion about Theano takes place in the theanodev and theanousers mailing lists. People interested in development of Theano should check the former, while the latter is reserved for issues that concern the end users.
Questions, comments, praise, criticism as well as bug reports should be submitted to these mailing lists.
We welcome all kinds of contributions. If you have any questions regarding how to extend Theano, please feel free to ask on the theanodev mailing list.
Installing Theano¶
Warning
If you want to install the bleedingedge or development version of Theano from GitHub, please make sure you are reading the latest version of this page.
Requirements¶
In order to use Theano, the following libraries and software will need to be installed (MacOS and Windows users should refer to platformspecific instructions below for detailed installation steps):
 Linux, Mac OS X or Windows operating system
 We develop mainly on 64bit Linux machines. other architectures are not welltested.
 Python 2 >= 2.6 or Python 3 >= 3.3
 The development package (
pythondev
orpythondevel
on most Linux distributions) is recommended (see just below). Python 2.4 was supported up to and including the release 0.6. Python 3 is supported past the 3.3 release.g++
(Linux and Windows),clang
(macOS),pythondev
(All platforms) Not technically required but highly recommended, in order to compile generated C code. Theano can fall back on a NumPybased Python execution model, but a C compiler allows for vastly faster execution. g++ >= 4.2 (for openmp that is currently always used) more recent version recommended!
 NumPy >= 1.7.1
 Earlier versions could work, but we don’t test it.
 SciPy >= 0.11
 Only currently required for sparse matrix and special functions support, but highly recommended. SciPy >=0.8 could work, but earlier versions have known bugs with sparse matrices.
 A BLAS installation (with Level 3 functionality)
 Including the development headers (
dev
,devel
, depending on your Linux distribution). Mac OS X comes with the Accelerate framework built in, and various options exist for Windows (see below).
The following libraries and software are optional:
 nose >= 1.3.0 and noseparameterized >= 0.5.0
 Recommended, to run Theano’s testsuite.
 Sphinx >= 0.5.1, pygments
 For building the documentation. LaTeX and dvipng are also necessary for math to show up as images.
 Git
 To download bleedingedge versions of Theano.
 graphiz and either pydotng or pydot
 To be able to make picture of Theano computation graph. pydotng is a pydot compatible replacement that support newer Python.
 NVIDIA CUDA drivers and SDK
 Required for GPU code generation/execution on NVIDIA gpus
 libgpuarray
Required for GPU/CPU code generation on CUDA and OpenCL devices (see: GpuArray Backend.)
note: OpenCL support is still minimal for now.
Linux¶
CentOS 6¶
Easy Installation of an optimized Theano on CentOS 6 provides instructions on how to install Theano on CentOS 6, written by the Theano developers. It covers how to install Theano (for CPUbased computation only) with the distributionpackaged ATLAS, a free fast implementation of BLAS.
Ubuntu¶
Easy Installation of an Optimized Theano on Current Ubuntu provides instructions on how to install Theano on Ubuntu. It covers how to install Theano with the distributionpackaged OpenBlas or ATLAS. Both are free fast implementation of BLAS.
Alternative installation on Gentoo¶
Brian Vandenberg emailed installation instructions on Gentoo, focusing on how to install the appropriate dependencies.
Nicolas Pinto provides ebuild scripts.
Alternative installation on Mandriva 2010.2¶
A contributor made rpm package for Mandriva 2010.2 of Theano 0.3.1.
AWS Marketplace with Bitfusion AMI¶
AWS EC2 AMI preinstalled with Nvidia drivers, CUDA, cuDNN, Theano, Keras, Lasagne, Python 2, Python 3, PyCuda, ScikitLearn, Pandas, Enum34, iPython, and Jupyter. Note, as always there is no charge for Theano and other open software, however there is a charge for AWS hosting + Bitfusion.
Launch an instance from the AWS Marketplace.
Docker¶
Builds of Theano are available as Docker images: Theano Docker (CPU) or Theano Docker (CUDA). These are updated on a weekly basis with bleedingedge builds of Theano. Examples of running bash in a Docker container are as follows:
sudo docker run it kaixhin/theano
sudo nvidiadocker run it kaixhin/cudatheano:7.0
For a guide to Docker, see the official docs. CUDA support requires NVIDIA Docker. For more details on how to use the Theano Docker images, consult the source project.
Basic user install instructions¶
The easiest way to obtain the released version of Theano is from PyPI using pip (a replacement for easy_install provided by setuptools/distribute) by typing
pip install Theano
This should work under Python 2 or Python 3. To test, run
nosetests theano
You may need to add sudo
before the pip
command to install into your
system’s sitepackages
directory. If you do not have administrator access
to your machine, you can install Theano locally (to ~/.local) using
pip install Theano user
Alternatively you can use virtualenv to create an isolated sitepackages
directory; see the virtualenv documentation for details.
Note
Theano can be installed with easy_install, however we recommend pip.
pip
offers many benefits over
easy_install
such as more intelligent dependency management, better
error messages and a pip uninstall
command for easily removing
packages.
If you do not have pip
installed but do have easy_install
, you can
get pip
by simply typing easy_install pip
.
Updating Theano¶
The following command will update only Theano:
sudo pip install upgrade nodeps theano
The following command will update Theano and Numpy/Scipy (warning bellow):
sudo pip install upgrade theano
If you installed NumPy/SciPy with yum/aptget, updating NumPy/SciPy with pip/easy_install is not always a good idea. This can make Theano crash due to problems with BLAS (but see below). The versions of NumPy/SciPy in the distribution are sometimes linked against faster versions of BLAS. Installing NumPy/SciPy with yum/aptget/pip/easy_install won’t install the development package needed to recompile it with the fast version. This mean that if you don’t install the development packages manually, when you recompile the updated NumPy/SciPy, it will compile with the slower version. This results in a slower Theano as well. To fix the crash, you can clear the Theano cache like this:
theanocache clear
Bleedingedge install instructions¶
Master Tests Status:
If you are a developer of Theano, then check out the Developer Start Guide.
If you want the bleedingedge without developing the code you can use pip for
this with the command line below. Note that it will also try to install Theano’s dependencies
(like NumPy and SciPy), but not upgrade them. If you wish to upgrade them,
remove the nodeps
switch to it, but go see a previous warning before doing this.
pip install upgrade nodeps git+git://github.com/Theano/Theano.git
or (if you want to install it for the current user only):
pip install upgrade nodeps git+git://github.com/Theano/Theano.git user
The following are general instructions that will set you up with the bleedingedge version of Theano and allow you to hack it. First, get the code using Git:
git clone git://github.com/Theano/Theano.git
From here, the easiest way to get started is (this requires setuptools or distribute to be installed):
cd Theano
python setup.py develop
This will install a .pth
file in your sitepackages
directory that
tells Python where to look for your Theano installation (i.e. in the
directory your just checked out of Github). Using develop
mode is
preferable to install
as any modifications you make in the checkout
directory (or changes you pull with Git) will be automatically reflected
in the “installed” version without rerunning python setup.py install
.
If you do not have permission to modify your sitepackages
directory you
can specify an alternative installation prefix using
python setup.py develop prefix=~/.local
A common choice is ~/.local
which is automatically searched for Python >=
2.6; for earlier Python versions and other installation prefixes, the prefix
specified must contain lib/pythonA.B/sitepackages
, where A.B
is e.g.
2.5, and this sitepackages
directory must be listed in PYTHONPATH
.
An alternative, perhaps simpler way of creating and using an isolated
sitepackages
is to use virtualenv; see the virtualenv documentation
for details. If you find yourself using virtualenv frequently you may find the
virtualenvwrapper package useful for switching between them.
Configuring PYTHONPATH
¶
If import theano
does not work in Python, you may need modify the
environment variable PYTHONPATH
accordingly.
In bash, you may do this:
export PYTHONPATH=<new location to add>:$PYTHONPATH
In csh:
setenv PYTHONPATH <new location to add>:$PYTHONPATH
To make this change stick you will usually need to add the above command to
your shell’s startup script, i.e. ~/.bashrc
or ~/.cshrc
.
Consult your shell’s documentation for details.
Updating¶
To update your library to the latest revision, change directory (cd
)
to your Theano
folder and execute the following command:
git pull
You should update frequently, bugs are fixed on a very regular basis.
Specific git commit¶
You can install a specific git commit by using the bleeding edge instruction and adding @COMMIT_ID to the pip command like:
pip install upgrade nodeps git+git://github.com/Theano/Theano.git@07e9332a0932e90c47ed2a70fc3c7f8a55d2aa23
Testing your installation¶
Once you have installed Theano, you should run the test suite. At a Python (or IPython) interpreter,
import theano
theano.test()
You can also run them inplace from the Git checkout directory by typing
theanonose
You should be able to execute it if you followed the instructions above.
If theanonose
is not found by your shell, you will need to add
Theano/bin
to your PATH
environment variable.
Note
In Theano versions <= 0.5, theanonose
was not included. If you
are working with such a version, you can call nosetests
instead
of theanonose
. In that case, some tests will fail by raising
the KnownFailureTest Exception, and will be considered as errors,
but they are nothing to worry about.
Note
The tests should be run with the configuration option device
set to cpu
(default). If you need to change this value,
you can do that by setting the THEANO_FLAGS
environment variable,
by prefixing the theanonose
command with THEANO_FLAGS=device=cpu
.
If you have a GPU, it will automatically be used to run GPUrelated tests.
If you want GPUrelated tests to run on a specific GPU device, and not
the default one, you should use init_gpu_device
.
For instance: THEANO_FLAGS=device=cpu,init_gpu_device=gpu1
.
See config – Theano Configuration for more information on how to change these configuration options.
All tests should pass (skipped tests and known failures are normal). If
some test fails on your machine, you are encouraged to tell us what went
wrong on the theanousers@googlegroups.com
mailing list.
Troubleshooting: Make sure you have a BLAS library¶
There are many ways to configure BLAS for Theano. This is done with the Theano
flags blas.ldflags
(config – Theano Configuration). The default is to use the BLAS
installation information in NumPy, accessible via
numpy.distutils.__config__.show()
. You can tell theano to use a different
version of BLAS, in case you did not compile NumPy with a fast BLAS or if NumPy
was compiled with a static library of BLAS (the latter is not supported in
Theano).
The short way to configure the Theano flags blas.ldflags
is by setting the
environment variable THEANO_FLAGS
to blas.ldflags=XXX
(in bash
export THEANO_FLAGS=blas.ldflags=XXX
)
The ${HOME}/.theanorc
file is the simplest way to set a relatively
permanent option like this one. Add a [blas]
section with an ldflags
entry like this:
# other stuff can go here
[blas]
ldflags = lf77blas latlas lgfortran #put your flags here
# other stuff can go here
For more information on the formatting of ~/.theanorc
and the
configuration options that you can put there, see config – Theano Configuration.
Here are some different way to configure BLAS:
0) Do nothing and use the default config, which is to link against the same BLAS against which NumPy was built. This does not work in the case NumPy was compiled with a static library (e.g. ATLAS is compiled by default only as a static library).
1) Disable the usage of BLAS and fall back on NumPy for dot products. To do
this, set the value of blas.ldflags
as the empty string (ex: export
THEANO_FLAGS=blas.ldflags=
). Depending on the kind of matrix operations your
Theano code performs, this might slow some things down (vs. linking with BLAS
directly).
2) You can install the default (reference) version of BLAS if the NumPy version
(against which Theano links) does not work. If you have root or sudo access in
fedora you can do sudo yum install blas blasdevel
. Under Ubuntu/Debian
sudo aptget install libblasdev
. Then use the Theano flags
blas.ldflags=lblas
. Note that the default version of blas is not optimized.
Using an optimized version can give up to 10x speedups in the BLAS functions
that we use.
3) Install the ATLAS library. ATLAS is an open source optimized version of
BLAS. You can install a precompiled version on most OSes, but if you’re willing
to invest the time, you can compile it to have a faster version (we have seen
speedups of up to 3x, especially on more recent computers, against the
precompiled one). On Fedora, sudo yum install atlasdevel
. Under Ubuntu,
sudo aptget install libatlasbasedev libatlasbase
or
libatlas3gfsse2
if your CPU supports SSE2 instructions. Then set the
Theano flags blas.ldflags
to lf77blas latlas lgfortran
. Note that
these flags are sometimes OSdependent.
4) Use a faster version like MKL, GOTO, ... You are on your own to install it.
See the doc of that software and set the Theano flags blas.ldflags
correctly (for example, for MKL this might be lmkl lguide lpthread
or
lmkl_intel_lp64 lmkl_intel_thread lmkl_core lguide liomp5 lmkl_mc
lpthread
).
Note
Make sure your BLAS libraries are available as dynamicallyloadable libraries. ATLAS is often installed only as a static library. Theano is not able to use this static library. Your ATLAS installation might need to be modified to provide dynamically loadable libraries. (On Linux this typically means a library whose name ends with .so. On Windows this will be a .dll, and on OSX it might be either a .dylib or a .so.)
This might be just a problem with the way Theano passes compilation arguments to g++, but the problem is not fixed yet.
Note
If you have problems linking with MKL, Intel Line Advisor and the MKL User Guide can help you find the correct flags to use.
Using the GPU¶
The first thing you’ll need for Theano to use your GPU is Nvidia’s
GPUprogramming toolchain. You should install at least the CUDA driver and the CUDA Toolkit, as
described here. The CUDA
Toolkit installs a folder on your computer with subfolders bin, lib,
include, and some more too. (Sanity check: The bin subfolder should contain an nvcc
program which is the compiler for GPU code.) This folder is called the cuda
root directory.
You must also add the ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a 64bit Linux
computer) to your $LD_LIBRARY_PATH
environment variable.
You must then tell Theano where the CUDA root folder is, and there are three ways to do it. Any one of them is enough.
 Define a $CUDA_ROOT environment variable to equal the cuda root directory, as in
CUDA_ROOT=/path/to/cuda/root
, or  add a
cuda.root
flag toTHEANO_FLAGS
, as inTHEANO_FLAGS='cuda.root=/path/to/cuda/root'
, or  add a [cuda] section to your .theanorc file containing the option
root = /path/to/cuda/root
.
Note
On Debian, you can ask the software package manager to install it
for you. We have a user report that this works for Debian Wheezy
(7.0). When you install it this way, you won’t always have the
latest version, but we were told that it gets updated
regularly. One big advantage is that it will be updated
automatically. You can try the sudo aptget install
nvidiacudatoolkit
command to install it.
Once that is done, the only thing left is to change the device
option to name the GPU device in your
computer, and set the default floating point computations to float32.
For example: THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'
.
You can also set these options in the .theanorc file’s [global]
section:
[global] device = gpu floatX = float32
Note that:
 If your computer has multiple GPUs and you use ‘device=gpu’, the driver selects the one to use (usually gpu0).
 You can use the program nvidasmi to change this policy.
 You can choose one specific GPU by specifying ‘device=gpuX’, with X the the corresponding GPU index (0, 1, 2, ...)
 By default, when
device
indicates preference for GPU computations, Theano will fall back to the CPU if there is a problem with the GPU. You can use the flag ‘force_device=True’ to instead raise an error when Theano cannot use the GPU.
Once your setup is complete, head to Using the GPU to find how to verify everything is working properly.
Mac OS¶
There are various ways to install Theano dependencies on a Mac. Here we describe the process in detail with Canopy, Anaconda, Homebrew or MacPorts but if you did it differently and it worked, please let us know the details on the theanousers mailinglist, so that we can add alternate instructions here.
In academia: Enthought Canopy¶
If you are working in academia, the easiest way to install most of the dependencies is to install Canopy. If you are affiliated with a university (as student or employee), you can download the installer for free.
The Canopy installation includes in particular Python (and the development headers), NumPy, SciPy, nose, sphinx, pip, pydot (but not Graphviz, which is necessary for it to work) and the MKL implementation of blas.
To install the latest Theano release execute this in a terminal:
$ pip install Theano
If you want the bleeding edge version execute this command instead:
$ pip install upgrade nodeps git+git://github.com/Theano/Theano.git
See the section install_bleeding_edge for more information on the bleeding edge version.
Then you must install the compiler. See Installing the compiler below.
Note
If you use version 0.6 or later of Theano, we try to automatically link with the Canopy blas version. Due to Mac OS peculiarities, this requires user intervention. We detect if the manipulation was done or not and give an error message explaining what to do in case it hasn’t been done.
Anaconda¶
An easy way to install most of the dependencies is to install
Anaconda. There is a free
version available to everybody. If you install their MKL
Optimizations
product (free for academic, ~30$ otherwise) Theano
will also be optimized as we will reuse the faster BLAS version
automatically.
The Anaconda installation includes in particular Python (and the development headers), NumPy, SciPy, nose, sphinx, pip, and a acceptable BLAS version.
After installing Anaconda, in a terminal execute this command to install the latest Theano release:
$ pip install Theano
To install the missing Theano optional dependency (pydot):
$ conda install pydotng
If you want the bleeding edge version instead execute this command:
$ pip install upgrade nodeps git+git://github.com/Theano/Theano.git
See the section install_bleeding_edge for more information on the bleeding edge version.
Then you must install the compiler. See Installing the compiler below.
Note
If you use version 0.6 or later of Theano, we try to automatically link with the python library. Due to Mac OS peculiarities, this requires user intervention. We detect if the user did the modification and if not, we tell him how to do it.
Installing the compiler¶
Theano officially supports only clang on OS X. This can be installed by getting XCode from the App Store and running it once to install the commandline tools.
If you still want to use g++ you can do so by setting its full path in the theano config flag gxx. Note that any bug reports on Mac using g++ will be ignored unless it can be reproduced with clang.
Homebrew¶
Install python with homebrew:
$ brew install python # or python3 if you prefer
This will install pip. Then use pip to install numpy, scipy:
$ pip install numpy scipy
If you want to use openblas instead of Accelerate, you have to install numpy and scipy with hombrew:
$ brew tap homebrew/python
$ brew install numpy withopenblas
$ brew install scipy withopenblas
Then install theano as usual:
$ pip install Theano user
Or for the bleedingedge version:
$ pip install upgrade nodeps git+git://github.com/Theano/Theano.git
MacPorts¶
Using MacPorts to install all required Theano dependencies is easy, but be aware that it will take a long time (a few hours) to build and install everything.
MacPorts requires installing XCode first (which can be found in the Mac App Store), if you do not have it already. If you can’t install it from the App Store, look in your MacOS X installation DVD for an old version. Then update your Mac to update XCode.
Download and install MacPorts, then ensure its package list is uptodate with
sudo port selfupdate
.Then, in order to install one or more of the required libraries, use
port install
, e.g. as follows:$ sudo port install py27numpy +atlas py27scipy +atlas py27pip
This will install all the required Theano dependencies. gcc will be automatically installed (since it is a SciPy dependency), but be aware that it takes a long time to compile (hours)! Having NumPy and SciPy linked with ATLAS (an optimized BLAS implementation) is not mandatory, but recommended if you care about performance.
You might have some different versions of gcc, SciPy, NumPy, Python installed on your system, perhaps via Xcode. It is a good idea to use either the MacPorts version of everything or some other set of compatible versions (e.g. provided by Xcode or Fink). The advantages of MacPorts are the transparency with which everything can be installed and the fact that packages are updated quite frequently. The following steps describe how to make sure you are using the MacPorts version of these packages.
In order to use the MacPorts version of Python, you will probably need to explicitly select it with
sudo port select python python27
. The reason this is necessary is because you may have an Appleprovided Python (via, for example, an Xcode installation). After performing this step, you should check that the symbolic link provided bywhich python
points to the MacPorts python. For instance, on MacOS X Lion with MacPorts 2.0.3, the output ofwhich python
is/opt/local/bin/python
and this symbolic link points to/opt/local/bin/python2.7
. When executingsudo port select python python27apple
(which you should not do), the link points to/usr/bin/python2.7
.Similarly, make sure that you are using the MacPortsprovided gcc: use
sudo port select gcc
to see which gcc installs you have on the system. Then execute for instancesudo port select gcc mpgcc44
to create a symlink that points to the correct (MacPorts) gcc (version 4.4 in this case).At this point, if you have not done so already, it may be a good idea to close and restart your terminal, to make sure all configuration changes are properly taken into account.
Afterwards, please check that the
scipy
module that is imported in Python is the right one (and is a recent one). For instance,import scipy
followed byprint(scipy.__version__)
andprint(scipy.__path__)
should result in a version number of at least 0.7.0 and a path that starts with/opt/local
(the path where MacPorts installs its packages). If this is not the case, then you might have some old installation ofscipy
in yourPYTHONPATH
so you should editPYTHONPATH
accordingly.Please follow the same procedure with
numpy
.This is covered in the MacPorts installation process, but make sure that your
PATH
environment variable contains/opt/local/bin
and/opt/local/sbin
before any other paths (to ensure that the Python and gcc binaries that you installed with MacPorts are visible first).MacPorts does not create automatically
nosetests
andpip
symlinks pointing to the MacPorts version, so you can add them yourself with$ sudo ln s /opt/local/bin/nosetests2.7 /opt/local/bin/nosetests $ sudo ln s /opt/local/bin/pip2.7 /opt/local/bin/pip
At this point you are ready to install Theano with
$ sudo pip install Theano
And if you are in no hurry, you can run its testsuite with
$ python c "import theano; theano.test()"
Using the GPU¶
You should be able to follow the Linux instructions to setup CUDA, but be aware of the following caveats:
 If you want to compile the CUDA SDK code, you may need to temporarily revert back to Apple’s gcc (
sudo port select gcc
) as their Makefiles are not compatible with MacPort’s gcc. If CUDA seems unable to find a CUDAcapable GPU, you may need to manually toggle your GPU on, which can be done with gfxCardStatus.
Once your setup is complete, head to Using the GPU to find how to verify everything is working properly.
Troubleshooting MacOS issues¶
Although the above steps should be enough, running Theano on a Mac may sometimes cause unexpected crashes, typically due to multiple versions of Python or other system libraries. If you encounter such problems, you may try the following.
You can ensure MacPorts shared libraries are given priority at runtime with
export LD_LIBRARY_PATH=/opt/local/lib:$LD_LIBRARY_PATH
. In order to do the same at compile time, you can add to your~/.theanorc
:[gcc] cxxflags = L/opt/local/lib
An obscure
Bus error
can sometimes be caused when linking Theanogenerated object files against theframework
library in Leopard. For this reason, we have disabled linking withframework Python
, since on most configurations this solves theBus error
problem. If this default configuration causes problems with your Python/Theano installation and you think that linking withframework Python
might help, then either set theTHEANO_FLAGS
environment variable withTHEANO_FLAGS=cmodule.mac_framework_link
or edit your~/.theanorc
to contain[cmodule] mac_framework_link=True
More generally, to investigate libraries issues, you can use the
otool L
command on.so
files found under your~/.theano
directory. This will list shared libraries dependencies, and may help identify incompatibilities.
Please inform us if you have trouble installing and running Theano on your Mac.
We would be especially interested in dependencies that we missed listing,
alternate installation steps, GPU instructions, as well as tests that fail on
your platform (use the theanousers@googlegroups.com
mailing list, but
note that you must first register to it, by going to theanousers).
Windows¶
Installation of Theano on Windows provides stepbystep instructions on how to install Theano on 32 or 64bit Windows systems, using freely available tools and compilers.
Editing code in Visual Studio¶
You will find a Visual Studio solution file (Theano.sln
) in the root of
the Theano repository. Note that this project file may not be kept uptodate
and is not officially supported by the core Theano developers: it is provided
for convenience only.
Also, be aware that it will not make Theano use Visual Studio to compile C
files: it is only meant to provide an easy way to edit Theano code within
the Visual Studio editor.
Windows Installation References¶
 http://stackoverflow.com/questions/9047072/windowspythonversionandvcredistributableversion
 http://stackoverflow.com/questions/1865069/howtocompilea64bitapplicationusingvisualc2010express
 http://blog.victorjabur.com/2011/06/05/compilingpython27modulesonwindows32and64usingmsvc2008express/
 http://stackoverflow.com/questions/126279/c99stdinthheaderandmsvisualstudio
 http://stackoverflow.com/questions/11182765/howcanibuildmycextensionswithmingww64inpython
 https://mail.python.org/pipermail/pythonannouncelist/2014September/010457.html
Generating the documentation¶
You can read the latest HTML documentation here. You can download the latest PDF documentation here.
We recommend you look at the documentation on the website, since it will be more current than the documentation included with the package.
If you really wish to build the documentation yourself, you will need sphinx, as described above. Issue the following command:
python ./doc/scripts/docgen.py
Documentation is built into html/
.
The PDF of the documentation is html/theano.pdf
.
Tutorial¶
Let us start an interactive session (e.g. with python
or ipython
) and import Theano.
>>> from theano import *
Several of the symbols you will need to use are in the tensor
subpackage
of Theano. Let us import that subpackage under a handy name like
T
(the tutorials will frequently use this convention).
>>> import theano.tensor as T
If that succeeded you are ready for the tutorial, otherwise check your installation (see Installing Theano).
Throughout the tutorial, bear in mind that there is a Glossary as well as index and modules links in the upperright corner of each page to help you out.
Prerequisites¶
Python tutorial¶
In this documentation, we suppose that the reader knows Python. Here is a small list of Python tutorials/exercises if you need to learn it or only need a refresher:
 Python Challenge
 Dive into Python
 Google Python Class
 Enthought Python course (free for academics)
We have a tutorial on how Python manages its memory.
NumPy refresher¶
 Here are some quick guides to NumPy:
Matrix conventions for machine learning¶
Rows are horizontal and columns are vertical. Every row is an example. Therefore, inputs[10,5] is a matrix of 10 examples where each example has dimension 5. If this would be the input of a neural network then the weights from the input to the first hidden layer would represent a matrix of size (5, #hid).
Consider this array:
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]])
array([[ 1., 2.],
[ 3., 4.],
[ 5., 6.]])
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]]).shape
(3, 2)
This is a 3x2 matrix, i.e. there are 3 rows and 2 columns.
To access the entry in the 3rd row (row #2) and the 1st column (column #0):
>>> numpy.asarray([[1., 2], [3, 4], [5, 6]])[2, 0]
5.0
To remember this, keep in mind that we read lefttoright, toptobottom, so each thing that is contiguous is a row. That is, there are 3 rows and 2 columns.
Broadcasting¶
Numpy does broadcasting of arrays of different shapes during arithmetic operations. What this means in general is that the smaller array (or scalar) is broadcasted across the larger array so that they have compatible shapes. The example below shows an instance of broadcastaing:
>>> a = numpy.asarray([1.0, 2.0, 3.0])
>>> b = 2.0
>>> a * b
array([ 2., 4., 6.])
The smaller array b
(actually a scalar here, which works like a 0d array) in this case is broadcasted to the same size
as a
during the multiplication. This trick is often useful in
simplifying how expression are written. More detail about broadcasting
can be found in the numpy user guide.
Basics¶
Baby Steps  Algebra¶
Adding two Scalars¶
To get us started with Theano and get a feel of what we’re working with, let’s make a simple function: add two numbers together. Here is how you do it:
>>> import numpy
>>> import theano.tensor as T
>>> from theano import function
>>> x = T.dscalar('x')
>>> y = T.dscalar('y')
>>> z = x + y
>>> f = function([x, y], z)
And now that we’ve created our function we can use it:
>>> f(2, 3)
array(5.0)
>>> numpy.allclose(f(16.3, 12.1), 28.4)
True
Let’s break this down into several steps. The first step is to define
two symbols (Variables) representing the quantities that you want
to add. Note that from now on, we will use the term
Variable to mean “symbol” (in other words,
x, y, z are all Variable objects). The output of the function
f is a numpy.ndarray
with zero dimensions.
If you are following along and typing into an interpreter, you may have
noticed that there was a slight delay in executing the function
instruction. Behind the scene, f was being compiled into C code.
Step 1
>>> x = T.dscalar('x')
>>> y = T.dscalar('y')
In Theano, all symbols must be typed. In particular, T.dscalar
is the type we assign to “0dimensional arrays (scalar) of doubles
(d)”. It is a Theano Type.
dscalar
is not a class. Therefore, neither x nor y
are actually instances of dscalar
. They are instances of
TensorVariable
. x and y
are, however, assigned the theano Type dscalar
in their type
field, as you can see here:
>>> type(x)
<class 'theano.tensor.var.TensorVariable'>
>>> x.type
TensorType(float64, scalar)
>>> T.dscalar
TensorType(float64, scalar)
>>> x.type is T.dscalar
True
By calling T.dscalar
with a string argument, you create a
Variable representing a floatingpoint scalar quantity with the
given name. If you provide no argument, the symbol will be unnamed. Names
are not required, but they can help debugging.
More will be said in a moment regarding Theano’s inner structure. You could also learn more by looking into Graph Structures.
Step 2
The second step is to combine x and y into their sum z:
>>> z = x + y
z is yet another Variable which represents the addition of x and y. You can use the pp function to prettyprint out the computation associated to z.
>>> from theano import pp
>>> print(pp(z))
(x + y)
Step 3
The last step is to create a function taking x and y as inputs and giving z as output:
>>> f = function([x, y], z)
The first argument to function
is a list of Variables
that will be provided as inputs to the function. The second argument
is a single Variable or a list of Variables. For either case, the second
argument is what we want to see as output when we apply the function. f may
then be used like a normal Python function.
Note
As a shortcut, you can skip step 3, and just use a variable’s
eval
method.
The eval()
method is not as flexible
as function()
but it can do everything we’ve covered in
the tutorial so far. It has the added benefit of not requiring
you to import function()
. Here is how eval()
works:
>>> import numpy
>>> import theano.tensor as T
>>> x = T.dscalar('x')
>>> y = T.dscalar('y')
>>> z = x + y
>>> numpy.allclose(z.eval({x : 16.3, y : 12.1}), 28.4)
True
We passed eval()
a dictionary mapping symbolic theano
variables to the values to substitute for them, and it returned
the numerical value of the expression.
eval()
will be slow the first time you call it on a variable –
it needs to call function()
to compile the expression behind
the scenes. Subsequent calls to eval()
on that same variable
will be fast, because the variable caches the compiled function.
Adding two Matrices¶
You might already have guessed how to do this. Indeed, the only change from the previous example is that you need to instantiate x and y using the matrix Types:
>>> x = T.dmatrix('x')
>>> y = T.dmatrix('y')
>>> z = x + y
>>> f = function([x, y], z)
dmatrix
is the Type for matrices of doubles. Then we can use
our new function on 2D arrays:
>>> f([[1, 2], [3, 4]], [[10, 20], [30, 40]])
array([[ 11., 22.],
[ 33., 44.]])
The variable is a NumPy array. We can also use NumPy arrays directly as inputs:
>>> import numpy
>>> f(numpy.array([[1, 2], [3, 4]]), numpy.array([[10, 20], [30, 40]]))
array([[ 11., 22.],
[ 33., 44.]])
It is possible to add scalars to matrices, vectors to matrices, scalars to vectors, etc. The behavior of these operations is defined by broadcasting.
The following types are available:
 byte:
bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4
 16bit integers:
wscalar, wvector, wmatrix, wrow, wcol, wtensor3, wtensor4
 32bit integers:
iscalar, ivector, imatrix, irow, icol, itensor3, itensor4
 64bit integers:
lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4
 float:
fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4
 double:
dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4
 complex:
cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4
The previous list is not exhaustive and a guide to all types compatible with NumPy arrays may be found here: tensor creation.
Note
You, the user—not the system architecture—have to choose whether your
program will use 32 or 64bit integers (i
prefix vs. the l
prefix)
and floats (f
prefix vs. the d
prefix).
More Examples¶
At this point it would be wise to begin familiarizing yourself more systematically with Theano’s fundamental objects and operations by browsing this section of the library: Basic Tensor Functionality.
As the tutorial unfolds, you should also gradually acquaint yourself with the other relevant areas of the library and with the relevant subjects of the documentation entrance page.
Logistic Function¶
Here’s another straightforward example, though a bit more elaborate than adding two numbers together. Let’s say that you want to compute the logistic curve, which is given by:
You want to compute the function elementwise on matrices of doubles, which means that you want to apply this function to each individual element of the matrix.
Well, what you do is this:
>>> import theano
>>> import theano.tensor as T
>>> x = T.dmatrix('x')
>>> s = 1 / (1 + T.exp(x))
>>> logistic = theano.function([x], s)
>>> logistic([[0, 1], [1, 2]])
array([[ 0.5 , 0.73105858],
[ 0.26894142, 0.11920292]])
The reason logistic is performed elementwise is because all of its operations—division, addition, exponentiation, and division—are themselves elementwise operations.
It is also the case that:
We can verify that this alternate form produces the same values:
>>> s2 = (1 + T.tanh(x / 2)) / 2
>>> logistic2 = theano.function([x], s2)
>>> logistic2([[0, 1], [1, 2]])
array([[ 0.5 , 0.73105858],
[ 0.26894142, 0.11920292]])
Computing More than one Thing at the Same Time¶
Theano supports functions with multiple outputs. For example, we can compute the elementwise difference, absolute difference, and squared difference between two matrices a and b at the same time:
>>> a, b = T.dmatrices('a', 'b')
>>> diff = a  b
>>> abs_diff = abs(diff)
>>> diff_squared = diff**2
>>> f = theano.function([a, b], [diff, abs_diff, diff_squared])
Note
dmatrices produces as many outputs as names that you provide. It is a shortcut for allocating symbolic variables that we will often use in the tutorials.
When we use the function f, it returns the three variables (the printing was reformatted for readability):
>>> f([[1, 1], [1, 1]], [[0, 1], [2, 3]])
[array([[ 1., 0.],
[1., 2.]]), array([[ 1., 0.],
[ 1., 2.]]), array([[ 1., 0.],
[ 1., 4.]])]
Setting a Default Value for an Argument¶
Let’s say you want to define a function that adds two numbers, except that if you only provide one number, the other input is assumed to be one. You can do it like this:
>>> from theano import In
>>> from theano import function
>>> x, y = T.dscalars('x', 'y')
>>> z = x + y
>>> f = function([x, In(y, value=1)], z)
>>> f(33)
array(34.0)
>>> f(33, 2)
array(35.0)
This makes use of the In class which allows
you to specify properties of your function’s parameters with greater detail. Here we
give a default value of 1 for y by creating a In
instance with
its value
field set to 1.
Inputs with default values must follow inputs without default values (like Python’s functions). There can be multiple inputs with default values. These parameters can be set positionally or by name, as in standard Python:
>>> x, y, w = T.dscalars('x', 'y', 'w')
>>> z = (x + y) * w
>>> f = function([x, In(y, value=1), In(w, value=2, name='w_by_name')], z)
>>> f(33)
array(68.0)
>>> f(33, 2)
array(70.0)
>>> f(33, 0, 1)
array(33.0)
>>> f(33, w_by_name=1)
array(34.0)
>>> f(33, w_by_name=1, y=0)
array(33.0)
Note
In
does not know the name of the local variables y and w
that are passed as arguments. The symbolic variable objects have name
attributes (set by dscalars
in the example above) and these are the
names of the keyword parameters in the functions that we build. This is
the mechanism at work in In(y, value=1)
. In the case of In(w,
value=2, name='w_by_name')
. We override the symbolic variable’s name
attribute with a name to be used for this function.
You may like to see Function in the library for more detail.
Copying functions¶
Theano functions can be copied, which can be useful for creating similar
functions but with different shared variables or updates. This is done using
the copy()
method of function
objects. The optimized graph of the original function is copied,
so compilation only needs to be performed once.
Let’s start from the accumulator defined above. Let’s add the on_unused_input='ignore'
parameter in case we don’t want to use both of our current arguments in a future copy of the function (this isn’t necessary on versions > 0.8.2):
>>> import theano
>>> import theano.tensor as T
>>> state = theano.shared(0)
>>> inc = T.iscalar('inc')
>>> accumulator = theano.function([inc], state, updates=[(state, state+inc)], on_unused_input='ignore')
We can use it to increment the state as usual:
>>> accumulator(10)
array(0)
>>> print(state.get_value())
10
We can use copy()
to create a similar accumulator but with its own internal state
using the swap
parameter, which is a dictionary of shared variables to exchange:
>>> new_state = theano.shared(0)
>>> new_accumulator = accumulator.copy(swap={state:new_state})
>>> new_accumulator(100)
[array(0)]
>>> print(new_state.get_value())
100
The state of the first function is left untouched:
>>> print(state.get_value())
10
We now create a copy with updates removed using the delete_updates
parameter, which is set to False
by default. Notice our new copy doesn’t actually use the inc
argument after removing the updates
parameter:
>>> null_accumulator = accumulator.copy(delete_updates=True)
As expected, the shared state is no longer updated:
>>> null_accumulator(9000)
[array(10)]
>>> print(state.get_value())
10
Using Random Numbers¶
Because in Theano you first express everything symbolically and afterwards compile this expression to get functions, using pseudorandom numbers is not as straightforward as it is in NumPy, though also not too complicated.
The way to think about putting randomness into Theano’s computations is to put random variables in your graph. Theano will allocate a NumPy RandomStream object (a random number generator) for each such variable, and draw from it as necessary. We will call this sort of sequence of random numbers a random stream. Random streams are at their core shared variables, so the observations on shared variables hold here as well. Theanos’s random objects are defined and implemented in RandomStreams and, at a lower level, in RandomStreamsBase.
Here’s a brief example. The setup code is:
from theano.tensor.shared_randomstreams import RandomStreams
from theano import function
srng = RandomStreams(seed=234)
rv_u = srng.uniform((2,2))
rv_n = srng.normal((2,2))
f = function([], rv_u)
g = function([], rv_n, no_default_updates=True) #Not updating rv_n.rng
nearly_zeros = function([], rv_u + rv_u  2 * rv_u)
Here, ‘rv_u’ represents a random stream of 2x2 matrices of draws from a uniform
distribution. Likewise, ‘rv_n’ represents a random stream of 2x2 matrices of
draws from a normal distribution. The distributions that are implemented are
defined in RandomStreams
and, at a lower level,
in raw_random. They only work on CPU.
See Other Implementations for GPU version.
Now let’s use these objects. If we call f(), we get random uniform numbers. The internal state of the random number generator is automatically updated, so we get different random numbers every time.
>>> f_val0 = f()
>>> f_val1 = f() #different numbers from f_val0
When we add the extra argument no_default_updates=True
to
function
(as in g), then the random number generator state is
not affected by calling the returned function. So, for example, calling
g multiple times will return the same numbers.
>>> g_val0 = g() # different numbers from f_val0 and f_val1
>>> g_val1 = g() # same numbers as g_val0!
An important remark is that a random variable is drawn at most once during any single function execution. So the nearly_zeros function is guaranteed to return approximately 0 (except for rounding error) even though the rv_u random variable appears three times in the output expression.
>>> nearly_zeros = function([], rv_u + rv_u  2 * rv_u)
Random variables can be seeded individually or collectively.
You can seed just one random variable by seeding or assigning to the
.rng
attribute, using .rng.set_value()
.
>>> rng_val = rv_u.rng.get_value(borrow=True) # Get the rng for rv_u
>>> rng_val.seed(89234) # seeds the generator
>>> rv_u.rng.set_value(rng_val, borrow=True) # Assign back seeded rng
You can also seed all of the random variables allocated by a RandomStreams
object by that object’s seed
method. This seed will be used to seed a
temporary random number generator, that will in turn generate seeds for each
of the random variables.
>>> srng.seed(902340) # seeds rv_u and rv_n with different seeds each
As usual for shared variables, the random number generators used for random variables are common between functions. So our nearly_zeros function will update the state of the generators used in function f above.
For example:
>>> state_after_v0 = rv_u.rng.get_value().get_state()
>>> nearly_zeros() # this affects rv_u's generator
array([[ 0., 0.],
[ 0., 0.]])
>>> v1 = f()
>>> rng = rv_u.rng.get_value(borrow=True)
>>> rng.set_state(state_after_v0)
>>> rv_u.rng.set_value(rng, borrow=True)
>>> v2 = f() # v2 != v1
>>> v3 = f() # v3 == v1
In some use cases, a user might want to transfer the “state” of all random
number generators associated with a given theano graph (e.g. g1, with compiled
function f1 below) to a second graph (e.g. g2, with function f2). This might
arise for example if you are trying to initialize the state of a model, from
the parameters of a pickled version of a previous model. For
theano.tensor.shared_randomstreams.RandomStreams
and
theano.sandbox.rng_mrg.MRG_RandomStreams
this can be achieved by copying elements of the state_updates parameter.
Each time a random variable is drawn from a RandomStreams object, a tuple is added to the state_updates list. The first element is a shared variable, which represents the state of the random number generator associated with this particular variable, while the second represents the theano graph corresponding to the random number generation process (i.e. RandomFunction{uniform}.0).
An example of how “random states” can be transferred from one theano function to another is shown below.
>>> from __future__ import print_function
>>> import theano
>>> import numpy
>>> import theano.tensor as T
>>> from theano.sandbox.rng_mrg import MRG_RandomStreams
>>> from theano.tensor.shared_randomstreams import RandomStreams
>>> class Graph():
... def __init__(self, seed=123):
... self.rng = RandomStreams(seed)
... self.y = self.rng.uniform(size=(1,))
>>> g1 = Graph(seed=123)
>>> f1 = theano.function([], g1.y)
>>> g2 = Graph(seed=987)
>>> f2 = theano.function([], g2.y)
>>> # By default, the two functions are out of sync.
>>> f1()
array([ 0.72803009])
>>> f2()
array([ 0.55056769])
>>> def copy_random_state(g1, g2):
... if isinstance(g1.rng, MRG_RandomStreams):
... g2.rng.rstate = g1.rng.rstate
... for (su1, su2) in zip(g1.rng.state_updates, g2.rng.state_updates):
... su2[0].set_value(su1[0].get_value())
>>> # We now copy the state of the theano random number generators.
>>> copy_random_state(g1, g2)
>>> f1()
array([ 0.59044123])
>>> f2()
array([ 0.59044123])
There are other distributions implemented.
There are 2 other implementations based on MRG31k3p and CURAND
.
The RandomStream only work on the CPU, MRG31k3p
work on the CPU and GPU. CURAND only work on the GPU.
Note
To use you the MRG version easily, you can just change the import to:
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
A Real Example: Logistic Regression¶
The preceding elements are featured in this more realistic example. It will be used repeatedly.
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400 # training sample size
feats = 784 # number of input variables
# generate a dataset: D = (input_values, target_class)
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
training_steps = 10000
# Declare Theano symbolic variables
x = T.dmatrix("x")
y = T.dvector("y")
# initialize the weight vector w randomly
#
# this and the following bias variable b
# are shared so they keep their values
# between training iterations (updates)
w = theano.shared(rng.randn(feats), name="w")
# initialize the bias term
b = theano.shared(0., name="b")
print("Initial model:")
print(w.get_value())
print(b.get_value())
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(T.dot(x, w)  b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = y * T.log(p_1)  (1y) * T.log(1p_1) # Crossentropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize
gw, gb = T.grad(cost, [w, b]) # Compute the gradient of the cost
# w.r.t weight vector w and
# bias term b
# (we shall return to this in a
# following section of this tutorial)
# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=((w, w  0.1 * gw), (b, b  0.1 * gb)))
predict = theano.function(inputs=[x], outputs=prediction)
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("Final model:")
print(w.get_value())
print(b.get_value())
print("target values for D:")
print(D[1])
print("prediction on D:")
print(predict(D[0]))
Derivatives in Theano¶
Computing Gradients¶
Now let’s use Theano for a slightly more sophisticated task: create a
function which computes the derivative of some expression y with
respect to its parameter x. To do this we will use the macro T.grad
.
For instance, we can compute the
gradient of with respect to . Note that:
.
Here is the code to compute this gradient:
>>> import numpy
>>> import theano
>>> import theano.tensor as T
>>> from theano import pp
>>> x = T.dscalar('x')
>>> y = x ** 2
>>> gy = T.grad(y, x)
>>> pp(gy) # print out the gradient prior to optimization
'((fill((x ** TensorConstant{2}), TensorConstant{1.0}) * TensorConstant{2}) * (x ** (TensorConstant{2}  TensorConstant{1})))'
>>> f = theano.function([x], gy)
>>> f(4)
array(8.0)
>>> numpy.allclose(f(94.2), 188.4)
True
In this example, we can see from pp(gy)
that we are computing
the correct symbolic gradient.
fill((x ** 2), 1.0)
means to make a matrix of the same shape as
x ** 2 and fill it with 1.0.
Note
The optimizer simplifies the symbolic gradient expression. You can see this by digging inside the internal properties of the compiled function.
pp(f.maker.fgraph.outputs[0])
'(2.0 * x)'
After optimization there is only one Apply node left in the graph, which doubles the input.
We can also compute the gradient of complex expressions such as the logistic function defined above. It turns out that the derivative of the logistic is: .
>>> x = T.dmatrix('x')
>>> s = T.sum(1 / (1 + T.exp(x)))
>>> gs = T.grad(s, x)
>>> dlogistic = theano.function([x], gs)
>>> dlogistic([[0, 1], [1, 2]])
array([[ 0.25 , 0.19661193],
[ 0.19661193, 0.10499359]])
In general, for any scalar expression s, T.grad(s, w)
provides
the Theano expression for computing . In
this way Theano can be used for doing efficient symbolic differentiation
(as the expression returned by T.grad
will be optimized during compilation), even for
function with many inputs. (see automatic differentiation for a description
of symbolic differentiation).
Note
The second argument of T.grad
can be a list, in which case the
output is also a list. The order in both lists is important: element
i of the output list is the gradient of the first argument of
T.grad
with respect to the ith element of the list given as second argument.
The first argument of T.grad
has to be a scalar (a tensor
of size 1). For more information on the semantics of the arguments of
T.grad
and details about the implementation, see
this section of the library.
Additional information on the inner workings of differentiation may also be found in the more advanced tutorial Extending Theano.
Computing the Jacobian¶
In Theano’s parlance, the term Jacobian designates the tensor comprising the
first partial derivatives of the output of a function with respect to its inputs.
(This is a generalization of to the socalled Jacobian matrix in Mathematics.)
Theano implements the theano.gradient.jacobian()
macro that does all
that is needed to compute the Jacobian. The following text explains how
to do it manually.
In order to manually compute the Jacobian of some function y with
respect to some parameter x we need to use scan
. What we
do is to loop over the entries in y and compute the gradient of
y[i] with respect to x.
Note
scan
is a generic op in Theano that allows writing in a symbolic
manner all kinds of recurrent equations. While creating
symbolic loops (and optimizing them for performance) is a hard task,
effort is being done for improving the performance of scan
. We
shall return to scan later in this tutorial.
>>> import theano
>>> import theano.tensor as T
>>> x = T.dvector('x')
>>> y = x ** 2
>>> J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences=T.arange(y.shape[0]), non_sequences=[y,x])
>>> f = theano.function([x], J, updates=updates)
>>> f([4, 4])
array([[ 8., 0.],
[ 0., 8.]])
What we do in this code is to generate a sequence of ints from 0 to
y.shape[0]
using T.arange
. Then we loop through this sequence, and
at each step, we compute the gradient of element y[i] with respect to
x. scan
automatically concatenates all these rows, generating a
matrix which corresponds to the Jacobian.
Note
There are some pitfalls to be aware of regarding T.grad
. One of them is that you
cannot rewrite the above expression of the Jacobian as
theano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,
non_sequences=x)
, even though from the documentation of scan this
seems possible. The reason is that y_i will not be a function of
x anymore, while y[i] still is.
Computing the Hessian¶
In Theano, the term Hessian has the usual mathematical acception: It is the
matrix comprising the second order partial derivative of a function with scalar
output and vector input. Theano implements theano.gradient.hessian()
macro that does all
that is needed to compute the Hessian. The following text explains how
to do it manually.
You can compute the Hessian manually similarly to the Jacobian. The only
difference is that now, instead of computing the Jacobian of some expression
y, we compute the Jacobian of T.grad(cost,x)
, where cost is some
scalar.
>>> x = T.dvector('x')
>>> y = x ** 2
>>> cost = y.sum()
>>> gy = T.grad(cost, x)
>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
>>> f = theano.function([x], H, updates=updates)
>>> f([4, 4])
array([[ 2., 0.],
[ 0., 2.]])
Jacobian times a Vector¶
Sometimes we can express the algorithm in terms of Jacobians times vectors, or vectors times Jacobians. Compared to evaluating the Jacobian and then doing the product, there are methods that compute the desired results while avoiding actual evaluation of the Jacobian. This can bring about significant performance gains. A description of one such algorithm can be found here:
 Barak A. Pearlmutter, “Fast Exact Multiplication by the Hessian”, Neural Computation, 1994
While in principle we would want Theano to identify these patterns automatically for us, in practice, implementing such optimizations in a generic manner is extremely difficult. Therefore, we provide special functions dedicated to these tasks.
The R operator is built to evaluate the product between a Jacobian and a vector, namely . The formulation can be extended even for x being a matrix, or a tensor in general, case in which also the Jacobian becomes a tensor and the product becomes some kind of tensor product. Because in practice we end up needing to compute such expressions in terms of weight matrices, Theano supports this more generic form of the operation. In order to evaluate the Roperation of expression y, with respect to x, multiplying the Jacobian with v you need to do something similar to this:
>>> W = T.dmatrix('W')
>>> V = T.dmatrix('V')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> JV = T.Rop(y, W, V)
>>> f = theano.function([W, V, x], JV)
>>> f([[1, 1], [1, 1]], [[2, 2], [2, 2]], [0,1])
array([ 2., 2.])
List of Op that implement Rop.
In similitude to the Roperator, the Loperator would compute a row vector times the Jacobian. The mathematical formula would be . The Loperator is also supported for generic tensors (not only for vectors). Similarly, it can be implemented as follows:
>>> W = T.dmatrix('W')
>>> v = T.dvector('v')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> VJ = T.Lop(y, W, v)
>>> f = theano.function([v,x], VJ)
>>> f([2, 2], [0, 1])
array([[ 0., 0.],
[ 2., 2.]])
Note
v, the point of evaluation, differs between the Loperator and the Roperator. For the Loperator, the point of evaluation needs to have the same shape as the output, whereas for the Roperator this point should have the same shape as the input parameter. Furthermore, the results of these two operations differ. The result of the Loperator is of the same shape as the input parameter, while the result of the Roperator has a shape similar to that of the output.
Hessian times a Vector¶
If you need to compute the Hessian times a vector, you can make use of the abovedefined operators to do it more efficiently than actually computing the exact Hessian and then performing the product. Due to the symmetry of the Hessian matrix, you have two options that will give you the same result, though these options might exhibit differing performances. Hence, we suggest profiling the methods before using either one of the two:
>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> vH = T.grad(T.sum(gy * v), x)
>>> f = theano.function([x, v], vH)
>>> f([4, 4], [2, 2])
array([ 4., 4.])
or, making use of the Roperator:
>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> Hv = T.Rop(gy, x, v)
>>> f = theano.function([x, v], Hv)
>>> f([4, 4], [2, 2])
array([ 4., 4.])
Final Pointers¶
 The
grad
function works symbolically: it receives and returns Theano variables. grad
can be compared to a macro since it can be applied repeatedly. Scalar costs only can be directly handled by
grad
. Arrays are handled through repeated applications.  Builtin functions allow to compute efficiently vector times Jacobian and vector times Hessian.
 Work is in progress on the optimizations required to compute efficiently the full Jacobian and the Hessian matrix as well as the Jacobian times vector.
Conditions¶
IfElse vs Switch¶
 Both ops build a condition over symbolic variables.
IfElse
takes a boolean condition and two variables as inputs.Switch
takes a tensor as condition and two variables as inputs.switch
is an elementwise operation and is thus more general thanifelse
. Whereas
switch
evaluates both output variables,ifelse
is lazy and only evaluates one variable with respect to the condition.
Example
from theano import tensor as T
from theano.ifelse import ifelse
import theano, time, numpy
a,b = T.scalars('a', 'b')
x,y = T.matrices('x', 'y')
z_switch = T.switch(T.lt(a, b), T.mean(x), T.mean(y))
z_lazy = ifelse(T.lt(a, b), T.mean(x), T.mean(y))
f_switch = theano.function([a, b, x, y], z_switch,
mode=theano.Mode(linker='vm'))
f_lazyifelse = theano.function([a, b, x, y], z_lazy,
mode=theano.Mode(linker='vm'))
val1 = 0.
val2 = 1.
big_mat1 = numpy.ones((10000, 1000))
big_mat2 = numpy.ones((10000, 1000))
n_times = 10
tic = time.clock()
for i in range(n_times):
f_switch(val1, val2, big_mat1, big_mat2)
print('time spent evaluating both values %f sec' % (time.clock()  tic))
tic = time.clock()
for i in range(n_times):
f_lazyifelse(val1, val2, big_mat1, big_mat2)
print('time spent evaluating one value %f sec' % (time.clock()  tic))
In this example, the IfElse
op spends less time (about half as much) than Switch
since it computes only one variable out of the two.
$ python ifelse_switch.py
time spent evaluating both values 0.6700 sec
time spent evaluating one value 0.3500 sec
Unless linker='vm'
or linker='cvm'
are used, ifelse
will compute both
variables and take the same computation time as switch
. Although the linker
is not currently set by default to cvm
, it will be in the near future.
There is no automatic optimization replacing a switch
with a
broadcasted scalar to an ifelse
, as this is not always faster. See
this ticket.
Note
If you use test values, then all branches of the IfElse will be computed. This is normal, as using test_value means everything will be computed when we build it, due to Python’s greedy evaluation and the semantic of test value. As we build both branches, they will be executed for test values. This doesn’t cause any changes during the execution of the compiled Theano function.
Loop¶
Scan¶
 A general form of recurrence, which can be used for looping.
 Reduction and map (loop over the leading dimensions) are special cases of
scan
.  You
scan
a function along some input sequence, producing an output at each timestep.  The function can see the previous K timesteps of your function.
sum()
could be computed by scanning the z + x(i) function over a list, given an initial state of z=0. Often a for loop can be expressed as a
scan()
operation, andscan
is the closest that Theano comes to looping.  Advantages of using
scan
over for loops: Number of iterations to be part of the symbolic graph.
 Minimizes GPU transfers (if GPU is involved).
 Computes gradients through sequential steps.
 Slightly faster than using a for loop in Python with a compiled Theano function.
 Can lower the overall memory usage by detecting the actual amount of memory needed.
The full documentation can be found in the library: Scan.
Scan Example: Computing tanh(x(t).dot(W) + b) elementwise
import theano
import theano.tensor as T
import numpy as np
# defining the tensor variables
X = T.matrix("X")
W = T.matrix("W")
b_sym = T.vector("b_sym")
results, updates = theano.scan(lambda v: T.tanh(T.dot(v, W) + b_sym), sequences=X)
compute_elementwise = theano.function(inputs=[X, W, b_sym], outputs=results)
# test values
x = np.eye(2, dtype=theano.config.floatX)
w = np.ones((2, 2), dtype=theano.config.floatX)
b = np.ones((2), dtype=theano.config.floatX)
b[1] = 2
print(compute_elementwise(x, w, b))
# comparison with numpy
print(np.tanh(x.dot(w) + b))
[[ 0.96402758 0.99505475]
[ 0.96402758 0.99505475]]
[[ 0.96402758 0.99505475]
[ 0.96402758 0.99505475]]
Scan Example: Computing the sequence x(t) = tanh(x(t  1).dot(W) + y(t).dot(U) + p(T  t).dot(V))
import theano
import theano.tensor as T
import numpy as np
# define tensor variables
X = T.vector("X")
W = T.matrix("W")
b_sym = T.vector("b_sym")
U = T.matrix("U")
Y = T.matrix("Y")
V = T.matrix("V")
P = T.matrix("P")
results, updates = theano.scan(lambda y, p, x_tm1: T.tanh(T.dot(x_tm1, W) + T.dot(y, U) + T.dot(p, V)),
sequences=[Y, P[::1]], outputs_info=[X])
compute_seq = theano.function(inputs=[X, W, Y, U, P, V], outputs=results)
# test values
x = np.zeros((2), dtype=theano.config.floatX)
x[1] = 1
w = np.ones((2, 2), dtype=theano.config.floatX)
y = np.ones((5, 2), dtype=theano.config.floatX)
y[0, :] = 3
u = np.ones((2, 2), dtype=theano.config.floatX)
p = np.ones((5, 2), dtype=theano.config.floatX)
p[0, :] = 3
v = np.ones((2, 2), dtype=theano.config.floatX)
print(compute_seq(x, w, y, u, p, v))
# comparison with numpy
x_res = np.zeros((5, 2), dtype=theano.config.floatX)
x_res[0] = np.tanh(x.dot(w) + y[0].dot(u) + p[4].dot(v))
for i in range(1, 5):
x_res[i] = np.tanh(x_res[i  1].dot(w) + y[i].dot(u) + p[4i].dot(v))
print(x_res)
[[0.99505475 0.99505475]
[ 0.96471973 0.96471973]
[ 0.99998585 0.99998585]
[ 0.99998771 0.99998771]
[ 1. 1. ]]
[[0.99505475 0.99505475]
[ 0.96471973 0.96471973]
[ 0.99998585 0.99998585]
[ 0.99998771 0.99998771]
[ 1. 1. ]]
Scan Example: Computing norms of lines of X
import theano
import theano.tensor as T
import numpy as np
# define tensor variable
X = T.matrix("X")
results, updates = theano.scan(lambda x_i: T.sqrt((x_i ** 2).sum()), sequences=[X])
compute_norm_lines = theano.function(inputs=[X], outputs=results)
# test value
x = np.diag(np.arange(1, 6, dtype=theano.config.floatX), 1)
print(compute_norm_lines(x))
# comparison with numpy
print(np.sqrt((x ** 2).sum(1)))
[ 1. 2. 3. 4. 5. 0.]
[ 1. 2. 3. 4. 5. 0.]
Scan Example: Computing norms of columns of X
import theano
import theano.tensor as T
import numpy as np
# define tensor variable
X = T.matrix("X")
results, updates = theano.scan(lambda x_i: T.sqrt((x_i ** 2).sum()), sequences=[X.T])
compute_norm_cols = theano.function(inputs=[X], outputs=results)
# test value
x = np.diag(np.arange(1, 6, dtype=theano.config.floatX), 1)
print(compute_norm_cols(x))
# comparison with numpy
print(np.sqrt((x ** 2).sum(0)))
[ 0. 1. 2. 3. 4. 5.]
[ 0. 1. 2. 3. 4. 5.]
Scan Example: Computing trace of X
import theano
import theano.tensor as T
import numpy as np
floatX = "float32"
# define tensor variable
X = T.matrix("X")
results, updates = theano.scan(lambda i, j, t_f: T.cast(X[i, j] + t_f, floatX),
sequences=[T.arange(X.shape[0]), T.arange(X.shape[1])],
outputs_info=np.asarray(0., dtype=floatX))
result = results[1]
compute_trace = theano.function(inputs=[X], outputs=result)
# test value
x = np.eye(5, dtype=theano.config.floatX)
x[0] = np.arange(5, dtype=theano.config.floatX)
print(compute_trace(x))
# comparison with numpy
print(np.diagonal(x).sum())
4.0
4.0
Scan Example: Computing the sequence x(t) = x(t  2).dot(U) + x(t  1).dot(V) + tanh(x(t  1).dot(W) + b)
import theano
import theano.tensor as T
import numpy as np
# define tensor variables
X = T.matrix("X")
W = T.matrix("W")
b_sym = T.vector("b_sym")
U = T.matrix("U")
V = T.matrix("V")
n_sym = T.iscalar("n_sym")
results, updates = theano.scan(lambda x_tm2, x_tm1: T.dot(x_tm2, U) + T.dot(x_tm1, V) + T.tanh(T.dot(x_tm1, W) + b_sym),
n_steps=n_sym, outputs_info=[dict(initial=X, taps=[2, 1])])
compute_seq2 = theano.function(inputs=[X, U, V, W, b_sym, n_sym], outputs=results)
# test values
x = np.zeros((2, 2), dtype=theano.config.floatX) # the initial value must be able to return x[2]
x[1, 1] = 1
w = 0.5 * np.ones((2, 2), dtype=theano.config.floatX)
u = 0.5 * (np.ones((2, 2), dtype=theano.config.floatX)  np.eye(2, dtype=theano.config.floatX))
v = 0.5 * np.ones((2, 2), dtype=theano.config.floatX)
n = 10
b = np.ones((2), dtype=theano.config.floatX)
print(compute_seq2(x, u, v, w, b, n))
# comparison with numpy
x_res = np.zeros((10, 2))
x_res[0] = x[0].dot(u) + x[1].dot(v) + np.tanh(x[1].dot(w) + b)
x_res[1] = x[1].dot(u) + x_res[0].dot(v) + np.tanh(x_res[0].dot(w) + b)
x_res[2] = x_res[0].dot(u) + x_res[1].dot(v) + np.tanh(x_res[1].dot(w) + b)
for i in range(2, 10):
x_res[i] = (x_res[i  2].dot(u) + x_res[i  1].dot(v) +
np.tanh(x_res[i  1].dot(w) + b))
print(x_res)
[[ 1.40514825 1.40514825]
[ 2.88898899 2.38898899]
[ 4.34018291 4.34018291]
[ 6.53463142 6.78463142]
[ 9.82972243 9.82972243]
[ 14.22203814 14.09703814]
[ 20.07439936 20.07439936]
[ 28.12291843 28.18541843]
[ 39.1913681 39.1913681 ]
[ 54.28407732 54.25282732]]
[[ 1.40514825 1.40514825]
[ 2.88898899 2.38898899]
[ 4.34018291 4.34018291]
[ 6.53463142 6.78463142]
[ 9.82972243 9.82972243]
[ 14.22203814 14.09703814]
[ 20.07439936 20.07439936]
[ 28.12291843 28.18541843]
[ 39.1913681 39.1913681 ]
[ 54.28407732 54.25282732]]
Scan Example: Computing the Jacobian of y = tanh(v.dot(A)) wrt x
import theano
import theano.tensor as T
import numpy as np
# define tensor variables
v = T.vector()
A = T.matrix()
y = T.tanh(T.dot(v, A))
results, updates = theano.scan(lambda i: T.grad(y[i], v), sequences=[T.arange(y.shape[0])])
compute_jac_t = theano.function([A, v], results, allow_input_downcast=True) # shape (d_out, d_in)
# test values
x = np.eye(5, dtype=theano.config.floatX)[0]
w = np.eye(5, 3, dtype=theano.config.floatX)
w[2] = np.ones((3), dtype=theano.config.floatX)
print(compute_jac_t(w, x))
# compare with numpy
print(((1  np.tanh(x.dot(w)) ** 2) * w).T)
[[ 0.41997434 0. 0.41997434 0. 0. ]
[ 0. 1. 1. 0. 0. ]
[ 0. 0. 1. 0. 0. ]]
[[ 0.41997434 0. 0.41997434 0. 0. ]
[ 0. 1. 1. 0. 0. ]
[ 0. 0. 1. 0. 0. ]]
Note that we need to iterate over the indices of y
and not over the elements of y
. The reason is that scan create a placeholder variable for its internal function and this placeholder variable does not have the same dependencies than the variables that will replace it.
Scan Example: Accumulate number of loop during a scan
import theano
import theano.tensor as T
import numpy as np
# define shared variables
k = theano.shared(0)
n_sym = T.iscalar("n_sym")
results, updates = theano.scan(lambda:{k:(k + 1)}, n_steps=n_sym)
accumulator = theano.function([n_sym], [], updates=updates, allow_input_downcast=True)
k.get_value()
accumulator(5)
k.get_value()
Scan Example: Computing tanh(v.dot(W) + b) * d where d is binomial
import theano
import theano.tensor as T
import numpy as np
# define tensor variables
X = T.matrix("X")
W = T.matrix("W")
b_sym = T.vector("b_sym")
# define shared random stream
trng = T.shared_randomstreams.RandomStreams(1234)
d=trng.binomial(size=W[1].shape)
results, updates = theano.scan(lambda v: T.tanh(T.dot(v, W) + b_sym) * d, sequences=X)
compute_with_bnoise = theano.function(inputs=[X, W, b_sym], outputs=results,
updates=updates, allow_input_downcast=True)
x = np.eye(10, 2, dtype=theano.config.floatX)
w = np.ones((2, 2), dtype=theano.config.floatX)
b = np.ones((2), dtype=theano.config.floatX)
print(compute_with_bnoise(x, w, b))
[[ 0.96402758 0. ]
[ 0. 0.96402758]
[ 0. 0. ]
[ 0.76159416 0.76159416]
[ 0.76159416 0. ]
[ 0. 0.76159416]
[ 0. 0.76159416]
[ 0. 0.76159416]
[ 0. 0. ]
[ 0.76159416 0.76159416]]
Note that if you want to use a random variable d
that will not be updated through scan loops, you should pass this variable as a non_sequences
arguments.
Scan Example: Computing pow(A, k)
import theano
import theano.tensor as T
theano.config.warn.subtensor_merge_bug = False
k = T.iscalar("k")
A = T.vector("A")
def inner_fct(prior_result, B):
return prior_result * B
# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)
# Scan has provided us with A ** 1 through A ** k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[1]
power = theano.function(inputs=[A, k], outputs=final_result,
updates=updates)
print(power(range(10), 2))
[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
Scan Example: Calculating a Polynomial
import numpy
import theano
import theano.tensor as T
theano.config.warn.subtensor_merge_bug = False
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x")
max_coefficients_supported = 10000
# Generate the components of the polynomial
full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)
test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
print(calculate_polynomial(test_coeff, 3))
19.0
How Shape Information is Handled by Theano¶
It is not possible to strictly enforce the shape of a Theano variable when building a graph since the particular value provided at runtime for a parameter of a Theano function may condition the shape of the Theano variables in its graph.
Currently, information regarding shape is used in two ways in Theano:
To generate faster C code for the 2d convolution on the CPU and the GPU, when the exact output shape is known in advance.
To remove computations in the graph when we only want to know the shape, but not the actual value of a variable. This is done with the Op.infer_shape method.
Example:
>>> import theano
>>> x = theano.tensor.matrix('x')
>>> f = theano.function([x], (x ** 2).shape)
>>> theano.printing.debugprint(f)
MakeVector{dtype='int64'} [id A] '' 2
Shape_i{0} [id B] '' 1
 x [id C]
Shape_i{1} [id D] '' 0
x [id C]
The output of this compiled function does not contain any multiplication or power. Theano has removed them to compute directly the shape of the output.
Shape Inference Problem¶
Theano propagates information about shape in the graph. Sometimes this can lead to errors. Consider this example:
>>> import numpy
>>> import theano
>>> x = theano.tensor.matrix('x')
>>> y = theano.tensor.matrix('y')
>>> z = theano.tensor.join(0, x, y)
>>> xv = numpy.random.rand(5, 4)
>>> yv = numpy.random.rand(3, 3)
>>> f = theano.function([x, y], z.shape)
>>> theano.printing.debugprint(f)
MakeVector{dtype='int64'} [id A] '' 4
Elemwise{Add}[(0, 0)] [id B] '' 3
 Shape_i{0} [id C] '' 1
  x [id D]
 Shape_i{0} [id E] '' 2
 y [id F]
Shape_i{1} [id G] '' 0
x [id D]
>>> f(xv, yv) # DOES NOT RAISE AN ERROR AS SHOULD BE.
array([8, 4])
>>> f = theano.function([x,y], z)# Do not take the shape.
>>> theano.printing.debugprint(f)
Join [id A] '' 0
TensorConstant{0} [id B]
x [id C]
y [id D]
>>> f(xv, yv)
Traceback (most recent call last):
...
ValueError: ...
As you can see, when asking only for the shape of some computation (join
in the
example), an inferred shape is computed directly, without executing
the computation itself (there is no join
in the first output or debugprint).
This makes the computation of the shape faster, but it can also hide errors. In
this example, the computation of the shape of the output of join
is done only
based on the first input Theano variable, which leads to an error.
This might happen with other ops such as elemwise
and dot
, for example.
Indeed, to perform some optimizations (for speed or stability, for instance),
Theano assumes that the computation is correct and consistent
in the first place, as it does here.
You can detect those problems by running the code without this
optimization, using the Theano flag
optimizer_excluding=local_shape_to_shape_i
. You can also obtain the
same effect by running in the modes FAST_COMPILE
(it will not apply this
optimization, nor most other optimizations) or DebugMode
(it will test
before and after all optimizations (much slower)).
Specifing Exact Shape¶
Currently, specifying a shape is not as easy and flexible as we wish and we plan some upgrade. Here is the current state of what can be done:
 You can pass the shape info directly to the
ConvOp
created when callingconv2d
. You simply set the parametersimage_shape
andfilter_shape
inside the call. They must be tuples of 4 elements. For example:
theano.tensor.nnet.conv2d(..., image_shape=(7, 3, 5, 5), filter_shape=(2, 3, 4, 4))
 You can use the
SpecifyShape
op to add shape information anywhere in the graph. This allows to perform some optimizations. In the following example, this makes it possible to precompute the Theano function to a constant.
>>> import theano
>>> x = theano.tensor.matrix()
>>> x_specify_shape = theano.tensor.specify_shape(x, (2, 2))
>>> f = theano.function([x], (x_specify_shape ** 2).shape)
>>> theano.printing.debugprint(f)
DeepCopyOp [id A] '' 0
TensorConstant{(2,) of 2} [id B]
Future Plans¶
The parameter “constant shape” will be added totheano.shared()
. This is probably the most frequent occurrence withshared
variables. It will make the code simpler and will make it possible to check that the shape does not change when updating theshared
variable.
Advanced¶
Sparse¶
In general, sparse matrices provide the same functionality as regular matrices. The difference lies in the way the elements of sparse matrices are represented and stored in memory. Only the nonzero elements of the latter are stored. This has some potential advantages: first, this may obviously lead to reduced memory usage and, second, clever storage methods may lead to reduced computation time through the use of sparse specific algorithms. We usually refer to the generically stored matrices as dense matrices.
Theano’s sparse package provides efficient algorithms, but its use is not recommended in all cases or for all matrices. As an obvious example, consider the case where the sparsity proportion is very low. The sparsity proportion refers to the ratio of the number of zero elements to the number of all elements in a matrix. A low sparsity proportion may result in the use of more space in memory since not only the actual data is stored, but also the position of nearly every element of the matrix. This would also require more computation time whereas a dense matrix representation along with regular optimized algorithms might do a better job. Other examples may be found at the nexus of the specific purpose and structure of the matrices. More documentation may be found in the SciPy Sparse Reference.
Since sparse matrices are not stored in contiguous arrays, there are several
ways to represent them in memory. This is usually designated by the socalled format
of the matrix. Since Theano’s sparse matrix package is based on the SciPy
sparse package, complete information about sparse matrices can be found
in the SciPy documentation. Like SciPy, Theano does not implement sparse formats for
arrays with a number of dimensions different from two.
So far, Theano implements two formats
of sparse matrix: csc
and csr
.
Those are almost identical except that csc
is based on the columns of the
matrix and csr
is based on its rows. They both have the same purpose:
to provide for the use of efficient algorithms performing linear algebra operations.
A disadvantage is that they fail to give an efficient way to modify the sparsity structure
of the underlying matrix, i.e. adding new elements. This means that if you are
planning to add new elements in a sparse matrix very often in your computational graph,
perhaps a tensor variable could be a better choice.
More documentation may be found in the Sparse Library Reference.
Before going further, here are the import
statements that are assumed for the rest of the
tutorial:
>>> import theano
>>> import numpy as np
>>> import scipy.sparse as sp
>>> from theano import sparse
Compressed Sparse Format¶
Theano supports two compressed sparse formats: csc
and csr
, respectively based on columns
and rows. They have both the same attributes: data
, indices
, indptr
and shape
.
 The
data
attribute is a onedimensionalndarray
which contains all the nonzero elements of the sparse matrix. The
indices
andindptr
attributes are used to store the position of the data in the sparse matrix. The
shape
attribute is exactly the same as theshape
attribute of a dense (i.e. generic) matrix. It can be explicitly specified at the creation of a sparse matrix if it cannot be infered from the first three attributes.
At the end, the format does not affect the length of the data
and indices
attributes. They are both
completly fixed by the number of elements you want to store. The only thing that changes with the format
is indptr
. In csc
format, the matrix is compressed along columns so a lower number of columns will
result in less memory use. On the other hand, with the csr
format, the matrix is compressed along
the rows and with a matrix that have a lower number of rows, csr
format is a better choice. So here is the rule:
Note
If shape[0] > shape[1], use csc
format. Otherwise, use csr
.
Sometimes, since the sparse module is young, ops does not exist for both format. So here is what may be the most relevent rule:
Note
Use the format compatible with the ops in your computation graph.
The documentation about the ops and their supported format may be found in the Sparse Library Reference.
Handling Sparse in Theano¶
Most of the ops in Theano depend on the format
of the sparse matrix.
That is why there are two kinds of constructors of sparse variables:
csc_matrix
and csr_matrix
. These can be called with the usual
name
and dtype
parameters, but no broadcastable
flags are
allowed. This is forbidden since the sparse package, as the SciPy sparse module,
does not provide any way to handle a number of dimensions different from two.
The set of all accepted dtype
for the sparse matrices can be found in
sparse.all_dtypes
.
>>> sparse.all_dtypes
set(['int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64',
'float32', 'float64', 'complex64', 'complex128'])
To move back and forth from a dense matrix to a sparse matrix representation, Theano
provides the dense_from_sparse
, csr_from_dense
and
csc_from_dense
functions. No additional detail must be provided. Here is
an example that performs a full cycle from sparse to sparse:
>>> x = sparse.csc_matrix(name='x', dtype='float32')
>>> y = sparse.dense_from_sparse(x)
>>> z = sparse.csc_from_dense(y)
Although sparse variables do not allow direct access to their properties,
this can be accomplished using the csm_properties
function. This will return
a tuple of onedimensional tensor
variables that represents the internal characteristics
of the sparse matrix.
In order to reconstruct a sparse matrix from some properties, the functions CSC
and CSR
can be used. This will create the sparse matrix in the desired
format. As an example, the following code reconstructs a csc
matrix into
a csr
one.
>>> x = sparse.csc_matrix(name='x', dtype='int64')
>>> data, indices, indptr, shape = sparse.csm_properties(x)
>>> y = sparse.CSR(data, indices, indptr, shape)
>>> f = theano.function([x], y)
>>> a = sp.csc_matrix(np.asarray([[0, 1, 1], [0, 0, 0], [1, 0, 0]]))
>>> print(a.toarray())
[[0 1 1]
[0 0 0]
[1 0 0]]
>>> print(f(a).toarray())
[[0 0 1]
[1 0 0]
[1 0 0]]
The last example shows that one format can be obtained from transposition of
the other. Indeed, when calling the transpose
function,
the sparse characteristics of the resulting matrix cannot be the same as the one
provided as input.
Several ops are set to make use of the very peculiar structure of the sparse matrices. These ops are said to be structured and simply do not perform any computations on the zero elements of the sparse matrix. They can be thought as being applied only to the data attribute of the latter. Note that these structured ops provide a structured gradient. More explication below.
>>> x = sparse.csc_matrix(name='x', dtype='float32')
>>> y = sparse.structured_add(x, 2)
>>> f = theano.function([x], y)
>>> a = sp.csc_matrix(np.asarray([[0, 0, 1], [0, 2, 1], [3, 0, 0]], dtype='float32'))
>>> print(a.toarray())
[[ 0. 0. 1.]
[ 0. 2. 1.]
[ 3. 0. 0.]]
>>> print(f(a).toarray())
[[ 0. 0. 1.]
[ 0. 0. 3.]
[ 5. 0. 0.]]
The gradients of the ops in the sparse module can also be structured. Some ops provide a flag to indicate if the gradient is to be structured or not. The documentation can be used to determine if the gradient of an op is regular or structured or if its implementation can be modified. Similarly to structured ops, when a structured gradient is calculated, the computation is done only for the nonzero elements of the sparse matrix.
More documentation regarding the gradients of specific ops can be found in the Sparse Library Reference.
Using the GPU¶
For an introductory discussion of Graphical Processing Units (GPU) and their use for intensive parallel computation purposes, see GPGPU.
One of Theano’s design goals is to specify computations at an abstract level, so that the internal function compiler has a lot of flexibility about how to carry out those computations. One of the ways we take advantage of this flexibility is in carrying out calculations on a graphics card.
There are two ways currently to use a gpu, one of which only supports NVIDIA cards (CUDA backend) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (GpuArray Backend).
CUDA backend¶
If you have not done so already, you will need to install Nvidia’s GPUprogramming toolchain (CUDA) and configure Theano to use it. We provide installation instructions for Linux, MacOS and Windows.
To see if your GPU is being used, cut and paste the following program into a file and run it.
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1  t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The program just computes the exp()
of a bunch of random numbers.
Note that we use the shared
function to
make sure that the input x is stored on the graphics device.
If I run this program (in check1.py) with device=cpu
, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
same floatingpoint numbers as the CPU. As a benchmark, a loop that calls numpy.exp(x.get_value())
takes about 46 seconds.
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.06635117531 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284]
Used the cpu
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.638810873032 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
Note that GPU operations in Theano require for now floatX
to be float32 (see also below).
The speedup is not greater in the preceding example because the function is
returning its result as a NumPy ndarray which has already been copied from the
device to the host for your convenience. This is what makes it so easy to swap in device=gpu
, but
if you don’t mind less portability, you might gain a bigger speedup by changing
the graph to express a computation with a GPUstored result. The gpu_from_host
op means “copy the input from the host to the GPU” and it is optimized away
after the T.exp(x)
is replaced by a GPU version of exp()
.
from theano import function, config, shared, sandbox
import theano.sandbox.cuda.basic_ops
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), 'float32'))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1  t0))
print("Result is %s" % (r,))
print("Numpy result is %s" % (numpy.asarray(r),))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The output from this program is
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took 0.34898686409 seconds
Result is <CudaNdarray object at 0x6a7a5f0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
Here we’ve shaved off about 50% of the runtime by simply not copying
the resulting array back to the host. The object returned by each
function call is now not a NumPy array but a “CudaNdarray” which can
be converted to a NumPy ndarray by the normal NumPy casting mechanism
using something like numpy.asarray()
.
For even more speed you can play with the borrow
flag. See
Borrowing when Constructing Function Objects.
The performance characteristics will change as we continue to optimize our implementations, and vary from device to device, but to give a rough idea of what to expect right now:
 Only computations with float32 datatype can be accelerated. Better support for float64 is expected in upcoming hardware but float64 computations are still relatively slow (Jan 2010).
 Matrix multiplication, convolution, and large elementwise operations can be accelerated a lot (550x) when arguments are large enough to keep 30 processors busy.
 Indexing, dimensionshuffling and constanttime reshaping will be equally fast on GPU as on CPU.
 Summation over rows/columns of tensors can be a little slower on the GPU than on the CPU.
 Copying of large quantities of data to and from a device is relatively slow, and often cancels most of the advantage of one or two accelerated functions on that data. Getting GPU performance largely hinges on making data transfer to the device pay off.
 Consider
adding
floatX=float32
to your.theanorc
file if you plan to do a lot of GPU work.  Use the Theano flag
allow_gc=False
. See GPU Async capabilities  Prefer
constructors like
matrix
,vector
andscalar
todmatrix
,dvector
anddscalar
because the former will give you float32 variables whenfloatX=float32
.  Ensure that your output variables have a float32 dtype and not float64. The more float32 variables are in your graph, the more work the GPU can do for you.
 Minimize
tranfers to the GPU device by using
shared
float32 variables to store frequentlyaccessed data (seeshared()
). When using the GPU, float32 tensorshared
variables are stored on the GPU by default to eliminate transfer time for GPU ops using those variables.  If you aren’t happy with the performance you see, try running your script with
profile=True
flag. This should print some timing information at program termination. Is time being used sensibly? If an op or Apply is taking more time than its share, then if you know something about GPU programming, have a look at how it’s implemented in theano.sandbox.cuda. Check the line similar to Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op. This can tell you if not enough of your graph is on the GPU or if there is too much memory transfer.  Use nvcc options. nvcc supports those options to speed up some computations: ftz=true to flush denormals values to zeros., –precdiv=false and –precsqrt=false options to speed up division and square root operation by being less precise. You can enable all of them with the nvcc.flags=–use_fast_math Theano flag or you can enable them individually as in this example: nvcc.flags=ftz=true –precdiv=false.
 To investigate whether if all the Ops in the computational graph are running on GPU. It is possible to debug or check your code by providing a value to assert_no_cpu_op flag, i.e. warn, for warning raise for raising an error or pdb for putting a breakpoint in the computational graph if there is a CPU Op.
Ever since Theano 0.6 we started to use the asynchronous capability of GPUs. This allows us to be faster but with the possibility that some errors may be raised later than when they should occur. This can cause difficulties when profiling Theano apply nodes. There is a NVIDIA driver feature to help with these issues. If you set the environment variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be automatically synchronized. This reduces performance but provides good profiling and appropriately placed error messages.
This feature interacts with Theano garbage collection of intermediate
results. To get the most of this feature, you need to disable the gc
as it inserts synchronization points in the graph. Set the Theano flag
allow_gc=False
to get even faster speed! This will raise the memory
usage.
GpuArray Backend¶
If you have not done so already, you will need to install libgpuarray as well as at least one computing toolkit. Instructions for doing so are provided at libgpuarray.
While all types of devices are supported if using OpenCL, for the remainder of this section, whatever compute device you are using will be referred to as GPU.
Warning
While it is fully our intention to support OpenCL, as of May 2014 this support is still in its infancy. A lot of very useful ops still do not support it because they were ported from the old backend with minimal change.
To see if your GPU is being used, cut and paste the following program into a file and run it.
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1  t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
The program just compute exp()
of a bunch of random numbers. Note
that we use the theano.shared()
function to make sure that the
input x is stored on the GPU.
$ THEANO_FLAGS=device=cpu python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.6071999073 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
$ THEANO_FLAGS=device=cuda0 python check1.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 2.28562092781 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
By default functions that execute on the GPU still return a standard
numpy ndarray. A transfer operation is inserted just before the
results are returned to ensure a consistent interface with CPU code.
This allows changing the deivce some code runs on by only replacing
the value of the device
flag without touching the code.
If you don’t mind a loss of flexibility, you can ask theano to return the GPU object directly. The following code is modifed to do just that.
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1  t0))
print("Result is %s" % (numpy.asarray(r),))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Here the theano.sandbox.gpuarray.basic.gpu_from_host()
call
means “copy input to the GPU”. However during the optimization phase,
since the result will already be on th gpu, it will be removed. It is
used here to tell theano that we want the result on the GPU.
The output is
$ THEANO_FLAGS=device=cuda0 python check2.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took 0.455810785294 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
While the time per call appears to be much lower than the two previous invocations (and should indeed be lower, since we avoid a transfer) the massive speedup we obtained is in part due to asynchronous nature of execution on GPUs, meaning that the work isn’t completed yet, just ‘launched’. We’ll talk about that later.
The object returned is a GpuArray from pygpu. It mostly acts as a
numpy ndarray with some exceptions due to its data being on the GPU.
You can copy it to the host and convert it to a regular ndarray by
using usual numpy casting such as numpy.asarray()
.
For even more speed, you can play with the borrow
flag. See
Borrowing when Constructing Function Objects.
The performance characteristics will of course vary from device to device, and also as we refine our implementation.
This backend supports all regular theano data types (float32, float64, int, ...) however GPU support varies and some units can’t deal with double (float64) or small (less than 32 bits like int16) data types. You will get an error at compile time or runtime if this is the case.
By default all inputs will get transferred to GPU. You can prevent an input from getting transferred by setting its tag.target attribute to ‘cpu’.
Complex support is untested and most likely completely broken.
In general, large operations like matrix multiplication, or elementwise operations with large inputs, will be significatly faster.
By default, all operations on the GPU are run asynchronously. This means that they are only scheduled to run and the function returns. This is made somewhat transparently by the underlying libgpuarray.
A forced synchronization point is introduced when doing memory transfers between device and host.
It is possible to force synchronization for a particular GpuArray by
calling its sync()
method. This is useful to get accurate timings
when doing benchmarks.
Software for Directly Programming a GPU¶
Leaving aside Theano which is a metaprogrammer, there are:
CUDA: GPU programming API by NVIDIA based on extension to C (CUDA C)
 Vendorspecific
 Numeric libraries (BLAS, RNG, FFT) are maturing.
OpenCL: multivendor version of CUDA
 More general, standardized.
 Fewer libraries, lesser spread.
PyCUDA: Python bindings to CUDA driver interface allow to access Nvidia’s CUDA parallel computation API from Python
Convenience:
Makes it easy to do GPU metaprogramming from within Python.
Abstractions to compile lowlevel CUDA code from Python (
pycuda.driver.SourceModule
).GPU memory buffer (
pycuda.gpuarray.GPUArray
).Helpful documentation.
Completeness: Binding to all of CUDA’s driver API.
Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
Speed: PyCUDA’s base layer is written in C++.
Good memory management of GPU objects:
Object cleanup tied to lifetime of objects (RAII, ‘Resource Acquisition Is Initialization’).
Makes it much easier to write correct, leak and crashfree code.
PyCUDA knows about dependencies (e.g. it won’t detach from a context before all memory allocated in it is also freed).
(This is adapted from PyCUDA’s documentation and Andreas Kloeckner’s website on PyCUDA.)
PyOpenCL: PyCUDA for OpenCL
Learning to Program with PyCUDA¶
If you already enjoy a good proficiency with the C programming language, you may easily leverage your knowledge by learning, first, to program a GPU with the CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA API with a Python wrapper.
The following resources will assist you in this learning process:
 CUDA API and CUDA C: Introductory
 CUDA API and CUDA C: Advanced
 MIT IAP2009 CUDA (full coverage: lectures, leading KirkHwu textbook, examples, additional resources)
 Course U. of Illinois (full lectures, KirkHwu textbook)
 NVIDIA’s knowledge base (extensive coverage, levels from introductory to advanced)
 practical issues (on the relationship between grids, blocks and threads; see also linked and related issues on same page)
 CUDA optimisation
 PyCUDA: Introductory
 PYCUDA: Advanced
The following examples give a foretaste of programming a GPU with PyCUDA. Once you feel competent enough, you may try yourself on the corresponding exercises.
Example: PyCUDA
# (from PyCUDA's documentation)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
assert numpy.allclose(dest, a*b)
print(dest)
Run the preceding example.
Modify and execute to work for a matrix of shape (20, 10).
Example: Theano + PyCUDA
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
__props__ = ()
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [storage_map[v] for v in node.inputs]
outputs = [storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape != inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512, 1, 1), grid=grid)
return thunk
Use this code to test it:
>>> x = theano.tensor.fmatrix()
>>> f = theano.function([x], PyCUDADoubleOp()(x))
>>> xv = numpy.ones((4, 5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2)
>>> print(numpy.asarray(f(xv)))
Run the preceding example.
Modify and execute to multiply two matrices: x * y.
Modify and execute to return two outputs: x + y and x  y.
(Notice that Theano’s current elemwise fusion optimization is only applicable to computations involving a single output. Hence, to gain efficiency over the basic solution that is asked here, the two operations would have to be jointly optimized explicitly in the code.)
Modify and execute to support stride (i.e. to avoid constraining the input to be Ccontiguous).
Note¶
 See Other Implementations to know how to handle random numbers on the GPU.
 The mode FAST_COMPILE disables C code, so also disables the GPU. You can use the Theano flag optimizer=’fast_compile’ to speed up compilation and keep the GPU.
Using multiple GPUs¶
Theano has a feature to allow the use of multiple GPUs at the same time in one function. The multiple gpu feature requires the use of the GpuArray Backend backend, so make sure that works correctly.
In order to keep a reasonably high level of abstraction you do not refer to device names directly for multiplegpu use. You instead refer to what we call context names. These are then mapped to a device using the theano configuration. This allows portability of models between machines.
Warning
The code is rather new and is still considered experimental at this point. It has been tested and seems to perform correctly in all cases observed, but make sure to doublecheck your results before publishing a paper or anything of the sort.
Note
For dataparallelism, you probably are better using platoon.
Defining the context map¶
The mapping from context names to devices is done through the
config.contexts
option. The format looks like this:
dev0>cuda0;dev1>cuda1
Let’s break it down. First there is a list of mappings. Each of these mappings is separeted by a semicolon ‘;’. There can be any number of such mappings, but in the example above we have two of them: dev0>cuda0 and dev1>cuda1.
The mappings themselves are composed of a context name followed by the two characters ‘>’ and the device name. The context name is a simple string which does not have any special meaning for Theano. For parsing reasons, the context name cannot contain the sequence ‘>’ or ‘;’. To avoid confusion context names that begin with ‘cuda’ or ‘opencl’ are disallowed. The device name is a device in the form that gpuarray expects like ‘cuda0’ or ‘opencl0:0’.
Note
Since there are a bunch of shell special characters in the syntax, defining this on the commandline will require proper quoting, like this:
$ THEANO_FLAGS="contexts=dev0>cuda0"
When you define a context map, if config.print_active_device
is True (the default), Theano will print the mappings as they are
defined. This will look like this:
$ THEANO_FLAGS="contexts=dev0>cuda0;dev1>cuda1" python c 'import theano'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X
Mapped name dev1 to device cuda1: GeForce GTX TITAN X
If you don’t have enough GPUs for a certain model, you can assign the same device to more than one name. You can also assign extra names that a model doesn’t need to some other devices. However, a proliferation of names is not always a good idea since theano often assumes that different context names will be on different devices and will optimize accordingly. So you may get faster performance for a single name and a single device.
Note
It is often the case that multigpu operation requires or assumes that all the GPUs involved are equivalent. This is not the case for this implementation. Since the user has the task of distrubuting the jobs across the different device a model can be built on the assumption that one of the GPU is slower or has smaller memory.
A simple graph on two GPUs¶
The following simple program works on two GPUs. It builds a function which perform two dot products on two different GPUs.
import numpy
import theano
v01 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v02 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
v11 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
v12 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev1')
f = theano.function([], [theano.tensor.dot(v01, v02),
theano.tensor.dot(v11, v12)])
f()
This model requires a context map with assignations for ‘dev0’ and ‘dev1’. It should run twice as fast when the devices are different.
Explicit transfers of data¶
Since operations themselves cannot work on more than one device, they will pick a device to work on based on their inputs and automatically insert transfers for any input which is not on the right device.
However you may want some explicit control over where and how these
transfers are done at some points. This is done by using the new
transfer()
method that is present on variables. It works for
moving data between GPUs and also between the host and the GPUs. Here
is a example.
import theano
v = theano.tensor.fmatrix()
# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')
# Move back to the cpu
cv = gv.transfer('cpu')
Of course you can mix transfers and operations in any order you choose. However you should try to minimize transfer operations because they will introduce overhead any may reduce performance.
Advanced configuration and debugging¶
Configuration Settings and Compiling Modes¶
Configuration¶
The config
module contains several attributes that modify Theano’s behavior. Many of these
attributes are examined during the import of the theano
module and several are assumed to be
readonly.
As a rule, the attributes in the config
module should not be modified inside the user code.
Theano’s code comes with default values for these attributes, but you can
override them from your .theanorc
file, and override those values in turn by
the THEANO_FLAGS
environment variable.
The order of precedence is:
 an assignment to theano.config.<property>
 an assignment in
THEANO_FLAGS
 an assignment in the .theanorc file (or the file indicated in
THEANORC
)
You can display the current/effective configuration at any time by printing theano.config. For example, to see a list of all active configuration variables, type this from the commandline:
python c 'import theano; print(theano.config)'  less
For more detail, see Configuration in the library.
Exercise¶
Consider the logistic regression:
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX),
rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(T.dot(x, w)b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = y*T.log(p_1)  (1y)*T.log(1p_1) # Crossentropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=[(w, w0.01*gw), (b, b0.01*gb)],
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
train.maker.fgraph.toposort()]):
print('Used the cpu')
elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
train.maker.fgraph.toposort()]):
print('Used the gpu')
else:
print('ERROR, not able to tell if theano used the cpu or the gpu')
print(train.maker.fgraph.toposort())
for i in range(training_steps):
pred, err = train(D[0], D[1])
print("target values for D")
print(D[1])
print("prediction on D")
print(predict(D[0]))
Modify and execute this example to run on CPU (the default) with floatX=float32 and
time the execution using the command line time python file.py
. Save your code
as it will be useful later on.
Note
 Apply the Theano flag
floatX=float32
(throughtheano.config.floatX
) in your code.  Cast inputs before storing them into a shared variable.
 Circumvent the automatic cast of int32 with float32 to float64:
 Insert manual cast in your code or use [u]int{8,16}.
 Insert manual cast around the mean operator (this involves division by length, which is an int64).
 Note that a new casting mechanism is being developed.
Mode¶
Every time theano.function
is called,
the symbolic relationships between the input and output Theano variables
are optimized and compiled. The way this compilation occurs
is controlled by the value of the mode
parameter.
Theano defines the following modes by name:
'FAST_COMPILE'
: Apply just a few graph optimizations and only use Python implementations. So GPU is disabled.'FAST_RUN'
: Apply all optimizations and use C implementations where possible.'DebugMode'
: Verify the correctness of all optimizations, and compare C and Pythonimplementations. This mode can take much longer than the other modes, but can identify several kinds of problems.
'NanGuardMode'
: Same optimization as FAST_RUN, but check if a node generate nans.
The default mode is typically FAST_RUN
, but it can be controlled via
the configuration variable config.mode
,
which can be overridden by passing the keyword argument to
theano.function
.
short name  Full constructor  What does it do? 

FAST_COMPILE 
compile.mode.Mode(linker='py', optimizer='fast_compile') 
Python implementations only, quick and cheap graph transformations 
FAST_RUN 
compile.mode.Mode(linker='cvm', optimizer='fast_run') 
C implementations where available, all available graph transformations. 
DebugMode 
compile.debugmode.DebugMode() 
Both implementations where available, all available graph transformations. 
Note
For debugging purpose, there also exists a MonitorMode
(which has no
short name). It can be used to step through the execution of a function:
see the debugging FAQ for details.
Linkers¶
A mode is composed of 2 things: an optimizer and a linker. Some modes,
like NanGuardMode
and DebugMode
, add logic around the optimizer and
linker. NanGuardMode
and DebugMode
use their own linker.
You can select which linker to use with the Theano flag config.linker
.
Here is a table to compare the different linkers.
linker  gc [1]  Raise error by op  Overhead  Definition 

cvm  yes  yes  “++”  As cpy, but the runtime algo to execute the code is in c 
cvm_nogc  no  yes  “+”  As cvm, but without gc 
cpy [2]  yes  yes  “+++”  Try C code. If none exists for an op, use Python 
cpy_nogc  no  yes  “++”  As cpy, but without gc 
c  no  yes  “+”  Use only C code (if none available for an op, raise an error) 
py  yes  yes  “+++”  Use only Python code 
NanGuardMode  no  no  “++++”  Check if nodes generate NaN 
DebugMode  no  yes  VERY HIGH  Make many checks on what Theano computes 
[1]  Garbage collection of intermediate results during computation. Otherwise, their memory space used by the ops is kept between Theano function calls, in order not to reallocate memory, and lower the overhead (make it faster...). 
[2]  Default 
For more detail, see Mode in the library.
Using DebugMode¶
While normally you should use the FAST_RUN
or FAST_COMPILE
mode,
it is useful at first (especially when you are defining new kinds of
expressions or new optimizations) to run your code using the DebugMode
(available via mode='DebugMode
). The DebugMode is designed to
run several selfchecks and assertions that can help diagnose
possible programming errors leading to incorrect output. Note that
DebugMode
is much slower than FAST_RUN
or FAST_COMPILE
so
use it only during development (not when you launch 1000 processes on a
cluster!).
DebugMode is used as follows:
x = T.dvector('x')
f = theano.function([x], 10 * x, mode='DebugMode')
f([5])
f([0])
f([7])
If any problem is detected, DebugMode will raise an exception according to
what went wrong, either at call time (f(5)) or compile time (
f = theano.function(x, 10 * x, mode='DebugMode')
). These exceptions
should not be ignored; talk to your local Theano guru or email the
users list if you cannot make the exception go away.
Some kinds of errors can only be detected for certain input value combinations. In the example above, there is no way to guarantee that a future call to, say f(1), won’t cause a problem. DebugMode is not a silver bullet.
If you instantiate DebugMode using the constructor (see DebugMode
)
rather than the keyword DebugMode
you can configure its behaviour via
constructor arguments. The keyword version of DebugMode (which you get by using mode='DebugMode'
)
is quite strict.
For more detail, see DebugMode in the library.
ProfileMode¶
Note
ProfileMode is deprecated. Use config.profile
instead.
Printing/Drawing Theano graphs¶
Theano provides the functions theano.printing.pprint()
and
theano.printing.debugprint()
to print a graph to the terminal before or
after compilation. pprint()
is more compact and mathlike,
debugprint()
is more verbose. Theano also provides pydotprint()
that creates an image of the function. You can read about them in
printing – Graph Printing and Symbolic Print Statement.
Note
When printing Theano functions, they can sometimes be hard to
read. To help with this, you can disable some Theano optimizations
by using the Theano flag:
optimizer_excluding=fusion:inplace
. Do not use this during
real job execution, as this will make the graph slower and use more
memory.
Consider again the logistic regression example:
>>> import numpy
>>> import theano
>>> import theano.tensor as T
>>> rng = numpy.random
>>> # Training data
>>> N = 400
>>> feats = 784
>>> D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
>>> training_steps = 10000
>>> # Declare Theano symbolic variables
>>> x = T.matrix("x")
>>> y = T.vector("y")
>>> w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
>>> b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
>>> x.tag.test_value = D[0]
>>> y.tag.test_value = D[1]
>>> # Construct Theano expression graph
>>> p_1 = 1 / (1 + T.exp(T.dot(x, w)b)) # Probability of having a one
>>> prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
>>> # Compute gradients
>>> xent = y*T.log(p_1)  (1y)*T.log(1p_1) # Crossentropy
>>> cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
>>> gw,gb = T.grad(cost, [w,b])
>>> # Training and prediction function
>>> train = theano.function(inputs=[x,y], outputs=[prediction, xent], updates=[[w, w0.01*gw], [b, b0.01*gb]], name = "train")
>>> predict = theano.function(inputs=[x], outputs=prediction, name = "predict")
Pretty Printing¶
>>> theano.printing.pprint(prediction)
'gt((TensorConstant{1} / (TensorConstant{1} + exp((((x \\dot w))  b)))),
TensorConstant{0.5})'
Debug Print¶
The precompilation graph:
>>> theano.printing.debugprint(prediction)
Elemwise{gt,no_inplace} [id A] ''
Elemwise{true_div,no_inplace} [id B] ''
 DimShuffle{x} [id C] ''
  TensorConstant{1} [id D]
 Elemwise{add,no_inplace} [id E] ''
 DimShuffle{x} [id F] ''
  TensorConstant{1} [id D]
 Elemwise{exp,no_inplace} [id G] ''
 Elemwise{sub,no_inplace} [id H] ''
 Elemwise{neg,no_inplace} [id I] ''
  dot [id J] ''
  x [id K]
  w [id L]
 DimShuffle{x} [id M] ''
 b [id N]
DimShuffle{x} [id O] ''
TensorConstant{0.5} [id P]
The postcompilation graph:
>>> theano.printing.debugprint(predict)
Elemwise{Composite{GT(scalar_sigmoid((((i0)  i1))), i2)}} [id A] '' 4
...Gemv{inplace} [id B] '' 3
 AllocEmpty{dtype='float64'} [id C] '' 2
  Shape_i{0} [id D] '' 1
  x [id E]
 TensorConstant{1.0} [id F]
 x [id E]
 w [id G]
 TensorConstant{0.0} [id H]
InplaceDimShuffle{x} [id I] '' 0
 b [id J]
TensorConstant{(1,) of 0.5} [id K]
Picture Printing of Graphs¶
The precompilation graph:
>>> theano.printing.pydotprint(prediction, outfile="pics/logreg_pydotprint_prediction.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_prediction.png
The postcompilation graph:
>>> theano.printing.pydotprint(predict, outfile="pics/logreg_pydotprint_predict.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_predict.png
The optimized training graph:
>>> theano.printing.pydotprint(train, outfile="pics/logreg_pydotprint_train.png", var_with_name_simple=True)
The output file is available at pics/logreg_pydotprint_train.png
Interactive Graph Visualization¶
The new d3viz
module complements theano.printing.pydotprint()
to
visualize complex graph structures. Instead of creating a static image, it
generates an HTML file, which allows to dynamically inspect graph structures in
a web browser. Features include zooming, draganddrop, editing node labels, or
coloring nodes by their compute time.
=> d3viz
<=
Debugging Theano: FAQ and Troubleshooting¶
There are many kinds of bugs that might come up in a computer program. This page is structured as a FAQ. It provides recipes to tackle common problems, and introduces some of the tools that we use to find problems in our own Theano code, and even (it happens) in Theano’s internals, in Using DebugMode.
Isolating the Problem/Testing Theano Compiler¶
You can run your Theano function in a DebugMode. This tests the Theano optimizations and helps to find where NaN, inf and other problems come from.
Interpreting Error Messages¶
Even in its default configuration, Theano tries to display useful error messages. Consider the following faulty code.
import numpy as np
import theano
import theano.tensor as T
x = T.vector()
y = T.vector()
z = x + x
z = z + y
f = theano.function([x, y], z)
f(np.ones((2,)), np.ones((3,)))
Running the code above we see:
Traceback (most recent call last):
...
ValueError: Input dimension mismatch. (input[0].shape[0] = 3, input[1].shape[0] = 2)
Apply node that caused the error: Elemwise{add,no_inplace}(<TensorType(float64, vector)>, <TensorType(float64, vector)>, <TensorType(float64, vector)>)
Inputs types: [TensorType(float64, vector), TensorType(float64, vector), TensorType(float64, vector)]
Inputs shapes: [(3,), (2,), (2,)]
Inputs strides: [(8,), (8,), (8,)]
Inputs scalar values: ['not scalar', 'not scalar', 'not scalar']
HINT: Rerunning with most Theano optimization disabled could give you a backtraces when this node was created. This can be done with by setting the Theano flags 'optimizer=fast_compile'. If that does not work, Theano optimization can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
Arguably the most useful information is approximately halfway through the error message, where the kind of error is displayed along with its cause (ValueError: Input dimension mismatch. (input[0].shape[0] = 3, input[1].shape[0] = 2). Below it, some other information is given, such as the apply node that caused the error, as well as the input types, shapes, strides and scalar values.
The two hints can also be helpful when debugging. Using the theano flag
optimizer=fast_compile
or optimizer=None
can often tell you
the faulty line, while exception_verbosity=high
will display a
debugprint of the apply node. Using these hints, the end of the error
message becomes :
Backtrace when the node is created:
File "test0.py", line 8, in <module>
z = z + y
Debugprint of the apply node:
Elemwise{add,no_inplace} [id A] <TensorType(float64, vector)> ''
Elemwise{add,no_inplace} [id B] <TensorType(float64, vector)> ''
 <TensorType(float64, vector)> [id C] <TensorType(float64, vector)>
 <TensorType(float64, vector)> [id C] <TensorType(float64, vector)>
<TensorType(float64, vector)> [id D] <TensorType(float64, vector)>
We can here see that the error can be traced back to the line z = z + y
.
For this example, using optimizer=fast_compile
worked. If it did not,
you could set optimizer=None
or use test values.
Using Test Values¶
As of v.0.4.0, Theano has a new mechanism by which graphs are executed
onthefly, before a theano.function
is ever compiled. Since optimizations
haven’t been applied at this stage, it is easier for the user to locate the
source of some bug. This functionality is enabled through the config flag
theano.config.compute_test_value
. Its use is best shown through the
following example. Here, we use exception_verbosity=high
and
optimizer=fast_compile
, which would not tell you the line at fault.
optimizer=None
would and it could therefore be used instead of test values.
import numpy
import theano
import theano.tensor as T
# compute_test_value is 'off' by default, meaning this feature is inactive
theano.config.compute_test_value = 'off' # Use 'warn' to activate this feature
# configure shared variables
W1val = numpy.random.rand(2, 10, 10).astype(theano.config.floatX)
W1 = theano.shared(W1val, 'W1')
W2val = numpy.random.rand(15, 20).astype(theano.config.floatX)
W2 = theano.shared(W2val, 'W2')
# input which will be of shape (5,10)
x = T.matrix('x')
# provide Theano with a default testvalue
#x.tag.test_value = numpy.random.rand(5, 10)
# transform the shared variable in some way. Theano does not
# know off hand that the matrix func_of_W1 has shape (20, 10)
func_of_W1 = W1.dimshuffle(2, 0, 1).flatten(2).T
# source of error: dot product of 5x10 with 20x10
h1 = T.dot(x, func_of_W1)
# do more stuff
h2 = T.dot(h1, W2.T)
# compile and call the actual function
f = theano.function([x], h2)
f(numpy.random.rand(5, 10))
Running the above code generates the following error message:
Traceback (most recent call last):
File "test1.py", line 31, in <module>
f(numpy.random.rand(5, 10))
File "PATH_TO_THEANO/theano/compile/function_module.py", line 605, in __call__
self.fn.thunks[self.fn.position_of_error])
File "PATH_TO_THEANO/theano/compile/function_module.py", line 595, in __call__
outputs = self.fn()
ValueError: Shape mismatch: x has 10 cols (and 5 rows) but y has 20 rows (and 10 cols)
Apply node that caused the error: Dot22(x, DimShuffle{1,0}.0)
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(5, 10), (20, 10)]
Inputs strides: [(80, 8), (8, 160)]
Inputs scalar values: ['not scalar', 'not scalar']
Debugprint of the apply node:
Dot22 [id A] <TensorType(float64, matrix)> ''
x [id B] <TensorType(float64, matrix)>
DimShuffle{1,0} [id C] <TensorType(float64, matrix)> ''
Flatten{2} [id D] <TensorType(float64, matrix)> ''
DimShuffle{2,0,1} [id E] <TensorType(float64, 3D)> ''
W1 [id F] <TensorType(float64, 3D)>
HINT: Rerunning with most Theano optimization disabled could give you a backtraces when this node was created. This can be done with by setting the Theano flags 'optimizer=fast_compile'. If that does not work, Theano optimization can be disabled with 'optimizer=None'.
If the above is not informative enough, by instrumenting the code ever so slightly, we can get Theano to reveal the exact source of the error.
# enable onthefly graph computations
theano.config.compute_test_value = 'warn'
...
# input which will be of shape (5, 10)
x = T.matrix('x')
# provide Theano with a default testvalue
x.tag.test_value = numpy.random.rand(5, 10)
In the above, we are tagging the symbolic matrix x with a special test
value. This allows Theano to evaluate symbolic expressions onthefly (by
calling the perform
method of each op), as they are being defined. Sources
of error can thus be identified with much more precision and much earlier in
the compilation pipeline. For example, running the above code yields the
following error message, which properly identifies line 24 as the culprit.
Traceback (most recent call last):
File "test2.py", line 24, in <module>
h1 = T.dot(x, func_of_W1)
File "PATH_TO_THEANO/theano/tensor/basic.py", line 4734, in dot
return _dot(a, b)
File "PATH_TO_THEANO/theano/gof/op.py", line 545, in __call__
required = thunk()
File "PATH_TO_THEANO/theano/gof/op.py", line 752, in rval
r = p(n, [x[0] for x in i], o)
File "PATH_TO_THEANO/theano/tensor/basic.py", line 4554, in perform
z[0] = numpy.asarray(numpy.dot(x, y))
ValueError: matrices are not aligned
The compute_test_value
mechanism works as follows:
 Theano
constants
andshared
variables are used as is. No need to instrument them.  A Theano variable (i.e.
dmatrix
,vector
, etc.) should be given a special test value through the attributetag.test_value
.  Theano automatically instruments intermediate results. As such, any quantity
derived from x will be given a
tag.test_value
automatically.
compute_test_value
can take the following values:
off
: Default behavior. This debugging mechanism is inactive.raise
: Compute test values on the fly. Any variable for which a test value is required, but not provided by the user, is treated as an error. An exception is raised accordingly.warn
: Idem, but a warning is issued instead of an Exception.ignore
: Silently ignore the computation of intermediate test values, if a variable is missing a test value.
Note
This feature is currently incompatible with Scan
and also with ops
which do not implement a perform
method.
“How do I Print an Intermediate Value in a Function?”¶
Theano provides a ‘Print’ op to do this.
import numpy
import theano
x = theano.tensor.dvector('x')
x_printed = theano.printing.Print('this is a very important value')(x)
f = theano.function([x], x * 5)
f_with_print = theano.function([x], x_printed * 5)
#this runs the graph without any printing
assert numpy.all( f([1, 2, 3]) == [5, 10, 15])
#this runs the graph with the message, and value printed
assert numpy.all( f_with_print([1, 2, 3]) == [5, 10, 15])
this is a very important value __str__ = [ 1. 2. 3.]
Since Theano runs your program in a topological order, you won’t have precise
control over the order in which multiple Print()
ops are evaluted. For a more
precise inspection of what’s being computed where, when, and how, see the discussion
“How do I Step through a Compiled Function?”.
Warning
Using this Print
Theano Op can prevent some Theano
optimization from being applied. This can also happen with
stability optimization. So if you use this Print and have nan, try
to remove them to know if this is the cause or not.
“How do I Print a Graph?” (before or after compilation)¶
Theano provides two functions (theano.pp()
and
theano.printing.debugprint()
) to print a graph to the terminal before or after
compilation. These two functions print expression graphs in different ways:
pp()
is more compact and mathlike, debugprint()
is more verbose.
Theano also provides theano.printing.pydotprint()
that creates a png image of the function.
You can read about them in printing – Graph Printing and Symbolic Print Statement.
“The Function I Compiled is Too Slow, what’s up?”¶
First, make sure you’re running in FAST_RUN
mode. Even though
FAST_RUN
is the default mode, insist by passing mode='FAST_RUN'
to theano.function
(or theano.make
) or by setting config.mode
to FAST_RUN
.
Second, try the Theano ProfileMode. This will tell you which
Apply
nodes, and which ops are eating up your CPU cycles.
Tips:
 Use the flags
floatX=float32
to require type float32 instead of float64; Use the Theano constructors matrix(),vector(),... instead of dmatrix(), dvector(),... since they respectively involve the default types float32 and float64.  Check in the
profile
mode that there is noDot
op in the postcompilation graph while you are multiplying two matrices of the same type.Dot
should be optimized todot22
when the inputs are matrices and of the same type. This can still happen when usingfloatX=float32
when one of the inputs of the graph is of type float64.
“Why does my GPU function seem to be slow?”¶
When you compile a theano function, if you do not get the speedup that you expect over the CPU performance of the same code. It is oftentimes due to the fact that some Ops might be running on CPU instead GPU. If that is the case, you can use assert_no_cpu_op to check if there is a CPU Op on your computational graph. assert_no_cpu_op can take the following one of the three options:
warn
: Raise a warningpdb
: Stop with a pdb in the computational graph during the compilationraise
: Raise an error, if there is a CPU Op in the computational graph.
It is possible to use this mode by providing the flag in THEANO_FLAGS, such as:
THEANO_FLAGS="float32,device=gpu,assert_no_cpu_op='raise'" python test.py
But note that this optimization will not catch all the CPU Ops, it might miss some Ops.
“How do I Step through a Compiled Function?”¶
You can use MonitorMode
to inspect the inputs and outputs of each
node being executed when the function is called. The code snipped below
shows how to print all inputs and outputs:
from __future__ import print_function
import theano
def inspect_inputs(i, node, fn):
print(i, node, "input(s) value(s):", [input[0] for input in fn.inputs],
end='')
def inspect_outputs(i, node, fn):
print(" output(s) value(s):", [output[0] for output in fn.outputs])
x = theano.tensor.dscalar('x')
f = theano.function([x], [5 * x],
mode=theano.compile.MonitorMode(
pre_func=inspect_inputs,
post_func=inspect_outputs))
f(3)
0 Elemwise{mul,no_inplace}(TensorConstant{5.0}, x) input(s) value(s): [array(5.0), array(3.0)] output(s) value(s): [array(15.0)]
When using these inspect_inputs
and inspect_outputs
functions
with MonitorMode
, you should see [potentially a lot of] printed output.
Every Apply
node will be printed out,
along with its position in the graph, the arguments to the functions perform
or
c_code
and the output it computed.
Admittedly, this may be a huge amount of
output to read through if you are using big tensors... but you can choose to
add logic that would, for instance, print
something out only if a certain kind of op were used, at a certain program
position, or only if a particular value showed up in one of the inputs or outputs.
A typical example is to detect when NaN values are added into computations, which
can be achieved as follows:
import numpy
import theano
# This is the current suggested detect_nan implementation to
# show you how it work. That way, you can modify it for your
# need. If you want exactly this method, you can use
# ``theano.compile.monitormode.detect_nan`` that will always
# contain the current suggested version.
def detect_nan(i, node, fn):
for output in fn.outputs:
if (not isinstance(output[0], numpy.random.RandomState) and
numpy.isnan(output[0]).any()):
print('*** NaN detected ***')
theano.printing.debugprint(node)
print('Inputs : %s' % [input[0] for input in fn.inputs])
print('Outputs: %s' % [output[0] for output in fn.outputs])
break
x = theano.tensor.dscalar('x')
f = theano.function([x], [theano.tensor.log(x) * x],
mode=theano.compile.MonitorMode(
post_func=detect_nan))
f(0) # log(0) * 0 = inf * 0 = NaN
*** NaN detected ***
Elemwise{Composite{(log(i0) * i0)}} [id A] ''
x [id B]
Inputs : [array(0.0)]
Outputs: [array(nan)]
To help understand what is happening in your graph, you can
disable the local_elemwise_fusion
and all inplace
optimizations. The first is a speed optimization that merges elemwise
operations together. This makes it harder to know which particular
elemwise causes the problem. The second optimization makes some ops’
outputs overwrite their inputs. So, if an op creates a bad output, you
will not be able to see the input that was overwriten in the post_func
function. To disable those optimizations (with a Theano version after
0.6rc3), define the MonitorMode like this:
mode = theano.compile.MonitorMode(post_func=detect_nan).excluding(
'local_elemwise_fusion', 'inplace')
f = theano.function([x], [theano.tensor.log(x) * x],
mode=mode)
Note
The Theano flags optimizer_including
, optimizer_excluding
and optimizer_requiring
aren’t used by the MonitorMode, they
are used only by the default
mode. You can’t use the default
mode with MonitorMode, as you need to define what you monitor.
To be sure all inputs of the node are available during the call to
post_func
, you must also disable the garbage collector. Otherwise,
the execution of the node can garbage collect its inputs that aren’t
needed anymore by the Theano function. This can be done with the Theano
flag:
allow_gc=False
How to Use pdb¶
In the majority of cases, you won’t be executing from the interactive shell but from a set of Python scripts. In such cases, the use of the Python debugger can come in handy, especially as your models become more complex. Intermediate results don’t necessarily have a clear name and you can get exceptions which are hard to decipher, due to the “compiled” nature of the functions.
Consider this example script (“ex.py”):
import theano
import numpy
import theano.tensor as T
a = T.dmatrix('a')
b = T.dmatrix('b')
f = theano.function([a, b], [a * b])
# matrices chosen so dimensions are unsuitable for multiplication
mat1 = numpy.arange(12).reshape((3, 4))
mat2 = numpy.arange(25).reshape((5, 5))
f(mat1, mat2)
This is actually so simple the debugging could be done easily, but it’s for illustrative purposes. As the matrices can’t be multiplied elementwise (unsuitable shapes), we get the following exception:
File "ex.py", line 14, in <module>
f(mat1, mat2)
File "/u/username/Theano/theano/compile/function_module.py", line 451, in __call__
File "/u/username/Theano/theano/gof/link.py", line 271, in streamline_default_f
File "/u/username/Theano/theano/gof/link.py", line 267, in streamline_default_f
File "/u/username/Theano/theano/gof/cc.py", line 1049, in execute ValueError: ('Input dimension mismatch. (input[0].shape[0] = 3, input[1].shape[0] = 5)', Elemwise{mul,no_inplace}(a, b), Elemwise{mul,no_inplace}(a, b))
The call stack contains some useful information to trace back the source of the error. There’s the script where the compiled function was called – but if you’re using (improperly parameterized) prebuilt modules, the error might originate from ops in these modules, not this script. The last line tells us about the op that caused the exception. In this case it’s a “mul” involving variables with names “a” and “b”. But suppose we instead had an intermediate result to which we hadn’t given a name.
After learning a few things about the graph structure in Theano, we can use the Python debugger to explore the graph, and then we can get runtime information about the error. Matrix dimensions, especially, are useful to pinpoint the source of the error. In the printout, there are also 2 of the 4 dimensions of the matrices involved, but for the sake of example say we’d need the other dimensions to pinpoint the error. First, we relaunch with the debugger module and run the program with “c”:
python m pdb ex.py
> /u/username/experiments/doctmp1/ex.py(1)<module>()
> import theano
(Pdb) c
Then we get back the above error printout, but the interpreter breaks in that state. Useful commands here are
 “up” and “down” (to move up and down the call stack),
 “l” (to print code around the line in the current stack position),
 “p variable_name” (to print the string representation of ‘variable_name’),
 “p dir(object_name)”, using the Python dir() function to print the list of an object’s members
Here, for example, I do “up”, and a simple “l” shows me there’s a local variable “node”. This is the “node” from the computation graph, so by following the “node.inputs”, “node.owner” and “node.outputs” links I can explore around the graph.
That graph is purely symbolic (no data, just symbols to manipulate it abstractly). To get information about the actual parameters, you explore the “thunk” objects, which bind the storage for the inputs (and outputs) with the function itself (a “thunk” is a concept related to closures). Here, to get the current node’s first input’s shape, you’d therefore do “p thunk.inputs[0][0].shape”, which prints out “(3, 4)”.
Dumping a Function to help debug¶
If you are reading this, there is high chance that you emailed our
mailing list and we asked you to read this section. This section
explain how to dump all the parameter passed to
theano.function()
. This is useful to help us reproduce a problem
during compilation and it doesn’t request you to make a self contained
example.
For this to work, we need to be able to import the code for all Op in the graph. So if you create your own Op, we will need this code. Otherwise, we won’t be able to unpickle it. We already have all the Ops from Theano and Pylearn2.
# Replace this line:
theano.function(...)
# with
theano.function_dump(filename, ...)
# Where filename is a string to a file that we will write to.
Then send us filename.

class
theano.tests.breakpoint.
PdbBreakpoint
(name)[source]¶ This is an identitylike op with the side effect of enforcing a conditional breakpoint, inside a theano function, based on a symbolic scalar condition.
Parameters: name (String) – name of the conditional breakpoint. To be printed when the breakpoint is activated.
Note: WARNING. At least one of the outputs of the op must be used otherwise the op will be removed from the Theano graph due to its outputs being unused
Note:  WARNING. Employing the function inside a theano graph can prevent
Theano from applying certain optimizations to improve performance, reduce memory consumption and/or reduce numerical instability.
Detailed explanation: As of 20141201 the PdbBreakpoint op is not known by any optimization. Setting a PdbBreakpoint op in the middle of a pattern that is usually optimized out will block the optimization.
Example:
import theano import theano.tensor as T from theano.tests.breakpoint import PdbBreakpoint input = T.fvector() target = T.fvector() # Mean squared error between input and target mse = (input  target) ** 2 # Conditional breakpoint to be activated if the total MSE is higher # than 100. The breakpoint will monitor the inputs, targets as well # as the individual error values breakpointOp = PdbBreakpoint("MSE too high") condition = T.gt(mse.sum(), 100) mse, monitored_input, monitored_target = breakpointOp(condition, mse, input, target) # Compile the theano function fct = theano.function([input, target], mse) # Use the function print fct([10, 0], [10, 5]) # Will NOT activate the breakpoint print fct([0, 0], [10, 5]) # Will activate the breakpoint
Dealing with NaNs¶
Having a model yielding NaNs or Infs is quite common if some of the tiny components in your model are not set properly. NaNs are hard to deal with because sometimes it is caused by a bug or error in the code, sometimes it’s because of the numerical stability of your computational environment (library versions, etc.), and even, sometimes it relates to your algorithm. Here we try to outline common issues which cause the model to yield NaNs, as well as provide nails and hammers to diagnose it.
Check Superparameters and Weight Initialization¶
Most frequently, the cause would be that some of the hyperparameters, especially learning rates, are set incorrectly. A high learning rate can blow up your whole model into NaN outputs even within one epoch of training. So the first and easiest solution is try to lower it. Keep halving your learning rate until you start to get resonable output values.
Other hyperparameters may also play a role. For example, are your training algorithms involve regularization terms? If so, are their corresponding penalties set reasonably? Search a wider hyperparameter space with a few (one or two) training epochs each to see if the NaNs could disappear.
Some models can be very sensitive to the initialization of weight vectors. If those weights are not initialized in a proper range, then it is not surprising that the model ends up with yielding NaNs.
Run in NanGuardMode, DebugMode, or MonitorMode¶
If adjusting hyperparameters doesn’t work for you, you can still get help from
Theano’s NanGuardMode. Change the mode of your theano function to NanGuardMode
and run them again. The NanGuardMode will monitor all input/output variables in
each node, and raises an error if NaNs are detected. For how to use the
NanGuardMode
, please refer to nanguardmode.
DebugMode can also help. Run your code in DebugMode with flag
mode=DebugMode,DebugMode.check_py=False
. This will give you clue about which
op is causing this problem, and then you can inspect that op in more detail. For
details of using DebugMode
, please refer to debugmode.
Theano’s MonitorMode provides another helping hand. It can be used to step through the execution of a function. You can inspect the inputs and outputs of each node being executed when the function is called. For how to use that, please check “How do I Step through a Compiled Function?”.
Numerical Stability¶
After you have located the op which causes the problem, it may turn out that the NaNs yielded by that op are related to numerical issues. For example, may result in NaNs for those nodes who have learned to yield a low probability p(x) for some input x.
Cuda Specific Option¶
The Theano flag nvcc.fastmath=True
can genarate NaN. Don’t set
this flag while debugging NaN.
Profiling Theano function¶
Note
This method replace the old ProfileMode. Do not use ProfileMode anymore.
Besides checking for errors, another important task is to profile your code in terms of speed and/or memory usage.
You can profile your functions using either of the following two options:
 Use Theano flag
config.profile
to enable profiling.  To enable the memory profiler use the Theano flag:
config.profile_memory
in addition toconfig.profile
.  Moreover, to enable the profiling of Theano optimization phase,
use the Theano flag:
config.profile_optimizer
in addition toconfig.profile
.  You can also use the Theano flags
profiling.n_apply
,profiling.n_ops
andprofiling.min_memory_size
to modify the quantity of information printed.
 To enable the memory profiler use the Theano flag:
 Use Theano flag
 Pass the argument
profile=True
to the functiontheano.function
. And then callf.profile.print_summary()
for a single function.  Use this option when you want to profile not all the functions but one or more specific function(s).
 You can also combine the profile of many functions:
 Pass the argument
The profiler will output one profile per Theano function and profile that is the sum of the printed profiles. Each profile contains 4 sections: global info, class info, Ops info and Apply node info.
In the global section, the “Message” is the name of the Theano
function. theano.function() has an optional parameter name
that
defaults to None. Change it to something else to help you profile many
Theano functions. In that section, we also see the number of times the
function was called (1) and the total time spent in all those
calls. The time spent in Function.fn.__call__ and in thunks is useful
to understand Theano overhead.
Also, we see the time spent in the two parts of the compilation process: optimization (modify the graph to make it more stable/faster) and the linking (compile c code and make the Python callable returned by function).
The class, Ops and Apply nodes sections are the same information: information about the Apply node that ran. The Ops section takes the information from the Apply section and merge the Apply nodes that have exactly the same op. If two Apply nodes in the graph have two Ops that compare equal, they will be merged. Some Ops like Elemwise, will not compare equal, if their parameters differ (the scalar being executed). So the class section will merge more Apply nodes then the Ops section.
Note that the profile also shows which Ops were running a c implementation.
Developers wishing to optimize the performance of their graph should focus on the worst offending Ops and Apply nodes – either by optimizing an implementation, providing a missing C implementation, or by writing a graph optimization that eliminates the offending Op altogether. You should strongly consider emailing one of our lists about your issue before spending too much time on this.
Here is an example output when we disable some Theano optimizations to give you a better idea of the difference between sections. With all optimizations enabled, there would be only one op left in the graph.
Note
To profile the peak memory usage on the GPU you need to do:
* In the file theano/sandbox/cuda/cuda_ndarray.cu, set the macro
COMPUTE_GPU_MEM_USED to 1.
* Then call theano.sandbox.cuda.theano_allocated()
It return a tuple with two ints. The first is the current GPU
memory allocated by Theano. The second is the peak GPU memory
that was allocated by Theano.
Do not always enable this, as this slows down memory allocation and free. As this slows down the computation, this will affect speed profiling. So don’t use both at the same time.
to run the example:
THEANO_FLAGS=optimizer_excluding=fusion:inplace,profile=True python doc/tutorial/profiling_example.py
The output:
Function profiling
==================
Message: None
Time in 1 calls to Function.__call__: 5.698204e05s
Time in Function.fn.__call__: 1.192093e05s (20.921%)
Time in thunks: 6.198883e06s (10.879%)
Total compile time: 3.642474e+00s
Theano Optimizer time: 7.326508e02s
Theano validate time: 3.712177e04s
Theano Linker time (includes C, CUDA code generation/compiling): 9.584920e01s
Class

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.000s 2.07e06s C 3 3 <class 'theano.tensor.elemwise.Elemwise'>
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
65.4% 65.4% 0.000s 2.03e06s C 2 2 Elemwise{add,no_inplace}
34.6% 100.0% 0.000s 2.15e06s C 1 1 Elemwise{mul,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
50.0% 50.0% 0.000s 3.10e06s 1 0 Elemwise{add,no_inplace}(x, y)
34.6% 84.6% 0.000s 2.15e06s 1 2 Elemwise{mul,no_inplace}(TensorConstant{(1,) of 2.0}, Elemwise{add,no_inplace}.0)
15.4% 100.0% 0.000s 9.54e07s 1 1 Elemwise{add,no_inplace}(Elemwise{add,no_inplace}.0, z)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Further readings¶
Graph Structures¶
Debugging or profiling code written in Theano is not that simple if you do not know what goes on under the hood. This chapter is meant to introduce you to a required minimum of the inner workings of Theano.
The first step in writing Theano code is to write down all mathematical
relations using symbolic placeholders (variables). When writing down
these expressions you use operations like +
, 
, **
,
sum()
, tanh()
. All these are represented internally as ops.
An op represents a certain computation on some type of inputs
producing some type of output. You can see it as a function definition
in most programming languages.
Theano represents symbolic mathematical computations as graphs. These graphs are composed of interconnected Apply, Variable and Op nodes. Apply node represents the application of an op to some variables. It is important to draw the difference between the definition of a computation represented by an op and its application to some actual data which is represented by the apply node. Furthermore, data types are represented by Type instances. Here is a piece of code and a diagram showing the structure built by that piece of code. This should help you understand how these pieces fit together:
Code
import theano.tensor as T
x = T.dmatrix('x')
y = T.dmatrix('y')
z = x + y
Diagram
Arrows represent references to the Python objects pointed at. The blue box is an Apply node. Red boxes are Variable nodes. Green circles are Ops. Purple boxes are Types.
When we create Variables and then Apply
Ops to them to make more Variables, we build a
bipartite, directed, acyclic graph. Variables point to the Apply nodes
representing the function application producing them via their
owner
field. These Apply nodes point in turn to their input and
output Variables via their inputs
and outputs
fields.
(Apply instances also contain a list of references to their outputs
, but
those pointers don’t count in this graph.)
The owner
field of both x
and y
point to None
because
they are not the result of another computation. If one of them was the
result of another computation, it’s owner
field would point to another
blue box like z
does, and so on.
Note that the Apply
instance’s outputs points to
z
, and z.owner
points back to the Apply
instance.
Traversing the graph¶
The graph can be traversed starting from outputs (the result of some computation) down to its inputs using the owner field. Take for example the following code:
>>> import theano
>>> x = theano.tensor.dmatrix('x')
>>> y = x * 2.
If you enter type(y.owner)
you get <class 'theano.gof.graph.Apply'>
,
which is the apply node that connects the op and the inputs to get this
output. You can now print the name of the op that is applied to get
y:
>>> y.owner.op.name
'Elemwise{mul,no_inplace}'
Hence, an elementwise multiplication is used to compute y. This multiplication is done between the inputs:
>>> len(y.owner.inputs)
2
>>> y.owner.inputs[0]
x
>>> y.owner.inputs[1]
DimShuffle{x,x}.0
Note that the second input is not 2 as we would have expected. This is
because 2 was first broadcasted to a matrix of
same shape as x. This is done by using the op DimShuffle
:
>>> type(y.owner.inputs[1])
<class 'theano.tensor.var.TensorVariable'>
>>> type(y.owner.inputs[1].owner)
<class 'theano.gof.graph.Apply'>
>>> y.owner.inputs[1].owner.op
<theano.tensor.elemwise.DimShuffle object at 0x106fcaf10>
>>> y.owner.inputs[1].owner.inputs
[TensorConstant{2.0}]
Starting from this graph structure it is easier to understand how automatic differentiation proceeds and how the symbolic relations can be optimized for performance or stability.
Graph Structures¶
The following section outlines each type of structure that may be used in a Theanobuilt computation graph. The following structures are explained: Apply, Constant, Op, Variable and Type.
An Apply node is a type of internal node used to represent a
computation graph in Theano. Unlike
Variable nodes, Apply nodes are usually not
manipulated directly by the end user. They may be accessed via
a Variable’s owner
field.
An Apply node is typically an instance of the Apply
class. It represents the application
of an Op on one or more inputs, where each input is a
Variable. By convention, each Op is responsible for
knowing how to build an Apply node from a list of
inputs. Therefore, an Apply node may be obtained from an Op
and a list of inputs by calling Op.make_node(*inputs)
.
Comparing with the Python language, an Apply node is Theano’s version of a function call whereas an Op is Theano’s version of a function definition.
An Apply instance has three important fields:
 op
 An Op that determines the function/transformation being applied here.
 inputs
 A list of Variables that represent the arguments of the function.
 outputs
 A list of Variables that represent the return values of the function.
An Apply instance can be created by calling gof.Apply(op, inputs, outputs)
.
An Op in Theano defines a certain computation on some types of inputs, producing some types of outputs. It is equivalent to a function definition in most programming languages. From a list of input Variables and an Op, you can build an Apply node representing the application of the Op to the inputs.
It is important to understand the distinction between an Op (the
definition of a function) and an Apply node (the application of a
function). If you were to interpret the Python language using Theano’s
structures, code going like def f(x): ...
would produce an Op for
f
whereas code like a = f(x)
or g(f(4), 5)
would produce an
Apply node involving the f
Op.
A Type in Theano represents a set of constraints on potential
data objects. These constraints allow Theano to tailor C code to handle
them and to statically optimize the computation graph. For instance,
the irow type in the theano.tensor
package
gives the following constraints on the data the Variables of type irow
may contain:
 Must be an instance of
numpy.ndarray
:isinstance(x, numpy.ndarray)
 Must be an array of 32bit integers:
str(x.dtype) == 'int32'
 Must have a shape of 1xN:
len(x.shape) == 2 and x.shape[0] == 1
Knowing these restrictions, Theano may generate C code for addition, etc. that declares the right data types and that contains the right number of loops over the dimensions.
Note that a Theano Type is not equivalent to a Python type or
class. Indeed, in Theano, irow and dmatrix both use numpy.ndarray
as the underlying type
for doing computations and storing data, yet they are different Theano
Types. Indeed, the constraints set by dmatrix
are:
 Must be an instance of
numpy.ndarray
:isinstance(x, numpy.ndarray)
 Must be an array of 64bit floating point numbers:
str(x.dtype) == 'float64'
 Must have a shape of MxN, no restriction on M or N:
len(x.shape) == 2
These restrictions are different from those of irow
which are listed above.
There are cases in which a Type can fully correspond to a Python type,
such as the double
Type we will define here, which corresponds to
Python’s float
. But, it’s good to know that this is not necessarily
the case. Unless specified otherwise, when we say “Type” we mean a
Theano Type.
A Variable is the main data structure you work with when using Theano. The symbolic inputs that you operate on are Variables and what you get from applying various Ops to these inputs are also Variables. For example, when I type
>>> import theano
>>> x = theano.tensor.ivector()
>>> y = x
x
and y
are both Variables, i.e. instances of the Variable
class. The Type of both x
and
y
is theano.tensor.ivector
.
Unlike x
, y
is a Variable produced by a computation (in this
case, it is the negation of x
). y
is the Variable corresponding to
the output of the computation, while x
is the Variable
corresponding to its input. The computation itself is represented by
another type of node, an Apply node, and may be accessed
through y.owner
.
More specifically, a Variable is a basic structure in Theano that
represents a datum at a certain point in computation. It is typically
an instance of the class Variable
or
one of its subclasses.
A Variable r
contains four important fields:
 type
 a Type defining the kind of value this Variable can hold in computation.
 owner
 this is either None or an Apply node of which the Variable is an output.
 index
 the integer such that
owner.outputs[index] is r
(ignored ifowner
is None)  name
 a string to use in prettyprinting and debugging.
Variable has one special subclass: Constant.
A Constant is a Variable with one extra field, data (only settable once). When used in a computation graph as the input of an Op application, it is assumed that said input will always take the value contained in the constant’s data field. Furthermore, it is assumed that the Op will not under any circumstances modify the input. This means that a constant is eligible to participate in numerous optimizations: constant inlining in C code, constant folding, etc.
A constant does not need to be specified in a function
‘s list
of inputs. In fact, doing so will raise an exception.
Graph Structures Extension¶
When we start the compilation of a Theano function, we compute some extra information. This section describes a portion of the information that is made available. Not everything is described, so email theanodev if you need something that is missing.
The graph gets cloned at the start of compilation, so modifications done during compilation won’t affect the user graph.
Each variable receives a new field called clients. It is a list with references to every place in the graph where this variable is used. If its length is 0, it means the variable isn’t used. Each place where it is used is described by a tuple of 2 elements. There are two types of pairs:
 The first element is an Apply node.
 The first element is the string “output”. It means the function outputs this variable.
In both types of pairs, the second element of the tuple is an index,
such that: var.clients[*][0].inputs[index]
or
fgraph.outputs[index]
is that variable.
>>> import theano
>>> v = theano.tensor.vector()
>>> f = theano.function([v], (v+1).sum())
>>> theano.printing.debugprint(f)
Sum{acc_dtype=float64} [id A] '' 1
Elemwise{add,no_inplace} [id B] '' 0
TensorConstant{(1,) of 1.0} [id C]
<TensorType(float64, vector)> [id D]
>>> # Sorted list of all nodes in the compiled graph.
>>> topo = f.maker.fgraph.toposort()
>>> topo[0].outputs[0].clients
[(Sum{acc_dtype=float64}(Elemwise{add,no_inplace}.0), 0)]
>>> topo[1].outputs[0].clients
[('output', 0)]
>>> # An internal variable
>>> var = topo[0].outputs[0]
>>> client = var.clients[0]
>>> client
(Sum{acc_dtype=float64}(Elemwise{add,no_inplace}.0), 0)
>>> type(client[0])
<class 'theano.gof.graph.Apply'>
>>> assert client[0].inputs[client[1]] is var
>>> # An output of the graph
>>> var = topo[1].outputs[0]
>>> client = var.clients[0]
>>> client
('output', 0)
>>> assert f.maker.fgraph.outputs[client[1]] is var
Automatic Differentiation¶
Having the graph structure, computing automatic differentiation is
simple. The only thing tensor.grad()
has to do is to traverse the
graph from the outputs back towards the inputs through all apply
nodes (apply nodes are those that define which computations the
graph does). For each such apply node, its op defines
how to compute the gradient of the node’s outputs with respect to its
inputs. Note that if an op does not provide this information,
it is assumed that the gradient is not defined.
Using the
chain rule
these gradients can be composed in order to obtain the expression of the
gradient of the graph’s output with respect to the graph’s inputs.
A following section of this tutorial will examine the topic of differentiation in greater detail.
Optimizations¶
When compiling a Theano function, what you give to the
theano.function
is actually a graph
(starting from the output variables you can traverse the graph up to
the input variables). While this graph structure shows how to compute
the output from the input, it also offers the possibility to improve the
way this computation is carried out. The way optimizations work in
Theano is by identifying and replacing certain patterns in the graph
with other specialized patterns that produce the same results but are either
faster or more stable. Optimizations can also detect
identical subgraphs and ensure that the same values are not computed
twice or reformulate parts of the graph to a GPU specific version.
For example, one (simple) optimization that Theano uses is to replace the pattern by x.
Further information regarding the optimization process and the specific optimizations that are applicable is respectively available in the library and on the entrance page of the documentation.
Example
Symbolic programming involves a change of paradigm: it will become clearer as we apply it. Consider the following example of optimization:
>>> import theano
>>> a = theano.tensor.vector("a") # declare symbolic variable
>>> b = a + a ** 10 # build symbolic expression
>>> f = theano.function([a], b) # compile function
>>> print(f([0, 1, 2])) # prints `array([0,2,1026])`
[ 0. 2. 1026.]
>>> theano.printing.pydotprint(b, outfile="./pics/symbolic_graph_unopt.png", var_with_name_simple=True)
The output file is available at ./pics/symbolic_graph_unopt.png
>>> theano.printing.pydotprint(f, outfile="./pics/symbolic_graph_opt.png", var_with_name_simple=True)
The output file is available at ./pics/symbolic_graph_opt.png
We used theano.printing.pydotprint()
to visualize the optimized graph
(right), which is much more compact than the unoptimized graph (left).
Unoptimized graph  Optimized graph  

Loading and Saving¶
Python’s standard way of saving class instances and reloading them
is the pickle mechanism. Many Theano objects can be serialized (and
deserialized) by pickle
, however, a limitation of pickle
is that
it does not save the code or data of a class along with the instance of
the class being serialized. As a result, reloading objects created by a
previous version of a class can be really problematic.
Thus, you will want to consider different mechanisms depending on the amount of time you anticipate between saving and reloading. For shortterm (such as temp files and network transfers), pickling of the Theano objects or classes is possible. For longerterm (such as saving models from an experiment) you should not rely on pickled Theano objects; we recommend loading and saving the underlying shared objects as you would in the course of any other Python program.
The Basics of Pickling¶
The two modules pickle
and cPickle
have the same functionalities, but
cPickle
, coded in C, is much faster.
>>> from six.moves import cPickle
You can serialize (or save, or pickle) objects to a file with
cPickle.dump
:
>>> f = open('obj.save', 'wb')
>>> cPickle.dump(my_obj, f, protocol=cPickle.HIGHEST_PROTOCOL)
>>> f.close()
Note
If you want your saved object to be stored efficiently, don’t forget
to use cPickle.HIGHEST_PROTOCOL
. The resulting file can be
dozens of times smaller than with the default protocol.
Note
Opening your file in binary mode ('b'
) is required for portability
(especially between Unix and Windows).
To deserialize (or load, or unpickle) a pickled file, use
cPickle.load
:
>>> f = open('obj.save', 'rb')
>>> loaded_obj = cPickle.load(f)
>>> f.close()
You can pickle several objects into the same file, and load them all (in the same order):
>>> f = open('objects.save', 'wb')
>>> for obj in [obj1, obj2, obj3]:
... cPickle.dump(obj, f, protocol=cPickle.HIGHEST_PROTOCOL)
>>> f.close()
Then:
>>> f = open('objects.save', 'rb')
>>> loaded_objects = []
>>> for i in range(3):
... loaded_objects.append(cPickle.load(f))
>>> f.close()
For more details about pickle’s usage, see Python documentation.
ShortTerm Serialization¶
If you are confident that the class instance you are serializing will be deserialized by a compatible version of the code, pickling the whole model is an adequate solution. It would be the case, for instance, if you are saving models and reloading them during the same execution of your program, or if the class you’re saving has been really stable for a while.
You can control what pickle will save from your object, by defining a __getstate__ method, and similarly __setstate__.
This will be especially useful if, for instance, your model class contains a link to the data set currently in use, that you probably don’t want to pickle along every instance of your model.
For instance, you can define functions along the lines of:
def __getstate__(self):
state = dict(self.__dict__)
del state['training_set']
return state
def __setstate__(self, d):
self.__dict__.update(d)
self.training_set = cPickle.load(open(self.training_set_file, 'rb'))
Robust Serialization¶
This type of serialization uses some helper functions particular to Theano. It
serializes the object using Python’s pickling protocol, but any ndarray
or
CudaNdarray
objects contained within the object are saved separately as NPY
files. These NPY files and the Pickled file are all saved together in single
ZIPfile.
The main advantage of this approach is that you don’t even need Theano installed in order to look at the values of shared variables that you pickled. You can just load the parameters manually with numpy.
import numpy
numpy.load('model.zip')
This approach could be beneficial if you are sharing your model with people who might not have Theano installed, who are using a different Python version, or if you are planning to save your model for a long time (in which case version mismatches might make it difficult to unpickle objects).
See theano.misc.pkl_utils.dump()
and theano.misc.pkl_utils.load()
.
LongTerm Serialization¶
If the implementation of the class you want to save is quite unstable, for instance if functions are created or removed, class members are renamed, you should save and load only the immutable (and necessary) part of your class.
You can do that by defining __getstate__ and __setstate__ functions as above, maybe defining the attributes you want to save, rather than the ones you don’t.
For instance, if the only parameters you want to save are a weight matrix W and a bias b, you can define:
def __getstate__(self):
return (self.W, self.b)
def __setstate__(self, state):
W, b = state
self.W = W
self.b = b
If at some point in time W is renamed to weights and b to bias, the older pickled files will still be usable, if you update these functions to reflect the change in name:
def __getstate__(self):
return (self.weights, self.bias)
def __setstate__(self, state):
W, b = state
self.weights = W
self.bias = b
For more information on advanced use of pickle
and its internals, see Python’s
pickle documentation.
PyCUDA/CUDAMat/Gnumpy compatibility¶
PyCUDA¶
Currently, PyCUDA and Theano have different objects to store GPU data. The two implementations do not support the same set of features. Theano’s implementation is called CudaNdarray and supports strides. It also only supports the float32 dtype. PyCUDA’s implementation is called GPUArray and doesn’t support strides. However, it can deal with all NumPy and CUDA dtypes.
We are currently working on having the same base object for both that will also mimic Numpy. Until this is ready, here is some information on how to use both objects in the same script.
You can use the theano.misc.pycuda_utils
module to convert GPUArray to and
from CudaNdarray. The functions to_cudandarray(x, copyif=False)
and
to_gpuarray(x)
return a new object that occupies the same memory space
as the original. Otherwise it raises a ValueError. Because GPUArrays don’t
support strides, if the CudaNdarray is strided, we could copy it to
have a nonstrided copy. The resulting GPUArray won’t share the same
memory region. If you want this behavior, set copyif=True
in
to_gpuarray
.
You can use PyCUDA to compile CUDA functions that work directly on
CudaNdarrays. Here is an example from the file theano/misc/tests/test_pycuda_theano_simple.py
:
import sys
import numpy
import theano
import theano.sandbox.cuda as cuda_ndarray
import theano.misc.pycuda_init
import pycuda
import pycuda.driver as drv
import pycuda.gpuarray
def test_pycuda_theano():
"""Simple example with pycuda function and Theano CudaNdarray object."""
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(100).astype(numpy.float32)
b = numpy.random.randn(100).astype(numpy.float32)
# Test with Theano object
ga = cuda_ndarray.CudaNdarray(a)
gb = cuda_ndarray.CudaNdarray(b)
dest = cuda_ndarray.CudaNdarray.zeros(a.shape)
multiply_them(dest, ga, gb,
block=(400, 1, 1), grid=(1, 1))
assert (numpy.asarray(dest) == a * b).all()
You can use a GPU function compiled with PyCUDA in a Theano op:
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
__props__ = ()
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i] * 2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [ storage_map[v] for v in node.inputs]
outputs = [ storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape!=inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512, 1, 1), grid=grid)
thunk.lazy = False
return thunk
CUDAMat¶
There are functions for conversion between CUDAMat objects and Theano’s CudaNdArray objects.
They obey the same principles as Theano’s PyCUDA functions and can be found in
theano.misc.cudamat_utils.py
.
WARNING: There is a peculiar problem associated with stride/shape with those converters. In order to work, the test needs a transpose and reshape...
Gnumpy¶
There are conversion functions between Gnumpy garray objects and Theano CudaNdArray objects.
They are also similar to Theano’s PyCUDA functions and can be found in theano.misc.gnumpy_utils.py
.
Understanding Memory Aliasing for Speed and Correctness¶
The aggressive reuse of memory is one of the ways through which Theano makes code fast, and it is important for the correctness and speed of your program that you understand how Theano might alias buffers.
This section describes the principles based on which Theano handles memory, and explains when you might want to alter the default behaviour of some functions and methods for faster performance.
The Memory Model: Two Spaces¶
There are some simple principles that guide Theano’s handling of memory. The main idea is that there is a pool of memory managed by Theano, and Theano tracks changes to values in that pool.
 Theano manages its own memory space, which typically does not overlap with the memory of normal Python variables that nonTheano code creates.
 Theano functions only modify buffers that are in Theano’s memory space.
 Theano’s memory space includes the buffers allocated to store
shared
variables and the temporaries used to evaluate functions.  Physically, Theano’s memory space may be spread across the host, a GPU device(s), and in the future may even include objects on a remote machine.
 The memory allocated for a
shared
variable buffer is unique: it is never aliased to anothershared
variable.  Theano’s managed memory is constant while Theano functions are not running and Theano’s library code is not running.
 The default behaviour of a function is to return userspace values for outputs, and to expect userspace values for inputs.
The distinction between Theanomanaged memory and usermanaged memory can be
broken down by some Theano functions (e.g. shared
, get_value
and the
constructors for In
and Out
) by using a borrow=True
flag.
This can make those methods faster (by avoiding copy operations) at the expense
of risking subtle bugs in the overall program (by aliasing memory).
The rest of this section is aimed at helping you to understand when it is safe
to use the borrow=True
argument and reap the benefits of faster code.
Borrowing when Constructing Function Objects¶
A borrow
argument can also be provided to the In
and Out
objects
that control how theano.function
handles its argument[s] and return value[s].
import theano, theano.tensor
x = theano.tensor.matrix()
y = 2 * x
f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True))
Borrowing an input means that Theano will treat the argument you provide as if
it were part of Theano’s pool of temporaries. Consequently, your input
may be reused as a buffer (and overwritten!) during the computation of other variables in the
course of evaluating that function (e.g. f
).
Borrowing an output means that Theano will not insist on allocating a fresh
output buffer every time you call the function. It will possibly reuse the same one as
on a previous call, and overwrite the old content. Consequently, it may overwrite
old return values through sideeffect.
Those return values may also be overwritten in
the course of evaluating another compiled function (for example, the output
may be aliased to a shared
variable). So be careful to use a borrowed return
value right away before calling any more Theano functions.
The default is of course to not borrow internal results.
It is also possible to pass a return_internal_type=True
flag to the Out
variable which has the same interpretation as the return_internal_type
flag
to the shared
variable’s get_value
function. Unlike get_value()
, the
combination of return_internal_type=True
and borrow=True
arguments to
Out()
are not guaranteed to avoid copying an output value. They are just
hints that give more flexibility to the compilation and optimization of the
graph.
For GPU graphs, this borrowing can have a major speed impact. See the following code:
from theano import function, config, shared, sandbox, tensor, Out
import numpy
import time
vlen = 10 * 30 * 768 # 10 x # cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
f2 = function([],
Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
borrow=True))
t0 = time.time()
for i in range(iters):
r = f1()
t1 = time.time()
no_borrow = t1  t0
t0 = time.time()
for i in range(iters):
r = f2()
t1 = time.time()
print(
"Looping %s times took %s seconds without borrow "
"and %s seconds with borrow" % (iters, no_borrow, (t1  t0))
)
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f1.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Which produces this output:
$ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
Using gpu device 0: GeForce GTX 275
Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
Used the gpu
Take home message:
When an input x to a function is not needed after the function
returns and you would like to make it available to Theano as
additional workspace, then consider marking it with In(x,
borrow=True)
. It may make the function faster and reduce its memory
requirement. When a return value y is large (in terms of memory
footprint), and you only need to read from it once, right away when
it’s returned, then consider marking it with an Out(y,
borrow=True)
.
Python Memory Management¶
One of the major challenges in writing (somewhat) largescale Python programs is to keep memory usage at a minimum. However, managing memory in Python is easy—if you just don’t care. Python allocates memory transparently, manages objects using a reference count system, and frees memory when an object’s reference count falls to zero. In theory, it’s swell. In practice, you need to know a few things about Python memory management to get a memoryefficient program running. One of the things you should know, or at least get a good feel about, is the sizes of basic Python objects. Another thing is how Python manages its memory internally.
So let us begin with the size of basic objects. In Python, there’s not a lot of primitive data types: there are ints, longs (an unlimited precision version of ints), floats (which are doubles), tuples, strings, lists, dictionaries, and classes.
Basic Objects¶
What is the size of int
? A programmer with a C or C++ background will
probably guess that the size of a machinespecific int
is something
like 32 bits, maybe 64; and that therefore it occupies at most 8 bytes. But
is that so in Python?
Let us first write a function that shows the sizes of objects (recursively if necessary):
import sys
def show_sizeof(x, level=0):
print "\t" * level, x.__class__, sys.getsizeof(x), x
if hasattr(x, '__iter__'):
if hasattr(x, 'items'):
for xx in x.items():
show_sizeof(xx, level + 1)
else:
for xx in x:
show_sizeof(xx, level + 1)
We can now use the function to inspect the sizes of the different basic data types:
show_sizeof(None)
show_sizeof(3)
show_sizeof(2**63)
show_sizeof(102947298469128649161972364837164)
show_sizeof(918659326943756134897561304875610348756384756193485761304875613948576297485698417)
If you have a 32bit 2.7x Python, you’ll see:
8 None
12 3
22 9223372036854775808
28 102947298469128649161972364837164
48 918659326943756134897561304875610348756384756193485761304875613948576297485698417
and if you have a 64bit 2.7x Python, you’ll see:
16 None
24 3
36 9223372036854775808
40 102947298469128649161972364837164
60 918659326943756134897561304875610348756384756193485761304875613948576297485698417
Let us focus on the 64bit version (mainly because that’s what we need the
most often in our case). None
takes 16 bytes. int
takes 24 bytes,
three times as much memory as a C int64_t
, despite being some kind of
“machinefriendly” integer. Long integers (unbounded precision), used to
represent integers larger than 2^{63}1, have a minimum size of 36
bytes. Then it grows linearly in the logarithm of the integer represented.
Python’s floats are implementationspecific but seem to be C doubles. However, they do not eat up only 8 bytes:
show_sizeof(3.14159265358979323846264338327950288)
Outputs
16 3.14159265359
on a 32bit platform and
24 3.14159265359
on a 64bit platform. That’s again, three times the size a C programmer would expect. Now, what about strings?
show_sizeof("")
show_sizeof("My hovercraft is full of eels")
outputs, on a 32 bit platform:
21
50 My hovercraft is full of eels
and
37
66 My hovercraft is full of eels
An empty string costs 37 bytes in a 64bit environment! Memory used by string then linearly grows in the length of the (useful) string.
* * *
Other structures commonly used, tuples, lists, and dictionaries are worthwhile to examine. Lists (which are implemented as array lists, not as linked lists, with everything it entails) are arrays of references to Python objects, allowing them to be heterogeneous. Let us look at our sizes:
show_sizeof([])
show_sizeof([4, "toaster", 230.1])
outputs
32 []
44 [4, 'toaster', 230.1]
on a 32bit platform and
72 []
96 [4, 'toaster', 230.1]
on a 64bit platform. An empty list eats up 72 bytes. The size of an
empty, 64bit C++ std::list()
is only 16 bytes, 45 times less. What
about tuples? (and dictionaries?):
show_sizeof({})
show_sizeof({'a':213, 'b':2131})
outputs, on a 32bit box
136 {}
136 {'a': 213, 'b': 2131}
32 ('a', 213)
22 a
12 213
32 ('b', 2131)
22 b
12 2131
and
280 {}
280 {'a': 213, 'b': 2131}
72 ('a', 213)
38 a
24 213
72 ('b', 2131)
38 b
24 2131
for a 64bit box.
This last example is particularly interesting because it “doesn’t add up.” If we look at individual key/value pairs, they take 72 bytes (while their components take 38+24=62 bytes, leaving 10 bytes for the pair itself), but the dictionary takes 280 bytes (rather than a strict minimum of 144=72×2 bytes). The dictionary is supposed to be an efficient data structure for search and the two likely implementations will use more space that strictly necessary. If it’s some kind of tree, then we should pay the cost of internal nodes that contain a key and two pointers to children nodes; if it’s a hash table, then we must have some room with free entries to ensure good performance.
The (somewhat) equivalent std::map
C++ structure takes 48 bytes when
created (that is, empty). An empty C++ string takes 8 bytes (then allocated
size grows linearly the size of the string). An integer takes 4 bytes (32 bits).
* * *
Why does all this matter? It seems that whether an empty string takes 8 bytes or 37 doesn’t change anything much. That’s true. That’s true until you need to scale. Then, you need to be really careful about how many objects you create to limit the quantity of memory your program uses. It is a problem in reallife applications. However, to devise a really good strategy about memory management, we must not only consider the sizes of objects, but how many and in which order they are created. It turns out to be very important for Python programs. One key element to understand is how Python allocates its memory internally, which we will discuss next.
Internal Memory Management¶
To speedup memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.
There are some internal details on how Python manages those lists into blocks, pools, and “arena”: a number of block forms a pool, pools are gathered into arena, etc., but they’re not very relevant to the point we want to make (if you really want to know, read Evan Jones’ ideas on how to improve Python’s memory allocation). The important point is that those lists never shrink.
Indeed: if an item (of size x) is deallocated (freed by lack of reference) its location is not returned to Python’s global memory pool (and even less to the system), but merely marked as free and added to the free list of items of size x. The dead object’s location will be reused if another object of compatible size is needed. If there are no dead objects available, new ones are created.
If small objects memory is never freed, then the inescapable conclusion is that, like goldfishes, these small object lists only keep growing, never shrinking, and that the memory footprint of your application is dominated by the largest number of small objects allocated at any given point.
* * *
Therefore, one should work hard to allocate only the number of small objects necessary for one task, favoring (otherwise unpythonèsque) loops where only a small number of elements are created/processed rather than (more pythonèsque) patterns where lists are created using list generation syntax then processed.
While the second pattern is more à la Python, it is rather the worst case: you end up creating lots of small objects that will come populate the small object lists, and even once the list is dead, the dead objects (now all in the free lists) will still occupy a lot of memory.
* * *
The fact that the free lists grow does not seem like much of a problem because the memory it contains is still accessible to the Python program. But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase.
* * *
Let us prove my point using memory_profiler, a Python addon module
(which depends on the pythonpsutil
package) by Fabian Pedregosa (the module’s github page). This addon provides the
decorator @profile
that allows one to monitor one specific function
memory usage. It is extremely simple to use. Let us consider the following
program:
import copy
import memory_profiler
@profile
def function():
x = list(range(1000000)) # allocate a big list
y = copy.deepcopy(x)
del x
return y
if __name__ == "__main__":
function()
invoking
python m memory_profiler memoryprofileme.py
prints, on a 64bit computer
Filename: memoryprofileme.py
Line # Mem usage Increment Line Contents
================================================
4 @profile
5 9.11 MB 0.00 MB def function():
6 40.05 MB 30.94 MB x = list(range(1000000)) # allocate a big list
7 89.73 MB 49.68 MB y = copy.deepcopy(x)
8 82.10 MB 7.63 MB del x
9 82.10 MB 0.00 MB return y
This program creates a list of n=1,000,000 ints (n x 24 bytes = ~23 MB) and an
additional list of references (n x 8 bytes = ~7.6 MB), which amounts to a total
memory usage of ~31 MB. copy.deepcopy
copies both lists, which allocates
again ~50 MB (I am not sure where the additional overhead of 50 MB  31 MB = 19
MB comes from). The interesting part is del x
: it deletes x
, but the
memory usage only decreases by 7.63 MB! This is because del
only deletes the
reference list, not the actual integer values, which remain on the heap and
cause a memory overhead of ~23 MB.
This example allocates in total ~73 MB, which is more than twice the amount of memory needed to store a single list of ~31 MB. You can see that memory can increase surprisingly if you are not careful!
Note that you might get different results on a different platform or with a different python version.
Pickle¶
On a related note: is pickle
wasteful?
Pickle is the standard way of (de)serializing Python objects to file. What is its memory footprint? Does it create extra copies of the data or is it rather smart about it? Consider this short example:
import memory_profiler
import pickle
import random
def random_string():
return "".join([chr(64 + random.randint(0, 25)) for _ in xrange(20)])
@profile
def create_file():
x = [(random.random(),
random_string(),
random.randint(0, 2 ** 64))
for _ in xrange(1000000)]
pickle.dump(x, open('machin.pkl', 'w'))
@profile
def load_file():
y = pickle.load(open('machin.pkl', 'r'))
return y
if __name__=="__main__":
create_file()
#load_file()
With one invocation to profile the creation of the pickled data, and one
invocation to reread it (you comment out the function not to be
called). Using memory_profiler
, the creation uses a lot of memory:
Filename: testpickle.py
Line # Mem usage Increment Line Contents
================================================
8 @profile
9 9.18 MB 0.00 MB def create_file():
10 9.33 MB 0.15 MB x=[ (random.random(),
11 random_string(),
12 random.randint(0,2**64))
13 246.11 MB 236.77 MB for _ in xrange(1000000) ]
14
15 481.64 MB 235.54 MB pickle.dump(x,open('machin.pkl','w'))
and rereading a bit less:
Filename: testpickle.py
Line # Mem usage Increment Line Contents
================================================
18 @profile
19 9.18 MB 0.00 MB def load_file():
20 311.02 MB 301.83 MB y=pickle.load(open('machin.pkl','r'))
21 311.02 MB 0.00 MB return y
So somehow, pickling is very bad for memory consumption. The initial list takes up more or less 230MB, but pickling it creates an extra 230something MB worth of memory allocation.
Unpickling, on the other hand, seems fairly efficient. It does create more memory than the original list (300MB instead of 230something) but it does not double the quantity of allocated memory.
Overall, then, (un)pickling should be avoided for memorysensitive applications. What are the alternatives? Pickling preserves all the structure of a data structure, so you can recover it exactly from the pickled file at a later time. However, that might not always be needed. If the file is to contain a list as in the example above, then maybe a simple flat, textbased, file format is in order. Let us see what it gives.
A naïve implementation would give:
import memory_profiler
import random
import pickle
def random_string():
return "".join([chr(64 + random.randint(0, 25)) for _ in xrange(20)])
@profile
def create_file():
x = [(random.random(),
random_string(),
random.randint(0, 2 ** 64))
for _ in xrange(1000000) ]
f = open('machin.flat', 'w')
for xx in x:
print >>f, xx
f.close()
@profile
def load_file():
y = []
f = open('machin.flat', 'r')
for line in f:
y.append(eval(line))
f.close()
return y
if __name__== "__main__":
create_file()
#load_file()
Creating the file:
Filename: testflat.py
Line # Mem usage Increment Line Contents
================================================
8 @profile
9 9.19 MB 0.00 MB def create_file():
10 9.34 MB 0.15 MB x=[ (random.random(),
11 random_string(),
12 random.randint(0, 2**64))
13 246.09 MB 236.75 MB for _ in xrange(1000000) ]
14
15 246.09 MB 0.00 MB f=open('machin.flat', 'w')
16 308.27 MB 62.18 MB for xx in x:
17 print >>f, xx
and reading the file back:
Filename: testflat.py
Line # Mem usage Increment Line Contents
================================================
20 @profile
21 9.19 MB 0.00 MB def load_file():
22 9.34 MB 0.15 MB y=[]
23 9.34 MB 0.00 MB f=open('machin.flat', 'r')
24 300.99 MB 291.66 MB for line in f:
25 300.99 MB 0.00 MB y.append(eval(line))
26 301.00 MB 0.00 MB return y
Memory consumption on writing is now much better. It still creates a lot of temporary small objects (for 60MB’s worth), but it’s not doubling memory usage. Reading is comparable (using only marginally less memory).
This particular example is trivial but it generalizes to strategies where you don’t load the whole thing first then process it but rather read a few items, process them, and reuse the allocated memory. Loading data to a Numpy array, for example, one could first create the Numpy array, then read the file line by line to fill the array: this allocates one copy of the whole data. Using pickle, you would allocate the whole data (at least) twice: once by pickle, and once through Numpy.
Or even better yet: use Numpy (or PyTables) arrays. But that’s a different topic. In the mean time, you can have a look at loading and saving another tutorial in the Theano/doc/tutorial directory.
* * *
Python design goals are radically different than, say, C design goals. While the latter is designed to give you good control on what you’re doing at the expense of more complex and explicit programming, the former is designed to let you code rapidly while hiding most (if not all) of the underlying implementation details. While this sounds nice, in a production environment ignoring the implementation inefficiencies of a language can bite you hard, and sometimes when it’s too late. I think that having a good feel of how inefficient Python is with memory management (by design!) will play an important role in whether or not your code meets production requirements, scales well, or, on the contrary, will be a burning hell of memory.
Multi cores support in Theano¶
BLAS operation¶
BLAS is an interface for some mathematic operations between two vectors, a vector and a matrix or two matrices (e.g. the dot product between vector/matrix and matrix/matrix). Many different implementations of that interface exist and some of them are parallelized.
Theano tries to use that interface as frequently as possible for performance reasons. So if Theano links to a parallel implementation, those operations will run in parallel in Theano.
The most frequent way to control the number of threads used is via the
OMP_NUM_THREADS
environment variable. Set it to the number of
threads you want to use before starting the Python process. Some BLAS
implementations support other environment variables.
To test if you BLAS supports OpenMP/Multiple cores, you can use the theano/misc/check_blas.py script from the command line like this:
OMP_NUM_THREADS=1 python theano/misc/check_blas.py q
OMP_NUM_THREADS=2 python theano/misc/check_blas.py q
Parallel element wise ops with OpenMP¶
Because element wise ops work on every tensor entry independently they can be easily parallelized using OpenMP.
To use OpenMP you must set the openmp
flag
to True
.
You can use the flag openmp_elemwise_minsize
to set the minimum
tensor size for which the operation is parallelized because for short
tensors using OpenMP can slow down the operation. The default value is
200000
.
For simple (fast) operations you can obtain a speedup with very large tensors while for more complex operations you can obtain a good speedup also for smaller tensors.
There is a script elemwise_openmp_speedup.py
in theano/misc/
which you can use to tune the value of openmp_elemwise_minsize
for
your machine. The script runs two elemwise operations (a fast one and
a slow one) for a vector of size openmp_elemwise_minsize
with and
without OpenMP and shows the time difference between the cases.
The only way to control the number of threads used is via the
OMP_NUM_THREADS
environment variable. Set it to the number of
threads you want to use before starting the Python process. You can
test this with this command:
OMP_NUM_THREADS=2 python theano/misc/elemwise_openmp_speedup.py
#The output
Fast op time without openmp 0.000533s with openmp 0.000474s speedup 1.12
Slow op time without openmp 0.002987s with openmp 0.001553s speedup 1.92
Frequently Asked Questions¶
How to update a subset of weights?¶
If you want to update only a subset of a weight matrix (such as some rows or some columns) that are used in the forward propogation of each iteration, then the cost function should be defined in a way that it only depends on the subset of weights that are used in that iteration.
For example if you want to learn a lookup table, e.g. used for word embeddings, where each row is a vector of weights representing the embedding that the model has learned for a word, in each iteration, the only rows that should get updated are those containing embeddings used during the forward propagation. Here is how the theano function should be written:
Defining a shared variable for the lookup table
lookup_table = theano.shared(matrix_ndarray)
Getting a subset of the table (some rows or some columns) by passing an integer vector of indices corresponding to those rows or columns.
subset = lookup_table[vector_of_indices]
From now on, use only ‘subset’. Do not call lookup_table[vector_of_indices] again. This causes problems with grad as this will create new variables.
Defining cost which depends only on subset and not the entire lookup_table
cost = something that depends on subset
g = theano.grad(cost, subset)
There are two ways for updating the parameters: Either use inc_subtensor or set_subtensor. It is recommended to use inc_subtensor. Some theano optimizations do the conversion between the two functions, but not in all cases.
updates = inc_subtensor(subset, g*lr)
OR
updates = set_subtensor(subset, subset + g*lr)
Currently we just cover the case here, not if you use inc_subtensor or set_subtensor with other types of indexing.
Defining the theano function
f = theano.function(..., updates=[(lookup_table, updates)])
Note that you can compute the gradient of the cost function w.r.t. the entire lookup_table, and the gradient will have nonzero rows only for the rows that were selected during forward propagation. If you use gradient descent to update the parameters, there are no issues except for unnecessary computation, e.g. you will update the lookup table parameters with many zero gradient rows. However, if you want to use a different optimization method like rmsprop or HessianFree optimization, then there will be issues. In rmsprop, you keep an exponentially decaying squared gradient by whose square root you divide the current gradient to rescale the update step componentwise. If the gradient of the lookup table row which corresponds to a rare word is very often zero, the squared gradient history will tend to zero for that row because the history of that row decays towards zero. Using HessianFree, you will get many zero rows and columns. Even one of them would make it noninvertible. In general, it would be better to compute the gradient only w.r.t. to those lookup table rows or columns which are actually used during the forward propagation.
Extending Theano¶
This advanced tutorial is for users who want to extend Theano with new Types, new Operations (Ops), and new graph optimizations. This first page of the tutorial mainly focuses on the Python implementation of an Op and then proposes an overview of the most important methods that define an op. The second page of the tutorial (Extending Theano with a C Op) provides then information on the C implementation of an Op. The rest of the tutorial goes more in depth on advanced topics related to Ops, such as how to write efficient code for an Op and how to write an optimization to speed up the execution of an Op.
Along the way, this tutorial also introduces many aspects of how Theano works, so it is also good for you if you are interested in getting more under the hood with Theano itself.
Note
Before tackling this more advanced presentation, it is highly recommended to read the introductory Tutorial, especially the sections that introduce the Theano Graphs, as providing a novel Theano op requires a basic understanting of the Theano Graphs.
See also the Developer Start Guide for information regarding the versioning framework, namely about git and GitHub, regarding the development workflow and how to make a quality contribution.
Creating a new Op: Python implementation¶
So suppose you have looked through the library documentation and you don’t see a function that does what you want.
If you can implement something in terms of existing Ops, you should do that. Odds are your function that uses existing Theano expressions is short, has no bugs, and potentially profits from optimizations that have already been implemented.
However, if you cannot implement an Op in terms of existing Ops, you have to write a new one. Don’t worry, Theano was designed to make it easy to add new Ops, Types, and Optimizations.
As an illustration, this tutorial shows how to write a simple Pythonbased
operations which performs operations on
Type, double<Double>
.
.. It also shows how to implement tests that
.. ensure the proper working of an op.
Note
This is an introductury tutorial and as such it does not cover how to make
an op that returns a view or modifies the values in its inputs. Thus, all
ops created with the instructions described here MUST return newly
allocated memory or reuse the memory provided in the parameter
output_storage
of the perform()
function. See
Views and inplace operations for an explanation on how to do this.
If your op returns a view or changes the value of its inputs without doing as prescribed in that page, Theano will run, but will return correct results for some graphs and wrong results for others.
It is recommended that you run your tests in DebugMode (Theano flag
mode=DebugMode
) since it verifies if your op behaves correctly in this
regard.
Theano Graphs refresher¶
Theano represents symbolic mathematical computations as graphs. Those graphs are bipartite graphs (graphs with 2 types of nodes), they are composed of interconnected Apply and Variable nodes. Variable nodes represent data in the graph, either inputs, outputs or intermediary values. As such, Inputs and Outputs of a graph are lists of Theano Variable nodes. Apply nodes perform computation on these variables to produce new variables. Each Apply node has a link to an instance of Op which describes the computation to perform. This tutorial details how to write such an Op instance. Please refers to Graph Structures for a more detailed explanation about the graph structure.
Op’s basic methods¶
An op is any Python object which inherits from gof.Op
.
This section provides an overview of the basic methods you typically have to
implement to make a new op. It does not provide extensive coverage of all the
possibilities you may encounter or need. For that refer to
Op’s contract.
import theano
class MyOp(theano.Op):
# Properties attribute
__props__ = ()
#itypes and otypes attributes are
#compulsory if make_node method is not defined.
#They're the type of input and output respectively
itypes = None
otypes = None
#Compulsory if itypes and otypes are not defined
def make_node(self, *inputs):
pass
# Python implementation:
def perform(self, node, inputs_storage, output_storage):
pass
# Other type of implementation
# C implementation: [see theano web site for other functions]
def c_code(self, node, inputs, outputs, sub):
pass
# Other implementations (pycuda, ...):
def make_thunk(self, node, storage_map, _, _2):
pass
# optional:
check_input = True
def __init__(self, *args):
pass
def grad(self, inputs, g):
pass
def R_op(self, inputs, eval_points):
pass
def infer_shape(node, input_shapes):
pass
An op has to implement some methods defined in the the interface of
gof.Op
. More specifically, it is mandatory for an op to define either
the method make_node()
or itypes
, otypes
and one of the
implementation methods, either perform()
, Op.c_code()
or make_thunk()
.
method make_node()
and one of the implementation methods, either
perform()
, Op.c_code()
or make_thunk()
.
make_node()
method creates an Apply node representing the application of the op on the inputs provided. This method is reponsible for three things:
 it first checks that the input Variables types are compatible with the current op. If the op cannot be applied on the provided input types, it must raises an exception (such as
TypeError
). it operates on the Variables found in
*inputs
in Theano’s symbolic language to infer the type of the symbolic output Variables. It creates output Variables of a suitable symbolic Type to serve as the outputs of this op’s application. it creates an Apply instance with the input and output Variable, and return the Apply instance.
perform()
method defines the Python implementation of an op. It takes several arguments:
node
is a reference to an Apply node which was previously obtained via theOp
‘smake_node()
method. It is typically not used in simple ops, but it contains symbolic information that could be required for complex ops.inputs
is a list of references to data which can be operated on using nonsymbolic statements, (i.e., statements in Python, Numpy).output_storage
is a list of storage cells where the output is to be stored. There is one storage cell for each output of the op. The data put inoutput_storage
must match the type of the symbolic output. It is forbidden to change the length of the list(s) contained inoutput_storage
. A function Mode may allowoutput_storage
elements to persist between evaluations, or it may resetoutput_storage
cells to hold a value ofNone
. It can also preallocate some memory for the op to use. This feature can allowperform
to reuse memory between calls, for example. If there is something preallocated in theoutput_storage
, it will be of the good dtype, but can have the wrong shape and have any stride pattern.
perform()
method must be determined by the inputs. That is to say, when applied to identical inputs the method must return the same outputs.
gof.Op
allows some other way to define the op implentation. For instance, it is possible to defineOp.c_code()
to provide a Cimplementation to the op. Please refers to tutorial Extending Theano with a C Op for a description ofOp.c_code()
and other related c_methods. Note that an op can provide both Python and C implementation.
make_thunk()
method is another alternative toperform()
. It returns a thunk. A thunk is defined as a zeroarguments function which encapsulates the computation to be performed by an op on the arguments of its corresponding node. It takes several parameters:
node
is the Apply instance for which a thunk is requested,storage_map
is a dict of lists which maps variables to a oneelement lists holding the variable’s current value. The oneelement list acts as pointer to the value and allows sharing that “pointer” with other nodes and instances.compute_map
is also a dict of lists. It maps variables to oneelement lists holding booleans. If the value is 0 then the variable has not been computed and the value should not be considered valid. If the value is 1 the variable has been computed and the value is valid. If the value is 2 the variable has been garbagecollected and is no longer valid, but shouldn’t be required anymore for this call. The returned function must ensure that it sets the computed variables as computed in the compute_map.
make_thunk()
is useful if you want to generate code and compile it yourself. For example, this allows you to use PyCUDA to compile GPU code.If
make_thunk()
is defined by an op, it will be used by Theano to obtain the op’s implementation.perform()
andOp.c_code()
will be ignored.If
make_node()
is not defined, theitypes
andotypes
are used by the Op’smake_node()
method to implement the functionality ofmake_node()
method mentioned above.
Op’s auxiliary methods¶
There are other methods that can be optionally defined by the op:
The
__str__()
method provides a meaningful string representation of your op.
__eq__()
and__hash__()
define respectivelly equality between two ops and the hash of an op instance. They will be used by the optimization phase to merge nodes that are doing equivalent computations (same inputs, same operation). Two ops that are equal according__eq__()
should return the same output when they are applied on the same inputs.The
__props__
lists the properties that influence how the computation is performed (Ususally these are those that you set in__init__()
). It must be a tuple. If you don’t have any properties, then you should set this attribute to the emtpy tuple ().
__props__
enables the automatic generation of appropriate__eq__()
and__hash__()
. Given the method__eq__()
, automatically generated from__props__
, two ops will be equal if they have the same values for all the properties listed in__props__
. Given to the method__hash__()
automatically generated from__props__
, two ops will be have the same hash if they have the same values for all the properties listed in__props__
.__props__
will also generate a suitable__str__()
for your op. This requires development version after September 1st, 2014 or version 0.7.The
infer_shape()
method allows to infer the shape of the op output variables, without actually computing the outputs. It takes as inputnode
, a reference to the op Apply node, and a list of Theano symbolic Varables (i0_shape
,i1_shape
, ...) which are the shape of the op input Variables.infer_shape()
returns a list where each element is a tuple representing the shape of one output. This could be helpful if one only needs the shape of the output instead of the actual outputs, which can be useful, for instance, for optimization procedures.The
grad()
method is required if you want to differentiate some cost whose expression includes your op. The gradient may be specified symbolically in this method. It takes two argumentsinputs
andoutput_gradients
which are both lists of symbolic Theano Variables and those must be operated on using Theano’s symbolic language. The grad method must return a list containing one Variable for each input. Each returned Variable represents the gradient with respect to that input computed based on the symbolic gradients with respect to each output. If the output is not differentiable with respect to an input then this method should be defined to return a variable of type NullType for that input. Likewise, if you have not implemented the grad computation for some input, you may return a variable of type NullType for that input. Please refer tograd()
for a more detailed view.The
R_op()
method is needed if you wanttheano.tensor.Rop
to work with your op. This function implements the application of the Roperator on the function represented by your op. Let assume that function is , with input , applying the Roperator means computing the Jacobian of and rightmultiplying it by , the evaluation point, namely: .The optional boolean
check_input
attribute is used to specify if you want the types used in your op to check their inputs in their c_code. It can be used to speed up compilation, reduce overhead (particularly for scalars) and reduce the number of generated C files.
Example: Op definition¶
import theano
#Using make_node
class DoubleOp1(theano.Op):
__props__ = ()
def make_node(self, x):
x = theano.tensor.as_tensor_variable(x)
# Note: using x_.type() is dangerous, as it copies x's broadcasting
# behaviour
return theano.Apply(self, [x], [x.type()])
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = x * 2
def infer_shape(self, node, i0_shapes):
return i0_shapes
def grad(self, inputs, output_grads):
return [output_grads[0] * 2]
def R_op(self, inputs, eval_points):
# R_op can receive None as eval_points.
# That mean there is no diferientiable path through that input
# If this imply that you cannot compute some outputs,
# return None for those.
if eval_points[0] is None:
return eval_points
return self.grad(inputs, eval_points)
doubleOp1 = DoubleOp1()
#Using itypes and otypes
class DoubleOp2(theano.Op):
__props__ = ()
itypes = [theano.tensor.dmatrix]
otypes = [theano.tensor.dmatrix]
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = x * 2
def infer_shape(self, node, i0_shapes):
return i0_shapes
def grad(self, inputs, output_grads):
return [output_grads[0] * 2]
def R_op(self, inputs, eval_points):
# R_op can receive None as eval_points.
# That mean there is no diferientiable path through that input
# If this imply that you cannot compute some outputs,
# return None for those.
if eval_points[0] is None:
return eval_points
return self.grad(inputs, eval_points)
doubleOp2 = DoubleOp2()
At a high level, the code fragment declares a class (e.g., DoubleOp1
) and then
creates one instance of it (e.g., doubleOp1
).
We often gloss over this distinction, but will be precise here:
doubleOp1
(the instance) is an Op, not DoubleOp1
(the class which is a
subclass of theano.Op
). You can call doubleOp1(tensor.vector())
on a
Variable to build an expression, and in the expression there will be
a .op
attribute that refers to doubleOp1
.
The make_node
method creates a node to be included in the expression graph.
It runs when we apply our Op (doubleOp1
) to the Variable (x
), as
in doubleOp1(tensor.vector())
.
When an Op has multiple inputs, their order in the inputs argument to Apply
is important: Theano will call make_node(*inputs)
to copy the graph,
so it is important not to change the semantics of the expression by changing
the argument order.
All the inputs
and outputs
arguments to Apply
must be Variables.
A common and easy way to ensure inputs are variables is to run them through
as_tensor_variable
. This function leaves TensorType variables alone, raises
an error for nonTensorType variables, and copies any numpy.ndarray
into
the storage for a TensorType Constant. The make_node
method dictates the
appropriate Type for all output variables.
The perform
method implements the Op’s mathematical logic in Python.
The inputs (here x
) are passed by value, but a single output is returned
indirectly as the first element of singleelement lists. If doubleOp1
had
a second output, it would be stored in output_storage[1][0]
.
In some execution modes, the output storage might contain the return value of a previous call. That old value can be reused to avoid memory reallocation, but it must not influence the semantics of the Op output.
You can try the new Op as follows:
import theano
x = theano.tensor.matrix()
f = theano.function([x], DoubleOp1()(x))
import numpy
inp = numpy.random.rand(5, 4)
out = f(inp)
assert numpy.allclose(inp * 2, out)
print(inp)
print(out)
[[ 0.08257206 0.34308357 0.5288043 0.06582951]
[ 0.65977826 0.10040307 0.5402353 0.55472296]
[ 0.82358552 0.29502171 0.97387481 0.0080757 ]
[ 0.77327215 0.65401857 0.76562992 0.94145702]
[ 0.8452076 0.30500101 0.88430501 0.95818655]]
[[ 0.16514411 0.68616713 1.0576086 0.13165902]
[ 1.31955651 0.20080613 1.08047061 1.10944593]
[ 1.64717104 0.59004341 1.94774962 0.0161514 ]
[ 1.5465443 1.30803715 1.53125983 1.88291403]
[ 1.6904152 0.61000201 1.76861002 1.9163731 ]]
import theano
x = theano.tensor.matrix()
f = theano.function([x], DoubleOp2()(x))
import numpy
inp = numpy.random.rand(5, 4)
out = f(inp)
assert numpy.allclose(inp * 2, out)
print(inp)
print(out)
[[ 0.02443785 0.67833979 0.91954769 0.95444365]
[ 0.60853382 0.7770539 0.78163219 0.92838837]
[ 0.04427765 0.37895602 0.23155797 0.4934699 ]
[ 0.20551517 0.7419955 0.34500905 0.49347629]
[ 0.24082769 0.49321452 0.24566545 0.15351132]]
[[ 0.04887571 1.35667957 1.83909538 1.90888731]
[ 1.21706764 1.55410779 1.56326439 1.85677674]
[ 0.08855531 0.75791203 0.46311594 0.9869398 ]
[ 0.41103034 1.48399101 0.69001811 0.98695258]
[ 0.48165539 0.98642904 0.4913309 0.30702264]]
Example: __props__ definition¶
We can modify the previous piece of code in order to demonstrate
the usage of the __props__
attribute.
We create an Op that takes a variable x
and returns a*x+b
.
We want to say that two such ops are equal when their values of a
and b
are equal.
import theano
class AXPBOp(theano.Op):
"""
This creates an Op that takes x to a*x+b.
"""
__props__ = ("a", "b")
def __init__(self, a, b):
self.a = a
self.b = b
super(AXPBOp, self).__init__()
def make_node(self, x):
# check that the theano version has support for __props__.
assert hasattr(self, '_props'), ("Your version of theano is too"
"old to support __props__.")
x = theano.tensor.as_tensor_variable(x)
return theano.Apply(self, [x], [x.type()])
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = self.a * x + self.b
def infer_shape(self, node, i0_shapes):
return i0_shapes
def grad(self, inputs, output_grads):
return [a * output_grads[0] + b]
The use of __props__
saves
the user the trouble of implementing __eq__()
and __hash__()
manually. It also generates a default __str__()
method that prints the
attribute names and their values.
We can test this by running the following segment:
mult4plus5op = AXPBOp(4, 5)
another_mult4plus5op = AXPBOp(4, 5)
mult2plus3op = AXPBOp(2, 3)
assert mult4plus5op == another_mult4plus5op
assert mult4plus5op != mult2plus3op
x = theano.tensor.matrix()
f = theano.function([x], mult4plus5op(x))
g = theano.function([x], mult2plus3op(x))
import numpy
inp = numpy.random.rand(5, 4).astype(numpy.float32)
assert numpy.allclose(4 * inp + 5, f(inp))
assert numpy.allclose(2 * inp + 3, g(inp))
How To Test it¶
Theano has some functionalities to simplify testing. These help test the
infer_shape
, grad
and R_op
methods. Put the following code
in a file and execute it with the theanonose
program.
Basic Tests¶
Basic tests are done by you just by using the op and checking that it
returns the right answer. If you detect an error, you must raise an
exception. You can use the assert
keyword to automatically raise an
AssertionError
.
import numpy
import theano
from theano.tests import unittest_tools as utt
from theano import config
class test_Double(utt.InferShapeTester):
def setUp(self):
super(test_Double, self).setUp()
self.op_class = DoubleOp
self.op = DoubleOp()
def test_basic(self):
x = theano.tensor.matrix()
f = theano.function([x], self.op(x))
inp = numpy.asarray(numpy.random.rand(5, 4), dtype=config.floatX)
out = f(inp)
# Compare the result computed to the expected value.
utt.assert_allclose(inp * 2, out)
We call utt.assert_allclose(expected_value, value)
to compare
NumPy ndarray.This raise an error message with more information. Also,
the default tolerance can be changed with the Theano flags
config.tensor.cmp_sloppy
that take values in 0, 1 and 2. The
defaul value do the most strict comparison, 1 and 2 make less strict
comparison.
Testing the infer_shape¶
When a class inherits from the InferShapeTester
class, it gets the
self._compile_and_check
method that tests the op’s infer_shape
method. It tests that the op gets optimized out of the graph if only
the shape of the output is needed and not the output
itself. Additionally, it checks that the optimized graph computes
the correct shape, by comparing it to the actual shape of the computed
output.
self._compile_and_check
compiles a Theano function. It takes as
parameters the lists of input and output Theano variables, as would be
provided to theano.function
, and a list of real values to pass to the
compiled function. It also takes the op class as a parameter
in order to verify that no instance of it appears in the shapeoptimized graph.
If there is an error, the function raises an exception. If you want to
see it fail, you can implement an incorrect infer_shape
.
When testing with input values with shapes that take the same value
over different dimensions (for instance, a square matrix, or a tensor3
with shape (n, n, n), or (m, n, m)), it is not possible to detect if
the output shape was computed correctly, or if some shapes with the
same value have been mixed up. For instance, if the infer_shape uses
the width of a matrix instead of its height, then testing with only
square matrices will not detect the problem. This is why the
self._compile_and_check
method prints a warning in such a case. If
your op works only with such matrices, you can disable the warning with the
warn=False
parameter.
from theano.tests import unittest_tools as utt
from theano import config
class test_Double(utt.InferShapeTester):
# [...] as previous tests.
def test_infer_shape(self):
x = theano.tensor.matrix()
self._compile_and_check([x], # theano.function inputs
[self.op(x)], # theano.function outputs
# Always use not square matrix!
# inputs data
[numpy.asarray(numpy.random.rand(5, 4),
dtype=config.floatX)],
# Op that should be removed from the graph.
self.op_class)
Testing the gradient¶
The function verify_grad verifies the gradient of an op or Theano graph. It compares the analytic (symbolically computed) gradient and the numeric gradient (computed through the Finite Difference Method).
If there is an error, the function raises an exception. If you want to see it fail, you can implement an incorrect gradient (for instance, by removing the multiplication by 2).
def test_grad(self):
theano.tests.unittest_tools.verify_grad(self.op,
[numpy.random.rand(5, 7, 2)])
Testing the Rop¶
The class RopLop_checker
defines the functions
RopLop_checker.check_mat_rop_lop()
, RopLop_checker.check_rop_lop()
and
RopLop_checker.check_nondiff_rop()
. These allow to test the
implementation of the Rop method of a particular op.
For instance, to verify the Rop method of the DoubleOp, you can use this:
import numpy
import theano.tests
from theano.tests.test_rop import RopLop_checker
class test_DoubleRop(RopLop_checker):
def setUp(self):
super(test_DoubleRop, self).setUp()
def test_double_rop(self):
self.check_rop_lop(DoubleRop()(self.x), self.in_shape)
Testing GPU Ops¶
Ops to be executed on the GPU should inherit from the
theano.sandbox.cuda.GpuOp
and not theano.Op
. This allows
Theano to distinguish them. Currently, we use this to test if the
NVIDIA driver works correctly with our sum reduction code on the GPU.
Running Your Tests¶
To perform your tests, you may select either one of the three following methods:
The method of choice to conduct tests is to run the file
theanonose
. In a regular Theano installation, the latter will be
on the operating system’s path and directly accessible from any
folder. Otherwise, it can be accessed in the Theano/bin
folder. The following command lines may be used for the corresponding
purposes:
theanonose theano
: Run every test found in Theano’s path.theanonose folder_name
: Run every test found in the folder folder_name.theanonose test_file.py
: Run every test found in the file test_file.py.
The following are particularly useful for development purposes since they call for particular classes or even for particular tests:
theanonose test_file.py:test_DoubleRop
: Run every test found inside the class test_DoubleRop.theanonose test_file.py:test_DoubleRop.test_double_op
: Run only the test test_double_op in the class test_DoubleRop.
Help with the use and functionalities of theanonose
may be
obtained by running it with the command line parameter help
(h)
.
The command nosetests
can also be used. Although it lacks the
useful functionalities that theanonose
provides, nosetests
can be called similarly to theanonose
from any folder in Python’s
path like so:
nosetests [suffix similar to the above]
.
More documentation on nosetests
is available here:
nosetests.
One may also add a block of code similar to the following at the end of the file containing a specific test of interest and run the file. In this example, the test test_DoubleRop in the class test_double_op would be performed.
if __name__ == '__main__':
t = test_DoubleRop("test_double_rop")
t.setUp()
t.test_double_rop()
We recommend that when we execute a file, we run all tests in that file. This can be done by adding this at the end of your test files:
if __name__ == '__main__':
unittest.main()
Run the code of the DoubleOp example above.
Modify and execute to compute: x * y.
Modify and execute the example to return two outputs: x + y and x  y.
You can omit the Rop functions. Try to implement the testing apparatus described above.
(Notice that Theano’s current elemwise fusion optimization is only applicable to computations involving a single output. Hence, to gain efficiency over the basic solution that is asked here, the two operations would have to be jointly optimized explicitly in the code.)
Making tests errors more reproducible is a good practice. To make your tests more reproducible, you need a way to get the same random numbers. You can do this by seeding NumPy’s random number generator.
For convenience, the classes InferShapeTester and RopLop_checker
already do this for you. If you implement your own setUp
function,
don’t forget to call the parent setUp
function.
For more details see Using Random Values in Test Cases.
as_op¶
as_op is a python decorator that converts a python function into a basic Theano op that will call the supplied function during execution.
This isn’t the recommended way to build an op, but allows for a quick implementation.
It takes an optional infer_shape()
parameter that must have this
signature:
def infer_shape(node, input_shapes):
# ...
return output_shapes
 `input_shapes` and `output_shapes` are lists of tuples that
represent the shape of the corresponding inputs/outputs.
Note
Not providing the infer_shape method prevents shaperelated optimizations from working with this op. For example your_op(inputs, ...).shape will need the op to be executed just to get the shape.
Note
As no grad is defined, this means you won’t be able to differentiate paths that include this op.
Note
It converts the Python function to a callable object that takes as inputs Theano variables that were declared.
Note
The python function wrapped by the as_op decorator needs to return a new data allocation, no views or in place modification of the input.
as_op Example¶
import theano
import numpy
from theano import function
from theano.compile.ops import as_op
def infer_shape_numpy_dot(node, input_shapes):
ashp, bshp = input_shapes
return [ashp[:1] + bshp[1:]]
@as_op(itypes=[theano.tensor.fmatrix, theano.tensor.fmatrix],
otypes=[theano.tensor.fmatrix], infer_shape=infer_shape_numpy_dot)
def numpy_dot(a, b):
return numpy.dot(a, b)
You can try it as follows:
x = theano.tensor.fmatrix()
y = theano.tensor.fmatrix()
f = function([x, y], numpy_dot(x, y))
inp1 = numpy.random.rand(5, 4).astype('float32')
inp2 = numpy.random.rand(4, 7).astype('float32')
out = f(inp1, inp2)
Exercise¶
Run the code of the numpy_dot example above.
Modify and execute to compute: numpy.add and numpy.subtract.
 Modify and execute the example to return two outputs: x + y
 and x  y.
Documentation and Coding Style¶
Please always respect the Requirements for Quality Contributions or your contribution will not be accepted.
NanGuardMode and AllocEmpty¶
NanGuardMode help users find where in the graph NaN appear. But sometimes, we want some variables to not be checked. For example, in the old GPU backend, we use a float32 CudaNdarray to store the MRG random number generator state (they are integers). So if NanGuardMode check it, it will generate false positive. Another case is related to [Gpu]AllocEmpty or some computation on it (like done by Scan).
You can tell NanGuardMode to do not check a variable with:
variable.tag.nan_guard_mode_check
. Also, this tag automatically
follow that variable during optimization. This mean if you tag a
variable that get replaced by an inplace version, it will keep that
tag.
Final Note¶
A more extensive discussion of this section’s content may be found in the advanced tutorial Extending Theano.
The section Other ops includes more instructions for the following specific cases:
Extending Theano with a C Op¶
This tutorial covers how to extend Theano with an op that offers a C implementation. It does not cover ops that run on a GPU but it does introduce many elements and concepts which are relevant for GPU ops. This tutorial is aimed at individuals who already know how to extend Theano (see tutorial Creating a new Op: Python implementation) by adding a new op with a Python implementation and will only cover the additional knowledge required to also produce ops with C implementations.
Providing a Theano op with a C implementation requires to interact with Python’s CAPI and Numpy’s CAPI. Thus, the first step of this tutorial is to introduce both and highlight their features which are most relevant to the task of implementing a C op. This tutorial then introduces the most important methods that the op needs to implement in order to provide a usable C implementation. Finally, it shows how to combine these elements to write a simple C op for performing the simple task of multiplying every element in a vector by a scalar.
Python CAPI¶
Python provides a CAPI to allows the manipulation of python objects from C
code. In this API, all variables that represent Python objects are of type
PyObject *
. All objects have a pointer to their type object and a reference
count field (that is shared with the python side). Most python methods have
an equivalent C function that can be called on the PyObject *
pointer.
As such, manipulating a PyObject instance is often straightforward but it is important to properly manage its reference count. Failing to do so can lead to undesired behavior in the C code.
Reference counting¶
Reference counting is a mechanism for keeping track, for an object, of the number of references to it held by other entities. This mechanism is often used for purposes of garbage collecting because it allows to easily see if an object is still being used by other entities. When the reference count for an object drops to 0, it means it is not used by anyone any longer and can be safely deleted.
PyObjects implement reference counting and the Python CAPI defines a number of macros to help manage those reference counts. The definition of these macros can be found here : Python CAPI Reference Counting. Listed below are the two macros most often used in Theano C ops.

void Py_XINCREF(PyObject *o)
Increments the reference count of object
o
. Without effect if the object is NULL.

void Py_XDECREF(PyObject *o)
Decrements the reference count of object
o
. If the reference count reaches 0, it will trigger a call of the object’s deallocation function. Without effect if the object is NULL.
The general principle, in the reference counting paradigm, is that the owner of a reference to an object is responsible for disposing properly of it. This can be done by decrementing the reference count once the reference is no longer used or by transfering ownership; passing on the reference to a new owner which becomes responsible for it.
Some functions return “borrowed references”; this means that they return a reference to an object without transfering ownership of the reference to the caller of the function. This means that if you call a function which returns a borrowed reference, you do not have the burden of properly disposing of that reference. You should not call Py_XDECREF() on a borrowed reference.
Correctly managing the reference counts is important as failing to do so can lead to issues ranging from memory leaks to segmentation faults.
NumPy CAPI¶
The NumPy library provides a CAPI to allow users to create, access and manipulate NumPy arrays from within their own C routines. NumPy’s ndarrays are used extensively inside Theano and so extending Theano with a C op will require interaction with the NumPy CAPI.
This sections covers the API’s elements that are often required to write code for a Theano C op. The full documentation for the API can be found here : NumPy CAPI.
NumPy data types¶
To allow portability between platforms, the NumPy CAPI defines its own data
types which should be used whenever you are manipulating a NumPy array’s
internal data. The data types most commonly used to implement C ops are the
following : npy_int{8,16,32,64}
, npy_uint{8,16,32,64}
and
npy_float{32,64}
.
You should use these data types when manipulating a NumPy array’s internal
data instead of C primitives because the size of the memory representation
for C primitives can vary between platforms. For instance, a C long
can be
represented in memory with 4 bytes but it can also be represented with 8.
On the other hand, the inmemory size of NumPy data types remains constant
across platforms. Using them will make your code simpler and more portable.
The full list of defined data types can be found here : NumPy CAPI data types.
NumPy ndarrays¶
In the NumPy CAPI, NumPy arrays are represented as instances of the PyArrayObject class which is a descendant of the PyObject class. This means that, as for any other Python object that you manipulate from C code, you need to appropriatedly manage the reference counts of PyArrayObject instances.
Unlike in a standard multidimensionnal C array, a NumPy array’s internal data representation does not have to occupy a continuous region in memory. In fact, it can be Ccontiguous, Fcontiguous or noncontiguous. Ccontiguous means that the data is not only contiguous in memory but also that it is organized such that the index of the latest dimension changes the fastest. If the following array
x = [[1, 2, 3],
[4, 5, 6]]
is Ccontiguous, it means that, in memory, the six values contained in the
array x
are stored in the order [1, 2, 3, 4, 5, 6]
(the first value is
x[0,0]
, the second value is x[0,1]
, the third value is x[0,2]
, the,
fourth value is x[1,0]
, etc). Fcontiguous (or Fortran Contiguous) also
means that the data is contiguous but that it is organized such that the index
of the latest dimension changes the slowest. If the array x
is
Fcontiguous, it means that, in memory, the values appear in the order
[1, 4, 2, 5, 3, 6]
(the first value is x[0,0]
, the second value is
x[1,0]
, the third value is x[0,1]
, etc).
Finally, the internal data can be noncontiguous. In this case, it occupies
a noncontiguous region in memory but it is still stored in an organized
fashion : the distance between the element x[i,j]
and the element
x[i+1,j]
of the array is constant over all valid values of i
and
j
, just as the distance between the element x[i,j]
and the element
x[i,j+1]
of the array is constant over all valid values of i
and j
.
This distance between consecutive elements of an array over a given dimension,
is called the stride of that dimension.
Accessing NumPy ndarrays’ data and properties¶
The following macros serve to access various attributes of NumPy ndarrays.

void* PyArray_DATA(PyArrayObject* arr)
Returns a pointer to the first element of the array’s data. The returned pointer must be cast to a pointer of the proper Numpy CAPI data type before use.

int PyArray_NDIM(PyArrayObject* arr)
Returns the number of dimensions in the the array pointed by
arr

npy_intp* PyArray_DIMS(PyArrayObject* arr)
Returns a pointer on the first element of
arr
‘s internal array describing its dimensions. This internal array contains as many elements as the arrayarr
has dimensions.The macro
PyArray_SHAPE()
is a synonym ofPyArray_DIMS()
: it has the same effect and is used in an identical way.

npy_intp* PyArray_STRIDES(PyArrayObject* arr)
Returns a pointer on the first element of
arr
‘s internal array describing the stride for each of its dimension. This array has as many elements as the number of dimensions inarr
. In this array, the strides are expressed in number of bytes.

PyArray_Descr* PyArray_DESCR(PyArrayObject* arr)
Returns a reference to the object representing the dtype of the array.
The macro
PyArray_DTYPE()
is a synonym of thePyArray_DESCR()
: it has the same effect and is used in an identical way.Note: This is a borrowed reference so you do not need to decrement its reference count once you are done with it.

int PyArray_TYPE(PyArrayObject* arr)
Returns the typenumber for the elements of the array. Like the dtype, the typenumber is a descriptor for the type of the data in the array. However, the two are not synonyms and, as such, cannot be used in place of the other.

npy_intp PyArray_SIZE(PyArrayObject* arr)
Returns to total number of elements in the array

bool PyArray_CHKFLAGS(PyArrayObject* arr, flags)
Returns true if the array has the specified flags. The variable flag should either be a NumPy array flag or an integer obtained by applying bitwise or to an ensemble of flags.
The flags that can be used in with this macro are : NPY_ARRAY_C_CONTIGUOUS, NPY_ARRAY_F_CONTIGUOUS, NPY_ARRAY_OWNDATA, NPY_ARRAY_ALIGNED, NPY_ARRAY_WRITEABLE, NPY_ARRAY_UPDATEIFCOPY.
Creating NumPy ndarrays¶
The following functions allow the creation and copy of NumPy arrays :

PyObject* PyArray_EMPTY(int nd, npy_intp* dims, typenum dtype,

int fortran)
Constructs a new ndarray with the number of dimensions specified by
nd
, shape specified bydims
and data type specified bydtype
. Iffortran
is equal to 0, the data is organized in a Ccontiguous layout, otherwise it is organized in a Fcontiguous layout. The array elements are not initialized in any way.The function
PyArray_Empty()
performs the same function as the macroPyArray_EMPTY()
but the data type is given as a pointer to aPyArray_Descr
object instead of atypenum
.

PyObject* PyArray_ZEROS(int nd, npy_intp* dims, typenum dtype,

int fortran)
Constructs a new ndarray with the number of dimensions specified by
nd
, shape specified bydims
and data type specified bydtype
. Iffortran
is equal to 0, the data is organized in a Ccontiguous layout, otherwise it is organized in a Fcontiguous layout. Every element in the array is initialized to 0.The function
PyArray_Zeros()
performs the same function as the macroPyArray_ZEROS()
but the data type is given as a pointer to aPyArray_Descr
object instead of atypenum
.

PyArrayObject* PyArray_GETCONTIGUOUS(PyObject* op)
Returns a Ccontiguous and wellbehaved copy of the array op. If op is already Ccontiguous and wellbehaved, this function simply returns a new reference to op.
Methods the C Op needs to define¶
There is a key difference between an op defining a Python implementation for
its computation and defining a C implementation. In the case of a Python
implementation, the op defines a function perform()
which executes the
required Python code to realize the op. In the case of a C implementation,
however, the op does not define a function that will execute the C code; it
instead defines functions that will return the C code to the caller.
This is because calling C code from Python code comes with a significant overhead. If every op was responsible for executing its own C code, every time a Theano function was called, this overhead would occur as many times as the number of ops with C implementations in the function’s computational graph.
To maximize performance, Theano instead requires the C ops to simply return the code needed for their execution and takes upon itself the task of organizing, linking and compiling the code from the various ops. Through this, Theano is able to minimize the number of times C code is called from Python code.
The following is a very simple example to illustrate how it’s possible to obtain performance gains with this process. Suppose you need to execute, from Python code, 10 different ops, each one having a C implementation. If each op was responsible for executing its own C code, the overhead of calling C code from Python code would occur 10 times. Consider now the case where the ops instead return the C code for their execution. You could get the C code from each op and then define your own C module that would call the C code from each op in succession. In this case, the overhead would only occur once; when calling your custom module itself.
Moreover, the fact that Theano itself takes care of compiling the C code, instead of the individual ops, allows Theano to easily cache the compiled C code. This allows for faster compilation times.
See Implementing the arithmetic Ops in C for the full documentation of the various methods of the class Op that are related to the C implementation. Of particular interest are:
 The methods
Op.c_libraries()
andOp.c_lib_dirs()
to allow your op to use external libraries.  The method
Op.c_code_cleanup()
to specify how the op should clean up what it has allocated during its execution.  The methods
Op.c_init_code()
andOp.c_init_code_apply()
to specify code that should be executed once when the module is initialized, before anything else is executed.  The methods
Op.c_compile_args()
andOp.c_no_compile_args()
to specify requirements regarding how the op’s C code should be compiled.
This section describes the methods Op.c_code()
,
Op.c_support_code()
, Op.c_support_code_apply()
and
Op.c_code_cache_version()
because they are the ones that are most
commonly used.

c_code
(node, name, input_names, output_names, sub)¶ This method returns a string containing the C code to perform the computation required by this op.
The
node
argument is an Apply node representing an application of the current Op on a list of inputs, producing a list of outputs.input_names
is a sequence of strings which contains as many strings as the op has inputs. Each string contains the name of the C variable to which the corresponding input has been assigned. For example, the name of the C variable representing the first input of the op is given byinput_names[0]
. You should therefore use this name in your C code to interact with that variable.output_names
is used identically toinput_names
, but for the op’s outputs.Finally,
sub
is a dictionary of extras parameters to the c_code method. Among other things, it containssub['fail']
which is a string of C code that you should include in your C code (after ensuring that a Python exception is set) if it needs to raise an exception. Ex:c_code = """ PyErr_Format(PyExc_ValueError, "X does not have the right value"); %(fail)s; """ % {'fail' : sub['fail']}
to raise a ValueError Python exception with the specified message. The function
PyErr_Format()
supports string formatting so it is possible to tailor the error message to the specifics of the error that occured. IfPyErr_Format()
is called with more than two arguments, the subsequent arguments are used to format the error message with the same behavior as the function PyString_FromFormat(). The%
characters in the format characters need to be escaped since the C code itself is defined in a string which undergoes string formatting.c_code = """ PyErr_Format(PyExc_ValueError, "X==%%i but it should be greater than 0", X); %(fail)s; """ % {'fail' : sub['fail']}
Note: Your C code should not return the output of the computation but rather put the results in the C variables whose names are contained in the output_names
.

c_support_code
()¶ Returns a string containing some support C code for this op. This code will be included at the global scope level and can be used to define functions and structs that will be used by every apply of this op.

c_support_code_apply
(node, name)¶ Returns a string containing some support C code for this op. This code will be included at the global scope level and can be used to define functions and structs that will be used by this op. The difference between this method and
c_support_code()
is that the C code specified inc_support_code_apply()
should be specific to each apply of the Op, whilec_support_code()
is for support code that is not specific to each apply.Both
c_support_code()
andc_support_code_apply ()
are necessary because a Theano op can be used more than once in a given Theano function. For example, an op that adds two matrices could be used at some point in the Theano function to add matrices of integers and, at another point, to add matrices of doubles. Because the dtype of the inputs and outputs can change between different applies of the op, any support code that relies on a certain dtype is specific to a given apply of the op and should therefore be defined inc_support_code_apply()
.

c_code_cache_version
()¶ Returns a tuple of integers representing the version of the C code in this op. Ex : (1, 4, 0) for version 1.4.0
This tuple is used by Theano to cache the compiled C code for this op. As such, the return value MUST BE CHANGED every time the C code is altered or else Theano will disregard the change in the code and simply load a previous version of the op from the cache. If you want to avoid caching of the C code of this op, return an empty tuple or do not implement this method.
Note: Theano can handle tuples of any hashable objects as return values for this function but, for greater readability and easier management, this function should return a tuple of integers as previously described.
Important restrictions when implementing an Op¶
There are some important restrictions to remember when implementing an Op.
Unless your Op correctly defines a view_map
attribute, the perform
and c_code
must not
produce outputs whose memory is aliased to any input (technically, if changing the
output could change the input object in some sense, they are aliased).
Unless your Op correctly defines a destroy_map
attribute, perform
and c_code
must
not modify any of the inputs.
TODO: EXPLAIN DESTROYMAP and VIEWMAP BETTER AND GIVE EXAMPLE.
When developing an Op, you should run computations in DebugMode, by using
argument mode='DebugMode'
to theano.function
. DebugMode is
slow, but it can catch many common violations of the Op contract.
TODO: Like what? How? Talk about Python vs. C too.
DebugMode is no silver bullet though.
For example, if you modify an Op self.*
during any of
make_node
, perform
, or c_code
, you are probably doing something
wrong but DebugMode will not detect this.
TODO: jpt: I don’t understand the following sentence.
Ops and Types should usually be considered immutable – you should
definitely not make a change that would have an impact on __eq__
,
__hash__
, or the mathematical value that would be computed by perform
or c_code
.
Simple C Op example¶
In this section, we put together the concepts that were covered in this
tutorial to generate an op which multiplies every element in a vector
by a scalar and returns the resulting vector. This is intended to be a simple
example so the methods c_support_code()
and c_support_code_apply()
are
not used because they are not required.
In the C code below notice how the reference count on the output variable is managed. Also take note of how the new variables required for the op’s computation are declared in a new scope to avoid crossinitialization errors.
Also, in the C code, it is very important to properly validate the inputs and outputs storage. Theano guarantees that the inputs exist and have the right number of dimensions but it does not guarantee their exact shape. For instance, if an op computes the sum of two vectors, it needs to validate that its two inputs have the same shape. In our case, we do not need to validate the exact shapes of the inputs because we don’t have a need that they match in any way.
For the outputs, things are a little bit more subtle. Theano does not guarantee that they have been allocated but it does guarantee that, if they have been allocated, they have the right number of dimension. Again, Theano offers no guarantee on the exact shapes. This means that, in our example, we need to validate that the output storage has been allocated and has the same shape as our vector input. If it is not the case, we allocate a new output storage with the right shape and number of dimensions.
import numpy
import theano
from theano import gof
import theano.tensor as T
class VectorTimesScalar(gof.Op):
__props__ = ()
def make_node(self, x, y):
# Validate the inputs' type
if x.type.ndim != 1:
raise TypeError('x must be a 1d vector')
if y.type.ndim != 0:
raise TypeError('y must be a scalar')
# Create an output variable of the same type as x
output_var = x.type()
return gof.Apply(self, [x, y], [output_var])
def c_code_cache_version(self):
return (1, 0)
def c_code(self, node, name, inp, out, sub):
x, y = inp
z, = out
# Extract the dtypes of the inputs and outputs storage to
# be able to declare pointers for those dtypes in the C
# code.
dtype_x = node.inputs[0].dtype
dtype_y = node.inputs[1].dtype
dtype_z = node.outputs[0].dtype
itemsize_x = numpy.dtype(dtype_x).itemsize
itemsize_z = numpy.dtype(dtype_z).itemsize
fail = sub['fail']
c_code = """
// Validate that the output storage exists and has the same
// dimension as x.
if (NULL == %(z)s 
PyArray_DIMS(%(x)s)[0] != PyArray_DIMS(%(z)s)[0])
{
/* Reference received to invalid output variable.
Decrease received reference's ref count and allocate new
output variable */
Py_XDECREF(%(z)s);
%(z)s = (PyArrayObject*)PyArray_EMPTY(1,
PyArray_DIMS(%(x)s),
PyArray_TYPE(%(x)s),
0);
if (!%(z)s) {
%(fail)s;
}
}
// Perform the vector multiplication by a scalar
{
/* The declaration of the following variables is done in a new
scope to prevent cross initialization errors */
npy_%(dtype_x)s* x_data_ptr =
(npy_%(dtype_x)s*)PyArray_DATA(%(x)s);
npy_%(dtype_z)s* z_data_ptr =
(npy_%(dtype_z)s*)PyArray_DATA(%(z)s);
npy_%(dtype_y)s y_value =
((npy_%(dtype_y)s*)PyArray_DATA(%(y)s))[0];
int x_stride = PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s;
int z_stride = PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s;
int x_dim = PyArray_DIMS(%(x)s)[0];
for(int i=0; i < x_dim; i++)
{
z_data_ptr[i * z_stride] = (x_data_ptr[i * x_stride] *
y_value);
}
}
"""
return c_code % locals()
The c_code
method accepts variable names as arguments (name
, inp
,
out
, sub
) and returns a C code fragment that computes the expression
output. In case of error, the %(fail)s
statement cleans up and returns
properly.
More complex C Op example¶
This section introduces a new example, slightly more complex than the previous
one, with an op to perform an elementwise multiplication between the elements
of two vectors. This new example differs from the previous one in its use
of the methods c_support_code()
and c_support_code_apply()
(it does
not need to use them but it does so to explain their use) and its capacity
to support inputs of different dtypes.
Recall the method c_support_code()
is meant to produce code that will
be used for every apply of the op. This means that the C code in this
method must be valid in every setting your op supports. If the op is meant
to supports inputs of various dtypes, the C code in this method should be
generic enough to work with every supported dtype. If the op operates on
inputs that can be vectors or matrices, the C code in this method should
be able to accomodate both kinds of inputs.
In our example, the method c_support_code()
is used to declare a C
function to validate that two vectors have the same shape. Because our
op only supports vectors as inputs, this function is allowed to rely
on its inputs being vectors. However, our op should support multiple
dtypes so this function cannot rely on a specific dtype in its inputs.
The method c_support_code_apply()
, on the other hand, is allowed
to depend on the inputs to the op because it is applyspecific. Therefore, we
use it to define a function to perform the multiplication between two vectors.
Variables or functions defined in the method c_support_code_apply()
will
be included at the global scale for every apply of the Op. Because of this,
the names of those variables and functions should include the name of the op,
like in the example. Otherwise, using the op twice in the same graph will give
rise to conflicts as some elements will be declared more than once.
The last interesting difference occurs in the c_code()
method. Because the
dtype of the output is variable and not guaranteed to be the same as any of
the inputs (because of the upcast in the method make_node()
), the typenum
of the output has to be obtained in the Python code and then included in the
C code.
class VectorTimesVector(gof.Op):
__props__ = ()
def make_node(self, x, y):
# Validate the inputs' type
if x.type.ndim != 1:
raise TypeError('x must be a 1d vector')
if y.type.ndim != 1:
raise TypeError('y must be a 1d vector')
# Create an output variable of the same type as x
output_var = theano.tensor.TensorType(
dtype=theano.scalar.upcast(x.dtype, y.dtype),
broadcastable=[False])()
return gof.Apply(self, [x, y], [output_var])
def c_code_cache_version(self):
return (1, 0, 2)
def c_support_code(self):
c_support_code = """
bool vector_same_shape(PyArrayObject* arr1,
PyArrayObject* arr2)
{
return (PyArray_DIMS(arr1)[0] == PyArray_DIMS(arr2)[0]);
}
"""
return c_support_code
def c_support_code_apply(self, node, name):
dtype_x = node.inputs[0].dtype
dtype_y = node.inputs[1].dtype
dtype_z = node.outputs[0].dtype
c_support_code = """
void vector_elemwise_mult_%(name)s(npy_%(dtype_x)s* x_ptr,
int x_str, npy_%(dtype_y)s* y_ptr, int y_str,
npy_%(dtype_z)s* z_ptr, int z_str, int nbElements)
{
for (int i=0; i < nbElements; i++){
z_ptr[i * z_str] = x_ptr[i * x_str] * y_ptr[i * y_str];
}
}
"""
return c_support_code % locals()
def c_code(self, node, name, inp, out, sub):
x, y = inp
z, = out
dtype_x = node.inputs[0].dtype
dtype_y = node.inputs[1].dtype
dtype_z = node.outputs[0].dtype
itemsize_x = numpy.dtype(dtype_x).itemsize
itemsize_y = numpy.dtype(dtype_y).itemsize
itemsize_z = numpy.dtype(dtype_z).itemsize
typenum_z = numpy.dtype(dtype_z).num
fail = sub['fail']
c_code = """
// Validate that the inputs have the same shape
if ( !vector_same_shape(%(x)s, %(y)s))
{
PyErr_Format(PyExc_ValueError, "Shape mismatch : "
"x.shape[0] and y.shape[0] should match but "
"x.shape[0] == %%i and y.shape[0] == %%i",
PyArray_DIMS(%(x)s)[0], PyArray_DIMS(%(y)s)[0]);
%(fail)s;
}
// Validate that the output storage exists and has the same
// dimension as x.
if (NULL == %(z)s  !(vector_same_shape(%(x)s, %(z)s)))
{
/* Reference received to invalid output variable.
Decrease received reference's ref count and allocate new
output variable */
Py_XDECREF(%(z)s);
%(z)s = (PyArrayObject*)PyArray_EMPTY(1,
PyArray_DIMS(%(x)s),
%(typenum_z)s,
0);
if (!%(z)s) {
%(fail)s;
}
}
// Perform the vector elemwise multiplication
vector_elemwise_mult_%(name)s(
(npy_%(dtype_x)s*)PyArray_DATA(%(x)s),
PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s,
(npy_%(dtype_y)s*)PyArray_DATA(%(y)s),
PyArray_STRIDES(%(y)s)[0] / %(itemsize_y)s,
(npy_%(dtype_z)s*)PyArray_DATA(%(z)s),
PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s,
PyArray_DIMS(%(x)s)[0]);
"""
return c_code % locals()
Alternate way of defining C Ops¶
The two previous examples have covered the standard way of implementing C Ops
in Theano by inheriting from the class Op
. This process is mostly
simple but it still involves defining many methods as well as mixing, in the
same file, both Python and C code which tends to make the result less
readable.
To help with this, Theano defines a class, COp
, from which new C ops
can inherit. The class COp
aims to simplify the process of implementing
C ops by doing the following :
 It allows you to define the C implementation of your op in a distinct C code file. This makes it easier to keep your Python and C code readable and well indented.
 It can automatically handle all the methods that return C code,
in addition to
Op.c_code_cache_version()
based on the provided external C implementation.
To illustrate how much simpler the class COp
makes the process of defining
a new op with a C implementation, let’s revisit the second example of this
tutorial, the VectorTimesVector
op. In that example, we implemented an op
to perform the task of elementwise vectorvector multiplication. The two
following blocks of code illustrate what the op would look like if it was
implemented using the COp
class.
The new op is defined inside a Python file with the following code :
import theano
from theano import gof
class VectorTimesVector(gof.COp):
__props__ = ()
func_file = "./vectorTimesVector.c"
func_name = "APPLY_SPECIFIC(vector_times_vector)"
def __init__(self):
super(VectorTimesVector, self).__init__(self.func_file,
self.func_name)
def make_node(self, x, y):
# Validate the inputs' type
if x.type.ndim != 1:
raise TypeError('x must be a 1d vector')
if y.type.ndim != 1:
raise TypeError('y must be a 1d vector')
# Create an output variable of the same type as x
output_var = theano.tensor.TensorType(
dtype=theano.scalar.upcast(x.dtype, y.dtype),
broadcastable=[False])()
return gof.Apply(self, [x, y], [output_var])
And the following is the C implementation of the op, defined in an external C file named vectorTimesVector.c :
#section support_code
// Support code function
bool vector_same_shape(PyArrayObject* arr1, PyArrayObject* arr2)
{
return (PyArray_DIMS(arr1)[0] == PyArray_DIMS(arr2)[0]);
}
#section support_code_apply
// Applyspecific support function
void APPLY_SPECIFIC(vector_elemwise_mult)(
DTYPE_INPUT_0* x_ptr, int x_str,
DTYPE_INPUT_1* y_ptr, int y_str,
DTYPE_OUTPUT_0* z_ptr, int z_str, int nbElements)
{
for (int i=0; i < nbElements; i++){
z_ptr[i * z_str] = x_ptr[i * x_str] * y_ptr[i * y_str];
}
}
// Applyspecific main function
int APPLY_SPECIFIC(vector_times_vector)(PyArrayObject* input0,
PyArrayObject* input1,
PyArrayObject** output0)
{
// Validate that the inputs have the same shape
if ( !vector_same_shape(input0, input1))
{
PyErr_Format(PyExc_ValueError, "Shape mismatch : "
"input0.shape[0] and input1.shape[0] should "
"match but x.shape[0] == %i and "
"y.shape[0] == %i",
PyArray_DIMS(input0)[0], PyArray_DIMS(input1)[0]);
return 1;
}
// Validate that the output storage exists and has the same
// dimension as x.
if (NULL == *output0  !(vector_same_shape(input0, *output0)))
{
/* Reference received to invalid output variable.
Decrease received reference's ref count and allocate new
output variable */
Py_XDECREF(*output0);
*output0 = (PyArrayObject*)PyArray_EMPTY(1,
PyArray_DIMS(input0),
TYPENUM_OUTPUT_0,
0);
if (!*output0) {
PyErr_Format(PyExc_ValueError,
"Could not allocate output storage");
return 1;
}
}
// Perform the actual vectorvector multiplication
APPLY_SPECIFIC(vector_elemwise_mult)(
(DTYPE_INPUT_0*)PyArray_DATA(input0),
PyArray_STRIDES(input0)[0] / ITEMSIZE_INPUT_0,
(DTYPE_INPUT_1*)PyArray_DATA(input1),
PyArray_STRIDES(input1)[0] / ITEMSIZE_INPUT_1,
(DTYPE_OUTPUT_0*)PyArray_DATA(*output0),
PyArray_STRIDES(*output0)[0] / ITEMSIZE_OUTPUT_0,
PyArray_DIMS(input0)[0]);
return 0;
}
As you can see from this example, the Python and C implementations are nicely decoupled which makes them much more readable than when they were intertwined in the same file and the C code contained string formatting markers.
Now that we have motivated the COp class, we can have a more precise look at what it does for us. For this, we go through the various elements that make up this new version of the VectorTimesVector op :
 Parent class : instead of inheriting from the class
Op
, VectorTimesVector inherits from the classCOp
.  Constructor : in our new op, the
__init__()
method has an important use; to inform the constructor of theCOp
class of the location, on the filesystem of the C implementation of this op. To do this, it gives a list of file paths containing the C code for this op. To autogenerate the c_code method with a function call you can specify the function name as the second parameter. The paths should be given as a relative path from the folder where the descendant of theCOp
class is defined. make_node()
: themake_node()
method is absolutely identical to the one in our old example. Using theCOp
class doesn’t change anything here. External C code : the external C code implements the various functions associated with the op. Writing this C code involves a few subtleties which deserve their own respective sections.
Main function¶
If you pass a function name to the __init__()
method of the
COp
class, it must respect the following constraints:
 It must return an int. The value of that int indicates whether the op could perform its task or not. A value of 0 indicates success while any nonzero value will interrupt the execution of the Theano function. When returning nonzero the function must set a python exception indicating the details of the problem.
 It must receive one argument for each input to the op followed by one pointer to an argument for each output of the op. The types for the argument is dependant on the Types (that is theano Types) of your inputs and outputs.
For example, the main C function of an op that takes two TensorTypes
(which has PyArrayObject *
as its C type) as inputs and returns
both their sum and the difference between them would have four
parameters (two for the op’s inputs and two for its outputs) and it’s
signature would look something like this :
int sumAndDiffOfScalars(PyArrayObject* in0, PyArrayObject* in1,
PyArrayObject** out0, PyArrayObject** out1)
Macros¶
For certain section tags, your C code can benefit from a number of
predefined macros. These section tags have no macros: init_code
,
support_code
. All other tags will have the support macros
discussed below.
APPLY_SPECIFIC(str)
which will automatically append a name unique to the Apply node that applies the Op at the end of the providedstr
. The use of this macro is discussed futher below.
For every input which has a dtype
attribute (this means
Tensors, and equivalent types on GPU), the following macros will be
defined unless your Op class has an Op.check_input
attribute
defined to False. In these descrptions ‘i’ refers to the position
(indexed from 0) in the input array.
DTYPE_INPUT_{i}
: NumPy dtype of the data in the array. This is the variable type corresponding to the NumPy dtype, not the string representation of the NumPy dtype. For instance, if the op’s first input is a float32 ndarray, then the macroDTYPE_INPUT_0
corresponds tonpy_float32
and can directly be used to declare a new variable of the same dtype as the data in the array :DTYPE_INPUT_0 myVar = someValue;
TYPENUM_INPUT_{i}
: Typenum of the data in the arrayITEMSIZE_INPUT_{i}
: Size, in bytes, of the elements in the array.
In the same way, the macros DTYPE_OUTPUT_{i}
,
ITEMSIZE_OUTPUT_{i}
and TYPENUM_OUTPUT_{i}
are defined for
every output ‘i’ of the op.
In addition to these macros, the init_code_struct
, code
, and
code_cleanup
section tags also have the following macros:
FAIL
: Code to insert at error points. A python exception should be set prior to this code. An invocation look like this:if (error) { // Set python exception FAIL }
You can add a semicolon after the macro if it makes your editor happy.
PARAMS
: Name of the params variable for this node. (only for Ops which have params, which is discussed elsewhere)
Finally the tag code
and code_cleanup
have macros to
pass the inputs and output names. These are name INPUT_{i}
and
OUTPUT_{i}
where i is the 0based index position in the input
and output arrays respectively.
Support code¶
Certain section are limited in what you can place in them due to
semantic and syntactic restrictions of the C++ language. Most of
these restrictions apply to the tags that end in _struct
.
When we defined the VectorTimesVector op without using the COp
class, we had to make a distinction between two types of support_code
: the support code that was applyspecific and the support code that
wasn’t. The applyspecific code was defined in the
c_support_code_apply()
method and the elements defined in that
code (global variables and functions) had to include the name of the
Apply node in their own names to avoid conflicts between the different
versions of the applyspecific code. The code that wasn’t
applyspecific was simply defined in the c_support_code()
method.
To make indentifiers that include the Apply node name use the
APPLY_SPECIFIC(str)
macro. In the above example, this macro is
used when defining the functions vector_elemwise_mult()
and
vector_times_vector()
as well as when calling function
vector_elemwise_mult()
from inside vector_times_vector()
.
When using the COp
class, we still have to make the distinction
between C code for each of the methods of a C class. These sections of
code are separated by #section <tag>
markers. The tag determines
the name of the method this C code applies to with the rule that
<tag>
applies to c_<tag>. Unknown tags are an error and will be
reported. Duplicate tags will be merged together in the order the
appear in the C files.
The rules for knowing if where a piece of code should be put can be
sometimes tricky. The key thing to remember is that things that can
be shared between instances of the op should be applyagnostic and go
into a section which does not end in _apply
or _struct
. The
distinction of _apply
and _struct
mostly hinghes on how you
want to manange the lifetime of the object. Note that to use an
applyspecific object, you have to be in a applyspecific section, so
some portions of the code that might seem applyagnostic may still be
applyspecific because of the data they use (this does not include
arguments).
In the above example, the function vector_same_shape()
is
applyagnostic because it uses none of the macros defined by the class
COp
and it doesn’t rely on any applyspecific code. The function
vector_elemwise_mult()
is applyspecific because it uses the
macros defined by COp
. Finally, the function
vector_times_vector()
is applyspecific because it uses those same
macros and also because it calls vector_elemwise_mult()
which is
an applyspecific function.
Using GDB to debug Op’s C code¶
When debugging C code, it can be useful to use GDB for code compiled by Theano.
For this, you must enable this Theano: cmodule.remove_gxx_opt=True. For the GPU, you must add in this second flag nvcc.flags=g (it slow down computation on the GPU, but it is enabled by default on the CPU).
Then you must start Python inside GDB and in it start your Python process (e.g. theanonose):
$gdb python
(gdb)r bin/theanonose theano/
Final Note¶
This tutorial focuses on providing C implementations to ops that manipulate Theano tensors. For more information about other Theano types, you can refer to the section Alternate Theano Types.
Writing an Op to work on an ndarray
in C¶
This section walks through a nontrivial example Op that does something pretty
weird and unrealistic, that is hard to express with existing Ops.
(Technically, we could use Scan
to implement the Op we’re about to describe,
but we ignore that possibility for the sake of example.)
The following code works, but important errorchecking has been omitted for clarity. For example, when you write C code that assumes memory is contiguous, you should check the strides and alignment.
import theano
class Fibby(theano.Op):
"""
An arbitrarily generalized Fibbonacci sequence
"""
__props__ = ()
def make_node(self, x):
x_ = tensor.as_tensor_variable(x)
assert x_.ndim == 1
return theano.Apply(self,
inputs=[x_],
outputs=[x_.type()])
# using x_.type() is dangerous, it copies x's broadcasting behaviour
def perform(self, node, inputs, output_storage):
x, = inputs
y = output_storage[0][0] = x.copy()
for i in range(2, len(x)):
y[i] = y[i1] * y[i2] + x[i]
def c_code(self, node, name, inames, onames, sub):
x, = inames
y, = onames
fail = sub['fail']
return """
Py_XDECREF(%(y)s);
%(y)s = (PyArrayObject*)PyArray_FromArray(
%(x)s, 0, NPY_ARRAY_ENSURECOPY);
if (!%(y)s)
%(fail)s;
{//New scope needed to make compilation work
dtype_%(y)s * y = (dtype_%(y)s*)PyArray_DATA(%(y)s);
dtype_%(x)s * x = (dtype_%(x)s*)PyArray_DATA(%(x)s);
for (int i = 2; i < PyArray_DIMS(%(x)s)[0]; ++i)
y[i] = y[i1]*y[i2] + x[i];
}
""" % locals()
def c_code_cache_version(self):
return (1,)
fibby = Fibby()
In the first two lines of the C function, we make y point to a new array with
the correct size for the output. This is essentially simulating the line
y = x.copy()
.
The variables %(x)s
and %(y)s
are set up by the TensorType to be PyArrayObject
pointers.
TensorType also set up dtype_%(x)s
to be a typdef to the C type for x
.
Py_XDECREF(%(y)s);
%(y)s = (PyArrayObject*)PyArray_FromArray(
%(x)s, 0, NPY_ARRAY_ENSURECOPY);
The first line reduces the reference count of the data that y originally pointed to. The second line allocates the new data and makes y point to it.
In C code for a theano op, numpy arrays are represented as PyArrayObject
C
structs. This is part of the numpy/scipy C API documented at
http://docs.scipy.org/doc/numpy/reference/capi.typesandstructures.html
TODO: NEEDS MORE EXPLANATION.
Writing an Optimization¶
fibby
of a vector of zeros is another vector of zeros of
the same size.
Theano does not attempt to infer this from the code provided via Fibby.perform
or Fibby.c_code
.
However, we can write an optimization that makes use of this observation.
This sort of local substitution of special cases is common,
and there is a stage of optimization (specialization) devoted to such optimizations.
The following optimization (fibby_of_zero
) tests whether the input is
guaranteed to be all zero, and if so it returns the input itself as a replacement
for the old output.
TODO: talk about OPTIMIZATION STAGES
from theano.tensor.opt import get_scalar_constant_value, NotScalarConstantError
# Remove any fibby(zeros(...))
@theano.tensor.opt.register_specialize
@theano.gof.local_optimizer([fibby])
def fibby_of_zero(node):
if node.op == fibby:
x = node.inputs[0]
try:
if numpy.all(0 == get_scalar_constant_value(x)):
return [x]
except NotScalarConstantError:
pass
The register_specialize
decorator is what activates our optimization, and
tells Theano to use it in the specialization stage.
The local_optimizer
decorator builds a class instance around our global
function. The [fibby]
argument is a hint that our optimizer works on nodes
whose .op
attribute equals fibby
.
The function here (fibby_of_zero
) expects an Apply
instance as an
argument for parameter node
. It tests using
function get_scalar_constant_value
, which determines if a
Variable (x
) is guaranteed to be a constant, and if so, what constant.
Test the optimization¶
Here is some code to test that the optimization is applied only when needed.
import numpy
import theano.tensor as T
from theano import function
from theano import tensor
# Test it does not apply when not needed
x = T.dvector()
f = function([x], fibby(x))
# We call the function to make sure it runs.
# If you run in DebugMode, it will compare the C and Python outputs.
f(numpy.random.rand(5))
topo = f.maker.fgraph.toposort()
assert len(topo) == 1
assert isinstance(topo[0].op, Fibby)
# Test that the optimization gets applied.
f_zero = function([], fibby(T.zeros([5])))
# If you run in DebugMode, it will compare the output before
# and after the optimization.
f_zero()
# Check that the optimization removes the Fibby Op.
# For security, the Theano memory interface ensures that the output
# of the function is always memory not aliased to the input.
# That is why there is a DeepCopyOp op.
topo = f_zero.maker.fgraph.toposort()
assert len(topo) == 1
assert isinstance(topo[0].op, theano.compile.ops.DeepCopyOp)
Overview of the compilation pipeline¶
The purpose of this page is to explain each step of defining and compiling a Theano function.
Definition of the computation graph¶
By creating Theano Variables using
theano.tensor.lscalar
or theano.tensor.dmatrix
or by using
Theano functions such as theano.tensor.sin
or
theano.tensor.log
, the user builds a computation graph. The
structure of that graph and details about its components can be found
in the Graph Structures article.
Compilation of the computation graph¶
Once the user has built a computation graph, she can use
theano.function
in order to make one or more functions that
operate on real data. function takes a list of input Variables as well as a list of output Variables that define a
precise subgraph corresponding to the function(s) we want to define,
compile that subgraph and produce a callable.
Here is an overview of the various steps that are done with the computation graph in the compilation phase:
Step 1  Create a FunctionGraph¶
The subgraph given by the end user is wrapped in a structure called FunctionGraph. That structure defines several hooks on adding and removing (pruning) nodes as well as on modifying links between nodes (for example, modifying an input of an Apply node) (see the article about fg – Graph Container [doc TODO] for more information).
FunctionGraph provides a method to change the input of an Apply node from one Variable to another and a more highlevel method to replace a Variable with another. This is the structure that Optimizers work on.
Some relevant Features are typically added to the FunctionGraph, namely to prevent any optimization from operating inplace on inputs declared as immutable.
Step 2  Execute main Optimizer¶
Once the FunctionGraph is made, an optimizer is produced by
the mode passed to function
(the Mode basically has two
important fields, linker
and optimizer
). That optimizer is
applied on the FunctionGraph using its optimize() method.
The optimizer is typically obtained through optdb
.
Step 3  Execute linker to obtain a thunk¶
Once the computation graph is optimized, the linker is
extracted from the Mode. It is then called with the FunctionGraph as
argument to
produce a thunk
, which is a function with no arguments that
returns nothing. Along with the thunk, one list of input containers (a
theano.gof.Container is a sort of object that wraps another and does
type casting) and one list of output containers are produced,
corresponding to the input and output Variables as well as the updates
defined for the inputs when applicable. To perform the computations,
the inputs must be placed in the input containers, the thunk must be
called, and the outputs must be retrieved from the output containers
where the thunk put them.
Typically, the linker calls the toposort
method in order to obtain
a linear sequence of operations to perform. How they are linked
together depends on the Linker used. The CLinker produces a single
block of C code for the whole computation, whereas the OpWiseCLinker
produces one thunk for each individual operation and calls them in
sequence.
The linker is where some options take effect: the strict
flag of
an input makes the associated input container do type checking. The
borrow
flag of an output, if False, adds the output to a
no_recycling
list, meaning that when the thunk is called the
output containers will be cleared (if they stay there, as would be the
case if borrow
was True, the thunk would be allowed to reuse (or
“recycle”) the storage).
Note
Compiled libraries are stored within a specific compilation directory,
which by default is set to $HOME/.theano/compiledir_xxx
, where
xxx
identifies the platform (under Windows the default location
is instead $LOCALAPPDATA\Theano\compiledir_xxx
). It may be manually set
to a different location either by setting config.compiledir
or
config.base_compiledir
, either within your Python script or by
using one of the configuration mechanisms described in config
.
The compile cache is based upon the C++ code of the graph to be compiled.
So, if you change compilation configuration variables, such as
config.blas.ldflags
, you will need to manually remove your compile cache,
using Theano/bin/theanocache clear
Theano also implements a lock mechanism that prevents
multiple compilations within the same compilation directory (to avoid
crashes with paralell execution of some scripts). This mechanism is
currently enabled by default, but if it causes any problem it may be
disabled using the function
theano.gof.compilelock.set_lock_status(..)
.
Step 4  Wrap the thunk in a pretty package¶
The thunk returned by the linker along with input and output
containers is unwieldy. function
hides that complexity away so
that it can be used like a normal function with arguments and return
values.
Theano vs. C¶
We describe some of the patterns in Theano, and present their closest analogue in a statically typed language such as C:
Theano  C 

Apply  function application / function call 
Variable  local function data / variable 
Shared Variable  global function data / variable 
Op  operations carried out in computation / function definition 
Type  data types 
For example:
int d = 0;
int main(int a) {
int b = 3;
int c = f(b)
d = b + c;
return g(a, c);
}
Based on this code snippet, we can relate f
and g
to Ops, a
,
b
and c
to Variables, d
to Shared Variable, g(a, c)
,
f(b)
and d = b + c
(taken as meaning
the action of computing f
, g
or +
on their respective inputs) to
Applies. Lastly, int
could be interpreted as the Theano Type of the
Variables a
, b
, c
and d
.
Making the double type¶
Type’s contract¶
In Theano’s framework, a Type
(gof.type.Type
)
is any object which defines the following
methods. To obtain the default methods described below, the Type should
be an instance of Type
or should be an instance of a
subclass of Type
. If you will write all methods yourself,
you need not use an instance of Type
.
Methods with default arguments must be defined with the same signature, i.e. the same default argument names and values. If you wish to add extra arguments to any of these methods, these extra arguments must have default values.

class
PureType
¶ 
filter
(value, strict=False, allow_downcast=None)¶ This casts a value to match the Type and returns the cast value. If
value
is incompatible with the Type, the method must raise an exception. Ifstrict
is True,filter
must return a reference tovalue
(i.e. casting prohibited). Ifstrict
is False, then casting may happen, but downcasting should only be used in two situations: if
allow_downcast
is True  if
allow_downcast
isNone
and the default behavior for this type allows downcasting for the givenvalue
(this behavior is typedependent, you may decide what your own type does by default)
We need to define
filter
with three arguments. The second argument must be calledstrict
(Theano often calls it by keyword) and must have a default value ofFalse
. The third argument must be calledallow_downcast
and must have a default value ofNone
. if

filter_inplace
(value, storage, strict=False, allow_downcast=None)¶ If filter_inplace is defined, it will be called instead of filter() This is to allow reusing the old allocated memory. As of this writing this is used only when we transfer new data to a shared variable on the gpu.
storage
will be the old value. i.e. The old numpy array, CudaNdarray, ...

is_valid_value
(value)¶ Returns True iff the value is compatible with the Type. If
filter(value, strict = True)
does not raise an exception, the value is compatible with the Type.Default: True iff
filter(value, strict=True)
does not raise an exception.

values_eq
(a, b)¶ Returns True iff
a
andb
are equal.Default:
a == b

values_eq_approx
(a, b)¶ Returns True iff
a
andb
are approximately equal, for a definition of “approximately” which varies from Type to Type.Default:
values_eq(a, b)

make_variable
(name=None)¶ Makes a Variable of this Type with the specified name, if
name
is notNone
. Ifname
isNone
, then the Variable does not have a name. The Variable will have itstype
field set to the Type object.Default: there is a generic definition of this in Type. The Variable’s
type
will be the object that defines this method (in other words,self
).

__call__
(name=None)¶ Syntactic shortcut to
make_variable
.Default:
make_variable

__eq__
(other)¶ Used to compare Type instances themselves
Default:
object.__eq__

__hash__
()¶ Types should not be mutable, so it should be OK to define a hash function. Typically this function should hash all of the terms involved in
__eq__
.Default:
id(self)

get_shape_info
(obj)¶ Optional. Only needed to profile the memory of this Type of object.
Return the information needed to compute the memory size of
obj
.The memory size is only the data, so this excludes the container. For an ndarray, this is the data, but not the ndarray object and other data structures such as shape and strides.
get_shape_info()
andget_size()
work in tandem for the memory profiler.get_shape_info()
is called during the execution of the function. So it is better that it is not too slow.get_size()
will be called on the output of this function when printing the memory profile.Parameters: obj – The object that this Type represents during execution Returns: Python object that self.get_size()
understands

get_size
(shape_info)¶ Number of bytes taken by the object represented by shape_info.
Optional. Only needed to profile the memory of this Type of object.
Parameters: shape_info – the output of the call to get_shape_info() Returns: the number of bytes taken by the object described by shape_info
.

clone
(dtype=None, broadcastable=None)¶ Optional, for TensorTypealikes.
Return a copy of the type with a possibly changed value for dtype and broadcastable (if they aren’t None).
Parameters:  dtype – New dtype for the copy.
 broadcastable – New broadcastable tuple for the copy.
Optional to run, but mandatory for DebugMode. Return True if the Python objects a and b could share memory. Return False otherwise. It is used to debug when Ops did not declare memory aliasing between variables. Can be a static method. It is highly recommended to use and is mandatory for Type in Theano as our buildbot runs in DebugMode.

For each method, the default is what Type
defines
for you. So, if you create an instance of Type
or an
instance of a subclass of Type
, you
must define filter
. You might want to override values_eq_approx
,
as well as values_eq
. The other defaults generally need not be
overridden.
For more details you can go see the documentation for Type.
Additional definitions¶
For certain mechanisms, you can register functions and other such things to plus your type into theano’s mechanisms. These are optional but will allow people to use you type with familiar interfaces.
To plug in additional options for the transfer target, define a function which takes a theano variable and a target argument and returns eitehr a new transferred variable (which can be the same as the input if no transfer is nessecary) or returns None if the transfer can’t be done.
Then register that function by calling register_transfer()
with it as argument.
Defining double¶
We are going to base Type double
on Python’s float
. We
must define filter
and shall override values_eq_approx
.
filter
# Note that we shadow Python's function ``filter`` with this
# definition.
def filter(x, strict=False, allow_downcast=None):
if strict:
if isinstance(x, float):
return x
else:
raise TypeError('Expected a float!')
elif allow_downcast:
return float(x)
else: # Covers both the False and None cases.
x_float = float(x)
if x_float == x:
return x_float
else:
raise TypeError('The double type cannot accurately represent '
'value %s (of type %s): you must explicitly '
'allow downcasting if you want to do this.'
% (x, type(x)))
If strict
is True we need to return x
. If strict
is True and x
is not a
float
(for example, x
could easily be an int
) then it is
incompatible with our Type and we must raise an exception.
If strict is False
then we are allowed to cast x
to a float
,
so if x
is an int
it we will return an equivalent float
.
However if this cast triggers a precision loss (x != float(x)
) and
allow_downcast
is not True, then we also raise an exception.
Note that here we decided that the default behavior of our type
(when allow_downcast
is set to None
) would be the same as
when allow_downcast
is False, i.e. no precision loss is allowed.
values_eq_approx
def values_eq_approx(x, y, tolerance=1e4):
return abs(x  y) / (abs(x) + abs(y)) < tolerance
The second method we define is values_eq_approx
. This method
allows approximate comparison between two values respecting our Type’s
constraints. It might happen that an optimization changes the computation
graph in such a way that it produces slightly different variables, for
example because of numerical instability like rounding errors at the
end of the mantissa. For instance, a + a + a + a + a + a
might not
actually produce the exact same output as 6 * a
(try with a=0.1),
but with values_eq_approx
we do not necessarily mind.
We added an extra tolerance
argument here. Since this argument is
not part of the API, it must have a default value, which we
chose to be 1e4.
Note
values_eq
is never actually used by Theano, but it might be used
internally in the future. Equality testing in
DebugMode is done using values_eq_approx
.
Putting them together
What we want is an object that respects the aforementioned
contract. Recall that Type defines default implementations for all
required methods of the interface, except filter
. One way to make
the Type is to instantiate a plain Type and set the needed fields:
from theano import gof
double = gof.Type()
double.filter = filter
double.values_eq_approx = values_eq_approx
Another way to make this Type is to make a subclass of gof.Type
and define filter
and values_eq_approx
in the subclass:
from theano import gof
class Double(gof.Type):
def filter(self, x, strict=False, allow_downcast=None):
# See code above.
...
def values_eq_approx(self, x, y, tolerance=1e4):
# See code above.
...
double = Double()
double
is then an instance of Type Double
, which in turn is a
subclass of Type
.
There is a small issue with defining double
this way. All
instances of Double
are technically the same Type. However, different
Double
Type instances do not compare the same:
>>> double1 = Double()
>>> double2 = Double()
>>> double1 == double2
False
Theano compares Types using ==
to see if they are the same.
This happens in DebugMode. Also, Ops can (and should) ensure that their inputs
have the expected Type by checking something like if x.type == lvector
.
There are several ways to make sure that equality testing works properly:
Define
Double.__eq__
so that instances of type Double are equal. For example:def __eq__(self, other): return type(self) is Double and type(other) is DoubleOverride
Double.__new__
to always return the same instance.Hide the Double class and only advertise a single instance of it.
Here we will prefer the final option, because it is the simplest.
Ops in the Theano code often define the __eq__
method though.
Untangling some concepts¶
Initially, confusion is common on what an instance of Type is versus
a subclass of Type or an instance of Variable. Some of this confusion is
syntactic. A Type is any object which has fields corresponding to the
functions defined above. The Type class provides sensible defaults for
all of them except filter
, so when defining new Types it is natural
to subclass Type. Therefore, we often end up with Type subclasses and
it is can be confusing what these represent semantically. Here is an
attempt to clear up the confusion:
 An instance of Type (or an instance of a subclass) is a set of constraints on real data. It is akin to a primitive type or class in C. It is a static annotation.
 An instance of Variable symbolizes data nodes in a data flow
graph. If you were to parse the C expression
int x;
,int
would be a Type instance andx
would be a Variable instance of that Type instance. If you were to parse the C expressionc = a + b;
,a
,b
andc
would all be Variable instances.  A subclass of Type is a way of implementing
a set of Type instances that share
structural similarities. In the
double
example that we are doing, there is actually only one Type in that set, therefore the subclass does not represent anything that one of its instances does not. In this case it is a singleton, a set with one element. However, theTensorType
class in Theano (which is a subclass of Type) represents a set of types of tensors parametrized by their data type or number of dimensions. We could say that subclassing Type builds a hierarchy of Types which is based upon structural similarity rather than compatibility.
Final version¶
from theano import gof
class Double(gof.Type):
def filter(self, x, strict=False, allow_downcast=None):
if strict:
if isinstance(x, float):
return x
else:
raise TypeError('Expected a float!')
elif allow_downcast:
return float(x)
else: # Covers both the False and None cases.
x_float = float(x)
if x_float == x:
return x_float
else:
raise TypeError('The double type cannot accurately represent '
'value %s (of type %s): you must explicitly '
'allow downcasting if you want to do this.'
% (x, type(x)))
def values_eq_approx(self, x, y, tolerance=1e4):
return abs(x  y) / (abs(x) + abs(y)) < tolerance
def __str__(self):
return "double"
double = Double()
We add one utility function, __str__
. That way, when we print
double
, it will print out something intelligible.
Making arithmetic Ops on double¶
Now that we have a double
type, we have yet to use it to perform
computations. We’ll start by defining multiplication.
Op’s contract¶
An Op is any object which inherits from gof.Op
. It has to
define the following methods.

make_node
(*inputs)¶ This method is responsible for creating output Variables of a suitable symbolic Type to serve as the outputs of this Op’s application. The Variables found in
*inputs
must be operated on using Theano’s symbolic language to compute the symbolic output Variables. This method should put these outputs into an Apply instance, and return the Apply instance.This method creates an Apply node representing the application of the Op on the inputs provided. If the Op cannot be applied to these inputs, it must raise an appropriate exception.
The inputs of the Apply instance returned by this call must be ordered correctly: a subsequent
self.make_node(*apply.inputs)
must produce something equivalent to the firstapply
.

perform
(node, inputs, output_storage)¶ This method computes the function associated to this Op.
node
is an Apply node created by the Op’smake_node
method.inputs
is a list of references to data to operate on using nonsymbolic statements, (i.e., statements in Python, Numpy).output_storage
is a list of storage cells where the variables of the computation must be put.More specifically:
node
: This is a reference to an Apply node which was previously obtained via theOp
‘smake_node
method. It is typically not used in simple Ops, but it contains symbolic information that could be required for complex Ops.inputs
: This is a list of data from which the values stored inoutput_storage
are to be computed using nonsymbolic language.output_storage
: This is a list of storage cells where the output is to be stored. A storage cell is a oneelement list. It is forbidden to change the length of the list(s) contained inoutput_storage
. There is one storage cell for each output of the Op.The data put in
output_storage
must match the type of the symbolic output. This is a situation where thenode
argument can come in handy.A function Mode may allow
output_storage
elements to persist between evaluations, or it may resetoutput_storage
cells to hold a value ofNone
. It can also preallocate some memory for the Op to use. This feature can allowperform
to reuse memory between calls, for example. If there is something preallocated in theoutput_storage
, it will be of the good dtype, but can have the wrong shape and have any stride pattern.
This method must be determined by the inputs. That is to say, if it is evaluated once on inputs A and returned B, then if ever inputs C, equal to A, are presented again, then outputs equal to B must be returned again.
You must be careful about aliasing outputs to inputs, and making modifications to any of the inputs. See Views and inplace operations before writing a
perform
implementation that does either of these things.
Instead (or in addition to) perform()
You can also provide a
C implementation of For more details, refer to the
documentation for Op.

__eq__
(other)¶ other
is also an Op.Returning
True
here is a promise to the optimization system that the other Op will produce exactly the same graph effects (from perform) as this one, given identical inputs. This means it will produce the same output values, it will destroy the same inputs (same destroy_map), and will alias outputs to the same inputs (same view_map). For more details, see Views and inplace operations.Note
If you set __props__, this will be automatically generated.

__hash__
()¶ If two Op instances compare equal, then they must return the same hash value.
Equally important, this hash value must not change during the lifetime of self. Op instances should be immutable in this sense.
Note
If you set __props__, this will be automatically generated.
Optional methods or attributes¶

__props__
¶ Default: Undefined
Must be a tuple. Lists the name of the attributes which influence the computation performed. This will also enable the automatic generation of appropriate __eq__, __hash__ and __str__ methods. Should be set to () if you have no attributes that are relevant to the computation to generate the methods.
New in version 0.7.

default_output
¶ Default: None
If this member variable is an integer, then the default implementation of
__call__
will returnnode.outputs[self.default_output]
, wherenode
was returned bymake_node
. Otherwise, the entire list of outputs will be returned, unless it is of length 1, where the single element will be returned by itself.

make_thunk
(node, storage_map, compute_map, no_recycling)¶ This function must return a thunk, that is a zeroarguments function that encapsulates the computation to be performed by this op on the arguments of the node.
Parameters:  node – Apply instance The node for which a thunk is requested.
 storage_map – dict of lists This maps variables to a oneelement lists holding the variable’s current value. The oneelement list acts as pointer to the value and allows sharing that “pointer” with other nodes and instances.
 compute_map – dict of lists This maps variables to oneelement lists holding booleans. If the value is 0 then the variable has not been computed and the value should not be considered valid. If the value is 1 the variable has been computed and the value is valid. If the value is 2 the variable has been garbagecollected and is no longer valid, but shouldn’t be required anymore for this call.
 no_recycling – WRITEME WRITEME
The returned function must ensure that is sets the computed variables as computed in the compute_map.
Defining this function removes the requirement for
perform()
or C code, as you will define the thunk for the computation yourself.

__call__
(*inputs, **kwargs)¶ By default this is a convenience function which calls
make_node()
with the supplied arguments and returns the result indexed by default_output. This can be overridden by subclasses to do anything else, but must return either a theano Variable or a list of Variables.If you feel the need to override __call__ to change the graph based on the arguments, you should instead create a function that will use your Op and build the graphs that you want and call that instead of the Op instance directly.

infer_shape
(node, shapes)¶ This function is needed for shape optimization.
shapes
is a list with one tuple for each input of the Apply node (which corresponds to the inputs of the op). Each tuple contains as many elements as the number of dimensions of the corresponding input. The value of each element is the shape (number of items) along the corresponding dimension of that specific input.While this might sound complicated, it is nothing more than the shape of each input as symbolic variables (one per dimension).
The function should return a list with one tuple for each output. Each tuple should contain the corresponding output’s computed shape.
Implementing this method will allow Theano to compute the output’s shape without computing the output itself, potentially sparing you a costly recomputation.

flops
(inputs, outputs)¶ It is only used to have more information printed by the memory profiler. It makes it print the mega flops and giga flops per second for each apply node. It takes as inputs two lists: one for the inputs and one for the outputs. They contain tuples that are the shapes of the corresponding inputs/outputs.

__str__
()¶ This allows you to specify a more informative string representation of your Op. If an Op has parameters, it is highly recommended to have the
__str__
method include the name of the op and the Op’s parameters’ values.Note
If you set __props__, this will be automatically generated. You can still overide it for custom output.

do_constant_folding
(node)¶ Default: Return True
By default when optimizations are enabled, we remove during function compilation Apply nodes whose inputs are all constants. We replace the Apply node with a Theano constant variable. This way, the Apply node is not executed at each function call. If you want to force the execution of an op during the function call, make do_constant_folding return False.
As done in the Alloc op, you can return False only in some cases by analyzing the graph from the node parameter.

debug_perform
(node, inputs, output_storage)¶ Undefined by default.
If you define this function then it will be used instead of C code or perform() to do the computation while debugging (currently DebugMode, but others may also use it in the future). It has the same signature and contract as
perform()
.This enables ops that cause trouble with DebugMode with their normal behaviour to adopt a different one when run under that mode. If your op doesn’t have any problems, don’t implement this.
If you want your op to work with gradient.grad() you also need to implement the functions described below.
Gradient¶
These are the function required to work with gradient.grad().

grad
(inputs, output_gradients)¶ If the Op being defined is differentiable, its gradient may be specified symbolically in this method. Both
inputs
andoutput_gradients
are lists of symbolic Theano Variables and those must be operated on using Theano’s symbolic language. The grad method must return a list containing one Variable for each input. Each returned Variable represents the gradient with respect to that input computed based on the symbolic gradients with respect to each output.If the output is not differentiable with respect to an input then this method should be defined to return a variable of type NullType for that input. Likewise, if you have not implemented the grad computation for some input, you may return a variable of type NullType for that input. theano.gradient contains convenience methods that can construct the variable for you:
theano.gradient.grad_undefined()
andtheano.gradient.grad_not_implemented()
, respectively.If an element of output_gradient is of type theano.gradient.DisconnectedType, it means that the cost is not a function of this output. If any of the op’s inputs participate in the computation of only disconnected outputs, then Op.grad should return DisconnectedType variables for those inputs.
If the grad method is not defined, then Theano assumes it has been forgotten. Symbolic differentiation will fail on a graph that includes this Op.
It must be understood that the Op’s grad method is not meant to return the gradient of the Op’s output. theano.tensor.grad computes gradients; Op.grad is a helper function that computes terms that appear in gradients.
If an Op has a single vectorvalued output y and a single vectorvalued input x, then the grad method will be passed x and a second vector z. Define J to be the Jacobian of y with respect to x. The Op’s grad method should return dot(J.T,z). When theano.tensor.grad calls the grad method, it will set z to be the gradient of the cost C with respect to y. If this op is the only op that acts on x, then dot(J.T,z) is the gradient of C with respect to x. If there are other ops that act on x, theano.tensor.grad will have to add up the terms of x’s gradient contributed by the other op’s grad method.
In practice, an op’s input and output are rarely implemented as single vectors. Even if an op’s output consists of a list containing a scalar, a sparse matrix, and a 4D tensor, you can think of these objects as being formed by rearranging a vector. Likewise for the input. In this view, the values computed by the grad method still represent a Jacobianvector product.
In practice, it is probably not a good idea to explicitly construct the Jacobian, which might be very large and very sparse. However, the returned value should be equal to the Jacobianvector product.
So long as you implement this product correctly, you need not understand what theano.tensor.grad is doing, but for the curious the mathematical justification is as follows:
In essence, the grad method must simply implement through symbolic Variables and operations the chain rule of differential calculus. The chain rule is the mathematical procedure that allows one to calculate the total derivative of the final scalar symbolic Variable C with respect to a primitive symbolic Variable x found in the list
inputs
. The grad method does this usingoutput_gradients
which provides the total derivative of C with respect to a symbolic Variable that is returned by the Op (this is provided inoutput_gradients
), as well as the knowledge of the total derivative of the latter with respect to the primitive Variable (this has to be computed).In mathematics, the total derivative of a scalar variable (C) with respect to a vector of scalar variables (x), i.e. the gradient, is customarily represented as the row vector of the partial derivatives, whereas the total derivative of a vector of scalar variables (f) with respect to another (x), is customarily represented by the matrix of the partial derivatives, i.e.the jacobian matrix. In this convenient setting, the chain rule instructs that the gradient of the final scalar variable C with respect to the primitive scalar variables in x through those in f is simply given by the matrix product: .
Here, the chain rule must be implemented in a similar but slightly more complex setting: Theano provides in the list
output_gradients
one gradient for each of the Variables returned by the Op. Where f is one such particular Variable, the corresponding gradient found inoutput_gradients
and representing is provided with a shape similar to f and thus not necessarily as a row vector of scalars. Furthermore, for each Variable x of the Op’s list of input variablesinputs
, the returned gradient representing must have a shape similar to that of Variable x.If the output list of the op is , then the list
output_gradients
is . Ifinputs
consists of the list , then Op.grad should return the list , where (and can stand for multiple dimensions).In other words,
grad()
does not return , but instead the appropriate dot product specified by the chain rule: . Both the partial differentiation and the multiplication have to be performed bygrad()
.Theano currently imposes the following constraints on the values returned by the grad method:
 They must be Variable instances.
 When they are types that have dtypes, they must never have an integer dtype.
The output gradients passed to Op.grad will also obey these constraints.
Integers are a tricky subject. Integers are the main reason for having DisconnectedType, NullType or zero gradient. When you have an integer as an argument to your grad method, recall the definition of a derivative to help you decide what value to return:
.
Suppose your function f has an integervalued output. For most functions you’re likely to implement in theano, this means your gradient should be zero, because f(x+epsilon) = f(x) for almost all x. (The only other option is that the gradient could be undefined, if your function is discontinuous everywhere, like the rational indicator function)
Suppose your function f has an integervalued input. This is a little trickier, because you need to think about what you mean mathematically when you make a variable integervalued in theano. Most of the time in machine learning we mean “f is a function of a realvalued x, but we are only going to pass in integervalues of x”. In this case, f(x+epsilon) exists, so the gradient through f should be the same whether x is an integer or a floating point variable. Sometimes what we mean is “f is a function of an integervalued x, and f is only defined where x is an integer.” Since f(x+epsilon) doesn’t exist, the gradient is undefined. Finally, many times in theano, integer valued inputs don’t actually affect the elements of the output, only its shape.
If your function f has both an integervalued input and an integervalued output, then both rules have to be combined:
 If f is defined at (x+epsilon), then the input gradient is defined. Since f(x+epsilon) would be equal to f(x) almost everywhere, the gradient should be 0 (first rule).
 If f is only defined where x is an integer, then the gradient is undefined, regardless of what the gradient with respect to the output is.
Examples:
 f(x,y) = dot product between x and y. x and y are integers.
Since the output is also an integer, f is a step function. Its gradient is zero almost everywhere, so Op.grad should return zeros in the shape of x and y.
 f(x,y) = dot product between x and y. x is floating point and y is an integer.
In this case the output is floating point. It doesn’t matter that y is an integer. We consider f to still be defined at f(x,y+epsilon). The gradient is exactly the same as if y were floating point.
 f(x,y) = argmax of x along axis y.
The gradient with respect to y is undefined, because f(x,y) is not defined for floating point y. How could you take an argmax along a fraActional axis? The gradient with respect to x is 0, because f(x+epsilon, y) = f(x) almost everywhere.
 f(x,y) = a vector with y elements, each of which taking on the value x
The grad method should return DisconnectedType()() for y, because the elements of f don’t depend on y. Only the shape of f depends on y. You probably also want to implement a connection_pattern method to encode this.
 f(x) = int(x) converts float x into an int. g(y) = float(y) converts an integer y into a float.
If the final cost C = 0.5 * g(y) = 0.5 g(f(x)), then the gradient with respect to y will be 0.5, even if y is an integer. However, the gradient with respect to x will be 0, because the output of f is integervalued.

connection_pattern(node):
Sometimes needed for proper operation of gradient.grad().
Returns a list of list of bools.
Op.connection_pattern[input_idx][output_idx] is true if the elements of inputs[input_idx] have an effect on the elements of outputs[output_idx].
The
node
parameter is needed to determine the number of inputs. Some ops such as Subtensor take a variable number of inputs.If no connection_pattern is specified, gradient.grad will assume that all inputs have some elements connected to some elements of all outputs.
This method conveys two pieces of information that are otherwise not part of the theano graph:
 Which of the op’s inputs are truly ancestors of each of the op’s outputs. Suppose an op has two inputs, x and y, and outputs f(x) and g(y). y is not really an ancestor of f, but it appears to be so in the theano graph.
 Whether the actual elements of each input/output are relevant to a computation. For example, the shape op does not read its input’s elements, only its shape metadata. d shape(x) / dx should thus raise a disconnected input exception (if these exceptions are enabled). As another example, the elements of the Alloc op’s outputs are not affected by the shape arguments to the Alloc op.
Failing to implement this function for an op that needs it can result in two types of incorrect behavior:
 gradient.grad erroneously raising a TypeError reporting that a gradient is undefined.
 gradient.grad failing to raise a ValueError reporting that an input is disconnected.
Even if connection_pattern is not implemented correctly, if gradient.grad returns an expression, that expression will be numerically correct.

R_op
(inputs, eval_points)¶ Optional, to work with gradient.R_op().
This function implements the application of the Roperator on the function represented by your op. Let assume that function is , with input , applying the Roperator means computing the Jacobian of and rightmultiplying it by , the evaluation point, namely: .
inputs
are the symbolic variables corresponding to the value of the input where you want to evaluate the jacobian, andeval_points
are the symbolic variables corresponding to the value you want to right multiply the jacobian with.Same conventions as for the grad method hold. If your op is not differentiable, you can return None. Note that in contrast to the method
grad()
, forR_op()
you need to return the same number of outputs as there are ouputs of the op. You can think of it in the following terms. You have all your inputs concatenated into a single vector . You do the same with the evaluation points (which are as many as inputs and of the shame shape) and obtain another vector . For each output, you reshape it into a vector, compute the jacobian of that vector with respect to and multiply it by . As a last step you reshape each of these vectors you obtained for each outputs (that have the same shape as the outputs) back to their corresponding shapes and return them as the output of theR_op()
method.
Defining an Op: mul
¶
We’ll define multiplication as a binary operation, even though a multiplication Op could take an arbitrary number of arguments.
First, we’ll instantiate a mul
Op:
from theano import gof
mul = gof.Op()
make_node
This function must take as many arguments as the operation we are
defining is supposed to take as inputs—in this example that would be
two. This function ensures that both inputs have the double
type.
Since multiplying two doubles yields a double, this function makes an
Apply node with an output Variable of type double
.
def make_node(x, y):
if x.type != double or y.type != double:
raise TypeError('mul only works on doubles')
return gof.Apply(mul, [x, y], [double()])
mul.make_node = make_node
The first two lines make sure that both inputs are Variables of the
double
type that we created in the previous section. We would not
want to multiply two arbitrary types, it would not make much sense
(and we’d be screwed when we implement this in C!)
The last line is the meat of the definition. There we create an Apply
node representing the application of Op mul
to inputs x
and
y
, giving a Variable instance of type double
as the output.
Note
Theano relies on the fact that if you call the make_node
method
of Apply’s first argument on the inputs passed as the Apply’s
second argument, the call will not fail and the returned Apply
instance will be equivalent. This is how graphs are copied.
perform
This code actually computes the function.
In our example, the data in inputs
will be instances of Python’s
builtin type float
because this is the type that double.filter()
will always return, per our own definition. output_storage
will
contain a single storage cell for the multiplication’s variable.
def perform(node, inputs, output_storage):
x, y = inputs[0], inputs[1]
z = output_storage[0]
z[0] = x * y
mul.perform = perform
Here, z
is a list of one element. By default, z == [None]
.
Note
It is possible that z
does not contain None
. If it contains
anything else, Theano guarantees that whatever it contains is what
perform
put there the last time it was called with this
particular storage. Furthermore, Theano gives you permission to do
whatever you want with z
‘s contents, chiefly reusing it or the
memory allocated for it. More information can be found in the
Op documentation.
Warning
We gave z
the Theano type double
in make_node
, which means
that a Python float
must be put there. You should not put, say, an
int
in z[0]
because Theano assumes Ops handle typing properly.
Trying out our new Op¶
In the following code, we use our new Op:
>>> import theano
>>> x, y = double('x'), double('y')
>>> z = mul(x, y)
>>> f = theano.function([x, y], z)
>>> f(5, 6)
30.0
>>> f(5.6, 6.7)
37.519999999999996
Note that there is an implicit call to
double.filter()
on each argument, so if we give integers as inputs
they are magically cast to the right type. Now, what if we try this?
>>> x = double('x')
>>> z = mul(x, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/u/breuleuo/hg/theano/theano/gof/op.py", line 207, in __call__
File "<stdin>", line 2, in make_node
AttributeError: 'int' object has no attribute 'type'
Automatic Constant Wrapping¶
Well, OK. We’d like our Op to be a bit more flexible. This can be done
by modifying make_node
to accept Python int
or float
as
x
and/or y
:
def make_node(x, y):
if isinstance(x, (int, float)):
x = gof.Constant(double, x)
if isinstance(y, (int, float)):
y = gof.Constant(double, y)
if x.type != double or y.type != double:
raise TypeError('mul only works on doubles')
return gof.Apply(mul, [x, y], [double()])
mul.make_node = make_node
Whenever we pass a Python int or float instead of a Variable as x
or
y
, make_node
will convert it to Constant for us. gof.Constant
is a Variable we statically know the value of.
>>> import numpy
>>> x = double('x')
>>> z = mul(x, 2)
>>> f = theano.function([x], z)
>>> f(10)
20.0
>>> numpy.allclose(f(3.4), 6.8)
True
Now the code works the way we want it to.
Note
Most Theano Ops follow this convention of upcasting literal
make_node arguments to Constants.
This makes typing expressions more natural. If you do
not want a constant somewhere in your graph, you have to pass a Variable
(like double('x')
here).
Final version¶
The above example is pedagogical. When you define other basic arithmetic
operations add
, sub
and div
, code for make_node
can be
shared between these Ops. Here is revised implementation of these four
arithmetic operators:
from theano import gof
class BinaryDoubleOp(gof.Op):
__props__ = ("name", "fn")
def __init__(self, name, fn):
self.name = name
self.fn = fn
def make_node(self, x, y):
if isinstance(x, (int, float)):
x = gof.Constant(double, x)
if isinstance(y, (int, float)):
y = gof.Constant(double, y)
if x.type != double or y.type != double:
raise TypeError('%s only works on doubles' % self.name)
return gof.Apply(self, [x, y], [double()])
def perform(self, node, inp, out):
x, y = inp
z, = out
z[0] = self.fn(x, y)
def __str__(self):
return self.name
add = BinaryDoubleOp(name='add',
fn=lambda x, y: x + y)
sub = BinaryDoubleOp(name='sub',
fn=lambda x, y: x  y)
mul = BinaryDoubleOp(name='mul',
fn=lambda x, y: x * y)
div = BinaryDoubleOp(name='div',
fn=lambda x, y: x / y)
Instead of working directly on an instance of Op, we create a subclass of
Op that we can parametrize. All the operations we define are binary. They
all work on two inputs with type double
. They all return a single
Variable of type double
. Therefore, make_node
does the same thing
for all these operations, except for the Op reference self
passed
as first argument to Apply. We define perform
using the function
fn
passed in the constructor.
This design is a flexible way to define basic operations without duplicating code. The same way a Type subclass represents a set of structurally similar types (see previous section), an Op subclass represents a set of structurally similar operations: operations that have the same input/output types, operations that only differ in one small detail, etc. If you see common patterns in several Ops that you want to define, it can be a good idea to abstract out what you can. Remember that an Op is just an object which satisfies the contract described above on this page and that you should use all the tools at your disposal to create these objects as efficiently as possible.
Exercise: Make a generic DoubleOp, where the number of arguments can also be given as a parameter.
Views and inplace operations¶
Theano allows the definition of Ops which return a view on one
of their inputs or operate inplace on one or several
inputs. This allows more efficient operations on numpy’s ndarray
data type than would be possible otherwise.
However, in order to work correctly, these Ops need to
implement an additional interface.
Theano recognizes views and inplace operations specially. It ensures that they are used in a consistent manner and it ensures that operations will be carried in a compatible order.
An unfortunate fact is that it is impossible to return a view on an
input with the double
type or to operate inplace on it (Python
floats are immutable). Therefore, we can’t make examples of these
concepts out of what we’ve just built. Nonetheless, we will present
the concepts:
Views¶
A “view” on an object x
is an object y
which shares memory
with x
in some way. In other words, changing x
might also
change y
and vice versa. For example, imagine a vector
structure
which contains two fields: an integer length and a pointer to a memory
buffer. Suppose we have:
x = vector {length: 256,
address: 0xDEADBEEF}
y = vector {length: 224,
address: 0xDEADBEEF + 0x10}
z = vector {length: 256,
address: 0xCAFEBABE}
So x
uses the memory range 0xDEADBEEF  0xDEADBFEF
, y
the
range 0xDEADBEFF  0xDEADBFDF
and z the range 0xCAFEBABE 
0xCAFEBBBE
. Since the ranges for x
and y
overlap, y
is
considered to be a view of x
and vice versa.
Suppose you had an Op which took x
as input and returned
y
. You would need to tell Theano that y
is a view of x
. For this
purpose, you would set the view_map
field as follows:
myop.view_map = {0: [0]}
What this means is that the first output (position 0) is a view of the first input (position 0). Even though the interface allows a list of inputs that are viewed by a given output, this feature is currently unsupported. Here are more examples:
myop.view_map = {0: [0]} # first output is a view of first input
myop.view_map = {0: [1]} # first output is a view of second input
myop.view_map = {1: [0]} # second output is a view of first input
myop.view_map = {0: [0], # first output is a view of first input
1: [1]} # *AND* second output is a view of second input
myop.view_map = {0: [0], # first output is a view of first input
1: [0]} # *AND* second output is *ALSO* a view of first input
myop.view_map = {0: [0, 1]} # THIS IS NOT SUPPORTED YET! Only put a single input number in the list!
Inplace operations¶
An inplace operation is one that modifies one or more of its
inputs. For example, the expression x += y
where x
and y
are numpy.ndarray
instances would normally represent an inplace
operation on x
.
Note
Inplace operations in Theano still work in a functional setting: they need to return the modified input. Symbolically, Theano requires one Variable standing for the input before being modified and another Variable representing the input after being modified. Therefore, code using inplace operations would look like this:
from theano.tensor import dscalars, log
from theano.tensor.inplace import add_inplace
x, y = dscalars('x', 'y')
r1 = log(x)
# r2 is x AFTER the add_inplace  x still represents the value before adding y
r2 = add_inplace(x, y)
# r3 is log(x) using the x from BEFORE the add_inplace
# r3 is the SAME as r1, even if we wrote this line after the add_inplace line
# Theano is actually going to compute r3 BEFORE r2
r3 = log(x)
# this is log(x) using the x from AFTER the add_inplace (so it's like log(x + y))
r4 = log(r2)
Needless to say, this goes for userdefined inplace operations as
well: the modified input must figure in the list of outputs you
give to Apply
in the definition of make_node
.
Also, for technical reasons but also because they are slightly confusing to use as evidenced by the previous code, Theano does not allow the end user to use inplace operations by default. However, it does allow optimizations to substitute them in in a later phase. Therefore, typically, if you define an inplace operation, you will define a pure equivalent and an optimization which subsitutes one for the other. Theano will automatically verify if it is possible to do so and will refuse the substitution if it introduces inconsistencies.
Take the previous definitions of x
, y
and z
and suppose an Op which
adds one to every byte of its input. If we give x
as an input to
that Op, it can either allocate a new buffer of the same size as x
(that could be z
) and set that new buffer’s bytes to the variable of
the addition. That would be a normal, pure Op. Alternatively,
it could add one to each byte in the buffer x
, therefore
changing it. That would be an inplace Op.
Theano needs to be notified of this fact. The syntax is similar to
that of view_map
:
myop.destroy_map = {0: [0]}
What this means is that the first output (position 0) operates inplace on the first input (position 0).
myop.destroy_map = {0: [0]} # first output operates inplace on first input
myop.destroy_map = {0: [1]} # first output operates inplace on second input
myop.destroy_map = {1: [0]} # second output operates inplace on first input
myop.destroy_map = {0: [0], # first output operates inplace on first input
1: [1]} # *AND* second output operates inplace on second input
myop.destroy_map = {0: [0], # first output operates inplace on first input
1: [0]} # *AND* second output *ALSO* operates inplace on first input
myop.destroy_map = {0: [0, 1]} # first output operates inplace on both the first and second input
# unlike for views, the previous line is legal and supported
Destructive Operations¶
While some operations will operate inplace on their inputs, some might simply destroy or corrupt them. For example, an Op could do temporary calculations right in its inputs. If that is the case, Theano also needs to be notified. The way to notify Theano is to assume that some output operated inplace on whatever inputs are changed or corrupted by the Op (even if the output does not technically reuse any of the input(s)’s memory). From there, go to the previous section.
Warning
Failure to correctly mark down views and inplace operations using
view_map
and destroy_map
can lead to nasty bugs. In the
absence of this information, Theano might assume that it is safe to
execute an inplace operation on some inputs before doing other
calculations on the previous values of the inputs. For example,
in the code: y = log(x); x2 = add_inplace(x, z)
it is
imperative to do the logarithm before the addition (because after
the addition, the original x that we wanted to take the logarithm
of is gone). If Theano does not know that add_inplace
changes
the value of x
it might invert the order and that will
certainly lead to erroneous computations.
You can often identify an incorrect view_map
or destroy_map
by using debugmode. Be sure to use DebugMode when developing
a new Op that uses ``view_map`` and/or ``destroy_map``.
Inplace optimization and DebugMode¶
It is recommended that during the graph construction, all Ops are not inplace. Then an optimization replaces them with inplace ones. Currently DebugMode checks all optimizations that were tried even if they got rejected. One reason an inplace optimization can get rejected is when there is another Op that is already being applied inplace on the same input. Another reason to reject an inplace optimization is if it would introduce a cycle into the graph.
The problem with DebugMode is that it will trigger a useless error when
checking a rejected inplace optimization, since it will lead to wrong results.
In order to be able to use DebugMode in more situations, your inplace
optimization can precheck whether it will get rejected by using the
theano.gof.destroyhandler.fast_inplace_check()
function, that will tell
which Ops can be performed inplace. You may then skip the optimization if it is
incompatible with this check. Note however that this check does not cover all
cases where an optimization may be rejected (it will not detect cycles).
Implementing some specific Ops¶
This page is a guide on the implementation of some specific types of Ops, and points to some examples of such implementations.
For the random number generating Ops, it explains different possible implementation strategies.
Scalar/Elemwise/Reduction Ops¶
Implementing a Theano scalar Op allows that scalar operation to be reused by our elemwise operations on tensors. If the scalar operation has C code, the elemwise implementation will automatically have C code too. This will enable the fusion of elemwise operations using your new scalar operation. It can also reuse the GPU elemwise code. It is similar for reduction operations.
For examples of how to add new scalar operations, you can have a look at those 2 pull requests, that add GammaLn and Psi and Gamma scalar Ops.
Be careful about some possible problems in the definition of the
grad
method, and about dependencies that may not be available. In
particular, see the following fixes:
Fix to grad() methods
and impl() methods related to SciPy.
SciPy Ops¶
We can wrap SciPy functions in Theano. But SciPy is an optional dependency. Here is some code that allows the Op to be optional:
try:
import scipy.linalg
imported_scipy = True
except ImportError:
# some ops (e.g. Cholesky, Solve, A_Xinv_b) won't work
imported_scipy = False
class SomeOp(Op):
...
def make_node(self, x):
assert imported_scipy, (
"SciPy not available. SciPy is needed for the SomeOp op.")
...
from nose.plugins.skip import SkipTest
class test_SomeOp(utt.InferShapeTester):
...
def test_infer_shape(self):
if not imported_scipy:
raise SkipTest("SciPy needed for the SomeOp op.")
...
Sparse Ops¶
There are a few differences to keep in mind if you want to make an op
that uses sparse inputs or outputs, rather than the
usual dense tensors. In particular, in the
make_node()
function, you have to call
theano.sparse.as_sparse_variable(x)
on sparse input variables,
instead of as_tensor_variable(x)
.
Another difference is that you need to use SparseVariable
and
SparseType
instead of TensorVariable
and TensorType
.
Do not forget that we support only sparse matrices (so only 2 dimensions)
and (like in SciPy) they do not support broadcasting operations by default
(although a few Ops do it when called manually). Also, we support only two
formats for sparse type: csr
and csc
. So in make_mode()
,
you can create output variables like this:
out_format = inputs[0].format # or 'csr' or 'csc' if the output format is fixed
SparseType(dtype=inputs[0].dtype, format=out_format).make_variable()
See the sparse theano.sparse.basic.Cast
op code
for a good example of a sparse op with Python code.
Note
From the definition of CSR and CSC formats, CSR column indices are
not necessarily sorted. Likewise for CSC row indices. Use
EnsureSortedIndices
if your code does not
support it.
Also, there can be explicit zeros in your inputs. Use
Remove0
or remove0
to
make sure they aren’t present in your input if you don’t support
that.
To remove explicit zeros and make sure indices are sorted, use
clean
.
Sparse Gradient¶
There are 2 types of gradients for sparse
operations: normal
gradient and structured
gradient. Please document what your op
implements in its docstring. It is important that the user knows it, and
it is not always easy to infer from the code. Also make clear which
inputs/outputs are sparse and which ones are dense.
Sparse C code¶
Theano does not have a native C code interface for sparse matrices. The
reason is simple: we use the SciPy sparse matrix objects and they don’t
have a C object. So we use a simple trick: a sparse matrix is made of
4 fields that are NumPy vector arrays: data
, indices
, indptr
and shape
. So to make
an op with C code that has sparse variables as inputs, we actually make an op
that takes as input the needed fields of those sparse variables.
You can extract the 4 fields with
theano.sparse.basic.csm_properties()
. You can use
theano.sparse.basic.csm_data()
,
theano.sparse.basic.csm_indices()
,
theano.sparse.basic.csm_indptr()
and
theano.sparse.basic.csm_shape()
to extract the individual
fields.
You can look at the AddSD sparse op for an example with C code. It implements the addition of a sparse matrix with a dense matrix.
Sparse Tests¶
You can reuse the test system for tensor variables. To generate the
needed sparse variable and data, you can use
theano.sparse.tests.test_basic.sparse_random_inputs()
. It takes
many parameters, including parameters for the format (csr or csc), the shape, the
dtype, whether to have explicit 0 and whether to have unsorted indices.
Random distribution¶
We have 3 base random number generators. One that wraps NumPy’s random generator, one that implements MRG31k3p and one that wraps CURAND.
The fastest, but less developed, is CURAND. It works only on CUDAenabled GPUs. It does not work on the CPU and it has fewer random distributions implemented.
The recommended and 2nd faster is MRG. It works on the GPU and CPU and has more implemented distributions.
The slowest is our wrapper on NumPy’s random generator.
We explain and provide advice on 3 possibles implementations of new distributions here:
 Extend our wrapper around NumPy random functions. See this PR as an example.
 Extend MRG implementation by reusing existing Theano Op. Look into
the
theano/sandbox/rng_mrg.py
file and grep for all code about binomial(). This distribution uses the output of the uniform distribution and converts it to a binomial distribution with existing Theano operations. The tests go intheano/sandbox/test_rng_mrg.py
 Extend MRG implementation with a new Op that takes a uniform sample as
input. Look in the
theano/sandbox/{rng_mrg,multinomial}.py
file and its test intheano/sandbox/test_multinomal.py
. This is recommended when current Theano ops aren’t well suited to modify the uniform to the target distribution. This can happen in particular if there is a loop or complicated condition.
Note
In all cases, you must reuse the same interface as NumPy for compatibility.
OpenMP Ops¶
To allow consistent interface of Ops that support OpenMP, we have some helper code. Doing this also allows to enable/disable OpenMP globally or per op for finegrained control.
Your Op needs to inherit from theano.gof.OpenMPOp
. If it overrides
the __init__()
method, it must have an openmp=None
parameter
and must call super(MyOpClass, self).__init__(openmp=openmp)
.
The OpenMPOp
class also implements c_compile_args
and
make_thunk
. This makes it add the correct g++ flags to compile with
OpenMP. It also disables OpenMP and prints a warning if the version of
g++ does not support it.
The Theano flag openmp
is currently False by default as we do not
have code that gets sped up with it. The only current implementation
is ConvOp. It speeds up some cases, but slows down others. That is why
we disable it by default. But we have all the code to have it enabled
by default if there is more than 1 core and the environment
variable OMP_NUM_THREADS is not 1. This allows Theano to respect the
current convention.
Numba Ops¶
Want C speed without writing C code for your new Op? You can use Numba to generate the C code for you! Here is an example Op doing that.
Alternate Theano Types¶
Most ops in Theano are used to manipulate tensors. However, Theano also supports many other variable types. The supported types are listed below, along with pointers to the relevant documentation.
TensorType
: Theano type that represents a multidimensional array containing elements that all have the same type. Variables of this Theano type are represented in C as objects of class PyArrayObject. TypedList : Theano type that represents a typed list (a list where every element in the list has the same Theano type). Variables of this Theano type are represented in C as objects of class PyListObject.
 Scalar : Theano type that represents a C primitive type. The C type associated with this Theano type is the represented C primitive itself.
 SparseType : Theano type used to represent sparse tensors. There is no equivalent C type for this Theano Type but you can split a sparse variable into its parts as TensorVariables. Those can then be used as inputs to an op with C code.
Generic
: Theano type that represents a simple Python Object. Variables of this Theano type are represented in C as objects of class PyObject.CDataType
: Theano type that represents a C data type. The C type associated with this Theano type depends on the data being represented.
Implementing double in C¶
The previous two sections described how to define a double Type and arithmetic operations on that Type, but all of them were implemented in pure Python. In this section we will see how to define the double type in such a way that it can be used by operations implemented in C (which we will define in the section after that).
How does it work?¶
In order to be Ccompatible, a Type must provide a C interface to the Python data that satisfy the constraints it puts forward. In other words, it must define C code that can convert a Python reference into some type suitable for manipulation in C and it must define C code that can convert some C structure in which the C implementation of an operation stores its variables into a reference to an object that can be used from Python and is a valid value for the Type.
For example, in the current example, we have a Type which represents a
Python float. First, we will choose a corresponding C type. The
natural choice would be the primitive double
type. Then, we need
to write code that will take a PyObject*
, check that it is a
Python float
and extract its value as a double
. Finally, we
need to write code that will take a C double
and will build a
PyObject*
of Python type float
that we can work with from
Python. We will be using CPython and thus special care must be given
to making sure reference counts are updated properly!
The C code we will write makes use of CPython’s C API which you can find here.
What needs to be defined¶
In order to be Ccompatible, a Type must define several additional
methods, which all start with the c_
prefix. The complete list can
be found in the documentation for gof.type.Type
. Here, we’ll focus on
the most important ones:

class
CLinkerType
¶ 
c_declare
(name, sub, check_input=True)¶ This must return C code which declares variables. These variables will be available to operations defined in C. You may also write typedefs.

c_init
(name, sub)¶ This must return C code which initializes the variables declared in
c_declare
. Either this orc_extract
will be called.

c_extract
(name, sub, check_input=True)¶ This must return C code which takes a reference to a Python object and initializes the variables declared in
c_declare
to match the Python object’s data. Either this orc_init
will be called.

c_sync
(name, sub)¶ When the computations are done, transfer the variables from the C structure we put them in to the destination Python object. This will only be called for the outputs.

c_cleanup
(name, sub)¶ When we are done using the data, clean up whatever we allocated and decrease the appropriate reference counts.

c_headers
([c_compiler])¶ 
c_libraries
([c_compiler])¶ 
c_header_dirs
([c_compiler])¶ 
c_lib_dirs
([c_compiler])¶ Allows you to specify headers, libraries and associated directories.
These methods have two versions, one with a c_compiler argument and one without. The version with c_compiler is tried first and if it doesn’t work, the one without is.
The c_compiler argument is the C compiler that will be used to compile the C code for the node that uses this type.

c_compile_args
([c_compiler])¶ 
c_no_compile_args
([c_compiler])¶ Allows to specify special compiler arguments to add/exclude.
These methods have two versions, one with a c_compiler argument and one without. The version with c_compiler is tried first and if it doesn’t work, the one without is.
The c_compiler argument is the C compiler that will be used to compile the C code for the node that uses this type.

c_init_code
()¶ Allows you to specify code that will be executed once when the module is initialized, before anything else is executed. For instance, if a type depends on NumPy’s C API, then
'import_array();'
has to be among the snippets returned byc_init_code()
.

c_compiler
()¶ Allows to specify a special compiler. This will force this compiler for the current compilation block (a particular op or the full graph). This is used for the GPU code.

c_code_cache_version
()¶ Should return a tuple of hashable objects like integers. This specifies the version of the code. It is used to cache the compiled code. You MUST change the returned tuple for each change in the code. If you don’t want to cache the compiled code return an empty tuple or don’t implement it.

Each of these functions take two arguments, name
and sub
which
must be used to parameterize the C code they return. name
is a
string which is chosen by the compiler to represent a Variable of
the Type in such a way that there are no name conflicts between
different pieces of data. Therefore, all variables declared in
c_declare
should have a name which includes name
. Furthermore,
the name of the variable containing a pointer to the Python object
associated to the Variable is py_<name>
.
sub
, on the other hand, is a dictionary containing bits of C code
suitable for use in certain situations. For instance, sub['fail']
contains code that should be inserted wherever an error is identified.
c_declare
and c_extract
also accept a third check_input
optional argument. If you want your type to validate its inputs, it must
only do it when check_input
is True.
The example code below should help you understand how everything plays out:
Warning
If some error condition occurs and you want to fail and/or raise an
Exception, you must use the fail
code contained in
sub['fail']
(there is an example in the definition of c_extract
below). You must NOT use the return
statement anywhere, ever,
nor break
outside of your own loops or goto
to strange
places or anything like that. Failure to comply with this
restriction could lead to erratic behavior, segfaults and/or memory
leaks because Theano defines its own cleanup system and assumes
that you are not meddling with it. Furthermore, advanced operations
or types might do code transformations on your code such as
inserting it in a loop – in that case they can call your
codegenerating methods with custom failure code that takes into account
what they are doing!
Defining the methods¶
c_declare
def c_declare(name, sub):
return """
double %(name)s;
""" % dict(name = name)
double.c_declare = c_declare
Very straightforward. All we need to do is write C code to declare a
double. That double will be named whatever is passed to our function
in the name
argument. That will usually be some mangled name like
“V0”, “V2” or “V92” depending on how many nodes there are in the
computation graph and what rank the current node has. This function
will be called for all Variables whose type is double
.
You can declare as many variables as you want there and you can also
do typedefs. Make sure that the name of each variable contains the
name
argument in order to avoid name collisions (collisions will
happen if you don’t parameterize the variable names as indicated
here). Also note that you cannot declare a variable called
py_<name>
or storage_<name>
because Theano already defines
them.
What you declare there is basically the C interface you are giving to your Type. If you wish people to develop operations that make use of it, it’s best to publish it somewhere.
c_init
def c_init(name, sub):
return """
%(name)s = 0.0;
""" % dict(name = name)
double.c_init = c_init
This function has to initialize the
double we declared previously to a suitable value. This is useful if
we want to avoid dealing with garbage values, especially if our data
type is a pointer. This is not going to be called for all Variables with
the double
type. Indeed, if a Variable is an input that we pass
from Python, we will want to extract that input from a Python object,
therefore it is the c_extract
method that will be called instead of
c_init
. You can therefore not assume, when writing c_extract
, that the
initialization has been done (in fact you can assume that it hasn’t
been done).
c_init
will typically be called on output Variables, but in general
you should only assume that either c_init
or c_extract
has been
called, without knowing for sure which of the two.
c_extract
def c_extract(name, sub):
return """
if (!PyFloat_Check(py_%(name)s)) {
PyErr_SetString(PyExc_TypeError, "expected a float");
%(fail)s
}
%(name)s = PyFloat_AsDouble(py_%(name)s);
""" % dict(name = name, fail = sub['fail'])
double.c_extract = c_extract
This method is slightly more sophisticated. What happens here is that
we have a reference to a Python object which Theano has placed in
py_%(name)s
where %(name)s
must be substituted for the name
given in the inputs. This special variable is declared by Theano as
PyObject* py_%(name)s
where PyObject*
is a pointer to a Python
object as defined by CPython’s C API. This is the reference that
corresponds, on the Python side of things, to a Variable with the
double
type. It is what the end user will give and what he or she
expects to get back.
In this example, the user will give a Python float
. The first
thing we should do is verify that what we got is indeed a Python
float
. The PyFloat_Check
function is provided by CPython’s C
API and does this for us. If the check fails, we set an exception and
then we insert code for failure. The code for failure is in
sub["fail"]
and it basically does a goto
to cleanup code.
If the check passes then we convert the Python float into a double
using the PyFloat_AsDouble
function (yet again provided by CPython’s C
API) and we put it in our double variable that we declared previously.
c_sync
def c_sync(name, sub):
return """
Py_XDECREF(py_%(name)s);
py_%(name)s = PyFloat_FromDouble(%(name)s);
if (!py_%(name)s) {
printf(&qu