PyMVPA Extravaganza – Fall 2009

This development workshop took place at Dartmouth College, Nov 30 – Dec 4, 2009.

Goals

The primary purpose of this first PyMVPA workshop was to gather all people involved in (or related to) the development of PyMVPA. Participants introduced their projects and discussed their integration into, or interoperation with the PyMVPA main line.

In addition, we discussed changes scheduled for the upcoming 0.5 release of PyMVPA that are supposed to improve shortcomings of the original design, or missing features that have been identified over the past two years. These include:

  • More flexible data storage: A new dataset implementation.
  • Better integration of PyMVPA and MDP.
  • Establishing an optimization framework within PyMVPA.
  • Various performance improvements, e.g. kernel-caching, parallelization, potential of CUDA.

Participants

  • Satrajit Ghosh, MIT, USA (for the kick-off talks)
  • Scott Gorlin, MIT, USA
  • Valentin Haenel, BCCN, Germany
  • Yaroslav O. Halchenko, Dartmouth College, USA
  • Michael Hanke, Dartmouth College, USA
  • Emanuele Olivetti, Fondazione Bruno Kessler, Italy
  • Per B. Sederberg, Princeton University, USA (virtual)
  • Tiziano Zito, BCCN, Germany

Kick-off Talks

The workshop started on Monday Nov 30th at 9:30am with a series of talks covering the various aspects of the workshop (abstracts below).

PyMVPA: Where we are now, and where we are going

Yaroslav O. Halchenko, Michael Hanke

This talk will give a brief summary of our original concept of PyMVPA that we had in mind when designing it two years ago, and how the project evolved since then. We will touch upon several issues we had to face concerning development, quality assurance, and deployment. While the latest PyMVPA release offers a wide array of tools and algorithms, we also identified a number of problems that limit further integration of novel techniques into the framework. The talk will conclude with an outline how we believe these issues can be resolved and introduces a number of improvements that will become available with the next milestone release: PyMVPA 0.5.

MDP inside out

Tiziano Zito

MDP is a Python collection of machine learning algorithms and a framework for implementing new algorithms and combining them into data processing workflows. MDP has been designed around two main ideas: expose a simple API, to allow scientific users to use it as a standalone library, and organize the internal structure of the objects to encourage developers to extend it and embed it in other libraries such as PyMVPA. In my talk, I will use MDP as a starting point to hash over some basic principles of scientific software design. I will discuss the criteria that inform the design of MDP and their specific implementation, and examine their advantages, limitations and possible alternatives. I will conclude with a summary of the current status and future plans for MDP development.

Nipype - A Python framework for neuroimaging

Satrajit Ghosh

Nipype is a project under the umbrella of Nipy, an effort to develop open-source, community-developed neuroimaging tools in Python. The goals of Nipype are two-fold: 1) to provide a uniform interface to existing neuroimaging software packages; and 2) to provide a pipelined environment for efficient batch-processing that can tie together different neuroimaging data analysis algorithms. By exposing a consistent interface to the external packages, researchers are able to explore a wide range of imaging algorithms and configure their own analysis pipeline which best fits their data and research objectives, and perform their analysis in a highly structured environment. The nipype framework is accessible to the wide range of programming expertise often found in neuroimaging, allowing for both easy-to-use high-level scripting and low-level algorithm development for unlimited customization.

Profiling PyMVPA

Valentin Haenel

In this talk I will present the work we did to compare the PyMVPA and Matlab implementations of the searchlight algorithm. This will include a description of how we iteratively discovered various bottlenecks and the steps taken to eliminate these. In particular, I will first present modifications of the source code and then show the resulting change in profiler output. I may conclude with some ideas for future work and some additional remarks about optimization in general.

Supervised Tract Segmentation

Emanuele Olivetti

Automatic segmentation of tractography data into pathways/tracts is a problem traditionally addressed by means of unsupervised techniques, i.e., clustering streamlines. The core of this work is to adopt instead a supervised approach, learning from the segmentation made by an expert neuroanatomist in order to predict tracts in new brains.

In this talk a novel set of supervised approaches to the tract segmentation problem will be illustrated. The proposed solutions are based on machine learning topics like “supervised clustering”, “learning with similarity functions” and “transduction”. These solutions allow to exploit both diffusion and functional MRI data, to avoid co-registration between different subjects and to predict tracts in hemispheres different from the training example. Preliminary results support these claims.

An intended goal of this talk is to open a discussion on how to map the building blocks of the proposed methods into the PyMVPA framework in order to support tractography data analysis natively and, more in general, to provide novel machine learning approaches to the users.

Caching kernels

Scott Gorlin

A major bottleneck in a standard classification analysis relies on calculating the dot product between vectors in high-dimensional space. This is especially time consuming when there are few samples but the number of dimensions is high, such as the case of fMRI data. In fact, many common analysis techniques such as cross validation, bootstrapping, and model selection require that the kernel be recalculated for each permutation, even if that exact calculation has been done before. This presentation analyzes the problem inherent in a high-level library such as PyMVPA and illustrates one example of how to cache and reuse kernels, greatly simplifying the underlying computations and accelerating many analytical technique implementations by several orders of magnitude.

Workshop Results

The workshop has been a huge success. We worked on further integrating PyMVPA with other Python-based software packages, both to make use of them inside PyMVPA, but also to better expose PyMVPA’s functionality to other packages. The kick-off talks were followed by four days of intensive coding. During these days we were able to integrate virtually all outstanding patches that have been offered over the last year, but could not be merged yet due to required changes in the codebase. Below is a list of projects that we have been working on during the workshop. Moreover, we were able to continue the transition towards the new dataset implementation that had been started prior to the workshop. A significant number of additional unittest has been ported to the new code – as usual identifying and fixing a number of bugs.

../_images/extravaganza-dc09.jpg

Workshop participants (from left to right and top to bottom): Emanuele Olivetti, Scott Gorlin, Michael Hanke, Tiziano Zito, Yaroslav O. Halchenko, Valentin Haenel

Grand Kernel Unification

Scott Gorlin, Yaroslav O. Halchenko, and Emanuele Olivetti

Many core MVPA algorithms rely on expensive kernel computations. However, most of these algorithms have their own naming standards and backend implementations which are not interchangeable, meaning that new advances in kernel logic or software implementations are not generally beneficial to PyMVPA as a whole. To solve this, we have implemented a new class hierarchy which not only specifies a standard kernel interface, but also allows the automatic translation of kernels from one software backend to another.

Specifically, it is now possible to specify new kernel classes in pure Python (or any method which can expose a Numpy array, such as PyCUDA or custom C) and automatically convert these back and forth to Shogun kernels. This has the immediate advantage of allowing custom kernels for any Shogun-based classifier (e.g. SVM), using Shogun kernels for fast computation in any other solver (e.g., GPR), or the automatic exchange of kernels for any implementation in the future.

Cached Kernel Optimization

Scott Gorlin, Yaroslav O. Halchenko

The main benefit of the kernel unification work is that new kernel classes can be specified in pure Python. A new optimized kernel class we have implemented is a CachedKernel which can automatically cache and reuse kernel matrices from any other NumpyKernel (or any kernel which is convertible to Numpy, such as a Shogun kernel). This class will prove extremely useful for techniques such as cross-validation, bootstrapping, etc, where the kernel product is normally recalculated every time it is computed - e.g., every time SVM.train(...) is called. Caching the kernel will avoid these expensive computations and greatly speed up this type of analysis by several orders of magnitude.

Flexible, straightforward adaptor for arbitrary MDP nodes and flows

Michael Hanke, Tiziano Zito

Although previously PyMVPA used MDP to provide a subset of its functionality through Mappers, this was limited to single nodes (e.g. PCA, ICA) and was not meant to be extended by users (except for subclassing and writing a new node wrapper by hand). Now, PyMVPA included flexible adaptors for arbitrary MDP nodes, or whole MDP flows. Besides incremental training, these adaptors offer access to the full functionality of the underlying node or flow. Straightforward (single-line of code) wrapping allows to seamlessly blend MDP into PyMVPA.

The benefits are two-fold: PyMVPA users have now access to the full functionality of MDP without having to develop custom mappers. This includes algorithms, such as PCA, ICA, factor analysis, discriminant analysis, slow feature analysis, or restricted Boltzmann machines, and many more. MDP users can now use PyMVPA to perform convenient cross-validation of classification procedures with arbitrary mixes of PyMVPA classifiers and measure and MDP nodes, and flows.

Non-matrix Dataset and prototype mapper for tractography data (and more!)

Emanuele Olivetti, Michael Hanke

The vast majority of algorithms available (and desirable) in PyMVPA requires data in a 2D matrix format. For this reason, until now, PyMVPA accepted only 2D matrices as samples in a Dataset. However, sometimes this causes problems, for example, with tractography data. That consists of a set of streamlines, a streamline being a polyline made of a non-constant number of points. In PyMVPA terms it means that the number of features in the corresponding dataset of streamlines would be different across instances.

The purpose of a set of patches made during the workshop is twofold: first to allow PyMVPA to accept also row-wise iterable collections as a Dataset independently of the content of each row and second to provide a mapper to transform these every kind of Dataset into 2D matrix Dataset. The mapper is prototype-based which means that each instance within the Dataset (e.g., each streamline) is mapped into a fixed size M-dimensional vector. The M values are computed by specifying a similarity (or kernel, or distance) function which evaluates the distance of that instance against a given set of other M instances (e.g., other M streamlines) called prototypes. An example application is supervised tract segmentation from tractography data which now can be mapped into a standard binary classification problem over the usual 2D matrix class-labeled dataset. This approach to adress varying features-space sizes is flexible, and not limited to the tractography domain.

Optimization and Generalization of Searchlight-analyses

Valentin Haenel, Michael Hanke

The searchlight analysis code has been ported to the new dataset/mapper framework, taking into account the result of a profiling analysis done by Valentin Haenel during the last year. The new code avoids significant look-up penalties of the previous implementation. Moreover, it has been generalized to support arbitrary look-up algorithms (e.g. kd-tree) and is no longer limited to sphere-based spatial searchlights.

Acknowledgements

We are grateful to Prof. James Haxby for sponsoring this workshop and hosting it in his lab.