LeCun et al. (1999): The MNIST Dataset Of Handwritten Digits (Images)

The MNIST dataset of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

See http://yann.lecun.com/exdb/mnist for more information.

Note

The version that is offered here is identical to the four files distributed there, but has been converted into a single HDF5 file than can easily be read by PyMVPA.

Terms Of Use

Yann LeCun (Courant Institute, NYU) and Corinna Cortes (Google Labs, New York) hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license.

Download

A single hdf5 file containing entire MNIST dataset is available from

Requirements

  • HDF5 access facility.
  • PyMVPA 0.5 (or later) provides the h5load() function (utilizes H5PY package).

Instructions

>>> from mvpa2.suite import *
>>> filepath = os.path.join(pymvpa_datadbroot, 'mnist', "mnist.hdf5")
>>> datasets = h5load(filepath)
>>> train = datasets['train']
>>> test = datasets['test']
>>> print train
<Dataset: 60000x784@uint8, <sa: labels>>
>>> print test
<Dataset: 10000x784@uint8, <sa: labels>>
>>> # assign a mapper able to recreate 28x28 pixel image arrays
>>> test.a.mapper = FlattenMapper(shape=(28, 28))
>>> test.mapper.reverse(test).shape
(10000, 28, 28)

References

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.