mvpa2.misc.surfing.queryengine.AttrDataset¶
-
class
mvpa2.misc.surfing.queryengine.
AttrDataset
(samples, sa=None, fa=None, a=None)¶ Generic storage class for datasets with multiple attributes.
A dataset consists of four pieces. The core is a two-dimensional array that has variables (so-called
features
) in its columns and the associated observations (so-calledsamples
) in the rows. In addition a dataset may have any number of attributes for features and samples. Unsurprisingly, these are called ‘feature attributes’ and ‘sample attributes’. Each attribute is a vector of any datatype that contains a value per each item (feature or sample). Both types of attributes are organized in their respective collections – accessible via thesa
(sample attribute) andfa
(feature attribute) attributes. Finally, a dataset itself may have any number of additional attributes (i.e. a mapper) that are stored in their own collection that is accessible via thea
attribute (see examples below).Notes
Any dataset might have a mapper attached that is stored as a dataset attribute called
mapper
.Examples
The simplest way to create a dataset is from a 2D array.
>>> import numpy as np >>> from mvpa2.datasets import * >>> samples = np.arange(12).reshape((4,3)) >>> ds = AttrDataset(samples) >>> ds.nsamples 4 >>> ds.nfeatures 3 >>> ds.samples array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
The above dataset can only be used for unsupervised machine-learning algorithms, since it doesn’t have any targets associated with its samples. However, creating a labeled dataset is equally simple.
>>> ds_labeled = dataset_wizard(samples, targets=range(4))
Both the labeled and the unlabeled dataset share the same samples array. No copying is performed.
>>> ds.samples is ds_labeled.samples True
If the data should not be shared the samples array has to be copied beforehand.
The targets are available from the samples attributes collection, but also via the convenience property
targets
.>>> ds_labeled.sa.targets is ds_labeled.targets True
If desired, it is possible to add an arbitrary amount of additional attributes. Regardless if their original sequence type they will be converted into an array.
>>> ds_labeled.sa['lovesme'] = [0,0,1,0] >>> ds_labeled.sa.lovesme array([0, 0, 1, 0])
An alternative method to create datasets with arbitrary attributes is to provide the attribute collections to the constructor itself – which would also test for an appropriate size of the given attributes:
>>> fancyds = AttrDataset(samples, sa={'targets': range(4), ... 'lovesme': [0,0,1,0]}) >>> fancyds.sa.lovesme array([0, 0, 1, 0])
Exactly the same logic applies to feature attributes as well.
Datasets can be sliced (selecting a subset of samples and/or features) similar to arrays. Selection is possible using boolean selection masks, index sequences or slicing arguments. The following calls for samples selection all result in the same dataset:
>>> sel1 = ds[np.array([False, True, True])] >>> sel2 = ds[[1,2]] >>> sel3 = ds[1:3] >>> np.all(sel1.samples == sel2.samples) True >>> np.all(sel2.samples == sel3.samples) True
During selection data is only copied if necessary. If the slicing syntax is used the resulting dataset will share the samples with the original dataset (here and below we compare .base against both ds.samples and its .base for compatibility with NumPy < 1.7)
>>> sel1.samples.base in (ds.samples.base, ds.samples) False >>> sel2.samples.base in (ds.samples.base, ds.samples) False >>> sel3.samples.base in (ds.samples.base, ds.samples) True
For feature selection the syntax is very similar they are just represented on the second axis of the samples array. Plain feature selection is achieved be keeping all samples and select a subset of features (all syntax variants for samples selection are also supported for feature selection).
>>> fsel = ds[:, 1:3] >>> fsel.samples array([[ 1, 2], [ 4, 5], [ 7, 8], [10, 11]])
It is also possible to simultaneously selection a subset of samples and features. Using the slicing syntax now copying will be performed.
>>> fsel = ds[:3, 1:3] >>> fsel.samples array([[1, 2], [4, 5], [7, 8]]) >>> fsel.samples.base in (ds.samples.base, ds.samples) True
Please note that simultaneous selection of samples and features is not always congruent to array slicing.
>>> ds[[0,1,2], [1,2]].samples array([[1, 2], [4, 5], [7, 8]])
Whereas the call: ‘ds.samples[[0,1,2], [1,2]]’ would not be possible. In
AttrDatasets
selection of samples and features is always applied individually and independently to each axis.Attributes
Methods
A Dataset might have an arbitrary number of attributes for samples, features, or the dataset as a whole. However, only the data samples themselves are required.
Parameters: samples : ndarray
Data samples. This has to be a two-dimensional (samples x features) array. If the samples are not in that format, please consider one of the
AttrDataset.from_*
classmethods.sa : SampleAttributesCollection
Samples attributes collection.
fa : FeatureAttributesCollection
Features attributes collection.
a : DatasetAttributesCollection
Dataset attributes collection.
Methods
-
aggregate_features
(dataset, fx=<function mean>)¶ Apply a function to each row of the samples matrix of a dataset.
The functor given as
fx
has to honour anaxis
keyword argument in the way that NumPy used it (e.g. NumPy.mean, var).Returns: a new `Dataset` object with the aggregated feature(s). :
-
append
(other)¶ This method should not be used and will be removed in the future
-
coarsen_chunks
(source, nchunks=4)¶ Change chunking of the dataset
Group chunks into groups to match desired number of chunks. Makes sense if originally there were no strong groupping into chunks or each sample was independent, thus belonged to its own chunk
Parameters: source : Dataset or list of chunk ids
dataset or list of chunk ids to operate on. If Dataset, then its chunks get modified
nchunks : int
desired number of chunks
-
copy
(deep=True, sa=None, fa=None, a=None, memo=None)¶ Create a copy of a dataset.
By default this is going to return a deep copy of the dataset, hence no data would be shared between the original dataset and its copy.
Parameters: deep : boolean, optional
If False, a shallow copy of the dataset is return instead. The copy contains only views of the samples, sample attributes and feature attributes, as well as shallow copies of all dataset attributes.
sa : list or None
List of attributes in the sample attributes collection to include in the copy of the dataset. If
None
all attributes are considered. If an empty list is given, all attributes are stripped from the copy.fa : list or None
List of attributes in the feature attributes collection to include in the copy of the dataset. If
None
all attributes are considered If an empty list is given, all attributes are stripped from the copy.a : list or None
List of attributes in the dataset attributes collection to include in the copy of the dataset. If
None
all attributes are considered If an empty list is given, all attributes are stripped from the copy.memo : dict
Developers only: This argument is only useful if copy() is called inside the __deepcopy__() method and refers to the dict-argument
memo
in the Python documentation.
-
classmethod
from_hdf5
(source, name=None)¶ Load a Dataset from HDF5 file
Parameters: source : string or h5py.highlevel.File
Filename or HDF5’s File to load dataset from
name : string, optional
If file contains multiple entries at the 1st level, if provided,
name
specifies the group to be loaded as the AttrDataset.Returns: AttrDataset :
Raises: ValueError :
-
get_nsamples_per_attr
(dataset, attr)¶ Returns the number of samples per unique value of a sample attribute.
Parameters: attr : str
Name of the sample attribute
Returns: dict with the number of samples (value) per unique attribute (key). :
-
get_samples_by_attr
(dataset, attr, values, sort=True)¶ Return indices of samples given a list of attributes
-
get_samples_per_chunk_target
(dataset, targets_attr='targets', chunks_attr='chunks')¶ Returns an array with the number of samples per target in each chunk.
Array shape is (chunks x targets).
Parameters: dataset : Dataset
Source dataset.
-
init_origids
(which, attr='origids', mode='new')¶ Initialize the dataset’s ‘origids’ attribute.
The purpose of origids is that they allow to track the identity of a feature or a sample through the lifetime of a dataset (i.e. subsequent feature selections).
Calling this method will overwrite any potentially existing IDs (of the XXX)
Parameters: which : {‘features’, ‘samples’, ‘both’}
An attribute is generated for each feature, sample, or both that represents a unique ID. This ID incorporates the dataset instance ID and should allow merging multiple datasets without causing multiple identical ID and the resulting dataset.
attr : str
Name of the attribute to store the generated IDs in. By convention this should be ‘origids’ (the default), but might be changed for specific purposes.
mode : {‘existing’, ‘new’, ‘raise’}, optional
Action if
attr
is already present in the collection. Default behavior is ‘new’ whenever new ids are generated and replace existing values if such are present. With ‘existing’ it would not alter existing content. With ‘raise’ it would raiseRuntimeError
.Raises: `RuntimeError` :
If
mode
== ‘raise’ andattr
is already defined
-
nfeatures
¶
-
nsamples
¶ len(object) -> integer
Return the number of items of a sequence or collection.
-
random_samples
(dataset, npertarget, targets_attr='targets')¶ Create a dataset with a random subset of samples.
Parameters: dataset : Dataset
npertarget : int or list
If an
int
is given, the specified number of samples is randomly chosen from the group of samples sharing a unique target value. Total number of selected samples: npertarget x len(uniquetargets). If alist
is given of length matching the unique target values, it specifies the number of samples chosen for each particular unique target.targets_attr : str, optional
Returns: Dataset :
A dataset instance for the chosen samples. All feature attributes and dataset attribute share there data with the source dataset.
-
remove_invariant_features
(dataset)¶ Returns a new dataset with all invariant features removed.
-
remove_nonfinite_features
(dataset)¶ Returns a new dataset with all non-finite (NaN,Inf) features removed
Removes all feature for which not all values are finite
Parameters: dataset : Dataset
Input dataset
Returns: finite_dataset: Dataset :
Dataset based on data form the input, but only the features for which all samples are finite are kept.
-
save
(dataset, destination, name=None, compression=None)¶ Save Dataset into HDF5 file
Parameters: dataset :
Dataset
destination :
h5py.highlevel.File
or strname : str, optional
compression : None or int or {‘gzip’, ‘szip’, ‘lzf’}, optional
Level of compression for gzip, or another compression strategy.
-
shape
¶
-
summary
(dataset, stats=True, lstats='auto', sstats='auto', idhash=False, targets_attr='targets', chunks_attr='chunks', maxc=30, maxt=20)¶ String summary over the object
Parameters: stats : bool
Include some basic statistics (mean, std, var) over dataset samples
lstats : ‘auto’ or bool
Include statistics on chunks/targets. If ‘auto’, includes only if both targets_attr and chunks_attr are present.
sstats : ‘auto’ or bool
Sequence (order) statistics. If ‘auto’, includes only if targets_attr is present.
idhash : bool
Include idhash value for dataset and samples
targets_attr : str, optional
Name of sample attributes of targets
chunks_attr : str, optional
Name of sample attributes of chunks – independent groups of samples
maxt : int
Maximal number of targets when provide details on targets/chunks
maxc : int
Maximal number of chunks when provide details on targets/chunks
-
summary_targets
(dataset, targets_attr='targets', chunks_attr='chunks', maxc=30, maxt=20)¶ Provide summary statistics over the targets and chunks
Parameters: dataset :
Dataset
Dataset to operate on
targets_attr : str, optional
Name of sample attributes of targets
chunks_attr : str, optional
Name of sample attributes of chunks – independent groups of samples
maxc : int
Maximal number of chunks when provide details
maxt : int
Maximal number of targets when provide details
-