# Data transformations


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Get, split, and label

For most data source creation we need functions to get a list of items,
split them in to train/valid sets, and label them. fastai provides
functions to make each of these steps easy (especially when combined
with `fastai.data.blocks`).

### Get

First we’ll look at functions that *get* a list of items (generally file
names).

We’ll use *tiny MNIST* (a subset of MNIST with just two classes, `7`s
and `3`s) for our examples/tests throughout this page.

``` python
path = untar_data(URLs.MNIST_TINY)
(path/'train').ls()
```

<style>
    /* Turns off some styling */
    progress {
        /* gets rid of default border in Firefox and Opera. */
        border: none;
        /* Needs to be in here for Safari polyfill so background images work as expected. */
        background-size: auto;
    }
    progress:not([value]), progress:not([value])::-webkit-progress-bar {
        background: repeating-linear-gradient(45deg, #7e7e7e, #7e7e7e 10px, #5c5c5c 10px, #5c5c5c 20px);
    }
    .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
        background: #F44336;
    }
</style>

    <div>
      <progress value='344064' class='' max='342207' style='width:300px; height:20px; vertical-align: middle;'></progress>
      100.54% [344064/342207 00:00&lt;00:00]
    </div>
    &#10;

    (#2) [Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/3')]

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L31"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_files

``` python

def get_files(
    path, extensions:NoneType=None, recurse:bool=True, folders:NoneType=None, followlinks:bool=True
):

```

*Get all the files in `path` with optional `extensions`, optionally with
`recurse`, only in `folders`, if specified.*

This is the most general way to grab a bunch of file names from disk. If
you pass `extensions` (including the `.`) then returned file names are
filtered by that list. Only those files directly in `path` are included,
unless you pass `recurse`, in which case all child folders are also
searched recursively. `folders` is an optional list of directories to
limit the search to.

``` python
t3 = get_files(path/'train'/'3', extensions='.png', recurse=False)
t7 = get_files(path/'train'/'7', extensions='.png', recurse=False)
t  = get_files(path/'train', extensions='.png', recurse=True)
test_eq(len(t), len(t3)+len(t7))
test_eq(len(get_files(path/'train'/'3', extensions='.jpg', recurse=False)),0)
test_eq(len(t), len(get_files(path, extensions='.png', recurse=True, folders='train')))
t
```

    (#709) [Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/9243.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/9519.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/7534.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/9082.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/8377.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/994.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/8559.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/8217.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/8571.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/8954.png')...]

It’s often useful to be able to create functions with customized
behavior. `fastai.data` generally uses functions named as CamelCase
verbs ending in `er` to create these functions.
[`FileGetter`](https://docs.fast.ai/data.transforms.html#filegetter) is
a simple example of such a function creator.

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L50"
target="_blank" style="float:right; font-size:smaller">source</a>

### FileGetter

``` python

def FileGetter(
    suf:str='', extensions:NoneType=None, recurse:bool=True, folders:NoneType=None
):

```

*Create
[`get_files`](https://docs.fast.ai/data.transforms.html#get_files)
partial function that searches path suffix `suf`, only in `folders`, if
specified, and passes along args*

``` python
fpng = FileGetter(extensions='.png', recurse=False)
test_eq(len(t7), len(fpng(path/'train'/'7')))
test_eq(len(t), len(fpng(path/'train', recurse=True)))
fpng_r = FileGetter(extensions='.png', recurse=True)
test_eq(len(t), len(fpng_r(path/'train')))
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L60"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_image_files

``` python

def get_image_files(
    path, recurse:bool=True, folders:NoneType=None
):

```

*Get image files in `path` recursively, only in `folders`, if
specified.*

This is simply
[`get_files`](https://docs.fast.ai/data.transforms.html#get_files)
called with a list of standard image extensions.

``` python
test_eq(len(t), len(get_image_files(path, recurse=True, folders='train')))
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L65"
target="_blank" style="float:right; font-size:smaller">source</a>

### ImageGetter

``` python

def ImageGetter(
    suf:str='', recurse:bool=True, folders:NoneType=None
):

```

*Create
[`get_image_files`](https://docs.fast.ai/data.transforms.html#get_image_files)
partial that searches suffix `suf` and passes along `kwargs`, only in
`folders`, if specified*

Same as
[`FileGetter`](https://docs.fast.ai/data.transforms.html#filegetter),
but for image extensions.

``` python
test_eq(len(get_files(path/'train', extensions='.png', recurse=True, folders='3')),
        len(ImageGetter(   'train',                    recurse=True, folders='3')(path)))
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L71"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_text_files

``` python

def get_text_files(
    path, recurse:bool=True, folders:NoneType=None
):

```

*Get text files in `path` recursively, only in `folders`, if specified.*

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L76"
target="_blank" style="float:right; font-size:smaller">source</a>

### ItemGetter

``` python

def ItemGetter(
    i
):

```

*Creates a proper transform that applies `itemgetter(i)` (even on a
tuple)*

``` python
test_eq(ItemGetter(1)((1,2,3)),  2)
test_eq(ItemGetter(1)(L(1,2,3)), 2)
test_eq(ItemGetter(1)([1,2,3]),  2)
test_eq(ItemGetter(1)(np.array([1,2,3])),  2)
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L83"
target="_blank" style="float:right; font-size:smaller">source</a>

### AttrGetter

``` python

def AttrGetter(
    nm, default:NoneType=None
):

```

*Creates a proper transform that applies `attrgetter(nm)` (even on a
tuple)*

``` python
test_eq(AttrGetter('shape')(torch.randn([4,5])), [4,5])
test_eq(AttrGetter('shape', [0])([4,5]), [0])
```

### Split

The next set of functions are used to *split* data into training and
validation sets. The functions return two lists - a list of indices or
masks for each of training and validation sets.

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L90"
target="_blank" style="float:right; font-size:smaller">source</a>

### RandomSplitter

``` python

def RandomSplitter(
    valid_pct:float=0.2, seed:NoneType=None
):

```

*Create function that splits `items` between train/val with `valid_pct`
randomly.*

``` python
def _test_splitter(f, items=None):
    "A basic set of condition a splitter must pass"
    items = ifnone(items, range_of(30))
    trn,val = f(items)
    assert 0<len(trn)<len(items)
    assert all(o not in val for o in trn)
    test_eq(len(trn), len(items)-len(val))
    # test random seed consistency
    test_eq(f(items)[0], trn)
    return trn, val
```

``` python
_test_splitter(RandomSplitter(seed=42))
```

    ((#24) [10,18,16,23,28,26,20,7,21,22...], (#6) [12,0,6,25,8,15])

Use scikit-learn train_test_split. This allow to *split* items in a
stratified fashion (uniformely according to the ‘labels‘ distribution)

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L100"
target="_blank" style="float:right; font-size:smaller">source</a>

### TrainTestSplitter

``` python

def TrainTestSplitter(
    test_size:float=0.2, random_state:NoneType=None, stratify:NoneType=None, train_size:NoneType=None,
    shuffle:bool=True
):

```

*Split `items` into random train and test subsets using sklearn
train_test_split utility.*

``` python
src = list(range(30))
labels = [0] * 20 + [1] * 10
test_size = 0.2

f = TrainTestSplitter(test_size=test_size, random_state=42, stratify=labels)
trn,val = _test_splitter(f, items=src)

# test labels distribution consistency
# there should be test_size % of zeroes and ones respectively in the validation set
test_eq(len([t for t in val if t < 20]) / 20, test_size)
test_eq(len([t for t in val if t > 20]) / 10, test_size)
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L109"
target="_blank" style="float:right; font-size:smaller">source</a>

### IndexSplitter

``` python

def IndexSplitter(
    valid_idx
):

```

*Split `items` so that `val_idx` are in the validation set and the
others in the training set*

``` python
items = 'a,b,c,d,e,f,g,h,i,j'.split(',')  #to make obvious that splits indexes and not items.
splitter = IndexSplitter([3,7,9])

_test_splitter(splitter, items)
test_eq(splitter(items),[[0,1,2,4,5,6,8],[3,7,9]])
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L117"
target="_blank" style="float:right; font-size:smaller">source</a>

### EndSplitter

``` python

def EndSplitter(
    valid_pct:float=0.2, valid_last:bool=True
):

```

*Create function that splits `items` between train/val with `valid_pct`
at the end if `valid_last` else at the start. Useful for ordered data.*

``` python
items = range_of(10)

splitter_last = EndSplitter(valid_last=True)
_test_splitter(splitter_last)
test_eq(splitter_last(items), ([0,1,2,3,4,5,6,7], [8,9]))

splitter_start = EndSplitter(valid_last=False)
_test_splitter(splitter_start)
test_eq(splitter_start(items), ([2,3,4,5,6,7,8,9], [0,1]))
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L132"
target="_blank" style="float:right; font-size:smaller">source</a>

### GrandparentSplitter

``` python

def GrandparentSplitter(
    train_name:str='train', valid_name:str='valid'
):

```

*Split `items` from the grand parent folder names (`train_name` and
`valid_name`).*

``` python
fnames = [path/'train/3/9932.png', path/'valid/7/7189.png',
          path/'valid/7/7320.png', path/'train/7/9833.png',
          path/'train/3/7666.png', path/'valid/3/925.png',
          path/'train/7/724.png', path/'valid/3/93055.png']
splitter = GrandparentSplitter()
```

``` python
_test_splitter(splitter, items=fnames)
test_eq(splitter(fnames),[[0,3,4,6],[1,2,5,7]])
```

``` python
fnames2 = fnames + [path/'test/3/4256.png', path/'test/7/2345.png', path/'valid/7/6467.png']
splitter = GrandparentSplitter(train_name=('train', 'valid'), valid_name='test')
_test_splitter(splitter, items=fnames2)
test_eq(splitter(fnames2),[[0,3,4,6,1,2,5,7,10],[8,9]])
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L139"
target="_blank" style="float:right; font-size:smaller">source</a>

### FuncSplitter

``` python

def FuncSplitter(
    func
):

```

*Split `items` by result of `func` (`True` for validation, `False` for
training set).*

``` python
splitter = FuncSplitter(lambda o: Path(o).parent.parent.name == 'valid')
_test_splitter(splitter, fnames)
test_eq(splitter(fnames),[[0,3,4,6],[1,2,5,7]])
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L147"
target="_blank" style="float:right; font-size:smaller">source</a>

### MaskSplitter

``` python

def MaskSplitter(
    mask
):

```

*Split `items` depending on the value of `mask`.*

``` python
items = list(range(6))
splitter = MaskSplitter([True,False,False,True,False,True])
_test_splitter(splitter, items)
test_eq(splitter(items),[[1,2,4],[0,3,5]])
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L153"
target="_blank" style="float:right; font-size:smaller">source</a>

### FileSplitter

``` python

def FileSplitter(
    fname
):

```

*Split `items` by providing file `fname` (contains names of valid items
separated by newline).*

``` python
with tempfile.TemporaryDirectory() as d:
    fname = Path(d)/'valid.txt'
    fname.write_text('\n'.join([Path(fnames[i]).name for i in [1,3,4]]))
    splitter = FileSplitter(fname)
    _test_splitter(splitter, fnames)
    test_eq(splitter(fnames),[[0,2,5,6,7],[1,3,4]])
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L161"
target="_blank" style="float:right; font-size:smaller">source</a>

### ColSplitter

``` python

def ColSplitter(
    col:str='is_valid', on:NoneType=None
):

```

*Split `items` (supposed to be a dataframe) by value in `col`*

``` python
df = pd.DataFrame({'a': [0,1,2,3,4], 'b': [True,False,True,True,False]})
splits = ColSplitter('b')(df)
test_eq(splits, [[1,4], [0,2,3]])
# Works with strings or index
splits = ColSplitter(1)(df)
test_eq(splits, [[1,4], [0,2,3]])
# does not get confused if the type of 'is_valid' is integer, but it meant to be a yes/no
df = pd.DataFrame({'a': [0,1,2,3,4], 'is_valid': [1,0,1,1,0]})
splits_by_int = ColSplitter('is_valid')(df)
test_eq(splits_by_int, [[1,4], [0,2,3]])
# optionally pass a specific value to split on
df = pd.DataFrame({'a': [0,1,2,3,4,5], 'b': [1,2,3,1,2,3]})
splits_on_val = ColSplitter('b', 3)(df)
test_eq(splits_on_val, [[0,1,3,4], [2,5]])
# or multiple values
splits_on_val = ColSplitter('b', [2,3])(df)
test_eq(splits_on_val, [[0,3], [1,2,4,5]])
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L173"
target="_blank" style="float:right; font-size:smaller">source</a>

### RandomSubsetSplitter

``` python

def RandomSubsetSplitter(
    train_sz, valid_sz, seed:NoneType=None
):

```

*Take randoms subsets of `splits` with `train_sz` and `valid_sz`*

``` python
items = list(range(100))
valid_idx = list(np.arange(70,100))
splitter = RandomSubsetSplitter(0.3, 0.1)
splits = RandomSubsetSplitter(0.3, 0.1)(items)
test_eq(len(splits[0]), 30)
test_eq(len(splits[1]), 10)
```

### Label

The final set of functions is used to *label* a single item of data.

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L187"
target="_blank" style="float:right; font-size:smaller">source</a>

### parent_label

``` python

def parent_label(
    o
):

```

*Label `item` with the parent folder name.*

Note that
[`parent_label`](https://docs.fast.ai/data.transforms.html#parent_label)
doesn’t have anything customize, so it doesn’t return a function - you
can just use it directly.

``` python
test_eq(parent_label(fnames[0]), '3')
test_eq(parent_label("fastai_dev/dev/data/mnist_tiny/train/3/9932.png"), '3')
[parent_label(o) for o in fnames]
```

    ['3', '7', '7', '7', '3', '3', '7', '3']

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L192"
target="_blank" style="float:right; font-size:smaller">source</a>

### RegexLabeller

``` python

def RegexLabeller(
    pat, match:bool=False
):

```

*Label `item` with regex `pat`.*

[`RegexLabeller`](https://docs.fast.ai/data.transforms.html#regexlabeller)
is a very flexible function since it handles any regex search of the
stringified item. Pass `match=True` to use `re.match` (i.e. check only
start of string), or `re.search` otherwise (default).

For instance, here’s an example the replicates the previous
[`parent_label`](https://docs.fast.ai/data.transforms.html#parent_label)
results.

``` python
f = RegexLabeller(fr'{posixpath.sep}(\d){posixpath.sep}')
test_eq(f(fnames[0]), '3')
[f(o) for o in fnames]
```

    ['3', '7', '7', '7', '3', '3', '7', '3']

``` python
f = RegexLabeller(fr'{posixpath.sep}(\d){posixpath.sep}')
a1 = Path(fnames[0]).as_posix()
test_eq(f(a1), '3')
[f(o) for o in fnames]
```

    ['3', '7', '7', '7', '3', '3', '7', '3']

``` python
f = RegexLabeller(r'(\d*)', match=True)
test_eq(f(fnames[0].name), '9932')
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L205"
target="_blank" style="float:right; font-size:smaller">source</a>

### ColReader

``` python

def ColReader(
    cols, pref:str='', suff:str='', label_delim:NoneType=None
):

```

*Read `cols` in `row` with potential `pref` and `suff`*

`cols` can be a list of column names or a list of indices (or a mix of
both). If `label_delim` is passed, the result is split using it.

``` python
df = pd.DataFrame({'a': 'a b c d'.split(), 'b': ['1 2', '0', '', '1 2 3']})
f = ColReader('a', pref='0', suff='1')
test_eq([f(o) for o in df.itertuples()], '0a1 0b1 0c1 0d1'.split())

f = ColReader('b', label_delim=' ')
test_eq([f(o) for o in df.itertuples()], [['1', '2'], ['0'], [], ['1', '2', '3']])

df['a1'] = df['a']
f = ColReader(['a', 'a1'], pref='0', suff='1')
test_eq([f(o) for o in df.itertuples()], [L('0a1', '0a1'), L('0b1', '0b1'), L('0c1', '0c1'), L('0d1', '0d1')])

df = pd.DataFrame({'a': [L(0,1), L(2,3,4), L(5,6,7)]})
f = ColReader('a')
test_eq([f(o) for o in df.itertuples()], [L(0,1), L(2,3,4), L(5,6,7)])

df['name'] = df['a']
f = ColReader('name')
test_eq([f(df.iloc[0,:])], [L(0,1)])

df['mask'] = df['a']
f = ColReader('mask')
test_eq([f(o) for o in df.itertuples()], [L(0,1), L(2,3,4), L(5,6,7)])
test_eq([f(df.iloc[0,:])], [L(0,1)])
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L223"
target="_blank" style="float:right; font-size:smaller">source</a>

### CategoryMap

``` python

def CategoryMap(
    col, sort:bool=True, add_na:bool=False, strict:bool=False
):

```

*Collection of categories with the reverse mapping in `o2i`*

``` python
t = CategoryMap([4,2,3,4])
test_eq(t, [2,3,4])
test_eq(t.o2i, {2:0,3:1,4:2})
test_eq(t.map_objs([2,3]), [0,1])
test_eq(t.map_ids([0,1]), [2,3])
test_fail(lambda: t.o2i['unseen label'])
```

``` python
t = CategoryMap([4,2,3,4], add_na=True)
test_eq(t, ['#na#',2,3,4])
test_eq(t.o2i, {'#na#':0,2:1,3:2,4:3})
```

``` python
t = CategoryMap(pd.Series([4,2,3,4]), sort=False)
test_eq(t, [4,2,3])
test_eq(t.o2i, {4:0,2:1,3:2})
```

``` python
col = pd.Series(pd.Categorical(['M','H','L','M'], categories=['H','M','L'], ordered=True))
t = CategoryMap(col)
test_eq(t, ['H','M','L'])
test_eq(t.o2i, {'H':0,'M':1,'L':2})
```

``` python
col = pd.Series(pd.Categorical(['M','H','M'], categories=['H','M','L'], ordered=True))
t = CategoryMap(col, strict=True)
test_eq(t, ['H','M'])
test_eq(t.o2i, {'H':0,'M':1})
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L249"
target="_blank" style="float:right; font-size:smaller">source</a>

### Categorize

``` python

def Categorize(
    vocab:NoneType=None, sort:bool=True, add_na:bool=False
):

```

*Reversible transform of category string to `vocab` id*

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L268"
target="_blank" style="float:right; font-size:smaller">source</a>

### Category

``` python

def Category(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

```

*str(object=’’) -\> str* str(bytes_or_buffer\[, encoding\[, errors\]\])
-\> str

Create a new string object from the given object. If encoding or errors
is specified, then the object must expose a data buffer that will be
decoded using the given encoding and error handler. Otherwise, returns
the result of object.\_\_str\_\_() (if defined) or repr(object).
encoding defaults to sys.getdefaultencoding(). errors defaults to
‘strict’.

``` python
cat = Categorize()
tds = Datasets(['cat', 'dog', 'cat'], tfms=[cat])
test_eq(cat.vocab, ['cat', 'dog'])
test_eq(cat('cat'), 0)
test_eq(cat.decode(1), 'dog')
test_stdout(lambda: show_at(tds,2), 'cat')
test_fail(lambda: cat('bird'))
```

``` python
cat = Categorize(add_na=True)
tds = Datasets(['cat', 'dog', 'cat'], tfms=[cat])
test_eq(cat.vocab, ['#na#', 'cat', 'dog'])
test_eq(cat('cat'), 1)
test_eq(cat.decode(2), 'dog')
test_stdout(lambda: show_at(tds,2), 'cat')
```

``` python
cat = Categorize(vocab=['dog', 'cat'], sort=False, add_na=True)
tds = Datasets(['cat', 'dog', 'cat'], tfms=[cat])
test_eq(cat.vocab, ['#na#', 'dog', 'cat'])
test_eq(cat('dog'), 1)
test_eq(cat.decode(2), 'cat')
test_stdout(lambda: show_at(tds,2), 'cat')
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L271"
target="_blank" style="float:right; font-size:smaller">source</a>

### MultiCategorize

``` python

def MultiCategorize(
    vocab:NoneType=None, add_na:bool=False
):

```

*Reversible transform of multi-category strings to `vocab` id*

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L292"
target="_blank" style="float:right; font-size:smaller">source</a>

### MultiCategory

``` python

def MultiCategory(
    items:NoneType=None, rest:VAR_POSITIONAL, use_list:bool=False, match:NoneType=None
):

```

*Behaves like a list of `items` but can also index with list of indices
or masks*

``` python
cat = MultiCategorize()
tds = Datasets([['b', 'c'], ['a'], ['a', 'c'], []], tfms=[cat])
test_eq(tds[3][0], TensorMultiCategory([]))
test_eq(cat.vocab, ['a', 'b', 'c'])
test_eq(cat(['a', 'c']), tensor([0,2]))
test_eq(cat([]), tensor([]))
test_eq(cat.decode([1]), ['b'])
test_eq(cat.decode([0,2]), ['a', 'c'])
test_stdout(lambda: show_at(tds,2), 'a;c')

# if vocab supplied, ensure it maintains its order (i.e., it doesn't sort)
cat = MultiCategorize(vocab=['z', 'y', 'x'])
test_eq(cat.vocab, ['z','y','x'])

test_fail(lambda: cat('bird'))
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L297"
target="_blank" style="float:right; font-size:smaller">source</a>

### OneHotEncode

``` python

def OneHotEncode(
    c:NoneType=None
):

```

*One-hot encodes targets*

Works in conjunction with `MultiCategorize` or on its own if you have
one-hot encoded targets (pass a `vocab` for decoding and
`do_encode=False` in this case)

``` python
_tfm = OneHotEncode(c=3)
test_eq(_tfm([0,2]), tensor([1.,0,1]))
test_eq(_tfm.decode(tensor([0,1,1])), [1,2])
```

``` python
tds = Datasets([['b', 'c'], ['a'], ['a', 'c'], []], [[MultiCategorize(), OneHotEncode()]])
test_eq(tds[1], [tensor([1.,0,0])])
test_eq(tds[3], [tensor([0.,0,0])])
test_eq(tds.decode([tensor([False, True, True])]), [['b','c']])
test_eq(type(tds[1][0]), TensorMultiCategory)
test_stdout(lambda: show_at(tds,2), 'a;c')
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L310"
target="_blank" style="float:right; font-size:smaller">source</a>

### EncodedMultiCategorize

``` python

def EncodedMultiCategorize(
    vocab
):

```

*Transform of one-hot encoded multi-category that decodes with `vocab`*

``` python
_tfm = EncodedMultiCategorize(vocab=['a', 'b', 'c'])
test_eq(_tfm([1,0,1]), tensor([1., 0., 1.]))
test_eq(type(_tfm([1,0,1])), TensorMultiCategory)
test_eq(_tfm.decode(tensor([False, True, True])), ['b','c'])

_tfm2 = EncodedMultiCategorize(vocab=['c', 'b', 'a'])
test_eq(_tfm2.vocab, ['c', 'b', 'a'])
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L320"
target="_blank" style="float:right; font-size:smaller">source</a>

### RegressionSetup

``` python

def RegressionSetup(
    c:NoneType=None
):

```

*Transform that floatifies targets*

``` python
_tfm = RegressionSetup()
dsets = Datasets([0, 1, 2], RegressionSetup)
test_eq(dsets.c, 1)
test_eq_type(dsets[0], (tensor(0.),))

dsets = Datasets([[0, 1, 2], [3,4,5]], RegressionSetup)
test_eq(dsets.c, 3)
test_eq_type(dsets[0], (tensor([0.,1.,2.]),))
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L333"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_c

``` python

def get_c(
    dls
):

```

*Call self as a function.*

## End-to-end dataset example with MNIST

Let’s show how to use those functions to grab the mnist dataset in a
[`Datasets`](https://docs.fast.ai/data.core.html#datasets). First we
grab all the images.

``` python
path = untar_data(URLs.MNIST_TINY)
items = get_image_files(path)
```

Then we split between train and validation depending on the folder.

``` python
splitter = GrandparentSplitter()
splits = splitter(items)
train,valid = (items[i] for i in splits)
train[:3],valid[:3]
```

    ((#3) [Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/9243.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/9519.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/train/7/7534.png')],
     (#3) [Path('/Users/jhoward/.fastai/data/mnist_tiny/valid/7/9294.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/valid/7/9257.png'),Path('/Users/jhoward/.fastai/data/mnist_tiny/valid/7/8175.png')])

Our inputs are images that we open and convert to tensors, our targets
are labeled depending on the parent directory and are categories.

``` python
from PIL import Image
```

``` python
def open_img(fn:Path): return Image.open(fn).copy()
def img2tensor(im:Image.Image): return TensorImage(array(im)[None])

tfms = [[open_img, img2tensor],
        [parent_label, Categorize()]]
train_ds = Datasets(train, tfms)
```

``` python
x,y = train_ds[3]
xd,yd = decode_at(train_ds,3)
test_eq(parent_label(train[3]),yd)
test_eq(array(Image.open(train[3])),xd[0].numpy())
```

``` python
ax = show_at(train_ds, 3, cmap="Greys", figsize=(1,1))
```

![](05_data.transforms_files/figure-commonmark/cell-74-output-1.png)

``` python
assert ax.title.get_text() in ('3','7')
test_fig_exists(ax)
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L342"
target="_blank" style="float:right; font-size:smaller">source</a>

### ToTensor

``` python

def ToTensor(
    enc:NoneType=None, dec:NoneType=None, split_idx:NoneType=None, order:NoneType=None
):

```

*Convert item to appropriate tensor class*

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L347"
target="_blank" style="float:right; font-size:smaller">source</a>

### IntToFloatTensor

``` python

def IntToFloatTensor(
    div:float=255.0, div_mask:int=1
):

```

*Transform image to float tensor, optionally dividing by 255 (e.g. for
images).*

``` python
t = (TensorImage(tensor(1)),tensor(2).long(),TensorMask(tensor(3)))
tfm = IntToFloatTensor()
ft = tfm(t)
test_eq(ft, [1./255, 2, 3])
test_eq(type(ft[0]), TensorImage)
test_eq(type(ft[2]), TensorMask)
test_eq(ft[0].type(),'torch.FloatTensor')
test_eq(ft[1].type(),'torch.LongTensor')
test_eq(ft[2].type(),'torch.LongTensor')
```

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L356"
target="_blank" style="float:right; font-size:smaller">source</a>

### broadcast_vec

``` python

def broadcast_vec(
    dim, ndim, t:VAR_POSITIONAL, cuda:bool=True
):

```

*Make a vector broadcastable over `dim` (out of `ndim` total) by
prepending and appending unit axes*

------------------------------------------------------------------------

<a
href="https://github.com/fastai/fastai/blob/main/fastai/data/transforms.py#L365"
target="_blank" style="float:right; font-size:smaller">source</a>

### Normalize

``` python

def Normalize(
    mean:NoneType=None, std:NoneType=None, axes:tuple=(0, 2, 3)
):

```

*Normalize/denorm batch of
[`TensorImage`](https://docs.fast.ai/torch_core.html#tensorimage)*

``` python
mean,std = [0.5]*3,[0.5]*3
mean,std = broadcast_vec(1, 4, mean, std)
batch_tfms = [IntToFloatTensor(), Normalize.from_stats(mean,std)]
tdl = TfmdDL(train_ds, after_batch=batch_tfms, bs=4, device=default_device())
```

``` python
x,y  = tdl.one_batch()
xd,yd = tdl.decode((x,y))

assert x.type().endswith('.FloatTensor')
test_eq(xd.type(), 'torch.LongTensor')
test_eq(type(x), TensorImage)
test_eq(type(y), TensorCategory)
assert x.mean()<0.0
assert x.std()>0.3
assert 0<xd.float().mean()/255.<1
assert 0<xd.float().std()/255.<0.7
```

``` python
#Just for visuals
from fastai.vision.core import *
```

``` python
tdl.show_batch((x,y))
```

![](05_data.transforms_files/figure-commonmark/cell-84-output-1.png)

![](05_data.transforms_files/figure-commonmark/cell-84-output-2.png)

![](05_data.transforms_files/figure-commonmark/cell-84-output-3.png)

![](05_data.transforms_files/figure-commonmark/cell-84-output-4.png)
