Splitting Data

The LKPY crossfold module provides support for preparing data sets for cross-validation. Crossfold methods are implemented as functions that operate on data frames and return generators of (train, test) pairs (lenskit.crossfold.TTPair objects). The train and test objects in each pair are also data frames, suitable for evaluation or writing out to a file.

Crossfold methods make minimal assumptions about their input data frames, so the frames can be ratings, purchases, or whatever. They do assume that each row represents a single data point for the purpose of splitting and sampling.

Experiment code should generally use these functions to prepare train-test files for training and evaluating algorithms. For example, the following will perform a user-based 5-fold cross-validation as was the default in the old LensKit:

import pandas as pd
import lenskit.crossfold as xf
ratings = pd.read_csv('ml-20m/ratings.csv')
ratings = ratings.rename(columns={'userId': 'user', 'movieId': 'item'})
for i, tp in enumerate(xf.partition_users(ratings, 5, xf.SampleN(5))):
    tp.train.to_csv('ml-20m.exp/train-%d.csv' % (i,))
    tp.train.to_parquet('ml-20m.exp/train-%d.parquet % (i,))
    tp.test.to_csv('ml-20m.exp/test-%d.csv' % (i,))
    tp.test.to_parquet('ml-20m.exp/test-%d.parquet % (i,))

Row-based splitting

The simplest preparation methods sample or partition the rows in the input frame. A 5-fold partition_rows() split will result in 5 splits, each of which extracts 20% of the rows for testing and leaves 80% for training.

lenskit.crossfold.partition_rows(data, partitions, *, rng_spec=None)

Partition a frame of ratings or other datainto train-test partitions. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent).

Parameters

data (pandas.DataFrame) – Ratings or other data you wish to partition.
partitions (int) – The number of partitions to produce.
rng_spec – The random number generator or seed (see lenskit.util.rng()).

Returns

an iterator of train-test pairs

Return type

iterator

lenskit.crossfold.sample_rows(data, partitions, size, disjoint=True, *, rng_spec=None)

Sample train-test a frame of ratings into train-test partitions. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent).

We can loop over a sequence of train-test pairs:

>>> from lenskit import datasets
>>> ratings = datasets.MovieLens('data/ml-latest-small').ratings
>>> for train, test in sample_rows(ratings, 5, 1000):
...     print(len(test))
1000
1000
1000
1000
1000

Sometimes for testing, it is useful to just get a single pair:

>>> train, test = sample_rows(ratings, None, 1000)
>>> len(test)
1000
>>> len(test) + len(train) - len(ratings)
0

Parameters

data (pandas.DataFrame) – Data frame containing ratings or other data to partition.
partitions (int or None) – The number of partitions to produce. If None, produce a _single_ train-test pair instead of an iterator or list.
size (int) – The size of each sample.
disjoint (bool) – If True, force samples to be disjoint.
rng_spec – The random number generator or seed (see lenskit.util.rng()).

Returns

An iterator of train-test pairs.

Return type

iterator

User-based splitting

It’s often desirable to use users, instead of raw rows, as the basis for splitting data. This allows you to control the experimental conditions on a user-by-user basis, e.g. by making sure each user is tested with the same number of ratings. These methods require that the input data frame have a user column with the user names or identifiers.

The algorithm used by each is as follows:

Sample or partition the set of user IDs into n sets of test users.
For each set of test users, select a set of that user’s rows to be test rows.
Create a training set for each test set consisting of the non-selected rows from each
of that set’s test users, along with all rows from each non-test user.

lenskit.crossfold.partition_users(data, partitions: int, method: PartitionMethod, *, rng_spec=None)

Partition a frame of ratings or other data into train-test partitions user-by-user. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent) and has a user column.

Parameters

data (pandas.DataFrame) – a data frame containing ratings or other data you wish to partition.
partitions (int) – the number of partitions to produce
method (PartitionMethod) – The method for selecting test rows for each user.
rng_spec – The random number generator or seed (see lenskit.util.rng()).

Returns: iterator: an iterator of train-test pairs

lenskit.crossfold.sample_users(data, partitions: int, size: int, method: PartitionMethod, disjoint=True, *, rng_spec=None)

Create train-test partitions by sampling users. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent) and has a user column.

Parameters

data (pandas.DataFrame) – Data frame containing ratings or other data you wish to partition.
partitions (int) – The number of partitions.
size (int) – The sample size.
method (PartitionMethod) – The method for obtaining user test ratings.
rng_spec – The random number generator or seed (see lenskit.util.rng()).

Returns

An iterator of train-test pairs (as TTPair objects).

Return type

iterator

Selecting user test rows

These functions each take a method to decide how select each user’s test rows. The method is a function that takes a data frame (containing just the user’s rows) and returns the test rows. This function is expected to preserve the index of the input data frame (which happens by default with common means of implementing samples).

We provide several partition method factories:

lenskit.crossfold.SampleN(n, rng_spec=None)

Randomly select a fixed number of test rows per user/item.

Parameters

n (int) – the number of test items to select
rng – the random number generator or seed

lenskit.crossfold.SampleFrac(frac, rng_spec=None)

Randomly select a fraction of test rows per user/item.

Parameters: frac (float) – the fraction items to select for testing.

lenskit.crossfold.LastN(n, col='timestamp')

Select a fixed number of test rows per user/item, based on ordering by a column.

Parameters: n (int) – The number of test items to select.

lenskit.crossfold.LastFrac(frac, col='timestamp')

Select a fraction of test rows per user/item.

Parameters: frac (double) – the fraction of items to select for testing.

Utility Classes

class lenskit.crossfold.PartitionMethod

Bases: ABC

Partition methods select test rows for a user or item. Partition methods are callable; when called with a data frame, they return the test rows.

abstract __call__(udf)

Subset a data frame.

Parameters: udf (pandas.DataFrame) – The input data frame of rows for a user or item.
Returns: The data frame of test rows, a subset of udf.
Return type: pandas.DataFrame

__weakref__: list of weak references to the object (if defined)

class lenskit.crossfold.TTPair(train, test)

Bases: tuple

Train-test pair (named tuple).

test: Test data for this pair.

train: Train data for this pair.