Splitting Data
The LKPY crossfold module provides support for preparing data sets for
cross-validation. Crossfold methods are implemented as functions that operate
on data frames and return generators of (train, test) pairs
(lenskit.crossfold.TTPair
objects). The train and test objects
in each pair are also data frames, suitable for evaluation or writing out to
a file.
Crossfold methods make minimal assumptions about their input data frames, so the frames can be ratings, purchases, or whatever. They do assume that each row represents a single data point for the purpose of splitting and sampling.
Experiment code should generally use these functions to prepare train-test files for training and evaluating algorithms. For example, the following will perform a user-based 5-fold cross-validation as was the default in the old LensKit:
import pandas as pd
import lenskit.crossfold as xf
ratings = pd.read_csv('ml-20m/ratings.csv')
ratings = ratings.rename(columns={'userId': 'user', 'movieId': 'item'})
for i, tp in enumerate(xf.partition_users(ratings, 5, xf.SampleN(5))):
tp.train.to_csv('ml-20m.exp/train-%d.csv' % (i,))
tp.train.to_parquet('ml-20m.exp/train-%d.parquet % (i,))
tp.test.to_csv('ml-20m.exp/test-%d.csv' % (i,))
tp.test.to_parquet('ml-20m.exp/test-%d.parquet % (i,))
Row-based splitting
The simplest preparation methods sample or partition the rows in the input frame.
A 5-fold partition_rows()
split will result in 5
splits, each of which extracts 20% of the rows for testing and leaves 80% for
training.
- lenskit.crossfold.partition_rows(data, partitions, *, rng_spec=None)
Partition a frame of ratings or other datainto train-test partitions. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent).
- Parameters
data (pandas.DataFrame) – Ratings or other data you wish to partition.
partitions (int) – The number of partitions to produce.
rng_spec – The random number generator or seed (see
lenskit.util.rng()
).
- Returns
an iterator of train-test pairs
- Return type
iterator
- lenskit.crossfold.sample_rows(data, partitions, size, disjoint=True, *, rng_spec=None)
Sample train-test a frame of ratings into train-test partitions. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent).
We can loop over a sequence of train-test pairs:
>>> from lenskit import datasets >>> ratings = datasets.MovieLens('data/ml-latest-small').ratings >>> for train, test in sample_rows(ratings, 5, 1000): ... print(len(test)) 1000 1000 1000 1000 1000
Sometimes for testing, it is useful to just get a single pair:
>>> train, test = sample_rows(ratings, None, 1000) >>> len(test) 1000 >>> len(test) + len(train) - len(ratings) 0
- Parameters
data (pandas.DataFrame) – Data frame containing ratings or other data to partition.
partitions (int or None) – The number of partitions to produce. If
None
, produce a _single_ train-test pair instead of an iterator or list.size (int) – The size of each sample.
disjoint (bool) – If
True
, force samples to be disjoint.rng_spec – The random number generator or seed (see
lenskit.util.rng()
).
- Returns
An iterator of train-test pairs.
- Return type
iterator
User-based splitting
It’s often desirable to use users, instead of raw rows, as the basis for splitting data. This allows you to control the experimental conditions on a user-by-user basis, e.g. by making sure each user is tested with the same number of ratings. These methods require that the input data frame have a user column with the user names or identifiers.
The algorithm used by each is as follows:
Sample or partition the set of user IDs into n sets of test users.
For each set of test users, select a set of that user’s rows to be test rows.
- Create a training set for each test set consisting of the non-selected rows from each
of that set’s test users, along with all rows from each non-test user.
- lenskit.crossfold.partition_users(data, partitions: int, method: PartitionMethod, *, rng_spec=None)
Partition a frame of ratings or other data into train-test partitions user-by-user. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent) and has a user column.
- Parameters
data (pandas.DataFrame) – a data frame containing ratings or other data you wish to partition.
partitions (int) – the number of partitions to produce
method (PartitionMethod) – The method for selecting test rows for each user.
rng_spec – The random number generator or seed (see
lenskit.util.rng()
).
- Returns
iterator: an iterator of train-test pairs
- lenskit.crossfold.sample_users(data, partitions: int, size: int, method: PartitionMethod, disjoint=True, *, rng_spec=None)
Create train-test partitions by sampling users. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent) and has a user column.
- Parameters
data (pandas.DataFrame) – Data frame containing ratings or other data you wish to partition.
partitions (int) – The number of partitions.
size (int) – The sample size.
method (PartitionMethod) – The method for obtaining user test ratings.
rng_spec – The random number generator or seed (see
lenskit.util.rng()
).
- Returns
An iterator of train-test pairs (as
TTPair
objects).- Return type
iterator
Selecting user test rows
These functions each take a method to decide how select each user’s test rows. The method is a function that takes a data frame (containing just the user’s rows) and returns the test rows. This function is expected to preserve the index of the input data frame (which happens by default with common means of implementing samples).
We provide several partition method factories:
- lenskit.crossfold.SampleN(n, rng_spec=None)
Randomly select a fixed number of test rows per user/item.
- Parameters
n (int) – the number of test items to select
rng – the random number generator or seed
- lenskit.crossfold.SampleFrac(frac, rng_spec=None)
Randomly select a fraction of test rows per user/item.
- Parameters
frac (float) – the fraction items to select for testing.
- lenskit.crossfold.LastN(n, col='timestamp')
Select a fixed number of test rows per user/item, based on ordering by a column.
- Parameters
n (int) – The number of test items to select.
- lenskit.crossfold.LastFrac(frac, col='timestamp')
Select a fraction of test rows per user/item.
- Parameters
frac (double) – the fraction of items to select for testing.
Utility Classes
- class lenskit.crossfold.PartitionMethod
Bases:
ABC
Partition methods select test rows for a user or item. Partition methods are callable; when called with a data frame, they return the test rows.
- abstract __call__(udf)
Subset a data frame.
- Parameters
udf (pandas.DataFrame) – The input data frame of rows for a user or item.
- Returns
The data frame of test rows, a subset of
udf
.- Return type
- __weakref__
list of weak references to the object (if defined)