Splitting Data#
The splitting
module splits data sets for offline evaluation
using cross-validation and other strategies. The various splitters are
implemented as functions that operate on a Dataset
and
return one or more train-test splits (as TTSplit
objects).
Changed in version 2025.1: Data splitting was moved from lenskit.crossfold
to the lenskit.splitting
module and functions were renamed and had their interfaces revised.
Experiment code should generally use these functions to prepare train-test files for training and evaluating algorithms. For example, the following will perform a user-based 5-fold cross-validation as was the default in older versions of LensKit:
import pandas as pd
from lenskit.data import load_movielens
from lenskit.splitting import crossfold_users, SampleN, dict_to_df
dataset = load_movielens('data/ml-20m.zip')
for i, tp in enumerate(crossfold_users(ratings, 5, SampleN(5))):
tp.train_df.to_parquet(f'ml-20m.exp/train-{i}.parquet')
tp.test_df.to_parquet(f'ml-20m.exp/test-{i}.parquet')
Temporal Splitting#
Global temporal splitting partitions data into train/test splits based on a
partitioning timestamp. The split_global_time()
function takes a data
set and a partition timestamp and splits the data. It can also take a list of
partition timestamps for multiple splits, e.g. for train/valid/test split.
Record-based Random Splitting#
The simplest preparation methods sample or partition the records in the input
data. A 5-fold crossfold_records()
split will result in 5 splits, each of
which extracts 20% of the user-item interaction records for testing and leaves
80% for training. There are two record-based random splitting functions:
crossfold_records()
partitions ratings or interactions into 5 equal-sized splits.sample_records()
produces 1 or more disjoint samples of the ratings for testing.
Note
When a dataset has repeated interactions, these functions operate only on
the matrix view of the data (user-item observations are deduplicated).
Specifically, they operate on the results of calling
interaction_matrix()
with format="pandas"
and field="all"
.
User-based Splitting#
It’s often desirable to use users, instead of raw rows, as the basis for splitting data. This allows you to control the experimental conditions on a user-by-user basis, e.g. by making sure each user is tested with the same number of ratings. These methods require that the input data frame have a user column with the user names or identifiers.
The algorithm used by each is as follows:
Sample or partition the set of user IDs into n sets of test users.
For each set of test users, select a set of that user’s rows to be test rows.
Create a training set for each test set consisting of the non-selected rows from each of that set’s test users, along with all rows from each non-test user.
As with record-based splitting, there are both cross-folding (partition all users into disjoint sets) and sampling (compute one or more disjoint sets of test users).
Selecting user holdout rows#
User-based splitting requires a mechanism to split a test user’s interactions
into the actual test data and the training or query data for that user. The
user-based splitting functions therefore take a holdout method
(the method
parameter) to do that partitioning. The method
is just a callable that takes an item list of the user’s interactions and
returns the test interactions.
We provide several holdout implementations, implemented as classes that take the holdout’s configuration (e.g. the number of test ratings per user) and return callable objects to do the holdout:
Randomly select a fixed number of test rows per user/item. |
|
Randomly select a fraction of test rows per user/item. |
|
Select a fixed number of test rows per user/item, based on ordering by a field. |
|
Select a fraction of test rows per user/item. |