Crossfold preparation¶
The LKPY crossfold module provides support for preparing data sets for
cross-validation. Crossfold methods are implemented as functions that operate
on data frames and return generators of (train, test) pairs
(lenskit.crossfold.TTPair
objects). The train and test objects
in each pair are also data frames, suitable for evaluation or writing out to
a file.
Crossfold methods make minimal assumptions about their input data frames, so the frames can be ratings, purchases, or whatever. They do assume that each row represents a single data point for the purpose of splitting and sampling.
Experiment code should generally use these functions to prepare train-test files for training and evaluating algorithms. For example, the following will perform a user-based 5-fold cross-validation as was the default in the old LensKit:
import pandas as pd
import lenskit.crossfold as xf
ratings = pd.read_csv('ml-20m/ratings.csv')
ratings = ratings.rename(columns={'userId': 'user', 'movieId': 'item'})
for i, tp in enumerate(xf.partition_users(ratings, 5, xf.SampleN(5))):
tp.train.to_csv('ml-20m.exp/train-%d.csv' % (i,))
tp.train.to_parquet('ml-20m.exp/train-%d.parquet % (i,))
tp.test.to_csv('ml-20m.exp/test-%d.csv' % (i,))
tp.test.to_parquet('ml-20m.exp/test-%d.parquet % (i,))
Row-based splitting¶
The simplest preparation methods sample or partition the rows in the input frame.
A 5-fold partition_rows()
split will result in 5
splits, each of which extracts 20% of the rows for testing and leaves 80% for
training.
-
lenskit.crossfold.
partition_rows
(data, partitions)¶ Partition a frame of ratings or other datainto train-test partitions. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent).
Parameters: - data (
pandas.DataFrame
or equivalent) – a data frame containing ratings or other data you wish to partition. - partitions (integer) – the number of partitions to produce
Return type: iterator
Returns: an iterator of train-test pairs
- data (
-
lenskit.crossfold.
sample_rows
(data, partitions, size, disjoint=True)¶ Sample train-test a frame of ratings into train-test partitions. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent).
We can loop over a sequence of train-test pairs:
>>> ratings = util.load_ml_ratings() >>> for train, test in sample_rows(ratings, 5, 1000): ... print(len(test)) 1000 1000 1000 1000 1000
Sometimes for testing, it is useful to just get a single pair:
>>> train, test = sample_rows(ratings, None, 1000) >>> len(test) 1000 >>> len(test) + len(train) - len(ratings) 0
Parameters: - data (pandas.DataFrame) – Data frame containing ratings or other data to partition.
- partitions (int or None) – The number of partitions to produce. If
None
, produce a _single_ train-test pair instead of an iterator or list. - size (int) – The size of each sample.
- disjoint (bool) – If
True
, force samples to be disjoint.
Returns: An iterator of train-test pairs.
Return type: iterator
User-based splitting¶
It’s often desirable to use users, instead of raw rows, as the basis for splitting data. This allows you to control the experimental conditions on a user-by-user basis, e.g. by making sure each user is tested with the same number of ratings. These methods require that the input data frame have a user column with the user names or identifiers.
The algorithm used by each is as follows:
- Sample or partition the set of user IDs into n sets of test users.
- For each set of test users, select a set of that user’s rows to be test rows.
- Create a training set for each test set consisting of the non-selected rows from each
- of that set’s test users, along with all rows from each non-test user.
-
lenskit.crossfold.
partition_users
(data, partitions: int, method: lenskit.crossfold.PartitionMethod)¶ Partition a frame of ratings or other data into train-test partitions user-by-user. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent) and has a user column.
Parameters: - data (
pandas.DataFrame
or equivalent) – a data frame containing ratings or other data you wish to partition. - partitions (integer) – the number of partitions to produce
- method – The method for selecting test rows for each user.
Return type: iterator
Returns: an iterator of train-test pairs
- data (
-
lenskit.crossfold.
sample_users
(data, partitions: int, size: int, method: lenskit.crossfold.PartitionMethod, disjoint=True)¶ Create train-test partitions by sampling users. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent) and has a user column.
Parameters: - data (pandas.DataFrame) – Data frame containing ratings or other data you wish to partition.
- partitions (int) – The number of partitions.
- size (int) – The sample size.
- method (PartitionMethod) – The method for obtaining user test ratings.
Returns: An iterator of train-test pairs (as
TTPair
objects).Return type: iterator
Selecting user test rows¶
These functions each take a method to decide how select each user’s test rows. The method is a function that takes a data frame (containing just the user’s rows) and returns the test rows. This function is expected to preserve the index of the input data frame (which happens by default with common means of implementing samples).
We provide several partition method factories:
-
lenskit.crossfold.
SampleN
(n)¶ Randomly select a fixed number of test rows per user/item.
Parameters: n – The number of test items to select.
-
lenskit.crossfold.
SampleFrac
(frac)¶ Randomly select a fraction of test rows per user/item.
Parameters: frac – the fraction of items to select for testing.
-
lenskit.crossfold.
LastN
(n, col='timestamp')¶ Select a fixed number of test rows per user/item, based on ordering by a column.
Parameters: - n – The number of test items to select.
- col – The column to sort by.
-
lenskit.crossfold.
LastFrac
(frac, col='timestamp')¶ Select a fraction of test rows per user/item.
Parameters: - frac – the fraction of items to select for testing.
- col – The column to sort by.
Utility Classes¶
-
class
lenskit.crossfold.
PartitionMethod
¶ Partition methods select test rows for a user or item. Partition methods are callable; when called with a data frame, they return the test rows.
-
__call__
(udf)¶ Subset a data frame.
Parameters: udf – The input data frame of rows for a user or item. Returns: The data frame of test rows, a subset of udf.
-