lenskit.splitting.sample_records#

lenskit.splitting.sample_records(data: Dataset, size: int, *, disjoint: bool = True, test_only: bool = False, rng: RNGInput = None, repeats: None = None) TTSplit#
lenskit.splitting.sample_records(data: Dataset, size: int, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: RNGInput = None) Iterator[TTSplit]

Sample train-test a frame of ratings into train-test partitions. This function does not care what kind of data is in data, so long as it is a Pandas DataFrame (or equivalent).

We can loop over a sequence of train-test pairs:

>>> from lenskit.data import load_movielens
>>> movielens = load_movielens('data/ml-latest-small')
>>> for split in sample_records(movielens, 1000, repeats=5):
...     print(sum(len(il) for il in split.test.lists()))
1000
1000
1000
1000
1000

Sometimes for testing, it is useful to just get a single pair:

>>> split = sample_records(movielens, 1000)
>>> sum(len(il) for il in split.test.lists())
1000
Parameters:
  • data – The data set to split.

  • size – The size of each test sample.

  • repeats – The number of data splits to produce. If None, produce a _single_ train-test pair instead of an iterator or list.

  • disjoint – If True, force test samples to be disjoint.

  • test_only – If True, returns splits with empty training sets (useful when you just want to save the test data).

  • rng – The random number generator or seed (see Random Seeds).

Returns:

A train-test pair or iterator of such pairs (depending on repeats).