Data Utilities

These are general-purpose data processing utilities.

Building Ratings Matrices

lenskit.data.sparse_ratings(ratings, scipy=False, *, users=None, items=None)

Convert a rating table to a sparse matrix of ratings.

Parameters:
  • ratings (pandas.DataFrame) – a data table of (user, item, rating) triples.

  • scipy (bool) – if True or 'csr', return a SciPy csr matrix instead of CSR. if 'coo', return a SciPy coo matrix.

  • users (pandas.Index) – an index of user IDs.

  • items (pandas.Index) – an index of items IDs.

Returns:

a named tuple containing the sparse matrix, user index, and item index.

Return type:

RatingMatrix

class lenskit.data.RatingMatrix(matrix, users, items)

Bases: tuple

A rating matrix with associated indices.

matrix

The rating matrix, with users on rows and items on columns.

Type:

CSR or scipy.sparse.csr_matrix

users

mapping from user IDs to row numbers.

Type:

pandas.Index

items

mapping from item IDs to column numbers.

Type:

pandas.Index

items

Alias for field number 2

matrix

Alias for field number 0

users

Alias for field number 1

Sampling Utilities

The lenskit.data.sampling module provides support functions for various data sampling procedures for use in model training.

lenskit.data.sampling.neg_sample(mat, uv, sample)

Sample the examples from a user-item matrix. For each user in uv, it samples an item that they have not rated using rejection sampling.

While this is embarassingly parallel, we do not parallelize because it’s often used in parallel.

This returns both the items and the sample counts for debugging:

neg_items, counts = neg_sample(matrix, users, sample_unweighted)
Parameters:
Returns:

Two arrays:

  1. The sampled negative item IDs.

  2. An array of sample counts, the number of samples required to sample each item. This is useful for diagnosing sample inefficiency.

Return type:

numpy.ndarray, numpy.ndarray

lenskit.data.sampling.sample_unweighted(mat)

Candidate sampling function for use with neg_sample(). It samples items uniformly at random.

lenskit.data.sampling.sample_weighted(mat)

Candidate sampling function for use with neg_sample(). It samples items proportionally to their popularity.