Data Utilities
These are general-purpose data processing utilities.
Building Ratings Matrices
- lenskit.data.sparse_ratings(ratings, scipy=False, *, users=None, items=None)
Convert a rating table to a sparse matrix of ratings.
- Parameters:
ratings (pandas.DataFrame) – a data table of (user, item, rating) triples.
scipy (bool) – if
True
or'csr'
, return a SciPy csr matrix instead ofCSR
. if'coo'
, return a SciPy coo matrix.users (pandas.Index) – an index of user IDs.
items (pandas.Index) – an index of items IDs.
- Returns:
a named tuple containing the sparse matrix, user index, and item index.
- Return type:
- class lenskit.data.RatingMatrix(matrix, users, items)
Bases:
tuple
A rating matrix with associated indices.
- matrix
The rating matrix, with users on rows and items on columns.
- Type:
CSR or scipy.sparse.csr_matrix
- users
mapping from user IDs to row numbers.
- Type:
- items
mapping from item IDs to column numbers.
- Type:
- items
Alias for field number 2
- matrix
Alias for field number 0
- users
Alias for field number 1
Sampling Utilities
The lenskit.data.sampling
module provides support functions for various
data sampling procedures for use in model training.
- lenskit.data.sampling.neg_sample(mat, uv, sample)
Sample the examples from a user-item matrix. For each user in
uv
, it samples an item that they have not rated using rejection sampling.While this is embarassingly parallel, we do not parallelize because it’s often used in parallel.
This returns both the items and the sample counts for debugging:
neg_items, counts = neg_sample(matrix, users, sample_unweighted)
- Parameters:
mat (csr.CSR) – The user-item matrix. Its values are ignored and do not need to be present.
uv (numpy.ndarray) – An array of user IDs.
sample (function) – A sampling function to sample candidate negative items. Should be one of
sample_weighted()
orsample_unweighted()
.
- Returns:
Two arrays:
The sampled negative item IDs.
An array of sample counts, the number of samples required to sample each item. This is useful for diagnosing sample inefficiency.
- Return type:
- lenskit.data.sampling.sample_unweighted(mat)
Candidate sampling function for use with
neg_sample()
. It samples items uniformly at random.
- lenskit.data.sampling.sample_weighted(mat)
Candidate sampling function for use with
neg_sample()
. It samples items proportionally to their popularity.