Data Management

Data Management#

LensKit provides a unified data model for recommender systems data along with classes and utility functions for working with it, described in this section of the manual.

Changed in version 2024.1: The new Dataset class replaces the Pandas data frames that were passed to algorithms in the past. It also subsumes the old support for producing sparse matrices from rating frames.

Data Model and Key Concepts#

The LensKit data model consists of users, items, and interactions, with fields providing additional (optional) data about each of these entities. The simplest valid LensKit data set is simply a list of user and item identifiers indicating which items each user has interacted with. These may be augumented with ratings, timestamps, or any other attributes.

Data can be read from a range of sources, but ultimately resolves to a collection of tables (e.g. Pandas DataFrame) that record user, item, and interaction data.

Identifiers#

Users and items have two identifiers:

The identifier as presented in the original source table(s). It appears in LensKit data frames as user_id and item_id columns. Identifiers can be integers, strings, or byte arrays.
The number assigned by the dataset handling code. This is a 0-based contiguous user or item number that is suitable for indexing into arrays or matrices, a common operation in recommendation models. In data frames, this appears as a user_num or item_num column. It is the only representation supported by NumPy and PyTorch array formats.

User and item numbers are assigned based on sorted identifiers in the initial data source, so reloading the same data set will yield the same numbers. Loading a subset, however, is not guaranteed to result in the same numbers, as the subset may be missing some users or items.

Methods that add additional users or items will assign numbers based on the sorted identifiers that do not yet have numbers.

Identifiers and numbers can be mapped to each other with the user and item vocabularies (users and items, see the Vocabulary class).

lenskit.data.vocab.EntityId: TypeAlias = int | str | bytes#: Allowable entity identifier types.

Dataset Abstraction#

The LensKit Dataset class is the standard LensKit interface to datasets for training, evaluation, etc. Trainable models and components expect a dataset instance to be passed to fit(). It is an abstract class with implementations covering various scenarios.

class lenskit.data.Dataset#

Bases: ABC

Representation of a data set for LensKit training, evaluation, etc. Data can be accessed in a variety of formats depending on the needs of a component.

Note

Zero-copy conversions are used whenever possible, so client code must not modify returned data in-place.

Todo

Support for advanced rating situations is not yet supported:

repeated ratings
mixed implicit & explicit feedback
later actions removing earlier ratings

Todo

Support for item and user content or metadata is not yet implemented.

abstract property items: Vocabulary[int | str | bytes]#: The items known by this dataset.

abstract property users: Vocabulary[int | str | bytes]#: The users known by this dataset.

abstract count(what)#

Count entities in the dataset.

Note

The precise counts are subtle in the presence of repeated or superseded interactions. See interaction_count() and rating_count() for details on the "interactions" and "ratings" counts.

Parameters:

what (str) –

The type of entity to count. Commonly-supported ones include:

users
items
interactions
ratings

Return type:

int

property interaction_count: int#: Count the total number of interaction records. Equivalent to count("interactions").

Note

If the interaction records themselves reprsent counts, such as the number of times a song was played, this returns the number of records, not the total number of plays.

property rating_count: int#: Count the total number of ratings (excluding superseded ratings). Equivalent to count("ratings").

abstract interaction_log(format, *, fields='all', original_ids=False)#

Get the user-item interactions as a table in the requested format. The table is not in a specified order. Interactions may be repeated (e.g. the same user may listen to a song multiple times). For a non-repeated “ratings matrix” view of the data, see interaction_matrix().

Warning

Client code must not perform in-place modifications on the table returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.

Parameters:

format (str) –
The desired data format. Currently-supported formats are:
- "pandas" — returns a pandas.DataFrame. The index is not meaningful.
- "numpy" — returns a NumpyUserItemTable.
- "torch" — returns a TorchUserItemTable.
fields (str | list[str] | None) – Which fields to include. If set to "all", will include all available fields in the resulting table; None includes no fields besides the user and item. Commonly-available fields include "rating" and "timestamp". Missing fields will be omitted in the result.
original_ids (bool) – If True, return user and item IDs as represented in the original source data in columns named user_id and item_id, instead of the user and item numbers typically returned. Only applicable to the pandas format. See Identifiers.

Returns:

The user-item interaction log in the specified format.

Return type:

Any

abstract interaction_matrix(format, *, layout=None, legacy=False, field=None, combine=None, original_ids=False)#

Get the user-item interactions as “ratings” matrix. Interactions are not repeated. The matrix may be in “coordinate” format, in which case it is comparable to interaction_log() but without repeated interactions, or it may be in a compressed sparse format.

Todo

Aggregate is currently ignored because repeated interactions are not yet supported.

Warning

Client code must not perform in-place modifications on the matrix returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.

Parameters:

format (str) –
The desired data format. Currently-supported formats are:
- "pandas" — returns a pandas.DataFrame.
- "torch" — returns a sparse torch.Tensor (see torch.sparse).
- "scipy" — returns a sparse array from scipy.sparse.
- "structure" — returns a CSRStructure containing only the user and item numbers in compressed sparse row format.
field (str | None) –
Which field to return in the matrix. Common fields include "rating" and "timestamp".

If unspecified (None), this will yield an implicit-feedback indicator matrix, with 1s for observed items; the "pandas" format will only include user and item columns.

If the rating field is requested but is not defined in the underlying data, then this is equivalent to "indicator", except that the "pandas" format will include a "rating" column of all 1s.
combine (str | None) –
How to combine multiple observations for a single user-item pair. Available methods are:
- "count" — count the user-item interactions. Only valid when field=None; if the underlying data defines a count field, then this is equivalent to "sum" on that field.
- "sum" — sum the field values.
- "first", "last" — take the first or last value seen (in timestamp order, if timestamps are defined).
layout (str | None) – The layout for a sparse matrix. Can be either csr or coo, or None to use the default for the specified format. CSR is only supported by Torch and SciPy backends.
legacy (bool) – True to return a legacy SciPy sparse matrix instead of sparse array.
original_ids (bool) – True to return user and item IDs instead of numbers in pandas-format matrix.

Return type:

Any

item_stats()#

Get item statistics.

Returns:

count — the number of interactions recorded for this item.
user_count — the number of distinct users who have interacted with or rated this item.
rating_count — the number of ratings for this item. Only provided if the dataset has explicit ratings; if there are repeated ratings, this does not count superseded ratings.
mean_rating — the mean of the reatings. Only provided if the dataset has explicit ratings.
first_time — the first time the item appears. Only provided if the dataset has timestamps.

The index is the vocabulary, so iloc works with item numbers.

Return type:

A data frame indexed by item ID with the following columns

user_stats()#

Get user statistics.

Returns:

count — the number of interactions recorded for this user.
item_count — the number of distinct items with which this user has interacted.
rating_count — the number of ratings for this user. Only provided if the dataset has explicit ratings; if there are repeated ratings, this does not count superseded ratings.
mean_rating — the mean of the user’s reatings. Only provided if the dataset has explicit ratings.
first_time — the first time the user appears. Only provided if the dataset has timestamps.
last_time — the last time the user appears. Only provided if the dataset has timestamps.

The index is the vocabulary, so iloc works with user numbers.

Return type:

A data frame indexed by user ID with the following columns

Creating Datasets#

Several functions create :class:`Dataset`s from different input data sources.

Loading Common Datasets#

lenskit.data.load_movielens(path)#

Load a MovieLens dataset. The appropriate MovieLens format is detected based on the file contents.

Parameters:: path (str | Path) – The path to the dataset, either as an unpacked directory or a zip file.
Returns:: The dataset.
Return type:: Dataset

Vocabularies#

LensKit uses vocabularies to record user/item IDs, tags, terms, etc. in a way that facilitates easy mapping to 0-based contiguous indexes for use in matrix and tensor data structures.

class lenskit.data.vocab.Vocabulary(keys=None, name=None)#

Bases: Generic[VT]

Vocabularies of terms, tags, entity IDs, etc. for the LensKit data model.

This class supports bidirectional mappings between key-like data and congiguous nonnegative integer indices. Its key use is to facilitate the user and item ID vocabularies in Dataset, but it can also be used for things like item tags.

It is currently a wrapper around pandas.Index, but supports the ability to add additional vocabulary terms after the vocabulary has been created. New terms do not change the index positions of previously-known identifiers.

Parameters:

keys (pd.Index | Iterable[VT] | None)
name (str | None)

name: str | None#: The name of the vocabulary (e.g. “user”, “item”).

property index: Index#: The property as a Pandas index.

property size: int#: Current vocabulary size.

number(term: VT, missing: Literal['error'] = 'error') → int#
number(term: VT, missing: Literal['none'] | None) → int | None: Look up the number for a vocabulary term.

numbers(terms, missing='error')#

Look up the numbers for an array of terms or IDs.

Parameters:

terms (Sequence[VT] | ArrayLike)
missing (Literal['error', 'negative'])

Return type:

ndarray[int, dtype[int32]]

term(num)#

Look up the term with a particular number. Negative indexing is not supported.

Parameters:: num (int)
Return type:: VT

terms(nums=None)#

Get a list of terms, optionally for an array of term numbers.

Parameters:: nums (list[int] | ndarray[Any, dtype[integer]] | Series | None) – The numbers (indices) for of terms to retrieve. If None, returns all terms.
Returns:: The terms corresponding to the specified numbers, or the full array of terms (in order) if nums=None.
Return type:: ndarray

id(num)#

Alias for term() for greater readability for entity ID vocabularies.

Parameters:: num (int)
Return type:: VT

ids(nums=None)#

Alias for terms() for greater readability for entity ID vocabularies.

Parameters:: nums (list[int] | ndarray[Any, dtype[integer]] | Series | None)
Return type:: ndarray

copy()#

Return a (cheap) copy of this vocabulary. It retains the same mapping, but will not be updated if the original vocabulary has new terms added. However, since new terms are always added to the end, it will be compatible with the original vocabulary for all terms recorded at the time of the copy.

This method is useful for saving known vocabularies in model training.

Return type:: Vocabulary[VT]

Dataset implementations#

Matrix Dataset#

The MatrixDataset provides an in-memory dataset implementation backed by a ratings matrix or implicit-feedback matrix.

Lazy Dataset#

The lazy data set takes a function that loads a data set (of any type), and lazily uses that function to load an underlying data set when needed.

User-Item Data Tables#

class lenskit.data.tables.NumpyUserItemTable(user_nums, item_nums, ratings=None, timestamps=None)#

Bases: object

Table of user-item interaction data represented as NumPy arrays.

Parameters:

user_nums (ndarray[int, dtype[int32]])
item_nums (ndarray[int, dtype[int32]])
ratings (ndarray[int, dtype[float32]] | None)
timestamps (ndarray[int, dtype[int64]] | None)

user_nums: ndarray[int, dtype[int32]]#: User numbers (0-based contiguous integers, see Identifiers).

item_nums: ndarray[int, dtype[int32]]#: Item numbers (0-based contiguous integers, see Identifiers).

ratings: ndarray[int, dtype[float32]] | None = None#: Ratings for the items.

timestamps: ndarray[int, dtype[int64]] | None = None#: Timestamps for recorded user-item interactions.

class lenskit.data.tables.TorchUserItemTable(user_nums, item_nums, ratings=None, timestamps=None)#

Bases: object

Table of user-item interaction data represented as PyTorch tensors.

Parameters:

user_nums (Tensor)
item_nums (Tensor)
ratings (Tensor | None)
timestamps (Tensor | None)

user_nums: Tensor#: User numbers (0-based contiguous integers, see Identifiers).

item_nums: Tensor#: Item numbers (0-based contiguous integers, see Identifiers).

ratings: Tensor | None = None#: Ratings for the items.

timestamps: Tensor | None = None#: Timestamps for recorded user-item interactions.

Data Management

Contents

Data Management#

Data Model and Key Concepts#

Identifiers#

Dataset Abstraction#

Creating Datasets#

Loading Common Datasets#

Vocabularies#

Dataset implementations#

Matrix Dataset#

Lazy Dataset#

User-Item Data Tables#