Data Management#
LensKit provides a unified data model for recommender systems data along with classes and utility functions for working with it, described in this section of the manual.
Changed in version 2024.1: The new Dataset
class replaces the Pandas data frames
that were passed to algorithms in the past. It also subsumes
the old support for producing sparse matrices from rating frames.
Data Model and Key Concepts#
The LensKit data model consists of users, items, and interactions, with fields providing additional (optional) data about each of these entities. The simplest valid LensKit data set is simply a list of user and item identifiers indicating which items each user has interacted with. These may be augumented with ratings, timestamps, or any other attributes.
Data can be read from a range of sources, but ultimately resolves to a
collection of tables (e.g. Pandas DataFrame
) that record user,
item, and interaction data.
Identifiers#
Users and items have two identifiers:
The identifier as presented in the original source table(s). It appears in LensKit data frames as
user_id
anditem_id
columns. Identifiers can be integers, strings, or byte arrays.The number assigned by the dataset handling code. This is a 0-based contiguous user or item number that is suitable for indexing into arrays or matrices, a common operation in recommendation models. In data frames, this appears as a
user_num
oritem_num
column. It is the only representation supported by NumPy and PyTorch array formats.User and item numbers are assigned based on sorted identifiers in the initial data source, so reloading the same data set will yield the same numbers. Loading a subset, however, is not guaranteed to result in the same numbers, as the subset may be missing some users or items.
Methods that add additional users or items will assign numbers based on the sorted identifiers that do not yet have numbers.
Identifiers and numbers can be mapped to each other with the user and item
vocabularies (users
and items
, see the
Vocabulary
class).
Dataset Abstraction#
The LensKit Dataset
class is the standard LensKit interface to datasets
for training, evaluation, etc. Trainable models and components expect a dataset
instance to be passed to fit()
. It is an
abstract class with implementations covering various scenarios.
- class lenskit.data.Dataset#
Bases:
ABC
Representation of a data set for LensKit training, evaluation, etc. Data can be accessed in a variety of formats depending on the needs of a component.
Note
Zero-copy conversions are used whenever possible, so client code must not modify returned data in-place.
Todo
Support for advanced rating situations is not yet supported:
repeated ratings
mixed implicit & explicit feedback
later actions removing earlier ratings
Todo
Support for item and user content or metadata is not yet implemented.
- abstract property items: Vocabulary[int | str | bytes]#
The items known by this dataset.
- abstract property users: Vocabulary[int | str | bytes]#
The users known by this dataset.
- abstract count(what)#
Count entities in the dataset.
Note
The precise counts are subtle in the presence of repeated or superseded interactions. See
interaction_count()
andrating_count()
for details on the"interactions"
and"ratings"
counts.
- property interaction_count: int#
Count the total number of interaction records. Equivalent to
count("interactions")
.Note
If the interaction records themselves reprsent counts, such as the number of times a song was played, this returns the number of records, not the total number of plays.
- property rating_count: int#
Count the total number of ratings (excluding superseded ratings). Equivalent to
count("ratings")
.
- abstract interaction_log(format, *, fields='all', original_ids=False)#
Get the user-item interactions as a table in the requested format. The table is not in a specified order. Interactions may be repeated (e.g. the same user may listen to a song multiple times). For a non-repeated “ratings matrix” view of the data, see
interaction_matrix()
.Warning
Client code must not perform in-place modifications on the table returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.
- Parameters:
format (str) –
The desired data format. Currently-supported formats are:
"pandas"
— returns apandas.DataFrame
. The index is not meaningful."numpy"
— returns aNumpyUserItemTable
."torch"
— returns aTorchUserItemTable
.
fields (str | list[str] | None) – Which fields to include. If set to
"all"
, will include all available fields in the resulting table;None
includes no fields besides the user and item. Commonly-available fields include"rating"
and"timestamp"
. Missing fields will be omitted in the result.original_ids (bool) – If
True
, return user and item IDs as represented in the original source data in columns nameduser_id
anditem_id
, instead of the user and item numbers typically returned. Only applicable to thepandas
format. See Identifiers.
- Returns:
The user-item interaction log in the specified format.
- Return type:
- abstract interaction_matrix(format, *, layout=None, legacy=False, field=None, combine=None, original_ids=False)#
Get the user-item interactions as “ratings” matrix. Interactions are not repeated. The matrix may be in “coordinate” format, in which case it is comparable to
interaction_log()
but without repeated interactions, or it may be in a compressed sparse format.Todo
Aggregate is currently ignored because repeated interactions are not yet supported.
Warning
Client code must not perform in-place modifications on the matrix returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.
- Parameters:
format (str) –
The desired data format. Currently-supported formats are:
"pandas"
— returns apandas.DataFrame
."torch"
— returns a sparsetorch.Tensor
(seetorch.sparse
)."scipy"
— returns a sparse array fromscipy.sparse
."structure"
— returns aCSRStructure
containing only the user and item numbers in compressed sparse row format.
field (str | None) –
Which field to return in the matrix. Common fields include
"rating"
and"timestamp"
.If unspecified (
None
), this will yield an implicit-feedback indicator matrix, with 1s for observed items; the"pandas"
format will only include user and item columns.If the
rating
field is requested but is not defined in the underlying data, then this is equivalent to"indicator"
, except that the"pandas"
format will include a"rating"
column of all 1s.combine (str | None) –
How to combine multiple observations for a single user-item pair. Available methods are:
"count"
— count the user-item interactions. Only valid whenfield=None
; if the underlying data defines acount
field, then this is equivalent to"sum"
on that field."sum"
— sum the field values."first"
,"last"
— take the first or last value seen (in timestamp order, if timestamps are defined).
layout (str | None) – The layout for a sparse matrix. Can be either
csr
orcoo
, orNone
to use the default for the specified format. CSR is only supported by Torch and SciPy backends.legacy (bool) –
True
to return a legacy SciPy sparse matrix instead of sparse array.original_ids (bool) –
True
to return user and item IDs instead of numbers inpandas
-format matrix.
- Return type:
- item_stats()#
Get item statistics.
- Returns:
count — the number of interactions recorded for this item.
user_count — the number of distinct users who have interacted with or rated this item.
rating_count — the number of ratings for this item. Only provided if the dataset has explicit ratings; if there are repeated ratings, this does not count superseded ratings.
mean_rating — the mean of the reatings. Only provided if the dataset has explicit ratings.
first_time — the first time the item appears. Only provided if the dataset has timestamps.
The index is the vocabulary, so
iloc
works with item numbers.- Return type:
A data frame indexed by item ID with the following columns
- user_stats()#
Get user statistics.
- Returns:
count — the number of interactions recorded for this user.
item_count — the number of distinct items with which this user has interacted.
rating_count — the number of ratings for this user. Only provided if the dataset has explicit ratings; if there are repeated ratings, this does not count superseded ratings.
mean_rating — the mean of the user’s reatings. Only provided if the dataset has explicit ratings.
first_time — the first time the user appears. Only provided if the dataset has timestamps.
last_time — the last time the user appears. Only provided if the dataset has timestamps.
The index is the vocabulary, so
iloc
works with user numbers.- Return type:
A data frame indexed by user ID with the following columns
Creating Datasets#
Several functions create :class:`Dataset`s from different input data sources.
Loading Common Datasets#
Vocabularies#
LensKit uses vocabularies to record user/item IDs, tags, terms, etc. in a way that facilitates easy mapping to 0-based contiguous indexes for use in matrix and tensor data structures.
- class lenskit.data.vocab.Vocabulary(keys=None, name=None)#
Bases:
Generic
[VT
]Vocabularies of terms, tags, entity IDs, etc. for the LensKit data model.
This class supports bidirectional mappings between key-like data and congiguous nonnegative integer indices. Its key use is to facilitate the user and item ID vocabularies in
Dataset
, but it can also be used for things like item tags.It is currently a wrapper around
pandas.Index
, but supports the ability to add additional vocabulary terms after the vocabulary has been created. New terms do not change the index positions of previously-known identifiers.- Parameters:
keys (pd.Index | Iterable[VT] | None)
name (str | None)
- number(term: VT, missing: Literal['error'] = 'error') int #
- number(term: VT, missing: Literal['none'] | None) int | None
Look up the number for a vocabulary term.
- numbers(terms, missing='error')#
Look up the numbers for an array of terms or IDs.
- term(num)#
Look up the term with a particular number. Negative indexing is not supported.
- Parameters:
num (int)
- Return type:
VT
- terms(nums=None)#
Get a list of terms, optionally for an array of term numbers.
- id(num)#
Alias for
term()
for greater readability for entity ID vocabularies.- Parameters:
num (int)
- Return type:
VT
- copy()#
Return a (cheap) copy of this vocabulary. It retains the same mapping, but will not be updated if the original vocabulary has new terms added. However, since new terms are always added to the end, it will be compatible with the original vocabulary for all terms recorded at the time of the copy.
This method is useful for saving known vocabularies in model training.
- Return type:
Vocabulary[VT]
Dataset implementations#
Matrix Dataset#
The MatrixDataset
provides an in-memory dataset implementation backed
by a ratings matrix or implicit-feedback matrix.
Lazy Dataset#
The lazy data set takes a function that loads a data set (of any type), and lazily uses that function to load an underlying data set when needed.
User-Item Data Tables#
- class lenskit.data.tables.NumpyUserItemTable(user_nums, item_nums, ratings=None, timestamps=None)#
Bases:
object
Table of user-item interaction data represented as NumPy arrays.
- Parameters:
- user_nums: ndarray[int, dtype[int32]]#
User numbers (0-based contiguous integers, see Identifiers).
- item_nums: ndarray[int, dtype[int32]]#
Item numbers (0-based contiguous integers, see Identifiers).
- class lenskit.data.tables.TorchUserItemTable(user_nums, item_nums, ratings=None, timestamps=None)#
Bases:
object
Table of user-item interaction data represented as PyTorch tensors.
- Parameters:
- user_nums: Tensor#
User numbers (0-based contiguous integers, see Identifiers).
- item_nums: Tensor#
Item numbers (0-based contiguous integers, see Identifiers).