Datasets#
LensKit provides a unified data model for recommender systems data along with classes and utility functions for working with it, described in this section of the manual.
Changed in version 2025.1: The new Dataset
class replaces the Pandas data frames
that were passed to algorithms in the past. It also subsumes
the old support for producing sparse matrices from rating frames.
Getting started with the dataset is fairly straightforward:
>>> from lenskit.data import load_movielens
>>> mlds = load_movielens('data/ml-latest-small')
>>> mlds.item_count
9066
You can then access the data from the various methods of the Dataset
class.
For example, if you want to get the ratings as a data frame:
>>> mlds.interaction_matrix('pandas', field='rating')
user_num item_num rating
0 0 30 2.5
1 0 833 3.0
2 0 859 3.0
3 0 906 2.0
4 0 931 4.0
...
[100004 rows x 3 columns]
Or obtain item statistics:
>>> mlds.item_stats()
count user_count rating_count mean_rating first_time
item
1 247 247 247 3.872470 828212413
2 107 107 107 3.401869 828213150
3 59 59 59 3.161017 833955544
4 13 13 13 2.384615 834425135
5 56 56 56 3.267857 829491839
...
[9066 rows x 5 columns]
Data Model and Key Concepts#
The LensKit data model consists of users, items, and interactions, with fields providing additional (optional) data about each of these entities. The simplest valid LensKit data set is simply a list of user and item identifiers indicating which items each user has interacted with. These may be augmented with ratings, timestamps, or any other attributes.
Data can be read from a range of sources, but ultimately resolves to a
collection of tables (e.g. Pandas DataFrame
) that record user,
item, and interaction data.
Identifiers#
Users and items have two identifiers:
The identifier as presented in the original source table(s). It appears in LensKit data frames as
user_id
anditem_id
columns. Identifiers can be integers, strings, or byte arrays, and are represented in LensKit by theID
type.The number assigned by the dataset handling code. This is a 0-based contiguous user or item number that is suitable for indexing into arrays or matrices, a common operation in recommendation models. In data frames, this appears as a
user_num
oritem_num
column. It is the only representation supported by NumPy and PyTorch array formats.User and item numbers are assigned based on sorted identifiers in the initial data source, so reloading the same data set will yield the same numbers. Loading a subset, however, is not guaranteed to result in the same numbers, as the subset may be missing some users or items.
Methods that add additional users or items will assign numbers based on the sorted identifiers that do not yet have numbers.
Identifiers and numbers can be mapped to each other with the user and item
vocabularies (users
and items
, see the
Vocabulary
class).
Dataset Abstraction#
The LensKit Dataset
class is the standard LensKit interface to datasets
for training, evaluation, etc. Trainable models and components expect a dataset
instance to be passed to train()
.
Datasets provide several views of different aspsects of a dataset, documented in
more detail in the reference documentation
. These include:
Sets of known user and item identifiers, through
Vocabulary
objects exposed through theDataset.users
andDataset.items
properties.
Creating Datasets#
Several functions can create a Dataset
from different input data sources.
|
Create a dataset from a data frame of ratings or other user-item interactions. |
Loading Common Datasets#
LensKit also provides support for loading several common data sets directly from their source files.
|
Load a MovieLens dataset. |
Dataset Implementations#
Dataset
itself is an abstract class that can be extended to provide new
data set implementations (e.g. querying a database). LensKit provides a few
implementations.
|
Dataset implementation using an in-memory rating or implicit-feedback matrix (with no duplicate interactions). |
|
A data set with an underlying load function, that doesn't call the function until data is actually needed. |