lenskit.data.Dataset#

class lenskit.data.Dataset(data)#

Bases: object

Representation of a data set for LensKit training, evaluation, etc. Data can be accessed in a variety of formats depending on the needs of a component. See Data Model for details of the LensKit data model.

Dataset objects should not be directly constructed; instead, use a DatasetBuilder, load(), or from_interactions_df().

Note

Zero-copy conversions are used whenever possible, so client code must not modify returned data in-place.

Parameters:

data (DataContainer | Callable[[], DataContainer | Dataset]) – The container for this dataset’s data, or a function that will return such a container to create a lazy-loaded dataset.

Stability: Caller

This API is at the caller stability level: breaking changes for code calling this function or class will be reserved for annual major version bumps, but minor versions may introduce changes that break subclasses or reimplementations. See Stability Levels for details.

__init__(data)#
Parameters:

data (DataContainer | Callable[[], DataContainer | Dataset])

Methods

__init__(data)

default_interaction_class()

entities(name)

Get the entities of a particular type / class.

interaction_matrix(*, format[, layout, ...])

Get the user-item interactions as “ratings” matrix from the default interaction class.

interaction_table(*, format[, fields, ...])

Get the user-item interactions as a table in the requested format.

interactions([name])

Get the interaction records of a particular class.

item_stats()

Get item statistics from the default interaction class.

load(path)

Load a dataset in the LensKit native format.

relationships(name)

Get the relationship records of a particular type / class.

save(path)

Save the data set in the LensKit native format.

user_row([user_id, user_num])

Get a user's row from the interaction matrix for the default interaction class, using default coalsecing for repeated interactions.

user_stats()

Get user statistics from the default interaction class.

Attributes

interaction_count

Count the total number of interactions of the default class, taking into account any count attribute.

item_count

items

The items known by this dataset.

schema

Get the schema of this dataset.

user_count

users

The users known by this dataset.

classmethod load(path)#

Load a dataset in the LensKit native format.

Parameters:

path (str | PathLike[str]) – The path to the dataset to load.

Returns:

The loaded dataset.

Return type:

Dataset

save(path)#

Save the data set in the LensKit native format.

Parameters:

path (str | PathLike[str]) – The path in which to save the data set (will be created as a directory).

property schema: DataSchema#

Get the schema of this dataset.

property items: Vocabulary#

The items known by this dataset.

property users: Vocabulary#

The users known by this dataset.

entities(name)#

Get the entities of a particular type / class.

Parameters:

name (str)

Return type:

EntitySet

relationships(name)#

Get the relationship records of a particular type / class.

Parameters:

name (str)

Return type:

RelationshipSet

interactions(name=None)#

Get the interaction records of a particular class. If no class is specified, returns the default interaction class.

Parameters:

name (str | None)

Return type:

RelationshipSet

property interaction_count: int#

Count the total number of interactions of the default class, taking into account any count attribute.

abstract interaction_table(*, format, fields=None, original_ids=False)#

Get the user-item interactions as a table in the requested format. The table is not in a specified order. Interactions may be repeated (e.g. the same user may listen to a song multiple times). For a non-repeated “ratings matrix” view of the data, see interaction_matrix().

This is a convenince wrapper on top of interactions() and the methods of RelationshipSet.

Warning

Client code must not perform in-place modifications on the table returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.

Parameters:
  • format (str) –

    The desired data format. Currently-supported formats are:

    • "pandas" — returns a pandas.DataFrame. The index is not meaningful.

    • "arrow" — returns a PyArrow Table. The index is not meaningful.

    • "numpy" — returns a dictionary mapping names to arrays.

  • fields (str | list[str] | None) – Which fields (attributes) to include, or None to include all fields. Commonly-available fields include "rating" and "timestamp".

  • original_ids (bool) – If True, return user and item IDs as represented in the original source data in columns named user_id and item_id, instead of the user and item numbers typically returned.

Returns:

The user-item interaction log in the specified format.

Return type:

Any

abstract interaction_matrix(*, format, layout='csr', field=None, original_ids=False, legacy=False)#

Get the user-item interactions as “ratings” matrix from the default interaction class. Interactions are not repeated, and are coalesced with the default coalescing strategy for each attribute.

The matrix may be returned in “coordinate” format, in which case it is comparable to interaction_table() but without repeated interactions, or it may be in a compressed sparse row format.

This is a convenince wrapper on top of interactions() and the methods of MatrixRelationshipSet.

Warning

Client code must not perform in-place modifications on the matrix returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.

Parameters:
  • format (str) –

    The desired data format. Currently-supported formats are:

  • field (str | None) –

    Which field to return in the matrix. Common fields include "rating" and "timestamp".

    If unspecified (None), this will yield an implicit-feedback indicator matrix, with 1s for observed items, except for the "pandas" format, which will return all attributes. Specify an empty list to return a Pandas data frame with only the user and item attributes.

  • layout (Literal['csr', 'coo']) – The layout for a sparse matrix. Can be either csr or coo, or None to use the default for the specified format. Ignored for the Pandas format.

  • original_ids (bool) – True to return user and item IDs instead of numbers in a pandas-format matrix.

  • legacy (bool)

Return type:

Any

abstract user_row(user_id=None, *, user_num=None)#

Get a user’s row from the interaction matrix for the default interaction class, using default coalsecing for repeated interactions. Available fields are returned as fields. If the dataset has ratings, these are provided as a rating field, not as the item scores. The item list is unordered, but items are returned in order by item number.

Parameters:
Returns:

The user’s interaction matrix row, or None if no user with that ID exists.

Return type:

ItemList | None

item_stats()#

Get item statistics from the default interaction class.

Returns:

A data frame indexed by item ID with the interaction statistics. See Interaction Statistics for a description of the columns returned.

The index is the vocabulary, so iloc works with item numbers.

Return type:

DataFrame

user_stats()#

Get user statistics from the default interaction class.

Returns:

A data frame indexed by user ID with the interaction statistics. See Interaction Statistics for a description of the columns returned.

The index is the vocabulary, so iloc works with user numbers.

Return type:

DataFrame