lenskit.data.Dataset#

class lenskit.data.Dataset#

Bases: ABC

Representation of a data set for LensKit training, evaluation, etc. Data can be accessed in a variety of formats depending on the needs of a component.

Note

Zero-copy conversions are used whenever possible, so client code must not modify returned data in-place.

Todo

Support for advanced rating situations is not yet supported:

  • repeated ratings

  • mixed implicit & explicit feedback

  • later actions removing earlier ratings

Todo

Support for item and user content or metadata is not yet implemented.

__init__()#

Methods

__init__()

count(what)

Count entities in the dataset.

interaction_log()

Get the user-item interactions as a table in the requested format.

interaction_matrix()

Get the user-item interactions as “ratings” matrix.

item_stats()

Get item statistics.

user_row()

Get a user's row from the interaction matrix.

user_stats()

Get user statistics.

Attributes

interaction_count

Count the total number of interaction records.

item_count

items

The items known by this dataset.

rating_count

Count the total number of ratings (excluding superseded ratings).

user_count

users

The users known by this dataset.

abstract property items: Vocabulary#

The items known by this dataset.

abstract property users: Vocabulary#

The users known by this dataset.

abstract count(what)#

Count entities in the dataset.

Note

The precise counts are subtle in the presence of repeated or superseded interactions. See interaction_count() and rating_count() for details on the "interactions" and "ratings" counts.

Parameters:

what (str) –

The type of entity to count. Commonly-supported ones include:

  • users

  • items

  • pairs (observed user-item pairs)

  • interactions

  • ratings

Return type:

int

property interaction_count: int#

Count the total number of interaction records. Equivalent to count("interactions").

Note

If the interaction records themselves reprsent counts, such as the number of times a song was played, this returns the number of records, not the total number of plays.

property rating_count: int#

Count the total number of ratings (excluding superseded ratings). Equivalent to count("ratings").

abstract interaction_log(format: Literal['pandas'], *, fields: str | list[str] | None = 'all', original_ids: bool = False) DataFrame#
abstract interaction_log(format: Literal['numpy'], *, fields: str | list[str] | None = 'all') NumpyUserItemTable
abstract interaction_log(format: Literal['torch'], *, fields: str | list[str] | None = 'all') TorchUserItemTable

Get the user-item interactions as a table in the requested format. The table is not in a specified order. Interactions may be repeated (e.g. the same user may listen to a song multiple times). For a non-repeated “ratings matrix” view of the data, see interaction_matrix().

Warning

Client code must not perform in-place modifications on the table returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.

Parameters:
  • format

    The desired data format. Currently-supported formats are:

    • "pandas" — returns a pandas.DataFrame. The index is not meaningful.

    • "numpy" — returns a NumpyUserItemTable.

    • "torch" — returns a TorchUserItemTable.

  • fields – Which fields to include. If set to "all", will include all available fields in the resulting table; None includes no fields besides the user and item. Commonly-available fields include "rating" and "timestamp". Missing fields will be omitted in the result.

  • original_ids – If True, return user and item IDs as represented in the original source data in columns named user_id and item_id, instead of the user and item numbers typically returned. Only applicable to the pandas format. See Identifiers.

Returns:

The user-item interaction log in the specified format.

abstract interaction_matrix(format: Literal['pandas'], *, layout: Literal['coo'] | None = None, field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None, original_ids: bool = False) DataFrame#
abstract interaction_matrix(format: Literal['torch'], *, layout: Literal['csr', 'coo'] | None = None, field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) Tensor
abstract interaction_matrix(format: Literal['scipy'], *, layout: Literal['coo'], legacy: Literal[True], field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) coo_matrix
abstract interaction_matrix(format: Literal['scipy'], *, layout: Literal['coo'], legacy: bool = False, field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) coo_array
abstract interaction_matrix(format: Literal['scipy'], *, layout: Literal['csr'] | None = None, legacy: Literal[True], field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) csr_matrix
abstract interaction_matrix(format: Literal['scipy'], *, layout: Literal['csr'] | None = None, legacy: bool = False, field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) csr_array
abstract interaction_matrix(format: Literal['structure'], *, layout: Literal['csr'] | None = None) CSRStructure

Get the user-item interactions as “ratings” matrix. Interactions are not repeated. The matrix may be in “coordinate” format, in which case it is comparable to interaction_log() but without repeated interactions, or it may be in a compressed sparse format.

Todo

Aggregate is currently ignored because repeated interactions are not yet supported.

Warning

Client code must not perform in-place modifications on the matrix returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.

Parameters:
  • format

    The desired data format. Currently-supported formats are:

    • "pandas" — returns a pandas.DataFrame.

    • "torch" — returns a sparse torch.Tensor (see torch.sparse).

    • "scipy" — returns a sparse array from scipy.sparse.

    • "structure" — returns a CSRStructure containing only the user and item numbers in compressed sparse row format.

  • field

    Which field to return in the matrix. Common fields include "rating" and "timestamp".

    If unspecified (None), this will yield an implicit-feedback indicator matrix, with 1s for observed items; the "pandas" format will only include user and item columns.

    If the rating field is requested but is not defined in the underlying data, then this is equivalent to "indicator", except that the "pandas" format will include a "rating" column of all 1s.

    The "pandas" format also supports the special field name "all" to return a data frame with all available fields. When field="all", a field named count (if defined) is combined with the sum method, and other fields use last.

  • combine

    How to combine multiple observations for a single user-item pair. Available methods are:

    • "count" — count the user-item interactions. Only valid when field=None; if the underlying data defines a count field, then this is equivalent to "sum" on that field.

    • "sum" — sum the field values.

    • "first", "last" — take the first or last value seen (in timestamp order, if timestamps are defined; otherwise, their order in the original input).

  • layout – The layout for a sparse matrix. Can be either csr or coo, or None to use the default for the specified format. CSR is only supported by Torch and SciPy backends.

  • legacyTrue to return a legacy SciPy sparse matrix instead of sparse array.

  • original_idsTrue to return user and item IDs instead of numbers in pandas-format matrix.

abstract user_row(user_id: int | str | bytes | integer[Any] | str_ | bytes_ | object_) ItemList | None#
abstract user_row(*, user_num: int) ItemList

Get a user’s row from the interaction matrix. Available fields are returned as fields. If the dataset has ratings, these are provided as a rating field, not as the item scores. The item list is unordered, but items are returned in order by item number.

Parameters:
  • user_id – The ID of the user to retrieve.

  • user_num – The number of the user to retrieve.

Returns:

The user’s interaction matrix row, or None if no user with that ID exists.

item_stats()#

Get item statistics.

Returns:

  • count — the number of interactions recorded for this item.

  • user_count — the number of distinct users who have interacted with or rated this item.

  • rating_count — the number of ratings for this item. Only provided if the dataset has explicit ratings; if there are repeated ratings, this does not count superseded ratings.

  • mean_rating — the mean of the ratings. Only provided if the dataset has explicit ratings.

  • first_time — the first time the item appears. Only provided if the dataset has timestamps.

The index is the vocabulary, so iloc works with item numbers.

Return type:

A data frame indexed by item ID with the following columns

user_stats()#

Get user statistics.

Returns:

  • count — the number of interactions recorded for this user.

  • item_count — the number of distinct items with which this user has interacted.

  • rating_count — the number of ratings for this user. Only provided if the dataset has explicit ratings; if there are repeated ratings, this does not count superseded ratings.

  • mean_rating — the mean of the user’s reatings. Only provided if the dataset has explicit ratings.

  • first_time — the first time the user appears. Only provided if the dataset has timestamps.

  • last_time — the last time the user appears. Only provided if the dataset has timestamps.

The index is the vocabulary, so iloc works with user numbers.

Return type:

A data frame indexed by user ID with the following columns