lenskit.data.Dataset#
- class lenskit.data.Dataset#
Bases:
ABC
Representation of a data set for LensKit training, evaluation, etc. Data can be accessed in a variety of formats depending on the needs of a component.
Note
Zero-copy conversions are used whenever possible, so client code must not modify returned data in-place.
Todo
Support for advanced rating situations is not yet supported:
repeated ratings
mixed implicit & explicit feedback
later actions removing earlier ratings
Todo
Support for item and user content or metadata is not yet implemented.
- __init__()#
Methods
__init__
()count
(what)Count entities in the dataset.
Get the user-item interactions as a table in the requested format.
Get the user-item interactions as “ratings” matrix.
Get item statistics.
user_row
()Get a user's row from the interaction matrix.
Get user statistics.
Attributes
Count the total number of interaction records.
item_count
The items known by this dataset.
Count the total number of ratings (excluding superseded ratings).
user_count
The users known by this dataset.
- abstract property items: Vocabulary#
The items known by this dataset.
- abstract property users: Vocabulary#
The users known by this dataset.
- abstract count(what)#
Count entities in the dataset.
Note
The precise counts are subtle in the presence of repeated or superseded interactions. See
interaction_count()
andrating_count()
for details on the"interactions"
and"ratings"
counts.
- property interaction_count: int#
Count the total number of interaction records. Equivalent to
count("interactions")
.Note
If the interaction records themselves reprsent counts, such as the number of times a song was played, this returns the number of records, not the total number of plays.
- property rating_count: int#
Count the total number of ratings (excluding superseded ratings). Equivalent to
count("ratings")
.
- abstract interaction_log(format: Literal['pandas'], *, fields: str | list[str] | None = 'all', original_ids: bool = False) DataFrame #
- abstract interaction_log(format: Literal['numpy'], *, fields: str | list[str] | None = 'all') NumpyUserItemTable
- abstract interaction_log(format: Literal['torch'], *, fields: str | list[str] | None = 'all') TorchUserItemTable
Get the user-item interactions as a table in the requested format. The table is not in a specified order. Interactions may be repeated (e.g. the same user may listen to a song multiple times). For a non-repeated “ratings matrix” view of the data, see
interaction_matrix()
.Warning
Client code must not perform in-place modifications on the table returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.
- Parameters:
format –
The desired data format. Currently-supported formats are:
"pandas"
— returns apandas.DataFrame
. The index is not meaningful."numpy"
— returns aNumpyUserItemTable
."torch"
— returns aTorchUserItemTable
.
fields – Which fields to include. If set to
"all"
, will include all available fields in the resulting table;None
includes no fields besides the user and item. Commonly-available fields include"rating"
and"timestamp"
. Missing fields will be omitted in the result.original_ids – If
True
, return user and item IDs as represented in the original source data in columns nameduser_id
anditem_id
, instead of the user and item numbers typically returned. Only applicable to thepandas
format. See Identifiers.
- Returns:
The user-item interaction log in the specified format.
- abstract interaction_matrix(format: Literal['pandas'], *, layout: Literal['coo'] | None = None, field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None, original_ids: bool = False) DataFrame #
- abstract interaction_matrix(format: Literal['torch'], *, layout: Literal['csr', 'coo'] | None = None, field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) Tensor
- abstract interaction_matrix(format: Literal['scipy'], *, layout: Literal['coo'], legacy: Literal[True], field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) coo_matrix
- abstract interaction_matrix(format: Literal['scipy'], *, layout: Literal['coo'], legacy: bool = False, field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) coo_array
- abstract interaction_matrix(format: Literal['scipy'], *, layout: Literal['csr'] | None = None, legacy: Literal[True], field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) csr_matrix
- abstract interaction_matrix(format: Literal['scipy'], *, layout: Literal['csr'] | None = None, legacy: bool = False, field: str | None = None, combine: Literal['count', 'sum', 'mean', 'first', 'last'] | None = None) csr_array
- abstract interaction_matrix(format: Literal['structure'], *, layout: Literal['csr'] | None = None) CSRStructure
Get the user-item interactions as “ratings” matrix. Interactions are not repeated. The matrix may be in “coordinate” format, in which case it is comparable to
interaction_log()
but without repeated interactions, or it may be in a compressed sparse format.Todo
Aggregate is currently ignored because repeated interactions are not yet supported.
Warning
Client code must not perform in-place modifications on the matrix returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.
- Parameters:
format –
The desired data format. Currently-supported formats are:
"pandas"
— returns apandas.DataFrame
."torch"
— returns a sparsetorch.Tensor
(seetorch.sparse
)."scipy"
— returns a sparse array fromscipy.sparse
."structure"
— returns aCSRStructure
containing only the user and item numbers in compressed sparse row format.
field –
Which field to return in the matrix. Common fields include
"rating"
and"timestamp"
.If unspecified (
None
), this will yield an implicit-feedback indicator matrix, with 1s for observed items; the"pandas"
format will only include user and item columns.If the
rating
field is requested but is not defined in the underlying data, then this is equivalent to"indicator"
, except that the"pandas"
format will include a"rating"
column of all 1s.The
"pandas"
format also supports the special field name"all"
to return a data frame with all available fields. Whenfield="all"
, a field namedcount
(if defined) is combined with thesum
method, and other fields uselast
.combine –
How to combine multiple observations for a single user-item pair. Available methods are:
"count"
— count the user-item interactions. Only valid whenfield=None
; if the underlying data defines acount
field, then this is equivalent to"sum"
on that field."sum"
— sum the field values."first"
,"last"
— take the first or last value seen (in timestamp order, if timestamps are defined; otherwise, their order in the original input).
layout – The layout for a sparse matrix. Can be either
csr
orcoo
, orNone
to use the default for the specified format. CSR is only supported by Torch and SciPy backends.legacy –
True
to return a legacy SciPy sparse matrix instead of sparse array.original_ids –
True
to return user and item IDs instead of numbers inpandas
-format matrix.
- abstract user_row(user_id: int | str | bytes | integer[Any] | str_ | bytes_ | object_) ItemList | None #
- abstract user_row(*, user_num: int) ItemList
Get a user’s row from the interaction matrix. Available fields are returned as fields. If the dataset has ratings, these are provided as a
rating
field, not as the item scores. The item list is unordered, but items are returned in order by item number.- Parameters:
user_id – The ID of the user to retrieve.
user_num – The number of the user to retrieve.
- Returns:
The user’s interaction matrix row, or
None
if no user with that ID exists.
- item_stats()#
Get item statistics.
- Returns:
count — the number of interactions recorded for this item.
user_count — the number of distinct users who have interacted with or rated this item.
rating_count — the number of ratings for this item. Only provided if the dataset has explicit ratings; if there are repeated ratings, this does not count superseded ratings.
mean_rating — the mean of the ratings. Only provided if the dataset has explicit ratings.
first_time — the first time the item appears. Only provided if the dataset has timestamps.
The index is the vocabulary, so
iloc
works with item numbers.- Return type:
A data frame indexed by item ID with the following columns
- user_stats()#
Get user statistics.
- Returns:
count — the number of interactions recorded for this user.
item_count — the number of distinct items with which this user has interacted.
rating_count — the number of ratings for this user. Only provided if the dataset has explicit ratings; if there are repeated ratings, this does not count superseded ratings.
mean_rating — the mean of the user’s reatings. Only provided if the dataset has explicit ratings.
first_time — the first time the user appears. Only provided if the dataset has timestamps.
last_time — the last time the user appears. Only provided if the dataset has timestamps.
The index is the vocabulary, so
iloc
works with user numbers.- Return type:
A data frame indexed by user ID with the following columns