lenskit.data.Dataset#
- class lenskit.data.Dataset(data)#
Bases:
object
Representation of a data set for LensKit training, evaluation, etc. Data can be accessed in a variety of formats depending on the needs of a component. See Data Model for details of the LensKit data model.
Dataset objects should not be directly constructed; instead, use a
DatasetBuilder
,load()
, orfrom_interactions_df()
.Note
Zero-copy conversions are used whenever possible, so client code must not modify returned data in-place.
- Parameters:
data (DataContainer | Callable[[], DataContainer | Dataset]) – The container for this dataset’s data, or a function that will return such a container to create a lazy-loaded dataset.
Stability: Caller
This API is at the caller stability level: breaking changes for code calling this function or class will be reserved for annual major version bumps, but minor versions may introduce changes that break subclasses or reimplementations. See Stability Levels for details.
Methods
__init__
(data)default_interaction_class
()entities
(name)Get the entities of a particular type / class.
interaction_matrix
(*, format[, layout, ...])Get the user-item interactions as “ratings” matrix from the default interaction class.
interaction_table
(*, format[, fields, ...])Get the user-item interactions as a table in the requested format.
interactions
([name])Get the interaction records of a particular class.
Get item statistics from the default interaction class.
load
(path)Load a dataset in the LensKit native format.
relationships
(name)Get the relationship records of a particular type / class.
save
(path)Save the data set in the LensKit native format.
user_row
([user_id, user_num])Get a user's row from the interaction matrix for the default interaction class, using default coalsecing for repeated interactions.
Get user statistics from the default interaction class.
Attributes
Count the total number of interactions of the default class, taking into account any
count
attribute.item_count
The items known by this dataset.
Get the schema of this dataset.
user_count
The users known by this dataset.
- classmethod load(path)#
Load a dataset in the LensKit native format.
- save(path)#
Save the data set in the LensKit native format.
- property schema: DataSchema#
Get the schema of this dataset.
- property items: Vocabulary#
The items known by this dataset.
- property users: Vocabulary#
The users known by this dataset.
- entities(name)#
Get the entities of a particular type / class.
- relationships(name)#
Get the relationship records of a particular type / class.
- Parameters:
name (str)
- Return type:
- interactions(name=None)#
Get the interaction records of a particular class. If no class is specified, returns the default interaction class.
- Parameters:
name (str | None)
- Return type:
- property interaction_count: int#
Count the total number of interactions of the default class, taking into account any
count
attribute.
- abstract interaction_table(*, format, fields=None, original_ids=False)#
Get the user-item interactions as a table in the requested format. The table is not in a specified order. Interactions may be repeated (e.g. the same user may listen to a song multiple times). For a non-repeated “ratings matrix” view of the data, see
interaction_matrix()
.This is a convenince wrapper on top of
interactions()
and the methods ofRelationshipSet
.Warning
Client code must not perform in-place modifications on the table returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.
- Parameters:
format (str) –
The desired data format. Currently-supported formats are:
"pandas"
— returns apandas.DataFrame
. The index is not meaningful."arrow"
— returns a PyArrowTable
. The index is not meaningful."numpy"
— returns a dictionary mapping names to arrays.
fields (str | list[str] | None) – Which fields (attributes) to include, or
None
to include all fields. Commonly-available fields include"rating"
and"timestamp"
.original_ids (bool) – If
True
, return user and item IDs as represented in the original source data in columns nameduser_id
anditem_id
, instead of the user and item numbers typically returned.
- Returns:
The user-item interaction log in the specified format.
- Return type:
- abstract interaction_matrix(*, format, layout='csr', field=None, original_ids=False, legacy=False)#
Get the user-item interactions as “ratings” matrix from the default interaction class. Interactions are not repeated, and are coalesced with the default coalescing strategy for each attribute.
The matrix may be returned in “coordinate” format, in which case it is comparable to
interaction_table()
but without repeated interactions, or it may be in a compressed sparse row format.This is a convenince wrapper on top of
interactions()
and the methods ofMatrixRelationshipSet
.Warning
Client code must not perform in-place modifications on the matrix returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.
- Parameters:
format (str) –
The desired data format. Currently-supported formats are:
"pandas"
— returns apandas.DataFrame
."torch"
— returns a sparsetorch.Tensor
(seetorch.sparse
)."scipy"
— returns a sparse array fromscipy.sparse
."structure"
— returns aCSRStructure
containing only the user and item numbers in compressed sparse row format.
field (str | None) –
Which field to return in the matrix. Common fields include
"rating"
and"timestamp"
.If unspecified (
None
), this will yield an implicit-feedback indicator matrix, with 1s for observed items, except for the"pandas"
format, which will return all attributes. Specify an empty list to return a Pandas data frame with only the user and item attributes.layout (Literal['csr', 'coo']) – The layout for a sparse matrix. Can be either
csr
orcoo
, orNone
to use the default for the specified format. Ignored for the Pandas format.original_ids (bool) –
True
to return user and item IDs instead of numbers in apandas
-format matrix.legacy (bool)
- Return type:
- abstract user_row(user_id=None, *, user_num=None)#
Get a user’s row from the interaction matrix for the default interaction class, using default coalsecing for repeated interactions. Available fields are returned as fields. If the dataset has ratings, these are provided as a
rating
field, not as the item scores. The item list is unordered, but items are returned in order by item number.
- item_stats()#
Get item statistics from the default interaction class.
- Returns:
A data frame indexed by item ID with the interaction statistics. See Interaction Statistics for a description of the columns returned.
The index is the vocabulary, so
iloc
works with item numbers.- Return type:
- user_stats()#
Get user statistics from the default interaction class.
- Returns:
A data frame indexed by user ID with the interaction statistics. See Interaction Statistics for a description of the columns returned.
The index is the vocabulary, so
iloc
works with user numbers.- Return type: