lenskit.data.DatasetBuilder#

class lenskit.data.DatasetBuilder(name=None)#

Bases: object

Construct data sets from data and tables.

Parameters:

name (str | DataContainer | Dataset | None)

__init__(name=None)#

Create a new dataset builder.

Parameters:

name (str | DataContainer | Dataset | None) – The dataset name. Can also be a data container or a dataset, which will initialize this builder with its contents to extend or modify.

Methods

__init__([name])

Create a new dataset builder.

add_entities()

Add entities to the data set.

add_entity_class(name)

Add an entity class to the dataset.

add_interactions(cls, data, *[, entities, ...])

Add a interaction records to the data set.

add_list_attribute()

Add a list attribute to an entity class.

add_relationship_class(name, entities[, ...])

Add a relationship class to the dataset.

add_relationships(cls, data, *[, entities, ...])

Add relationship records to the data set.

add_scalar_attribute()

Add a scalar attribute to an entity class.

add_vector_attribute(cls, name, entities, ...)

Add a vector attribute to a set of entities.

build()

Build the dataset.

build_container()

Build a data container (backing store for a dataset).

clear_relationships(cls)

Remove all records for a specified relationship class.

entity_classes()

Get the entity classes defined so far.

entity_id_type(name)

Get the PyArrow data type for an entity classes's identifiers.

filter_interactions(cls[, min_time, ...])

Filter interactions based on timestamp or to remove particular entities.

record_count(class_name)

Get the number of records for the specified entity or relationship class.

relationship_classes()

Get the relationship classes defined so far.

save(path)

Save the dataset to disk in the LensKit native format.

Attributes

name

Get the dataset name.

schema

The data schema assembled so far.

schema: DataSchema#

The data schema assembled so far. Do not modify this schema directly.

property name: str | None#

Get the dataset name.

entity_classes()#

Get the entity classes defined so far.

Return type:

dict[str, EntitySchema]

relationship_classes()#

Get the relationship classes defined so far.

Return type:

dict[str, RelationshipSchema]

record_count(class_name)#

Get the number of records for the specified entity or relationship class.

Parameters:

class_name (str)

Return type:

int

entity_id_type(name)#

Get the PyArrow data type for an entity classes’s identifiers.

Parameters:

name (str)

Return type:

DataType

add_entity_class(name)#

Add an entity class to the dataset.

Parameters:

name (str) – The name of the entity class.

Return type:

None

add_relationship_class(name, entities, allow_repeats=True, interaction=False)#

Add a relationship class to the dataset. This usually doesn’t need to be called; add_relationships() and add_interactions() will automatically add the relationship class if needed.

Note

The order of entity classes in entities matters, as the relationship matrix logic (lenskit.data.RelationshipSet.matrix()) will default to using the first and last entity classes as the rows and columns of the matrix.

Parameters:
  • name (str) – The name of the relationship class.

  • entities (Sequence[str] | Mapping[str, str | None]) – The entity classes participating in the relationship class.

  • allow_repeats (bool) – Whether repeated records for the same combination of entities are allowed.

  • interaction (bool) – Whether this is an interaction relationship.

Return type:

None

add_entities(cls: str, ids: IDSequence | pa.Array[Any] | pa.ChunkedArray[Any], /, *, duplicates: Literal['update', 'error', 'overwrite'] = 'error') None#
add_entities(cls: str, frame: DataFrame | Table | dict[str, ndarray[Any, dtype[Any]]], /, *, duplicates: Literal['update', 'error', 'overwrite'] = 'error') None

Add entities to the data set.

Parameters:
  • cls – The name of the entity class (e.g. "item").

  • source

    The input data, as an array or list of entity IDs.

    Note

    Data frame support will be added in a future version.

  • duplicates – How to handle duplicate entity IDs.

add_relationships(cls, data, *, entities=None, missing='error', allow_repeats=True, interaction=False, _warning_parent=0)#

Add relationship records to the data set.

This method adds relationship records, provided as a Pandas data frame or an Arrow table, to the data set being built. The relationships can be of a new class (in which case it will be created), or new relationship records for an existing class.

For each entity E participating in the relationship, the table must have a column named E_id storing the entity IDs.

Parameters:
  • cls (str) – The name of the interaction class (e.g. rating, purchase).

  • data (DataFrame | Table | dict[str, ndarray[Any, dtype[Any]]]) – The interaction data.

  • entities (Sequence[str] | Mapping[str, str | None] | None) – The entity classes involved in this interaction class.

  • missing (Literal['insert', 'filter', 'error']) – What to do when interactions reference nonexisting entities; can be "error" or "insert".

  • allow_repeats (bool) – Whether repeated interactions are allowed.

  • interaction (bool | Literal['default']) – Whether this is an interaction relationship or not; can be "default" to indicate this is the default interaction relationship.

  • _warning_parent (int)

Return type:

None

add_interactions(cls, data, *, entities=None, missing='error', allow_repeats=True, default=False)#

Add a interaction records to the data set.

This method adds new interaction records, provided as a Pandas data frame or an Arrow table, to the data set being built. The interactions can be of a new class (in which case it will be created), or new interactions for an existing class.

For each entity E participating in the interaction, the table must have a column named E_id storing the entity IDs.

Interactions should usually have user as the first entity and item as the last; the default interaction matrix logic uses the first and last entities as the rows and columns, respectively, of the interaction matrix.

Parameters:
  • cls (str) – The name of the interaction class (e.g. rating, purchase).

  • data (DataFrame | Table | dict[str, ndarray[Any, dtype[Any]]]) – The interaction data.

  • entities (Sequence[str] | Mapping[str, str | None] | None) – The entity classes involved in this interaction class.

  • missing (Literal['insert', 'filter', 'error']) – What to do when interactions reference nonexisting entities; can be "error" or "insert".

  • allow_repeats (bool) – Whether repeated interactions are allowed.

  • default (bool) – If True, set this as the default interaction class (if the dataset has more than one interaction class).

Return type:

None

filter_interactions(cls, min_time=None, max_time=None, remove=None)#

Filter interactions based on timestamp or to remove particular entities.

Parameters:
  • cls (str) – The interaction class to filter.

  • min_time (int | float | datetime | None) – The minimum interaction time to keep (inclusive).

  • max_time (int | float | datetime | None) – The maximum interaction time to keep (exclusive).

  • remove (Table | dict[str, numpy.typing.ArrayLike] | DataFrame | None) – Combinations of entity numbers or IDs to remove. The entities are filtered using an anti-join with this table, so providing a single column of entity IDs or numbers will remove all interactions associated with the listed entities.

clear_relationships(cls)#

Remove all records for a specified relationship class.

Parameters:

cls (str)

add_scalar_attribute(cls: str, name: str, data: pd.Series[Any] | TableInput, /, *, dictionary: bool = False) None#
add_scalar_attribute(cls: str, name: str, entities: IDSequence | tuple[IDSequence, ...], values: ArrayLike, /, *, dictionary: bool = False) None

Add a scalar attribute to an entity class.

Parameters:
  • cls – The entity class name.

  • name – The attribute name.

  • entities – The IDs for the entities whose attribute should be set.

  • values – The attribute values.

  • data – A Pandas datatframe or Arrow table storing entity IDs and attribute values.

  • dictionaryTrue to dictionary-encode the attribute values (saves space for string categorical values).

add_list_attribute(cls: str, name: str, data: pd.Series[Any] | TableInput, /, *, dictionary: bool = False) None#
add_list_attribute(cls: str, name: str, entities: IDSequence | tuple[IDSequence, ...], values: ArrayLike, /, *, dictionary: bool = False) None

Add a list attribute to an entity class.

Parameters:
  • cls – The entity class name.

  • name – The attribute name.

  • entities – The IDs for the entities whose attribute should be set.

  • values – The attribute values (an array or list of lists)

  • data – A Pandas datatframe or Arrow table storing entity IDs and attribute values.

  • dictionaryTrue to dictionary-encode the attribute values (saves space for string categorical values).

add_vector_attribute(cls, name, entities, values, /, dim_names=None)#

Add a vector attribute to a set of entities.

Warning

Dense vector attributes are stored densely, even for entities for which it is not set. High-dimensional vectors can therefore take up a lot of space.

Parameters:
  • cls (str) – The entity class name.

  • name (str) – The attribute name.

  • entities (IDSequence | tuple[IDSequence, ...]) – The entity IDs to which the attribute should be attached.

  • values (pa.Array[Any] | pa.ChunkedArray[Any] | np.ndarray[tuple[int, int], Any] | sparray) – The attribute values, as a fixed-length list array or a two-dimensional NumPy array (for dense vector attributes) or a SciPy sparse array (for sparse vector attributes).

  • dim_names (ArrayLike | pd.Index[Any] | Sequence[Any] | None) – The names for the dimensions of the array.

Return type:

None

build()#

Build the dataset.

Return type:

Dataset

build_container()#

Build a data container (backing store for a dataset).

Return type:

DataContainer

save(path)#

Save the dataset to disk in the LensKit native format.

Parameters:

path (str | PathLike[str]) – The path where the dataset will be saved (will be created as a directory).