lenskit.data.DatasetBuilder#

class lenskit.data.DatasetBuilder(name=None)#

Bases: object

Construct data sets from data and tables.

Parameters:

name (str | DataContainer | Dataset | None)

__init__(name=None)#

Create a new dataset builder.

Parameters:

name (str | DataContainer | Dataset | None) – The dataset name. Can also be a data container or a dataset, which will initialize this builder with its contents to extend or modify.

Methods

__init__([name])

Create a new dataset builder.

add_entities()

add_entity_class(name)

add_interactions(cls, data, *[, entities, ...])

add_list_attribute()

add_relationship_class(name, entities[, ...])

add_relationships(cls, data, *[, entities, ...])

add_scalar_attribute()

add_vector_attribute(cls, name, entities, ...)

Add a vector attribute to a set of entities.

build()

build_container()

clear_relationships(cls)

entity_classes()

Get the entity classes defined so far.

entity_id_type(name)

Get the PyArrow data type for an entity classes's identifiers.

filter_interactions(cls[, min_time, ...])

Filter interactions based on timestamp or to remove particular entities.

record_count(class_name)

relationship_classes()

Get the relationship classes defined so far.

save(path)

Save the dataset to disk in the LensKit native format.

Attributes

name

schema

The data schema assembled so far.

schema: DataSchema#

The data schema assembled so far. Do not modify this schema directly.

entity_classes()#

Get the entity classes defined so far.

Return type:

dict[str, EntitySchema]

relationship_classes()#

Get the relationship classes defined so far.

Return type:

dict[str, RelationshipSchema]

entity_id_type(name)#

Get the PyArrow data type for an entity classes’s identifiers.

Parameters:

name (str)

Return type:

DataType

filter_interactions(cls, min_time=None, max_time=None, remove=None)#

Filter interactions based on timestamp or to remove particular entities.

Parameters:
  • cls (str) – The interaction class to filter.

  • min_time (int | float | datetime | None) – The minimum interaction time to keep (inclusive).

  • max_time (int | float | datetime | None) – The maximum interaction time to keep (exclusive).

  • remove (Table | dict[str, numpy.typing.ArrayLike] | DataFrame | None) – Combinations of entity numbers or IDs to remove. The entities are filtered using an anti-join with this table, so providing a single column of entity IDs or numbers will remove all interactions associated with the listed entities.

add_vector_attribute(cls, name, entities, values, /, dim_names=None)#

Add a vector attribute to a set of entities.

Warning

The vector is stored densely, even for entities for which it is not set. High-dimensional vectors can therefore take up a lot of space.

Parameters:
  • cls (str) – The entity class name.

  • name (str) – The attribute name.

  • entities (IDSequence | tuple[IDSequence, ...]) – The entity IDs to which the attribute should be attached.

  • values (pa.Array[Any] | pa.ChunkedArray[Any] | np.ndarray[tuple[int, int], Any] | sparray) – The attribute values, as a fixed-length list array or a two-dimensional NumPy array.

  • dim_names (ArrayLike | pd.Index[Any] | Sequence[Any] | None) – The names for the dimensions of the array.

Return type:

None

save(path)#

Save the dataset to disk in the LensKit native format.

Parameters:

path (str | PathLike[str]) – The path where the dataset will be saved (will be created as a directory)