Implementing Components#

LensKit is particularly designed to excel in research and educational applications, and for that you will often need to write your own components implementing new scoring models, rankers, or other components. The pipeline design and standard pipelines are intended to make this as easy as possible and allow you to focus just on your logic without needing to implement a lot of boilerplate like looking up user histories or ranking by score: you can implement your training and scoring logic, and let LensKit do the rest.

Basics#

Implementing a component therefore consists of a few steps:

  1. Defining the configuration class.

  2. Defining the component class, with its config attribute declaration.

  3. Defining a __call__ method for the component class that performs the component’s actual computation.

  4. If the component supports training, implementing the Trainable protocol by defining a train() method, or implement Iterative Training.

A simple example component that computes a linear weighted blend of the scores from two other components could look like this:

# This file is part of LensKit.
# Copyright (C) 2018-2023 Boise State University
# Copyright (C) 2023-2025 Drexel University
# Licensed under the MIT license, see LICENSE.md for details.
# SPDX-License-Identifier: MIT

from pydantic import BaseModel

from lenskit.data import ItemList
from lenskit.pipeline import Component


class LinearBlendConfig(BaseModel):
    "Configuration for :class:`LinearBlendScorer`."

    # define the parameter with a type, default value, and docstring.
    mix_weight: float = 0.5
    r"""
    Linear blending mixture weight :math:`\alpha`.
    """


class LinearBlendScorer(Component[ItemList]):
    r"""
    Score items with a linear blend of two other scores.

    Given a mixture weight :math:`\alpha` and two scores
    :math:`s_i^{\mathrm{left}}` and :math:`s_i^{\mathrm{right}}`, this
    computes :math:`s_i = \alpha s_i^{\mathrm{left}} + (1 - \alpha)
    s_i^{\mathrm{right}}`.  Missing values propagate, so only items
    scored in both inputs have scores in the output.
    """

    # define the configuration attribute, with a docstring to make sure
    # it shows up in component docs.
    config: LinearBlendConfig
    "Configuration parameters for the linear blend."

    # the __call__ method defines the component's operation
    def __call__(self, left: ItemList, right: ItemList) -> ItemList:
        """
        Blend the scores of two item lists.
        """
        ls = left.scores("pandas", index="ids")
        rs = right.scores("pandas", index="ids")
        ls, rs = ls.align(rs)
        alpha = self.config.mix_weight
        combined = ls * alpha + rs * (1 - alpha)
        return ItemList(item_ids=combined.index, scores=combined.values)

This component can be instantiated with its defaults:

>>> LinearBlendScorer()
<LinearBlendScorer {
    "mix_weight": 0.5
}>

You an instantiate it with its configuration class:

>>> LinearBlendScorer(LinearBlendConfig(mix_weight=0.2))
<LinearBlendScorer {
    "mix_weight": 0.2
}>

Finally, you can directly pass configuration parameters to the component constructor:

>>> LinearBlendScorer(mix_weight=0.7)
<LinearBlendScorer {
    "mix_weight": 0.7
}>

Component Configuration#

As noted in the pipeline documentation, components are configured with configuration objects. These are JSON-serializable objects defined as Python dataclasses or Pydantic models, and define the different settings or hyperparameters that control the model’s behavior.

The choice of parameters are up to the component author, and each component will have different configuration needs. Some needs are common across many components, though; see Configuration Conventions for common LensKit configuration conventions.

Component Operation#

The heart of the component interface is the __call__ method (components are just callable objects). This method takes the component inputs as parameters, and returns the component’s result.

Most components return an ItemList. Scoring components usually have the following signature:

def __call__(self, query: QueryInput, items: ItemList) -> ItemList:
    ...

The query input receives the user ID, history, context, or other query input; the items input receives the list of items to be scored (e.g., the candidate items for recommendation). The scorer then returns a list of scored items.

Most component begin by converting the query to a RecQuery:

def __call__(self, query: QueryInput, items: ItemList) -> ItemList:
    query = RecQuery.create(query)
    ...

It is conventional for scorers to return a copy of the input item list with the scores attached, filling in NaN for items that cannot be scored. After assembling a NumPy array of scores, you can do this with:

return ItemList(items, scores=scores)

Scalars can also be supplied, so if the scorer cannot score any of the items, it can simply return a list with no scores:

return ItemList(items, scores=np.nan)

Components do need to be able to handle items in items that were not seen at training time. If the component has saved the training item vocabulary, the easiest way to do this is to use numbers(): with missing="negative":

i_nums = items.numbers(vocabulary=self.items, missing="negative")
scorable_mask = i_nums >= 0

Component Training#

Components that need to train models on training data must implement the Trainable protocol, either directly or through a helper implementation like IterativeTraining. The core of the Trainable protocol is the train() method, which takes a Dataset and TrainingOptions and trains the model.

The details of training will vary significantly from model to model. Typically, though, it follows the following steps:

  1. Extract, prepare, and preprocess training data as needed for the model.

  2. Compute the model’s parameters, either directly (i.e. BiasScorer) or through an optimization method (i.e. ImplicitMFScorer).

  3. Finalize the model parameters and clean up any temporary data.

Learned model parameters are then stored as attributes on the component class, either directly or in a container object (such as a PyTorch Module).

Note

If the model is already trained and the retrain is False, then the train method should return without any training. IterativeTraining handles this automatically.

Iterative Training#

The lenskit.training.IterativeTraining class provides a standardized interface and training loop support for training models with iterative methods that pass through the training data in multiple epochs. Models that use this support extend IterativeTraining in addition to Component, and implement the training_loop() method instead of train(). Iteratively-trainable components should also have an epochs setting on their configuration class that specifies the number of training epochs to run.

The training_loop() method does 3 things:

  1. Set up initial data structures, preparation, etc. needed for model training.

  2. Train the model, yielding after each training epoch. It can optionally yield a set of metrics, such as training loss or update magnitudes.

  3. Perform any final steps and training data cleanup.

The model should be usable after each epoch, to support things like measuring performance on validation data.

The training loop itself is represented as a Python iterator, so that a for loop will loop through the training epochs. While the interface definition specifies the Iterator type in order to minimize restrictions on component implementers, we recommend that it actually be a Generator, which allows the caller to request early termination (through the close() method). We also recommend that the training_loop() method only return the generator after initial data preparation is complete, so that setup time is not included in the time taken for the first loop iteration. The easiest way to do implement this is by delegating to an inner loop function, written as a Python generator:

def training_loop(self, data: Dataset, options: TrainingOptions):
    # do initial data setup/prep for training
    context = ...
    # pass off to inner generator
    return self._training_loop_impl(context)

def _training_loop_impl(self, context):
    for i in range(self.config.epochs):
        # do the model training
        # compute the metrics
        try:
            yield {'loss': loss}
        except GeneratorExit:
            # client code has requested early termination
            break

    # any final cleanup steps

Further Reading#

See Component Conventions for more conventions for component design and configuration.