Recommendation Pipelines#

Since version 2025.1.1, LensKit uses a flexible “pipeline” abstraction to wire together different components such as candidate selectors, personalized item scorers, and rankers to produce predictions, recommendations, or other recommender system outputs. This is a significant change from the LensKit 0.x design of monolithic and composable components based on the Scikit-Learn API, allowing new recommendation designs to be composed without writing new classes just for the composition. It also makes recommender definition code more explicit by laying out the pipeline instead of burying composition logic in the definitions of different composition classes. The pipeline lives in the lenskit.pipeline module, and the primary entry point is the Pipeline class.

If all you want to do is build a standard top-N recommendation pipeline from an item scorer, see topn_pipeline(). RecPipelineBuilder provides some more flexibility in configuring a recommendation pipeline with a standard design, and you can always fully configure the pipeline yourself for maximum flexibility.

Pipeline components are not limited to looking things up from training data — they can query databases, load files, and any other operations. A runtime pipeline can use some components (especially the scorer) trained from training data, and other components that query a database or REST services for things like user history and candidate set lookup.

Acknowledgements

The LensKit pipeline design is heavily inspired by the pipeline abstraction Karl Higley originally created for POPROX (available in the git history), as well as by Haystack.

Constructing Pipelines#

The simplest way to make a pipeline is to construct a topn_pipeline — the following will create a top-N recommendation pipeline using implicit-feedback matrix factorization:

als = ImplicitMFScorer(50)
pipe = topn_pipeline(als)

The RecPipelineBuilder class provides a more flexible mechanism to create standard recommendation pipelines; to implement the same pipeline with that class, do:

als = ImplicitMFScorer(50)
builder = RecPipelineBuilder()
builder.scorer(als)
pipe = builder.build('ALS')

For maximum flexibility, you can directly construct and wire the pipeline yourself; this is described in Standard Pipelines. Pipelines are built with a PipelineBuilder, which sets up the nodes and connections, checks for things like cycles, and instantiates the components to make a pipeline that can be trained and used.

After any of these methods, you can run the pipeline to produce recommendations with:

user_recs = pipe.run('recommender', query=user_id)

This is also exposed with a convenience function:

from lenskit import recommend
user_recs = recommend(pipe, user_id)

Pipeline Model#

A pipeline has a couple key concepts:

An input is data that needs to be provided to the pipeline when it is run, such as the user to generate recommendations for. Inputs have specified data types, and it is an error to provide an input value of an unexpected type.
A component processes input data and produces an output. It can be either a Python function or object (anything that implements the Component protocol) that takes zero or more inputs as keyword arguments and returns an output. The pipeline will supply these inputs either from pipeline inputs or from the outputs of other components.

These are arranged in a directed acyclic graph, consisting of:

Nodes (represented by Node), which correspond to either inputs or components.
Connections from one node’s input to another node’s data (or to a fixed data value). This is how the pipeline knows which components depend on other components and how to provide each component with the inputs it requires; see Connections for details.

Each node has a name that can be used to look up the node with Pipeline.node() (or PipelineBuilder.node()) and appears in serialization and logging situations. Names must be unique within a pipeline.

Connections#

Components declare their inputs as keyword arguments on their call signatures (either the function call signature, if it is a bare function, or the __call__ method if it is implemented by a class). In a pipeline, these inputs can be connected to a source, which the pipeline will use to obtain a value for that parameter when running the pipeline. Inputs can be connected to the following types:

A Node, in which case the input will be provided from the corresponding pipeline input or component return value. Nodes are returned by create_input() or add_component(), and can be looked up after creation with node().
A Python object, in which case that value will be provided directly to the component input argument.

These input connections are specified via keyword arguments to the PipelineBuilder.add_component() or PipelineBuilder.connect() methods — specify the component’s input name(s) and the node or data to which each input should be wired.

You can also use PipelineBuilder.default_connection() to specify default connections. For example, you can specify a default for inputs named user:

pipe.default_connection('user', user_history)

With this default in place, if a component has an input named user and that input is not explicitly connected to a node, then the user_history node will be used to supply its value. Judicious use of defaults can reduce the amount of code overhead needed to wire common pipelines.

Note

You cannot directly wire an input another component using only that component’s name; if you only have a name, pass it to PipelineBuilder.node() to obtain the node. This is because it would be impossible to distinguish between a string component name and a string data value.

Building the Pipeline#

Once you have set up the pipeline with the various methods to PipelineBuilder, you can do a couple of things:

Call PipelineBuilder.build to build a usable Pipeline. The pipeline can then be trained, run, etc.
Call PipelineBuilder.build_config to build a PipelineConfig that can be serialized and reloaded from JSON, YAML, or similar formats.

Building a pipeline resolves default connections, instantiates components from their configurations, and checks for cycles.

Execution#

Once configured, a pipeline can be run with Pipeline.run(), or with one of the operation functions (see Invoking Recommenders; these functions call run() under the hood).

The run() method takes two types of inputs:

Positional arguments specifying the node(s) to run and whose results should be returned. This is to allow partial runs of pipelines (e.g. to only score items without ranking them), and to allow multiple return values to be obtained (e.g. initial item scores and final rankings, which may have altered scores).
Keyword arguments specifying the values for the pipeline’s inputs, as defined by calls to PipelineBuilder.create_input().

Pipeline execution logically proceeds in the following steps:

Determine the full list of pipeline components that need to be run in order to run the specified components.
Run those components in order, taking their inputs from pipeline inputs or previous components as specified by the pipeline connections and defaults.
Return the values of the specified components. If a single component is specified, its value is returned directly; if two or more components are specified, their values are returned in a tuple.

Component Names#

As noted above, each component (and pipeline input) has a name that is unique across the pipeline. For consistency and clarity, we recommend naming components with a noun or kebab-case noun phrase that describes the component itself, e.g.:

recommender
reranker
scorer
history-lookup
item-embedder

Component nodes can also have aliases, allowing them to be accessed by more than one name. Use PipelineBuilder.alias() to define these aliases.

Various LensKit facilities recognize several standard component names used by the standard pipeline builders, and we recommend you use them in your own pipelines when applicable:

scorer — compute (usually personalized) scores for items for a given user.
ranker — compute a (ranked) list of recommendations for a user. If you are configuring a pipeline with rerankers whose outputs are also rankings, this name should usually be used for the last such ranker, and downstream components (if any) transform that ranking into another layout; that way the evaluation tools will operate on the last such ranking.
recommender — compute recommendations for a user. This will often be an alias for ranker, as in a top-N recommender, but may return other formats such as grids or unordered slates.
rating-predictor — predict a user’s ratings for the specified items. When present, this may be an alias for scorer, or it may be another component that fills in missing scores with a baseline prediction.

These component names replace the task-specific interfaces in pre-2025 LensKit; a Recommender is now just a pipeline with recommender and/or ranker components.

Pipeline Serialization#

Pipelines are defined by the following:

The components and inputs (nodes)
The component input connections (edges)
The component configurations (see Component)
The components’ learned parameters (see Trainable)

LensKit supports serializing both pipeline descriptions (components, connections, and configurations) and pipeline parameters. There are two ways to save a pipeline or part thereof:

Pickle the entire pipeline. This is easy, and saves everything in the pipeline; it has the usual downsides of pickling (arbitrary code execution, etc.). LensKit uses pickling to share pipelines with worker processes for parallel batch operations.

Note

Pickled pipelines must be unpickled with the same LensKit version — we make no attempt to maintain pickle compatibility.
Save the pipeline configuration (Pipeline.config, using model_dump_json()). This saves the components, their configurations, and their connections, but not any learned parameter data. A new pipeline can be constructed from such a configuration can be reloaded with Pipeline.from_config().

Standard Pipelines#

The standard recommendation pipeline, produced by either of the approaches described above in Constructing Pipelines, looks like this:

        ---
config:
    fontFamily: '"Source Sans 3", Verdana, Helvetica, Arial, sans-serif'
---
flowchart LR
    subgraph input["Inputs"]
    QUERY[/"Query<br>(<tt>query: QueryInput</tt>)"/]
    ITEMS[/"Candidate Items<br>(<tt>items: ItemList</tt>)"/]
    N[/"List Length<br>(<tt>n: int</tt>)"/]
    class ITEMS optional;
    class N optional;
    end

    subgraph prep["Data Preparation"]
    HLOOK["History Lookup"]
    CSEL["Candidate Selector"]
    CPICK(["Pick Cand. Source"])
    end

    subgraph rank["Scoring and Ranking"]
    SCORE["Scorer"]
    RANK["`Top-*N* Ranker`"]
    class SCORE config;
    end

    RESULT[\"Recommendations<br>(ordered <tt>ItemList</tt>)"\]

    QUERY --> HLOOK
    HLOOK -- RecQuery --> CSEL
    CSEL -.->|"ItemList<br>(if needed)"| CPICK
    ITEMS --> CPICK

    HLOOK -- RecQuery --> SCORE
    CPICK -- ItemList --> SCORE
    SCORE -- ItemList --> RANK
    N --> RANK

    RANK --> RESULT

    classDef optional stroke-dasharray: 5 5;
    classDef config font-weight:bold,stroke-width:4px;

Top-N recommendation pipeline.#

The convenience methods are equivalent to the following pipeline code:

pipe = PipelineBuilder()
# define an input parameter for the user ID (the 'query')
query = pipe.create_input('query', ID)
# allow candidate items to be optionally specified
items = pipe.create_input('items', ItemList, None)
# look up a user's history in the training data
history = pipe.add_component('history-lookup', LookupTrainingHistory, query=query)
# find candidates from the training data
default_candidates = pipe.add_component(
    'candidate-selector',
    UnratedTrainingItemsCandidateSelector,
    query=history,
)
# if the client provided items as a pipeline input, use those; otherwise
# use the candidate selector we just configured.
candidates = pipe.use_first_of('candidates', items, default_candidates)
# score the candidate items using the specified scorer
score = pipe.add_component('scorer', scorer, query=query, items=candidates)
# rank the items by score
recommend = pipe.add_component('ranker', TopNRanker, {'n': 50}, items=score)
pipe.alias('recommender', recommend)
pipe.default_component('recommender')
pipe = pipe.build()

If we want to also emit rating predictions, with fallback to a baseline model to predict ratings for items the primary scorer cannot score (e.g. they are not in an item neighborhood), we use the following pipeline (created by RecPipelineBuilder when rating prediction is enabled):

        ---
config:
    fontFamily: '"Source Sans 3", Verdana, Helvetica, Arial, sans-serif'
---
flowchart LR
    subgraph input["Inputs"]
    QUERY[/"Query<br>(<tt>query: QueryInput</tt>)"/]
    ITEMS[/"Candidate Items<br>(<tt>items: ItemList</tt>)"/]
    N[/"List Length<br>(<tt>n: int</tt>)"/]
    class ITEMS optional;
    class N optional;
    end

    subgraph prep["Data Preparation"]
    HLOOK["History Lookup"]
    CSEL["Candidate Selector"]
    CPICK(["Pick Cand. Source"])
    end

    subgraph rank["Scoring and Ranking"]
    SCORE["Scorer"]
    FALLBACK["Baseline Scorer"]
    FILL["Fallback"]
    RANK["`Top-*N* Ranker`"]
    class SCORE config;
    end

    RESULT[\"Recommendations<br>(ordered <tt>ItemList</tt>)"\]
    PREDS[\"Predictions<br>(<tt>ItemList</tt>)"\]

    QUERY --> HLOOK
    HLOOK -- RecQuery --> CSEL
    ITEMS --> CPICK
    CSEL -.->|"ItemList<br>(if needed)"| CPICK

    HLOOK -- RecQuery --> SCORE
    CPICK -- ItemList --> SCORE
    HLOOK -- RecQuery --> FALLBACK
    CPICK -- ItemList --> FALLBACK
    SCORE -- ItemList --> FILL
    FALLBACK -- ItemList --> FILL
    SCORE -- ItemList --> RANK
    N --> RANK

    RANK --> RESULT
    FILL --> PREDS

    classDef optional stroke-dasharray: 5 5;
    classDef config font-weight:bold,stroke-width:4px;

Pipeline for top-N recommendation and rating prediction, with predictions falling back to a baseline scorer.#

Component Interface#

Pipeline components are callable objects that can optionally provide configuration, training, and serialization capabilities. In the simplest case, a component that requires no training or configuration can simply be a Python function.

Most components will extend the Component base class to expose configuration capabilities, and implement the lenskit.training.Trainable protocol if they contain a model that needs to be trained.

Components also must be pickleable, as LensKit uses pickling for shared memory parallelism in its batch-inference code.

See Implementing Components for more information on implementing your own components.

Configuring Components#

Unlike components in some other machine learning packages, LensKit components carry their configuration in a separate configuration object that can be serialized to and from JSON-like data structures.

To support configuration, all a component needs to do is (1) extend Component, and (2) declare an instance variable whose type is the configuration object type. This configuration object’s class can be either a Python dataclass (see dataclasses) or a Pydantic model class (see pydantic.BaseModel); in both cases, they are serialized and validated with Pydantic. Component.__init__ will take care of storing the configuration object if one is provided, or instantiating the configuration class with defaults or from keyword arguments. In most cases, you don’t need to define a constructor for a component.

See Configuration Conventions for standard configuration option names.

Motivation

Splitting configuration off into a separate configuration model class, instead of making them attributes and constructor parameters for the component class itself, is for a few reasons:

Pydantic validation ensures that hyperparameters are of correct types (and ranges, if you use more sophisticated Pydantic validations), without needing to write as much manual input validation code in each component.
Declaring parameters as attributes, as keyword parameters to the constructor, and saving them in the attributes is a lot of duplication that increases opportunity for errors.
It’s slightly easier to document configuration parameters, and keep them separate from other potential inputs, when they are in a configuration class.
Using Pydantic models provides consistent serialization of component configurations to and from configuration files.
The base class can provide well-defined and complete string representations for free to all component implementations.

Adding Components to the Pipeline#

You can add components to the pipeline in two ways:

Instantiate the component with its configuration options and pass it to PipelineBuilder.add_component():
```
builder.add_component('component-name', MyComponent(option='value'))
```
When you convert the pipeline into a configuration or clone it, the component will be re-instantiated from its configuration.

Pass the component class and configuration separately to PipelineBuilder.add_component():

builder.add_component('component-name', MyComponent, MyConfig(option='value'))

Alternatively:

builder.add_component('component-name', MyComponent, {'option': 'value'}))

When you use the second approach, PipelineBuilder.build() instantiates the component from the provided configuration.

Modifying Pipelines#

Pipelines, once constructed, are immutable (and modifying the pipeline, its configuration, or its internal data structures is undefined behavior). However, you can create a new pipeline from an existing one with added or changed components. To do this:

Create a builder from the pipeline with Pipeline.modify(), which returns a PipelineBuilder.
Add new components, or replace existing ones with PipelineBuilder.replace_component().
Build the modified pipeline with PipelineBuilder.build().

Pipeline Hooks#

Pipelines support hooks to allow client code to inspect or modify their behavior. Hooks are also used internally to support things like runtime type checking.

Note

As of 2025.3.0 (in progress), only a single hook is supported: component-input run hooks. Future hooks will be added as there is demand. If you want a new hook, file an issue (or send a PR adding it).

Note

Currently (2025.3.0 (in progress)), only functions can reliably be used as hooks. Support for other callable objects is under consideration but has not yet been implemented or tested.

Installing a hook requires three pieces:

The hook name, which identifies the point in the process to insert the hook.
The hook function, which is called when the pipeline reaches that hook point. Each hook point has an associated protocol defining the call signature for that hook.
The hook priority, which determines the order in which hooks are called. Hooks are run in ascending priority order, and the priority 0 is reserved for LensKit’s internal hooks.

Run Hooks#

Run hooks are called each time the pipeline is run. They can be configured with PipelineBuilder.add_run_hook().

component-input: A component-input hook (see ComponentInputHook) will be called for each component input: a data value being supplied to one of the input parameters of a component. It is called separately for each component input, even if two components have their input wired to the same source node. This facilitates data inspection, type checking, and other checks or analyses that need to run on each edge of the pipeline graph.

POPROX and Other Integrators#

One of LensKit’s design principles is “use the pieces you want”. That extends to the pipeline code — while the pipeline components included with LensKit use LensKit’s data structures like ItemList and RecQuery, the pipeline itself is fully generic. Components can accept and return any types, and the pipeline code makes no assumptions about the kinds of data routed through the pipeline, the structure of the pipeline, or the presence or absence of any particular components. The only aspects of component interface or behavior defined by the pipeline are that:

Pipeline objects are callable, and accept their inputs as keyword parameters.
Configurable components extend the Component interface and use Pydantic models to house their configurable options (with its requirements, such as defining a config attribute to store the configuration).
Components can be constructed with either zero arguments or a single configuration model argument.

The exception to this is training support — Pipeline.train() takes a LensKit dataset and trains components implementing the Trainable protocol. But it is entirely possible to handle model training outside of the pipeline and ignore LensKit train method. You can also use the method, but with a different input data object; it will fail static typechecking, but Pipeline.train() doesn’t actually care what the type of its first argument is, and will pass it as-is to the component train() methods.

One example of an integrator that uses the pipeline without the rest of LensKit’s data structures is _POPROX: the POPROX recommender design uses its own data structures, like a Pydantic-backed ArticleSet, instead of ItemList and friends, and expects components to be pre-trained by other code. It still uses the LensKit pipeline to wire these components together.

Recommendation Pipelines#

Constructing Pipelines#

Pipeline Model#

Connections#

Building the Pipeline#

Execution#

Component Names#

Pipeline Serialization#

Standard Pipelines#

Component Interface#

Configuring Components#

Adding Components to the Pipeline#

Modifying Pipelines#

Pipeline Hooks#

Run Hooks#

POPROX and Other Integrators#

This Page