Recommendation Pipelines#
Since version 2025.1 (in progress), LensKit uses a flexible “pipeline” abstraction to
wire together different components such as candidate selectors, personalized
item scorers, and rankers to produce predictions, recommendations, or other
recommender system outputs. This is a significant change from the LensKit 0.x
design of monolithic and composable components based on the Scikit-Learn API,
allowing new recommendation designs to be composed without writing new classes
just for the composition. It also makes recommender definition code more
explicit by laying out the pipeline instead of burying composition logic in the
definitions of different composition classes. The pipeline lives in the
lenskit.pipeline
module, and the primary entry point is the
Pipeline
class.
If all you want to do is build a standard top-N recommendation pipeline from an
item scorer, see topn_pipeline()
. RecPipelineBuilder
provides
some more flexibility in configuring a recommendation pipeline with a standard
design, and you can always fully configure the pipeline yourself for maximum
flexibility.
Todo
Redo some of those types with user & item data, etc.
Todo
Provide utility functions to make more common wiring operations easy so there is middle ground between “give me a standard pipeline” and “make me do everything myself”.
Todo
Rethink the “keyword inputs only” constraint in view of the limitation it places on fallback or other compositional components — it’s hard to specify a component that implements fallback logic for an arbitrary number of inputs.
Pipeline components are not limited to looking things up from training data — they can query databases, load files, and any other operations. A runtime pipeline can use some components (especially the scorer) trained from training data, and other components that query a database or REST services for things like user history and candidate set lookup.
Acknowledgements
The LensKit pipeline design is heavily inspired by the pipeline abstraction Karl Higley originally created for POPROX (available in the git history), as well as by Haystack.
Constructing Pipelines#
The simplest way to make a pipeline is to construct a topn_pipeline
— the
following will create a top-N recommendation pipeline using implicit-feedback
matrix factorization:
als = ImplicitMFScorer(50)
pipe = topn_pipeline(als)
The RecPipelineBuilder
class provides a more flexible mechanism to
create standard recommendation pipelines; to implement the same pipeline with
that class, do:
als = ImplicitMFScorer(50)
builder = RecPipelineBuilder()
builder.scorer(als)
pipe = builder.build('ALS')
For maximum flexibility, you can directly construct and wire the pipeline yourself; this is described in Standard Layouts.
After any of these methods, you can run the pipeline to produce recommendations with:
user_recs = pipe.run('recommender', query=user_id)
This is also exposed with a convenience function:
from lenskit import recommend
user_recs = recommend(pipe, user_id)
Pipeline Model#
A pipeline has a couple key concepts:
An input is data that needs to be provided to the pipeline when it is run, such as the user to generate recommendations for. Inputs have specified data types, and it is an error to provide an input value of an unexpected type.
A component processes input data and produces an output. It can be either a Python function or object (anything that implements the
Component
protocol) that takes inputs as keyword arguments and returns an output.
These are arranged in a directed acyclic graph, consisting of:
Nodes (represented by
Node
), which correspond to either inputs or components.Connections from one node’s input to another node’s data (or to a fixed data value). This is how the pipeline knows which components depend on other components and how to provide each component with the inputs it requires; see Connections for details.
Each node has a name that can be used to look up the node with
Pipeline.node()
and appears in serialization and logging situations. Names
must be unique within a pipeline.
Connections#
Components declare their inputs as keyword arguments on their call signatures
(either the function call signature, if it is a bare function, or the
__call__
method if it is implemented by a class). In a pipeline, these
inputs can be connected to a source, which the pipeline will use to obtain a
value for that parameter when running the pipeline. Inputs can be connected to
the following types:
A
Node
, in which case the input will be provided from the corresponding pipeline input or component return value. Nodes are returned bycreate_input()
oradd_component()
, and can be looked up after creation withnode()
.A Python object, in which case that value will be provided directly to the component input argument.
These input connections are specified via keyword arguments to the
Pipeline.add_component()
or Pipeline.connect()
methods — specify the
component’s input name(s) and the node or data to which each input should be
wired.
You can also use Pipeline.add_default()
to specify default connections. For example,
you can specify a default for user
:
pipe.add_default('user', user_history)
With this default in place, if a component has an input named user
and that
input is not explicitly connected to a node, then the user_history
node will
be used to supply its value. Judicious use of defaults can reduce the amount of
code overhead needed to wire common pipelines.
Note
You cannot directly wire an input another component using only that
component’s name; if you only have a name, pass it to node()
to obtain the node. This is because it would be impossible to
distinguish between a string component name and a string data value.
Note
You do not usually need to call this method directly; when possible,
provide the wirings when calling add_component()
.
Execution#
Once configured, a pipeline can be run with Pipeline.run()
. This
method takes two types of inputs:
Positional arguments specifying the node(s) to run and whose results should be returned. This is to allow partial runs of pipelines (e.g. to only score items without ranking them), and to allow multiple return values to be obtained (e.g. initial item scores and final rankings, which may have altered scores).
If no components are specified, it is the same as specifying the last component that was added to the pipeline.
Keyword arguments specifying the values for the pipeline’s inputs, as defined by calls to
create_input()
.
Pipeline execution logically proceeds in the following steps:
Determine the full list of pipeline components that need to be run in order to run the specified components.
Run those components in order, taking their inputs from pipeline inputs or previous components as specified by the pipeline connections and defaults.
Return the values of the specified components. If a single component is specified, its value is returned directly; if two or more components are specified, their values are returned in a tuple.
Component Names#
As noted above, each component (and pipeline input) has a name that is unique across the pipeline. For consistency and clarity, we recommend naming components with a noun or kebab-case noun phrase that describes the component itself, e.g.:
recommender
reranker
scorer
history-lookup
item-embedder
Component nodes can also have aliases, allowing them to be accessed by more
than one name. Use Pipeline.alias()
to define these aliases.
Various LensKit facilities recognize several standard component names used by the standard pipeline builders, and we recommend you use them in your own pipelines when applicable:
scorer
— compute (usually personalized) scores for items for a given user.ranker
— compute a (ranked) list of recommendations for a user. If you are configuring a pipeline with rerankers whose outputs are also rankings, this name should usually be used for the last such ranker, and downstream components (if any) transform that ranking into another layout; that way the evaluation tools will operate on the last such ranking.recommender
— compute recommendations for a user. This will often be an alias forranker
, as in a top-N recommender, but may return other formats such as grids or unordered slates.rating-predictor
— predict a user’s ratings for the specified items. When present, this may be an alias forscorer
, or it may be another component that fills in missing scores with a baseline prediction.
These component names replace the task-specific interfaces in pre-2025 LensKit;
a Recommender
is now just a pipeline with recommender
and/or ranker
components.
Pipeline Serialization#
Pipelines are defined by the following:
The components and inputs (nodes)
The component input connections (edges)
The component configurations (see
Configurable
andComponent
)The components’ learned parameters (see
Trainable
)
LensKit supports serializing both pipeline descriptions (components, connections, and configurations) and pipeline parameters. There are two ways to save a pipeline or part thereof:
Pickle the entire pipeline. This is easy, and saves everything in the pipeline; it has the usual downsides of pickling (arbitrary code execution, etc.). LensKit uses pickling to share pipelines with worker processes for parallel batch operations.
Save the pipeline configuration with
Pipeline.save_config()
. This saves the components, their configurations, and their connections, but not any learned parameter data. A new pipeline can be constructed from such a configuration can be reloaded withPipeline.from_config()
.
Save the pipeline parameters with
Pipeline.save_params()
. This saves the learned parameters but not the configuration or connections. The parameters can be reloaded into a compatible pipeline withPipeline.load_params()
; a compatible pipeline can be created by running the same pipeline setup code or using a saved pipeline configuration.These can be mixed and matched: if you pickle an untrained pipeline, you can unpickle it and use
load_params()
to infuse it with parameters.Component implementations need to support the configuration and/or parameter values, as needed, in addition to functioning correctly with pickle (no specific logic is usually needed for this).
LensKit knows how to safely save the following object types from
Trainable.get_params()
:
torch.Tensor
(dense, CSR, and COO tensors).
scipy.sparse.csr_array
,coo_array
,csc_array
, and the corresponding*_matrix
versions.Other objects (including Pandas dataframes) are serialized by pickling, and the pipeline will emit a warning (or fail, if
allow_pickle=False
is passed tosave_params()
).Note
The load/save parameter operations are modeled after PyTorch’s
state_dict()
and the needs of safetensors.
Standard Layouts#
The standard recommendation pipeline, produced by either of the approaches described above in Constructing Pipelines, looks like this:
The convenience methods are equivalent to the following pipeline code:
pipe = Pipeline()
# define an input parameter for the user ID (the 'query')
query = pipe.create_input('query', ID)
# allow candidate items to be optionally specified
items = pipe.create_input('items', ItemList, None)
# look up a user's history in the training data
history = pipe.add_component('history-lookup', LookupTrainingHistory(), query=query)
# find candidates from the training data
default_candidates = pipe.add_component(
'candidate-selector',
UnratedTrainingItemsCandidateSelector(),
query=history,
)
# if the client provided items as a pipeline input, use those; otherwise
# use the candidate selector we just configured.
candidates = pipe.use_first_of('candidates', items, default_candidates)
# score the candidate items using the specified scorer
score = pipe.add_component('scorer', scorer, query=query, items=candidates)
# rank the items by score
recommend = pipe.add_component('ranker', TopNRanker(50), items=score)
pipe.alias('recommender', recommend)
If we want to also emit rating predictions, with fallback to a baseline model to
predict ratings for items the primary scorer cannot score (e.g. they are not in
an item neighborhood), we use the following pipeline (created by
RecPipelineBuilder
when rating prediction is enabled):
Component Interface#
Pipeline components are callable objects that can optionally provide
configuration, training, and serialization capabilities. In the simplest case,
a component that requires no training or configuration can simply be a Python
function; most components will extend the Component
base class to
expose configuration capabilities, and implement the Trainable
protocol
if they contain a model that needs to be trained.
Components also must be pickleable, as LensKit uses pickling for shared memory parallelism in its batch-inference code.