k-NN Collaborative Filtering#

LKPY provides user- and item-based classical k-NN collaborative Filtering implementations. These lightly-configurable implementations are intended to capture the behavior of the Java-based LensKit implementations to provide a good upgrade path and enable basic experiments out of the box.

There are two different primary modes that you can use these algorithms in. When using explicit feedback (rating values), you usually want to use the defaults of weighted-average aggregation and mean-centering normalization. This is the default mode, and can be selected explicitly by passing feedback='explicit' to the class constructor.

With implicit feedback (unary data such as clicks and purchases, typically represented with rating values of 1 for positive items), the usual design is sum aggregation and no centering. This can be selected with feedback='implicit', which also configures the algorithm to ignore rating values (when present) and treat every rating as 1:

implicit_knn = ItemItem(20, feedback='implicit')

Attempting to center data on the same scale (all 1, for example) will typically produce invalid results. ItemKNN has diagnostics to warn you about this.

The feedback option only sets defaults; the algorithm can be further configured (e.g. to re-enable rating values) with additional parameters to the constructor.

Added in version 0.14: The feedback option and the ability to ignore rating values was added in LensKit 0.14. In previous versions, you need to specifically configure each option.

Item-based k-NN#

This is LensKit’s item-based k-NN model, based on the description by Deshpande and Karypis [DK04].

class lenskit.algorithms.knn.ItemItem(nnbrs, min_nbrs=1, min_sim=1e-06, save_nbrs=None, feedback='explicit', block_size=250, **kwargs)#

Bases: Predictor

Item-item nearest-neighbor collaborative filtering with ratings. This item-item implementation is not terribly configurable; it hard-codes design decisions found to work well in the previous Java-based LensKit code [ELKR11]. This implementation is based on the description of item-based CF by Deshpande and Karypis [DK04], and produces results equivalent to Java LensKit.

The k-NN predictor supports several aggregate functions:

weighted-average

The weighted average of the user’s rating values, using item-item similarities as weights.

sum

The sum of the similarities between the target item and the user’s rated items, regardless of the rating the user gave the items.

Parameters:
  • nnbrs (int) – the maximum number of neighbors for scoring each item (None for unlimited)

  • min_nbrs (int) – the minimum number of neighbors for scoring each item

  • min_sim (float) – Minimum similarity threshold for considering a neighbor. Must be positive; if less than the smallest 32-bit normal (\(1.175 \times 10^{-38}\)), is clamped to that value.

  • save_nbrs (int | None) – the number of neighbors to save per item in the trained model (None for unlimited)

  • feedback (Literal['explicit', 'implicit']) –

    Control how feedback should be interpreted. Specifies defaults for the other settings, which can be overridden individually; can be one of the following values:

    explicit

    Configure for explicit-feedback mode: use rating values, center ratings, and use the weighted-average aggregate method for prediction. This is the default setting.

    implicit

    Configure for implicit-feedback mode: ignore rating values, do not center ratings, and use the sum aggregate method for prediction.

  • center – whether to normalize (mean-center) rating vectors prior to computing similarities and aggregating user rating values. Defaults to True; turn this off when working with unary data and other data types that don’t respond well to centering.

  • aggregate – the type of aggregation to do. Can be weighted-average (the default) or sum.

  • use_ratings – whether or not to use the rating values. If False, it ignores rating values and considers an implicit feedback signal of 1 for every (user,item) pair present.

  • block_size (int)

IGNORED_PARAMS = ['feedback']#

Names of parameters to ignore in get_params().

EXTRA_PARAMS = ['center', 'aggregate', 'use_ratings']#

Names of extra parameters to include in get_params(). Useful when the constructor takes **kwargs.

item_index_: Index#

The index of item IDs.

item_means_: Tensor | None#

Mean rating for each known item.

item_counts_: Tensor#

Number of saved neighbors for each item.

sim_matrix_: Tensor#

Similarity matrix (sparse CSR tensor).

user_index_: Index#

Index of user IDs.

rating_matrix_: Tensor#

Normalized rating matrix to look up user ratings at prediction time.

fit(ratings, **kwargs)#

Train a model.

The model-training process depends on save_nbrs and min_sim, but not on other algorithm parameters.

Parameters:

ratings (DataFrame) – (user,item,rating) data for computing item similarities.

predict_for_user(user, items, ratings=None)#

Compute predictions for a user and items.

Parameters:
  • user – the user ID

  • items (array-like) – the items to predict

  • ratings (pandas.Series) – the user’s ratings (indexed by item id); if provided, they may be used to override or augment the model’s notion of a user’s preferences.

Returns:

scores for the items, indexed by item id.

Return type:

pandas.Series

User-based k-NN#

class lenskit.algorithms.knn.UserUser(nnbrs, min_nbrs=1, min_sim=1e-06, feedback='explicit', **kwargs)#

Bases: Predictor

User-user nearest-neighbor collaborative filtering with ratings. This user-user implementation is not terribly configurable; it hard-codes design decisions found to work well in the previous Java-based LensKit code.

Parameters:
  • nnbrs (int) – the maximum number of neighbors for scoring each item (None for unlimited).

  • min_nbrs (int) – The minimum number of neighbors for scoring each item.

  • min_sim (float) – Minimum similarity threshold for considering a neighbor. Must be positive; if less than the smallest 32-bit normal (\(1.175 \times 10^{-38}\)), is clamped to that value.

  • feedback (FeedbackType) –

    Control how feedback should be interpreted. Specifies defaults for the other settings, which can be overridden individually; can be one of the following values:

    explicit

    Configure for explicit-feedback mode: use rating values, center ratings, and use the weighted-average aggregate method for prediction. This is the default setting.

    implicit

    Configure for implicit-feedback mode: ignore rating values, do not center ratings, and use the sum aggregate method for prediction.

  • center – whether to normalize (mean-center) rating vectors. Turn this off when working with unary data and other data types that don’t respond well to centering.

  • aggregate – the type of aggregation to do. Can be weighted-average or sum.

  • use_ratings – whether or not to use rating values; default is True. If False, it ignores rating values and treates every present rating as 1.

IGNORED_PARAMS = ['feedback']#

Names of parameters to ignore in get_params().

EXTRA_PARAMS = ['center', 'aggregate', 'use_ratings']#

Names of extra parameters to include in get_params(). Useful when the constructor takes **kwargs.

user_index_: pd.Index[Any]#

The index of user IDs.

item_index_: pd.Index[Any]#

The index of item IDs.

user_means_: torch.Tensor | None#

Mean rating for each known user.

user_vectors_: torch.Tensor#

Normalized rating matrix (CSR) to find neighbors at prediction time.

user_ratings_: torch.Tensor#

Centered but un-normalized rating matrix (COO) to find neighbor ratings.

fit(ratings, **kwargs)#

“Train” a user-user CF model. This memorizes the rating data in a format that is usable for future computations.

Parameters:

ratings (pandas.DataFrame) – (user, item, rating) data for collaborative filtering.

Return type:

Self

predict_for_user(user, items, ratings=None)#

Compute predictions for a user and items.

Parameters:
  • user – the user ID

  • items (array-like) – the items to predict

  • ratings (pandas.Series) – the user’s ratings (indexed by item id); if provided, will be used to recompute the user’s bias at prediction time.

Returns:

scores for the items, indexed by item id.

Return type:

pandas.Series