Evaluating Rating Predictions#

While rating prediction is no longer a widely-studied recommendation task, LensKit provides support for evaluating predictions for completeness and reproducing or comparing against historical research.

LensKit provides two prediction accuracy metrics: RMSE() and MAE(). They support both global (micro-averaged) and per-user (macro-averaged) computation.

Changed in version 2025.1: The prediction accuracy metric interface has changed to use item lists.

Calling Metrics#

There are two ways to directly call a prediction accuracy metric:

  • Pass two item lists, the first containing predicted ratings (as the list’s scores()) and the second containing ground-truth ratings as a rating field.

  • Pass a single item list with scores and a rating field.

For evaluation, you will usually want to use RunAnalysis, which takes care of calling the prediction metric for you.

Missing Data#

There are two important missing data cases for evaluating predictions:

  • Missing predictions (the test data has a rating for which the system could not generate a prediction).

  • Missing ratings (the system generated a prediction with no corresponding test rating).

By default, LensKit throws an error in both of these cases, to help you catch bad configurations. We recommend using a fallback predictor, such setting up FallbackScorer with BiasScorer, when measuring rating predictions to ensure that all items are scored. The alternative design — ignoring missing predictions — means that different scorers may be evaluated on different items, and a scorer perform exceptionally well by only scoring items with high confidence.

Note

Both topn_pipeline() and predict_pipeline() default to using a BiasScorer() as a fallback when rating prediction is enabled.

If you want to skip missing predictions, pass missing_scores="ignore" to the metric function:

RMSE(user_preds, user_ratings, missing_scores="ignore")

The corresponding missing_truth="ignore" will cause the metric to ignore predictions with no corresponding rating (this case is unlikely to produce unhelpful eval results, but may indicate a misconfiguration in how you determine the items to score).