MovieLens Data#

The MovieLens data sets are a widely-used set of movie rating datasets, available from the GroupLens dataset collection. The core of these data sets is matrix of user-provided 5-star ratings of movies, along with movie metadata such as titles and IMDB links. Some sets include user demographics as well, and others include various forms of tag data.

Loading MovieLens Data#

The load_movielens() function will load any published MovieLens dataset, constructing a LensKit Dataset with its contents. This dataset can then be split, saved in LensKit native format, used to train models and pipelines, etc.

This function automatically detects which MovieLens dataset is being loaded, and can load them from either the Zip archives published by GroupLens or from a directory where the archive has been unpacked.

MovieLens Data Model#

The MovieLens loader loads the data into the standard user and item entities, with a rating interaction class storing the user-provided ratings. The items have the following attributes:

title

The movie title.

genres

A list of genres for this movie.

tag_counts

A sparse vector attribute storing the number of times each tag has been applied to this movie. It is a summary of the tags data provided by MovieLens. The tag names themselves are on the attribute’s names.

tag_genome

A vector attribute storing the relevance values from the tag genome [VSR12], when it is available (ML20M and 25M).

For most data sets, there are no user attributes; ML100K and ML1M have gender, age, and zip_code attributes. See the MovieLens data documentation for details on these.

Ratings have two attributes: rating and timestamp. The timestamps are parsed into Arrow/NumPy/Pandas timestamps.

Todo

A future version of LensKit will likely introduce tags as first-class entities.