Data Set Utilities

The lenskit.datasets module provides utilities for reading a variety of commonly-used LensKit data sets. It does not package or automatically download them, but loads them from a local directory where you have unpacked the data set. Each data set class or function takes a path parameter specifying the location of the data set.

The normal mode of operation for these utilities is to provide a class for the data set; this class then exposes the data set’s data as attributes. These attributes are cached internally, so e.g. accessing MovieLens.ratings twice will only load the data file once.

These data files have normalized column names to fit with LensKit’s general conventions. These are the following:

  • User ID columns are called user.

  • Item ID columns are called item.

  • Rating columns are called rating.

  • Timestamp columns are called timestamp.

Other column names are unchanged. Data tables that provide information about specific things, such as a table of movie titles, are indexed by the relevant ID (e.g. MovieLens.ratings is indexed by item).

MovieLens Data Sets

The GroupLens research group provides several data sets extracted from the MovieLens service [ML]. These can be downloaded from https://grouplens.org/datasets/movielens/.

class lenskit.datasets.MovieLens(path='data/ml-20m')

Code for reading current MovieLens data sets, including ML-20M, ML-Latest, and ML-Latest-Small.

Parameters

path (str or pathlib.Path) – Path to the directory containing the data set.

The movie link table, connecting movie IDs to external identifiers. It is indexed by movie ID.

>>> mlsmall = MovieLens('ml-latest-small')
>>> mlsmall.links
         imdbId  tmdbId
item
1        114709     862
2        113497    8844
3        113228   15602
4        114885   31357
5        113041   11862
...
[9125 rows x 2 columns]
movies

The movie table, with titles and genres. It is indexed by movie ID.

>>> mlsmall = MovieLens('ml-latest-small')
>>> mlsmall.movies
                                                    title                                           genres
item
1                                        Toy Story (1995)      Adventure|Animation|Children|Comedy|Fantasy
2                                          Jumanji (1995)                       Adventure|Children|Fantasy
3                                 Grumpier Old Men (1995)                                   Comedy|Romance
4                                Waiting to Exhale (1995)                             Comedy|Drama|Romance
5                      Father of the Bride Part II (1995)                                           Comedy
...
[9125 rows x 2 columns]
ratings

The rating table.

>>> mlsmall = MovieLens('ml-latest-small')
>>> mlsmall.ratings
        user  item  rating   timestamp
0          1    31     2.5  1260759144
1          1  1029     3.0  1260759179
2          1  1061     3.0  1260759182
3          1  1129     2.0  1260759185
4          1  1172     4.0  1260759205
...
[100004 rows x 4 columns]
tag_genome

The tag genome table, recording inferred item-tag relevance scores. This gets returned as a wide Pandas data frame, with rows indexed by item ID.

tags

The tag application table, recording user-supplied tags for movies.

>>> mlsmall = MovieLens('ml-latest-small')
>>> mlsmall.tags
      user  ...   timestamp
0       15  ...  1138537770
1       15  ...  1193435061
2       15  ...  1170560997
3       15  ...  1170626366
4       15  ...  1141391765
...
[1296 rows x 4 columns]
class lenskit.datasets.ML100K(path='data/ml-100k')

The MovieLens 100K data set. This older data set is in a different format from the more current data sets loaded by MovieLens.

available

Query whether the data set exists.

ratings

Return the rating data (from u.data).

>>> ml = ML100K('ml-100k')
>>> ml.ratings              #doctest: +SKIP
       user  item  rating  timestamp
0       196   242     3.0  881250949
1       186   302     3.0  891717742
2        22   377     1.0  878887116
3       244    51     2.0  880606923
4       166   346     1.0  886397596
...
[100000 rows x 4 columns]
users

Return the user data (from u.user).

>>> ml = ML100K('ml-100k')
>>> ml.users                #doctest: +SKIP
      age gender     occupation     zip
user
1      24      M     technician   85711
2      53      F          other   94043
3      23      M         writer   32067
4      24      M     technician   43537
5      33      F          other   15213
...
[943 rows x 4 columns]
class lenskit.datasets.ML1M(path='data/ml-1m')

MovieLens 1M data set.

Note

Some documentation examples use ML-10M100K; that is because this class shares implementation with the 10M data set.

movies

Return the movie data (from movies.dat). Indexed by movie ID.

>>> ml = ML10M()
>>> ml.movies       #doctest: +SKIP
                                                    title                                           genres
item
1                                        Toy Story (1995)      Adventure|Animation|Children|Comedy|Fantasy
2                                          Jumanji (1995)                       Adventure|Children|Fantasy
3                                 Grumpier Old Men (1995)                                   Comedy|Romance
4                                Waiting to Exhale (1995)                             Comedy|Drama|Romance
5                      Father of the Bride Part II (1995)                                           Comedy
...
[10681 rows x 2 columns]
ratings

Return the rating data (from ratings.dat).

>>> ml = ML10M()
>>> ml.ratings      #doctest: +SKIP
           user  item  rating  timestamp
0             1   122     5.0  838985046
1             1   185     5.0  838983525
2             1   231     5.0  838983392
3             1   292     5.0  838983421
4             1   316     5.0  838983392
...
[10000054 rows x 4 columns]
users

Return the movie data (from users.dat). Indexed by user ID.

>>> ml = ML1M()
>>> ml.users        #doctest: +SKIP
     gender  age    zip
user
1         F    1  48067
2         M   56  70072
3         M   25  55117
4         M   45  02460
5         M   25  55455
...
[6040 rows x 3 columns]
class lenskit.datasets.ML10M(path='data/ml-10M100K')

MovieLens 10M100K data set.

movies

Return the movie data (from movies.dat). Indexed by movie ID.

>>> ml = ML10M()
>>> ml.movies       #doctest: +SKIP
                                                    title                                           genres
item
1                                        Toy Story (1995)      Adventure|Animation|Children|Comedy|Fantasy
2                                          Jumanji (1995)                       Adventure|Children|Fantasy
3                                 Grumpier Old Men (1995)                                   Comedy|Romance
4                                Waiting to Exhale (1995)                             Comedy|Drama|Romance
5                      Father of the Bride Part II (1995)                                           Comedy
...
[10681 rows x 2 columns]
ratings

Return the rating data (from ratings.dat).

>>> ml = ML10M()
>>> ml.ratings      #doctest: +SKIP
           user  item  rating  timestamp
0             1   122     5.0  838985046
1             1   185     5.0  838983525
2             1   231     5.0  838983392
3             1   292     5.0  838983421
4             1   316     5.0  838983392
...
[10000054 rows x 4 columns]
ML

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872