Loading Data#

LensKit can work with any data in a pandas.DataFrame with the expected columns. LensKit algorithms expect a ratings frame to contain the following columns (in any order):

  • user, containing user identifiers. No requirements are placed on user IDs — if an algorithm requires something specific, such as contiguous 0-based identifiers for indexing into an array — it will use a pandas.Index to map them.

  • item, containing item identifiers. The same comments apply as for user.

  • rating, containing user ratings (if available). Implicit-feedback code will not require ratings.

‘Rating’ data can contain other columns as well, and is a catch-all for any user-item interaction data. Algorithms will document any non-standard columns they can make use of.

lenskit.algorithms.Recommender.fit() can also accept additional data objects as keyword arguments, and algorithms that wrap other algorithms will pass this data through unchanged. Algorithms ignore extra data objects they receive. This allows you to build algorithms that train on data besides user-item interactions, such as user metadata or item content.

Data Loaders#

The lenskit.datasets module provides utilities for reading a variety of commonly-used LensKit data sets. It does not package or automatically download them, but loads them from a local directory where you have unpacked the data set. Each data set class or function takes a path parameter specifying the location of the data set.

The normal mode of operation for these utilities is to provide a class for the data set; this class then exposes the data set’s data as attributes. These attributes are cached internally, so e.g. accessing MovieLens.ratings twice will only load the data file once.

These data files have normalized column names to fit with LensKit’s general conventions. These are the following:

  • User ID columns are called user.

  • Item ID columns are called item.

  • Rating columns are called rating.

  • Timestamp columns are called timestamp.

Other column names are unchanged. Data tables that provide information about specific things, such as a table of movie titles, are indexed by the relevant ID (e.g. MovieLens.ratings is indexed by item).

Data sets supported:

MovieLens Data Sets#

The GroupLens research group provides several data sets extracted from the MovieLens service [HK15]. These can be downloaded from https://grouplens.org/datasets/movielens/.

class lenskit.datasets.MovieLens(path='data/ml-20m')#

Bases: object

Code for reading current MovieLens data sets, including ML-20M, ML-Latest, and ML-Latest-Small.

Parameters:

path (str or pathlib.Path) – Path to the directory containing the data set.

property ratings#

The rating table.

>>> mlsmall = MovieLens('data/ml-latest-small')
>>> mlsmall.ratings
        user  item  rating   timestamp
0          1    31     2.5  1260759144
1          1  1029     3.0  1260759179
2          1  1061     3.0  1260759182
3          1  1129     2.0  1260759185
4          1  1172     4.0  1260759205
...
[100004 rows x 4 columns]
property movies#

The movie table, with titles and genres. It is indexed by movie ID.

>>> mlsmall = MovieLens('data/ml-latest-small')
>>> mlsmall.movies
                                                    title                                           genres
item
1                                        Toy Story (1995)      Adventure|Animation|Children|Comedy|Fantasy
2                                          Jumanji (1995)                       Adventure|Children|Fantasy
3                                 Grumpier Old Men (1995)                                   Comedy|Romance
4                                Waiting to Exhale (1995)                             Comedy|Drama|Romance
5                      Father of the Bride Part II (1995)                                           Comedy
...
[9125 rows x 2 columns]

The movie link table, connecting movie IDs to external identifiers. It is indexed by movie ID.

>>> mlsmall = MovieLens('data/ml-latest-small')
>>> mlsmall.links
         imdbId  tmdbId
item
1        114709     862
2        113497    8844
3        113228   15602
4        114885   31357
5        113041   11862
...
[9125 rows x 2 columns]
property tags#

The tag application table, recording user-supplied tags for movies.

>>> mlsmall = MovieLens('data/ml-latest-small')
>>> mlsmall.tags
      user  ...   timestamp
0       15  ...  1138537770
1       15  ...  1193435061
2       15  ...  1170560997
3       15  ...  1170626366
4       15  ...  1141391765
...
[1296 rows x 4 columns]
property tag_genome#

The tag genome table, recording inferred item-tag relevance scores. This gets returned as a wide Pandas data frame, with rows indexed by item ID.

>>> ml20m = MovieLens('data/ml-20m')
>>> ml20m.tag_genome
tag         007  007 (series)  18th century  ...     wwii   zombie  zombies
item                                         ...
1       0.02500       0.02500       0.05775  ...  0.03625  0.07775  0.02300
2       0.03975       0.04375       0.03775  ...  0.01475  0.09025  0.01875
3       0.04350       0.05475       0.02800  ...  0.01950  0.09700  0.01850
4       0.03725       0.03950       0.03675  ...  0.01525  0.06450  0.01300
5       0.04200       0.05275       0.05925  ...  0.01675  0.10750  0.01825
...
[10381 rows x 1128 columns]
class lenskit.datasets.ML100K(path='data/ml-100k')#

Bases: object

The MovieLens 100K data set. This older data set is in a different format from the more current data sets loaded by MovieLens.

property available#

Query whether the data set exists.

property ratings#

Return the rating data (from u.data).

>>> ml = ML100K()
>>> ml.ratings
       user  item  rating  timestamp
0       196   242     3.0  881250949
1       186   302     3.0  891717742
2        22   377     1.0  878887116
3       244    51     2.0  880606923
4       166   346     1.0  886397596
...
[100000 rows x 4 columns]
property users#

Return the user data (from u.user).

>>> ml = ML100K()
>>> ml.users
      age gender     occupation     zip
user
1      24      M     technician   85711
2      53      F          other   94043
3      23      M         writer   32067
4      24      M     technician   43537
5      33      F          other   15213
...
[943 rows x 4 columns]
property movies#

Return the user data (from u.user).

>>> ml = ML100K()
>>> ml.movies
                                          title      release  ...  War Western
item                                                          ...
1                              Toy Story (1995)  01-Jan-1995  ...    0       0
2                              GoldenEye (1995)  01-Jan-1995  ...    0       0
3                             Four Rooms (1995)  01-Jan-1995  ...    0       0
4                             Get Shorty (1995)  01-Jan-1995  ...    0       0
5                                Copycat (1995)  01-Jan-1995  ...    0       0
...
[1682 rows x 23 columns]
class lenskit.datasets.ML1M(path='data/ml-1m')#

Bases: MLM

MovieLens 1M data set.

Note

Some documentation examples use ML-10M100K; that is because this class shares implementation with the 10M data set.

property users#

Return the movie data (from users.dat). Indexed by user ID.

>>> ml = ML1M()
>>> ml.users
     gender  age    zip
user
1         F    1  48067
2         M   56  70072
3         M   25  55117
4         M   45  02460
5         M   25  55455
...
[6040 rows x 3 columns]
property movies#

Return the movie data (from movies.dat). Indexed by movie ID.

>>> ml = ML10M()
>>> ml.movies
                                                    title                                           genres
item
1                                        Toy Story (1995)      Adventure|Animation|Children|Comedy|Fantasy
2                                          Jumanji (1995)                       Adventure|Children|Fantasy
3                                 Grumpier Old Men (1995)                                   Comedy|Romance
4                                Waiting to Exhale (1995)                             Comedy|Drama|Romance
5                      Father of the Bride Part II (1995)                                           Comedy
...
[10681 rows x 2 columns]
property ratings#

Return the rating data (from ratings.dat).

>>> ml = ML10M()
>>> ml.ratings
           user  item  rating  timestamp
0             1   122     5.0  838985046
1             1   185     5.0  838983525
2             1   231     5.0  838983392
3             1   292     5.0  838983421
4             1   316     5.0  838983392
...
[10000054 rows x 4 columns]
class lenskit.datasets.ML10M(path='data/ml-10M100K')#

Bases: MLM

MovieLens 10M100K data set.

property movies#

Return the movie data (from movies.dat). Indexed by movie ID.

>>> ml = ML10M()
>>> ml.movies
                                                    title                                           genres
item
1                                        Toy Story (1995)      Adventure|Animation|Children|Comedy|Fantasy
2                                          Jumanji (1995)                       Adventure|Children|Fantasy
3                                 Grumpier Old Men (1995)                                   Comedy|Romance
4                                Waiting to Exhale (1995)                             Comedy|Drama|Romance
5                      Father of the Bride Part II (1995)                                           Comedy
...
[10681 rows x 2 columns]
property ratings#

Return the rating data (from ratings.dat).

>>> ml = ML10M()
>>> ml.ratings
           user  item  rating  timestamp
0             1   122     5.0  838985046
1             1   185     5.0  838983525
2             1   231     5.0  838983392
3             1   292     5.0  838983421
4             1   316     5.0  838983392
...
[10000054 rows x 4 columns]