Loading Data
LensKit can work with any data in a pandas.DataFrame
with the expected
columns. LensKit algorithms expect a ratings
frame to contain the following
columns (in any order):
user
, containing user identifiers. No requirements are placed on user IDs — if an algorithm requires something specific, such as contiguous 0-based identifiers for indexing into an array — it will use apandas.Index
to map them.item
, containing item identifiers. The same comments apply as foruser
.rating
, containing user ratings (if available). Implicit-feedback code will not require ratings.
‘Rating’ data can contain other columns as well, and is a catch-all for any user-item interaction data. Algorithms will document any non-standard columns they can make use of.
lenskit.algorithms.Recommender.fit()
can also accept additional data objects
as keyword arguments, and algorithms that wrap other algorithms will pass this data
through unchanged. Algorithms ignore extra data objects they receive. This allows
you to build algorithms that train on data besides user-item interactions, such as
user metadata or item content.
Data Loaders
The lenskit.datasets
module provides utilities for reading a variety
of commonly-used LensKit data sets. It does not package or automatically
download them, but loads them from a local directory where you have unpacked
the data set. Each data set class or function takes a path
parameter
specifying the location of the data set.
The normal mode of operation for these utilities is to provide a class for the
data set; this class then exposes the data set’s data as attributes. These
attributes are cached internally, so e.g. accessing MovieLens.ratings
twice will only load the data file once.
These data files have normalized column names to fit with LensKit’s general conventions. These are the following:
User ID columns are called
user
.Item ID columns are called
item
.Rating columns are called
rating
.Timestamp columns are called
timestamp
.
Other column names are unchanged. Data tables that provide information about
specific things, such as a table of movie titles, are indexed by the relevant
ID (e.g. MovieLens.ratings
is indexed by item
).
Data sets supported:
MovieLens Data Sets
The GroupLens research group provides several data sets extracted from the MovieLens service []. These can be downloaded from https://grouplens.org/datasets/movielens/.
- class lenskit.datasets.MovieLens(path='data/ml-20m')
Bases:
object
Code for reading current MovieLens data sets, including ML-20M, ML-Latest, and ML-Latest-Small.
- Parameters:
path (str or pathlib.Path) – Path to the directory containing the data set.
- property ratings
The rating table.
>>> mlsmall = MovieLens('data/ml-latest-small') >>> mlsmall.ratings user item rating timestamp 0 1 31 2.5 1260759144 1 1 1029 3.0 1260759179 2 1 1061 3.0 1260759182 3 1 1129 2.0 1260759185 4 1 1172 4.0 1260759205 ... [100004 rows x 4 columns]
- property movies
The movie table, with titles and genres. It is indexed by movie ID.
>>> mlsmall = MovieLens('data/ml-latest-small') >>> mlsmall.movies title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [9125 rows x 2 columns]
- property links
The movie link table, connecting movie IDs to external identifiers. It is indexed by movie ID.
>>> mlsmall = MovieLens('data/ml-latest-small') >>> mlsmall.links imdbId tmdbId item 1 114709 862 2 113497 8844 3 113228 15602 4 114885 31357 5 113041 11862 ... [9125 rows x 2 columns]
- property tags
The tag application table, recording user-supplied tags for movies.
>>> mlsmall = MovieLens('data/ml-latest-small') >>> mlsmall.tags user ... timestamp 0 15 ... 1138537770 1 15 ... 1193435061 2 15 ... 1170560997 3 15 ... 1170626366 4 15 ... 1141391765 ... [1296 rows x 4 columns]
- property tag_genome
The tag genome table, recording inferred item-tag relevance scores. This gets returned as a wide Pandas data frame, with rows indexed by item ID.
>>> ml20m = MovieLens('data/ml-20m') >>> ml20m.tag_genome tag 007 007 (series) 18th century ... wwii zombie zombies item ... 1 0.02500 0.02500 0.05775 ... 0.03625 0.07775 0.02300 2 0.03975 0.04375 0.03775 ... 0.01475 0.09025 0.01875 3 0.04350 0.05475 0.02800 ... 0.01950 0.09700 0.01850 4 0.03725 0.03950 0.03675 ... 0.01525 0.06450 0.01300 5 0.04200 0.05275 0.05925 ... 0.01675 0.10750 0.01825 ... [10381 rows x 1128 columns]
- class lenskit.datasets.ML100K(path='data/ml-100k')
Bases:
object
The MovieLens 100K data set. This older data set is in a different format from the more current data sets loaded by
MovieLens
.- property available
Query whether the data set exists.
- property ratings
Return the rating data (from
u.data
).>>> ml = ML100K() >>> ml.ratings user item rating timestamp 0 196 242 3.0 881250949 1 186 302 3.0 891717742 2 22 377 1.0 878887116 3 244 51 2.0 880606923 4 166 346 1.0 886397596 ... [100000 rows x 4 columns]
- property users
Return the user data (from
u.user
).>>> ml = ML100K() >>> ml.users age gender occupation zip user 1 24 M technician 85711 2 53 F other 94043 3 23 M writer 32067 4 24 M technician 43537 5 33 F other 15213 ... [943 rows x 4 columns]
- property movies
Return the user data (from
u.user
).>>> ml = ML100K() >>> ml.movies title release ... War Western item ... 1 Toy Story (1995) 01-Jan-1995 ... 0 0 2 GoldenEye (1995) 01-Jan-1995 ... 0 0 3 Four Rooms (1995) 01-Jan-1995 ... 0 0 4 Get Shorty (1995) 01-Jan-1995 ... 0 0 5 Copycat (1995) 01-Jan-1995 ... 0 0 ... [1682 rows x 23 columns]
- class lenskit.datasets.ML1M(path='data/ml-1m')
Bases:
MLM
MovieLens 1M data set.
Note
Some documentation examples use ML-10M100K; that is because this class shares implementation with the 10M data set.
- property users
Return the movie data (from
users.dat
). Indexed by user ID.>>> ml = ML1M() >>> ml.users gender age zip user 1 F 1 48067 2 M 56 70072 3 M 25 55117 4 M 45 02460 5 M 25 55455 ... [6040 rows x 3 columns]
- property movies
Return the movie data (from
movies.dat
). Indexed by movie ID.>>> ml = ML10M() >>> ml.movies title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [10681 rows x 2 columns]
- property ratings
Return the rating data (from
ratings.dat
).>>> ml = ML10M() >>> ml.ratings user item rating timestamp 0 1 122 5.0 838985046 1 1 185 5.0 838983525 2 1 231 5.0 838983392 3 1 292 5.0 838983421 4 1 316 5.0 838983392 ... [10000054 rows x 4 columns]
- class lenskit.datasets.ML10M(path='data/ml-10M100K')
Bases:
MLM
MovieLens 10M100K data set.
- property movies
Return the movie data (from
movies.dat
). Indexed by movie ID.>>> ml = ML10M() >>> ml.movies title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [10681 rows x 2 columns]
- property ratings
Return the rating data (from
ratings.dat
).>>> ml = ML10M() >>> ml.ratings user item rating timestamp 0 1 122 5.0 838985046 1 1 185 5.0 838983525 2 1 231 5.0 838983392 3 1 292 5.0 838983421 4 1 316 5.0 838983392 ... [10000054 rows x 4 columns]