Data Set Utilities¶
The lenskit.datasets
module provides utilities for reading a variety
of commonly-used LensKit data sets. It does not package or automatically
download them, but loads them from a local directory where you have unpacked
the data set. Each data set class or function takes a path
parameter
specifying the location of the data set.
The normal mode of operation for these utilities is to provide a class for the
data set; this class then exposes the data set’s data as attributes. These
attributes are cached internally, so e.g. accessing MovieLens.ratings
twice will only load the data file once.
These data files have normalized column names to fit with LensKit’s general conventions. These are the following:
User ID columns are called
user
.Item ID columns are called
item
.Rating columns are called
rating
.Timestamp columns are called
timestamp
.
Other column names are unchanged. Data tables that provide information about
specific things, such as a table of movie titles, are indexed by the relevant
ID (e.g. MovieLens.ratings
is indexed by item
).
MovieLens Data Sets¶
The GroupLens research group provides several data sets extracted from the MovieLens service [ML]. These can be downloaded from https://grouplens.org/datasets/movielens/.
-
class
lenskit.datasets.
MovieLens
(path='data/ml-20m')¶ Code for reading current MovieLens data sets, including ML-20M, ML-Latest, and ML-Latest-Small.
- Parameters
path (str or pathlib.Path) – Path to the directory containing the data set.
-
links
¶ The movie link table, connecting movie IDs to external identifiers. It is indexed by movie ID.
>>> mlsmall = MovieLens('ml-latest-small') >>> mlsmall.links imdbId tmdbId item 1 114709 862 2 113497 8844 3 113228 15602 4 114885 31357 5 113041 11862 ... [9125 rows x 2 columns]
-
movies
¶ The movie table, with titles and genres. It is indexed by movie ID.
>>> mlsmall = MovieLens('ml-latest-small') >>> mlsmall.movies title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [9125 rows x 2 columns]
-
ratings
¶ The rating table.
>>> mlsmall = MovieLens('ml-latest-small') >>> mlsmall.ratings user item rating timestamp 0 1 31 2.5 1260759144 1 1 1029 3.0 1260759179 2 1 1061 3.0 1260759182 3 1 1129 2.0 1260759185 4 1 1172 4.0 1260759205 ... [100004 rows x 4 columns]
-
tag_genome
¶ The tag genome table, recording inferred item-tag relevance scores. This gets returned as a wide Pandas data frame, with rows indexed by item ID.
The tag application table, recording user-supplied tags for movies.
>>> mlsmall = MovieLens('ml-latest-small') >>> mlsmall.tags user ... timestamp 0 15 ... 1138537770 1 15 ... 1193435061 2 15 ... 1170560997 3 15 ... 1170626366 4 15 ... 1141391765 ... [1296 rows x 4 columns]
-
class
lenskit.datasets.
ML100K
(path='data/ml-100k')¶ The MovieLens 100K data set. This older data set is in a different format from the more current data sets loaded by
MovieLens
.-
available
¶ Query whether the data set exists.
-
ratings
¶ Return the rating data (from
u.data
).>>> ml = ML100K('ml-100k') >>> ml.ratings #doctest: +SKIP user item rating timestamp 0 196 242 3.0 881250949 1 186 302 3.0 891717742 2 22 377 1.0 878887116 3 244 51 2.0 880606923 4 166 346 1.0 886397596 ... [100000 rows x 4 columns]
-
users
¶ Return the user data (from
u.user
).>>> ml = ML100K('ml-100k') >>> ml.users #doctest: +SKIP age gender occupation zip user 1 24 M technician 85711 2 53 F other 94043 3 23 M writer 32067 4 24 M technician 43537 5 33 F other 15213 ... [943 rows x 4 columns]
-
-
class
lenskit.datasets.
ML1M
(path='data/ml-1m')¶ MovieLens 1M data set.
Note
Some documentation examples use ML-10M100K; that is because this class shares implementation with the 10M data set.
-
movies
¶ Return the movie data (from
movies.dat
). Indexed by movie ID.>>> ml = ML10M() >>> ml.movies #doctest: +SKIP title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [10681 rows x 2 columns]
-
ratings
¶ Return the rating data (from
ratings.dat
).>>> ml = ML10M() >>> ml.ratings #doctest: +SKIP user item rating timestamp 0 1 122 5.0 838985046 1 1 185 5.0 838983525 2 1 231 5.0 838983392 3 1 292 5.0 838983421 4 1 316 5.0 838983392 ... [10000054 rows x 4 columns]
-
users
¶ Return the movie data (from
users.dat
). Indexed by user ID.>>> ml = ML1M() >>> ml.users #doctest: +SKIP gender age zip user 1 F 1 48067 2 M 56 70072 3 M 25 55117 4 M 45 02460 5 M 25 55455 ... [6040 rows x 3 columns]
-
-
class
lenskit.datasets.
ML10M
(path='data/ml-10M100K')¶ MovieLens 10M100K data set.
-
movies
¶ Return the movie data (from
movies.dat
). Indexed by movie ID.>>> ml = ML10M() >>> ml.movies #doctest: +SKIP title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [10681 rows x 2 columns]
-
ratings
¶ Return the rating data (from
ratings.dat
).>>> ml = ML10M() >>> ml.ratings #doctest: +SKIP user item rating timestamp 0 1 122 5.0 838985046 1 1 185 5.0 838983525 2 1 231 5.0 838983392 3 1 292 5.0 838983421 4 1 316 5.0 838983392 ... [10000054 rows x 4 columns]
-
- ML
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872