Data Set Utilities¶
The lenskit.datasets
module provides utilities for reading a variety
of commonly-used LensKit data sets. It does not package or automatically
download them, but loads them from a local directory where you have unpacked
the data set. Each data set class or function takes a path
parameter
specifying the location of the data set.
The normal mode of operation for these utilities is to provide a class for the
data set; this class then exposes the data set’s data as attributes. These
attributes are cached internally, so e.g. accessing MovieLens.ratings
twice will only load the data file once.
These data files have normalized column names to fit with LensKit’s general conventions. These are the following:
User ID columns are called
user
.Item ID columns are called
item
.Rating columns are called
rating
.Timestamp columns are called
timestamp
.
Other column names are unchanged. Data tables that provide information about
specific things, such as a table of movie titles, are indexed by the relevant
ID (e.g. MovieLens.ratings
is indexed by item
).
MovieLens Data Sets¶
The GroupLens research group provides several data sets extracted from the MovieLens service [ML]. These can be downloaded from https://grouplens.org/datasets/movielens/.
-
class
lenskit.datasets.
MovieLens
(path='data/ml-20m')¶ Code for reading current MovieLens data sets, including ML-20M, ML-Latest, and ML-Latest-Small.
- Parameters
path (str or pathlib.Path) – Path to the directory containing the data set.
-
property
links
¶ The movie link table, connecting movie IDs to external identifiers. It is indexed by movie ID.
>>> mlsmall = MovieLens('ml-latest-small') >>> mlsmall.links imdbId tmdbId item 1 114709 862 2 113497 8844 3 113228 15602 4 114885 31357 5 113041 11862 ... [9125 rows x 2 columns]
-
property
movies
¶ The movie table, with titles and genres. It is indexed by movie ID.
>>> mlsmall = MovieLens('ml-latest-small') >>> mlsmall.movies title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [9125 rows x 2 columns]
-
property
ratings
¶ The rating table.
>>> mlsmall = MovieLens('ml-latest-small') >>> mlsmall.ratings user item rating timestamp 0 1 31 2.5 1260759144 1 1 1029 3.0 1260759179 2 1 1061 3.0 1260759182 3 1 1129 2.0 1260759185 4 1 1172 4.0 1260759205 ... [100004 rows x 4 columns]
-
property
tag_genome
¶ The tag genome table, recording inferred item-tag relevance scores. This gets returned as a wide Pandas data frame, with rows indexed by item ID.
The tag application table, recording user-supplied tags for movies.
>>> mlsmall = MovieLens('ml-latest-small') >>> mlsmall.tags user ... timestamp 0 15 ... 1138537770 1 15 ... 1193435061 2 15 ... 1170560997 3 15 ... 1170626366 4 15 ... 1141391765 ... [1296 rows x 4 columns]
-
class
lenskit.datasets.
ML100K
(path='data/ml-100k')¶ The MovieLens 100K data set. This older data set is in a different format from the more current data sets loaded by
MovieLens
.-
property
available
¶ Query whether the data set exists.
-
property
movies
¶ Return the user data (from
u.user
).>>> ml = ML100K('ml-100k') >>> ml.movies title release ... War Western item ... 1 Toy Story (1995) 01-Jan-1995 ... 0 0 2 GoldenEye (1995) 01-Jan-1995 ... 0 0 3 Four Rooms (1995) 01-Jan-1995 ... 0 0 4 Get Shorty (1995) 01-Jan-1995 ... 0 0 5 Copycat (1995) 01-Jan-1995 ... 0 0 ... [1682 rows x 23 columns]
-
property
ratings
¶ Return the rating data (from
u.data
).>>> ml = ML100K('ml-100k') >>> ml.ratings user item rating timestamp 0 196 242 3.0 881250949 1 186 302 3.0 891717742 2 22 377 1.0 878887116 3 244 51 2.0 880606923 4 166 346 1.0 886397596 ... [100000 rows x 4 columns]
-
property
users
¶ Return the user data (from
u.user
).>>> ml = ML100K('ml-100k') >>> ml.users age gender occupation zip user 1 24 M technician 85711 2 53 F other 94043 3 23 M writer 32067 4 24 M technician 43537 5 33 F other 15213 ... [943 rows x 4 columns]
-
property
-
class
lenskit.datasets.
ML1M
(path='data/ml-1m')¶ MovieLens 1M data set.
Note
Some documentation examples use ML-10M100K; that is because this class shares implementation with the 10M data set.
-
property
movies
¶ Return the movie data (from
movies.dat
). Indexed by movie ID.>>> ml = ML10M() >>> ml.movies title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [10681 rows x 2 columns]
-
property
ratings
¶ Return the rating data (from
ratings.dat
).>>> ml = ML10M() >>> ml.ratings user item rating timestamp 0 1 122 5.0 838985046 1 1 185 5.0 838983525 2 1 231 5.0 838983392 3 1 292 5.0 838983421 4 1 316 5.0 838983392 ... [10000054 rows x 4 columns]
-
property
users
¶ Return the movie data (from
users.dat
). Indexed by user ID.>>> ml = ML1M() >>> ml.users gender age zip user 1 F 1 48067 2 M 56 70072 3 M 25 55117 4 M 45 02460 5 M 25 55455 ... [6040 rows x 3 columns]
-
property
-
class
lenskit.datasets.
ML10M
(path='data/ml-10M100K')¶ MovieLens 10M100K data set.
-
property
movies
¶ Return the movie data (from
movies.dat
). Indexed by movie ID.>>> ml = ML10M() >>> ml.movies title genres item 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy ... [10681 rows x 2 columns]
-
property
ratings
¶ Return the rating data (from
ratings.dat
).>>> ml = ML10M() >>> ml.ratings user item rating timestamp 0 1 122 5.0 838985046 1 1 185 5.0 838983525 2 1 231 5.0 838983392 3 1 292 5.0 838983421 4 1 316 5.0 838983392 ... [10000054 rows x 4 columns]
-
property
- ML
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872