Item Lists and Collections ========================== .. currentmodule:: lenskit.data Throughout its data handling, components, and evaluation metrics, LensKit uses an abstraction of an “item list” (:class:`~lenskit.data.ItemList`). An item list is a list of items along with additional fields, such as scores and ratings; it supports both item IDs and numbers, and can record a vocabulary to convert between them. It is also backend-agnostic and can present fields (except for the item ID) as NumPy arrays, Pandas series (optionally indexed by item ID or number), and Torch tensors. Item lists are used to represent a user's history, the candidate items for scoring and/or ranking, recommendation results, test data, etc. .. _item-list-convert: Data Conversion ~~~~~~~~~~~~~~~ Item lists support round-tripping with Pandas :py:class:`data frames ` and PyArrow :py:class:`tables `:: >>> import pandas as pd >>> from lenskit.data import ItemList >>> il = ItemList(item_ids=['a', 'b'], scores=[1.5, 1.2], rating=[3, 5]) >>> il.to_df() item_id score rating 0 a 1.5 3 1 b 1.2 5 .. _item-list-collections: Item List Collections ~~~~~~~~~~~~~~~~~~~~~ On top of the :class:`~lenskit.data.ItemList` we build the idea of an item list *collection* (`~lenskit.data.ItemListCollection`). An item list collection is a list or dictionary of item lists, associated with keys (e.g. the user ID). Semantically, an :class:`ItemListCollection` is a *list* (more specifically, a :class:`~collections.abc.Sequence`) of (key, list) tuples. It supports the usual sequence operations: iteration, :func:`len`, and retrieving a list and its key by position with ``ilc[pos]``. Lists can also be looked up by key using :meth:`~ItemListCollection.lookup`. Keys, Schemas, and Lookup ------------------------- .. testsetup:: >>> from lenskit.data import ItemListCollection, ItemList Item list collections use **keys** following a schema that is set when the item list collection is created. A key schema or key type defines one or more *key fields*, optionally with associated types, that are used to identify and look up item lists in the collection. In the simple, common case, item lists are associated with user IDs and the key has a single field ``user_id``. These keys are sufficiently common they have their own key type, :class:`UserIDKey`. Other experimental designs or data sets can use other key schemas. For example, for session-based recommendation, you may want a ``session_id`` key field. Key schemas can be defined in two ways: - A named tuple type, created with either :class:`typing.NamedTuple` or :func:`collections.namedtuple`. - A list of field names. When this method is used, LensKit creates a named tuple type using :func:`~collections.namedtuple` (caching it, so only a single type is created for each schema). Pass the key schema to the :class:`ItemListCollection` constructor or to other methods such as :meth:`ItemListCollection.emtpy` to create a list collection with the specified schema:: >>> ilc = ItemListCollection.empty(['user_id']) >>> ilc.key_fields ('user_id',) >>> ilc.key_type When adding to the collection, you can specify the attached key fields as key-value pairs to the :meth:`~ItemListCollection.add` method:: >>> ilc.add(ItemList([5, 10]), user_id=42) >>> len(ilc) 1 >>> ilc[0] (UserIDKey(user_id=42), ) This list can also be retried by key with :meth:`~ItemListCollection.lookup`:: >>> ilc.lookup(user_id=42) See the :class:`ItemListCollection` documentation for further methods. Pandas Conversions ------------------ You can convert an item list collection from a data frame with :meth:`~ItemListCollection.from_df`: .. code:: python ilc = ItemListCollection.from_df(df, UserIDKey) The :meth:`ItemListCollection.to_df` goes the other way, converting to a Pandas data frame. Saving and Loading ------------------ If you want to save or load an item list to a disk file, however, we recommend using :meth:`~ItemListCollection.save_parquet` and :meth:`~ItemListCollection.load_parquet` — they use a Parquet schema with one row per list that can correctly save and load empty item lists. Motivation ~~~~~~~~~~ Given that LensKit for Python's initial design guidelines :cite:p:`lkpy` emphasize the use of standard data structures, why did we introduce a new abstraction instead of continuing to use Pandas data frames? There are a couple of reasons for this. * Pandas data frames are not self-documenting; if a component returns a data frame, that is not enough information to know what columns to expect (without advanced type trickery that stretches or exceeds the limits of Python's typing model). We were duplicating that knowledge across the code base, and things like autocomplete on the available data was not available. Incorrect columns was also, in Michael's experience, a common source of bugs and difficulties. * Many components were only using Pandas at the interface, and internally were converting to sparse matrices, tensors, etc.; by standardizing some of that support, and only converting data formats when necessary, we can make data conversions more consistent across LensKit and reduce the number of conversions and CPU/GPU round-trips when chaining together components using the same compute backend. * The item ID / number logic specifically was duplicated across many modules, and also was an early thing Michael needed to teach when teaching RecSys; a standard, documented abstraction that handles that logic makes it easier to both write components and teach recommendation concepts.