lenskit.data.ItemList#

class lenskit.data.ItemList(source=None, *, item_ids=None, item_nums=None, vocabulary=None, ordered=None, scores=None, **fields)#

Bases: object

Representation of a (usually ordered) list of items, possibly with scores and other associated data; many components take and return item lists. Item lists are to be treated as immutable — create a new list with modified data, do not do in-place modifications of the list itself or the arrays or data frame it returns.

An item list logically a list of rows, each of which is an item with multiple fields. A designated field, score, is available through the scores() method, and is always single-precision floating-point.

Item lists can be subset as an array (e.g. items[selector]), where integer indices (or arrays thereof), boolean arrays, and slices are allowed as selectors.

When an item list is pickled, it is pickled compactly but only for CPUs: the vocabulary is dropped (after ensuring both IDs and numbers are computed), and all arrays are pickled as NumPy arrays. This makes item lists compact to serialize and transmit, but does mean that that serializing an item list whose scores are still on the GPU will deserialize on the CPU in the receiving process. This is usually not a problem, because item lists are typically used for small lists of items, not large data structures that need to remain in shared memory.

Note

Naming for fields and accessor methods is tricky, because the usual convention for a data frame is to use singular column names (e.g. “item_id”, “score”) instead of plural (“item_ids”, “scores”) — the data frame, like a database table, is a list of instances, and the column names are best interpreted as naming attributes of individual instances.

However, when working with a list of e.g. item IDs, it is more natural — at least to this author — to use plural names: item_ids. Since this class is doing somewhat double-duty, representing a list of items along with associated data, as well as a data frame of columns representing items, the appropriate naming is not entirely clear. The naming convention in this class is therefore as follows:

  • Field names are singular (item_id, score).

  • Named accessor methods are plural (item_ids(), scores()).

  • Both singular and plural forms are accepted for item IDs numbers, and scores in the keyword arguments. Other field names should be singular.

Todo

Right now, selection / subsetting only happens on the CPU, and will move data to the CPU for the subsetting operation. There is no reason, in principle, why we cannot subset on GPU. Future revisions may add support for this.

Parameters:
  • source (ItemList | IDSequence | None) – A source item list. If provided and an ItemList, its fields and data are used to initialize any aspects of the item list that are not provided in the other arguments. Otherwise, it is interpreted as item_ids.

  • item_ids (IDSequence | None) – A list or array of item identifiers. item_id is accepted as an alternate name.

  • item_nums (NDArray[np.int32] | pd.Series[int] | Sequence[int] | ArrayLike | None) – A list or array of item numbers. item_num is accepted as an alternate name.

  • vocabulary (Vocabulary | None) – A vocabulary to translate between item IDs and numbers.

  • ordered (bool) – Whether the list has a meaningful order.

  • scores (NDArray[np.generic] | torch.Tensor | ArrayLike | Literal[False] | np.floating | float | None) – An array of scores for the items. Pass the value False to remove the scores when copying from a source list.

  • fields (NDArray[np.generic] | torch.Tensor | ArrayLike | Literal[False]) – Additional fields, such as score or rating. Field names should generally be singular; the named keyword arguments and accessor methods are plural for readability (“get the list of item IDs”). Pass the value False to remove the field when copying from a source list.

__init__(source=None, *, item_ids=None, item_nums=None, vocabulary=None, ordered=None, scores=None, **fields)#
Parameters:
  • source (ItemList | IDSequence | None)

  • item_ids (IDSequence | None)

  • item_nums (NDArray[np.int32] | pd.Series[int] | Sequence[int] | ArrayLike | None)

  • vocabulary (Vocabulary | None)

  • ordered (bool | None)

  • scores (NDArray[np.generic] | torch.Tensor | ArrayLike | Literal[False] | np.floating | float | None)

  • fields (NDArray[np.generic] | torch.Tensor | ArrayLike | Literal[False])

Methods

__init__([source, item_ids, item_nums, ...])

arrow_types(*[, ids, numbers])

Get the Arrow data types for this item list.

clone()

Make a shallow copy of the item list.

field()

from_arrow(tbl, *[, vocabulary])

Create a item list from a Pandas table or structured array.

from_df(df, *[, vocabulary, keep_user])

Create a item list from a Pandas data frame.

from_vocabulary(vocab)

ids()

Get the item IDs.

numbers()

Get the item numbers.

ranks()

Get an array of ranks for the items in this list, if it is ordered.

scores()

Get the item scores (if available).

to_arrow()

Convert the item list to a Pandas table.

to_df(*[, ids, numbers])

Convert this item list to a Pandas data frame.

Attributes

ordered

Whether this list has a meaningful order.

vocabulary

Get the item list's vocabulary, if available.

ordered: bool = False#

Whether this list has a meaningful order.

classmethod from_df(df, *, vocabulary=None, keep_user=False)#

Create a item list from a Pandas data frame. The frame should have item_num and/or item_id columns to identify the items; other columns (e.g. score or rating) are added as fields. If the data frame has user columns (user_id or user_num), those are dropped by default.

Parameters:
  • df (DataFrame) – The data frame to turn into an item list.

  • vocabulary (Vocabulary | None) – The item vocabulary.

  • keep_user (bool) – If True, keeps user ID/number columns instead of dropping them.

Return type:

ItemList

classmethod from_arrow(tbl, *, vocabulary=None)#

Create a item list from a Pandas table or structured array. The table should have item_num and/or item_id columns to identify the items; other columns (e.g. score or rating) are added as fields. If the data frame has user columns (user_id or user_num), those are dropped by default.

Parameters:
  • tbl (StructArray | ChunkedArray | Table) – The Arrow table or array to convert to an item list.

  • vocabulary (Vocabulary | None) – The item vocabulary.

Return type:

ItemList

clone()#

Make a shallow copy of the item list.

Return type:

ItemList

property vocabulary: Vocabulary | None#

Get the item list’s vocabulary, if available.

ids()#

Get the item IDs.

Returns:

An array of item identifiers.

Raises:

RuntimeError – if the item list was not created with IDs or a Vocabulary.

Return type:

ndarray[tuple[int], dtype[integer[Any] | str_ | bytes_ | object_]]

numbers(format: Literal['numpy'] = 'numpy', *, vocabulary: Vocabulary | None = None, missing: Literal['error', 'negative'] = 'error') ndarray[Any, dtype[int32]]#
numbers(format: Literal['torch'], *, vocabulary: Vocabulary | None = None, missing: Literal['error', 'negative'] = 'error') Tensor
numbers(format: Literal['arrow'], *, vocabulary: Vocabulary | None = None, missing: Literal['error', 'negative'] = 'error') pa.Array[pa.Int32Scalar]
numbers(format: LiteralString = 'numpy', *, vocabulary: Vocabulary | None = None, missing: Literal['error', 'negative'] = 'error') ArrayLike

Get the item numbers.

Parameters:
  • format – The array format to use.

  • vocabulary – A alternate vocabulary for mapping IDs to numbers. If provided, then the item list must have IDs (either stored, or through a vocabulary).

Returns:

An array of item numbers.

Raises:
scores(format: Literal['numpy'] = 'numpy') ndarray[Any, dtype[float32]] | None#
scores(format: Literal['torch']) Tensor | None
scores(format: Literal['arrow']) pa.Array[pa.FloatScalar] | None
scores(format: Literal['pandas'], *, index: Literal['ids', 'numbers'] | None = None) Series | None
scores(format: LiteralString = 'numpy') ArrayLike | None

Get the item scores (if available).

ranks(format: Literal['numpy'] = 'numpy') ndarray[Any, dtype[int32]] | None#
ranks(format: Literal['torch']) Tensor | None
ranks(format: Literal['arrow']) pa.Array[pa.Int32Scalar] | None
ranks(format: LiteralString = 'numpy') ArrayLike | None

Get an array of ranks for the items in this list, if it is ordered. Unordered lists have no ranks. The ranks are based on the order in the list, not on the score.

Item ranks start with 1, for compatibility with common practice in mathematically defining information retrieval metrics and operations.

Returns:

An array of item ranks, or None if the list is unordered.

to_df(*, ids=True, numbers=True)#

Convert this item list to a Pandas data frame. It has the following columns:

  • item_id — the item IDs (if available and ids=True)

  • item_num — the item numbers (if available and numbers=True)

  • score — the item scores

  • rank — the item ranks (if the list is ordered)

  • all other defined fields, using their field names

Parameters:
Return type:

DataFrame

to_arrow(*, ids: bool = True, numbers: bool = False, type: Literal['table'] = 'table', columns: dict[str, DataType] | None = None) Table#
to_arrow(*, ids: bool = True, numbers: bool = False, type: Literal['array'], columns: dict[str, DataType] | None = None) StructArray

Convert the item list to a Pandas table.

arrow_types(*, ids=True, numbers=False)#

Get the Arrow data types for this item list.

Parameters:
Return type:

dict[str, DataType]