Performance Tips ================ LensKit strives to provide pretty good performance (in terms of computation speed), but sometimes it needs a little nudging. .. note:: If you are implementing an algorithm, see the `implementation tips`_ for information on good performance. .. _implementation tips: impl-tips.html Quick Tips ---------- * Use Conda-based Python, with ``tbb`` installed. * When using MKL, set the ``MKL_THREADING_LAYER`` environment variable to ``tbb``, so both MKL and LensKit will use TBB and can coordinate their thread pools. * Use ``LK_NUM_PROCS`` if you want to control LensKit's batch prediction and recommendation parallelism, and ``NUMBA_NUM_THREADS`` to control its model training parallelism. We generally find the best performance using MKL with TBB throughout the stack on Intel processors. If both LensKit's Numba-accelerated code and MKL are using TBB, they will coordinate their thread pools to coordinate threading levels. If you are **not** using MKL (Apple Silicon, maybe also AMD processors), we recommend controlling your BLAS parallelism. For OpenBLAS, how you control this depends on how OpenBLAS was built, whether Numba is using OpenMP or TBB, and whether you are training or evaluating the model. When LensKit starts (usually at model training time), it will check your runtime environment and log warning messages if it detects problems. During evaluation, it also makes a best-effort attempt, through `threadpoolctl`_, to disable nested parallelism when running a parallel evaluation. .. _threadpoolctl: https://github.com/joblib/threadpoolctl Controlling Parallelism ----------------------- LensKit has two forms of parallelism. Algorithm training processes can be parallelized through a number of mechanisms: * Our own parallel code uses Numba, which in turn uses TBB (preferred) or OpenMP. The thread count is controlled by ``NUMBA_NUM_THREADS``. * The BLAS library may parallelize underlying operations using its threading library. This is usually OpenMP; MKL also supports TBB, but unlike Numba, it defaults to OpenMP even if TBB is available. * Underlying libraries such as TensorFlow and scikit-learn may provide their own parallelism. The LensKit `batch functions`_ use Python ``multiprocessing``, and their concurrency level is controlled by the ``LK_NUM_PROCS`` environment variable. The default number of processes is one-half the number of cores as reported by :py:func:`multiprocessing.cpu_count`. The batch functions also set the thread count for some libraries within the worker procesess, to prevent over-subscribing the CPU. Right now, the worker will configure Numba and MKL. In the rest of this section, this will be referred to as the ‘inner thread count’. The thread count logic is controlled by :py:func:`lenskit.util.parallel.proc_count`, and works as follows: * If ``LK_NUM_PROCS`` is an integer, the batch functions will use the specified number of processes, and with 1 inner thread. * If ``LK_NUM_PROCS`` is a comma-separated pair of integers (e.g. ``8,4``), the batch functions will use the first number for the process count and the second number as the inner thread count. This **overrides** ``NUMBA_NUM_THREADS``, unless it is larger than ``NUMBA_NUM_THREADS``. * If ``LK_NUM_PROCS`` is not set, the batch functions use half the number of cores as the process count and 2 as the inner thread count (unless ``NUMBA_NUM_THREADS`` is set to 1 in the environment). .. _batch functions: batch.html Other Notes ----------- * Batch parallelism **disables** TensorFlow GPUs in the worker threads. This is fine, because GPUs are most useful for model training; multiple worker processes competing for the GPU causes problems.