Performance Tips#
LensKit strives to provide pretty good performance (in terms of computation speed), but sometimes it needs a little nudging.
Todo
Update for 2025.1.
Note
If you are implementing an algorithm, see the implementation tips for information on good performance.
Quick Tips#
Use Conda-based Python, with
tbb
installed.When using MKL, set the
MKL_THREADING_LAYER
environment variable totbb
, so both MKL and LensKit will use TBB and can coordinate their thread pools.Use
LK_NUM_PROCS
if you want to control LensKit’s batch prediction and recommendation parallelism, andNUMBA_NUM_THREADS
to control its model training parallelism.
We generally find the best performance using MKL with TBB throughout the stack on Intel processors. If both LensKit’s Numba-accelerated code and MKL are using TBB, they will coordinate their thread pools to coordinate threading levels.
If you are not using MKL (Apple Silicon, maybe also AMD processors), we recommend controlling your BLAS parallelism. For OpenBLAS, how you control this depends on how OpenBLAS was built, whether Numba is using OpenMP or TBB, and whether you are training or evaluating the model.
When LensKit starts (usually at model training time), it will check your runtime environment and log warning messages if it detects problems. During evaluation, it also makes a best-effort attempt, through threadpoolctl, to disable nested parallelism when running a parallel evaluation.
Controlling Parallelism#
LensKit has two forms of parallelism. Algorithm training processes can be parallelized through a number of mechanisms:
Our own parallel code uses Numba, which in turn uses TBB (preferred) or OpenMP. The thread count is controlled by
NUMBA_NUM_THREADS
.The BLAS library may parallelize underlying operations using its threading library. This is usually OpenMP; MKL also supports TBB, but unlike Numba, it defaults to OpenMP even if TBB is available.
Underlying libraries such as TensorFlow and scikit-learn may provide their own parallelism.
The LensKit batch functions use Python multiprocessing
, and their concurrency
level is controlled by the LK_NUM_PROCS
environment variable. The default number
of processes is one-half the number of cores as reported by multiprocessing.cpu_count()
.
The batch functions also set the thread count for some libraries within the worker
procesess, to prevent over-subscribing the CPU. Right now, the worker will configure
Numba and MKL. In the rest of this section, this will be referred to as the ‘inner
thread count’.
The thread count logic is controlled by lenskit.util.parallel.proc_count()
,
and works as follows:
If
LK_NUM_PROCS
is an integer, the batch functions will use the specified number of processes, and with 1 inner thread.If
LK_NUM_PROCS
is a comma-separated pair of integers (e.g.8,4
), the batch functions will use the first number for the process count and the second number as the inner thread count. This overridesNUMBA_NUM_THREADS
, unless it is larger thanNUMBA_NUM_THREADS
.If
LK_NUM_PROCS
is not set, the batch functions use half the number of cores as the process count and 2 as the inner thread count (unlessNUMBA_NUM_THREADS
is set to 1 in the environment).
Other Notes#
Batch parallelism disables TensorFlow GPUs in the worker threads. This is fine, because GPUs are most useful for model training; multiple worker processes competing for the GPU causes problems.