Papers using LensKit#
This page lists known papers using the Python version of LensKit. If you use LensKit for research, please e-mail Michael Ekstrand <ekstrand@acm.org> with a copy of your paper and bibliographic information so we can add it to this list.

Excellent! Next you can
create a new website with this list, or
embed it in an existing web page by copying & pasting
any of the following snippets.
JavaScript
(easiest)
PHP
iFrame
(not recommended)
<script src="https://bibbase.org/show?bib=https%3A%2F%2Fapi.zotero.org%2Fusers%2F6655%2Fcollections%2F3TB3KT36%2Fitems%3Fkey%3DVFvZhZXIoHNBbzoLZ1IM2zgf%26format%3Dbibtex%26limit%3D100&jsonp=1&jsonp=1"></script>
<?php
$contents = file_get_contents("https://bibbase.org/show?bib=https%3A%2F%2Fapi.zotero.org%2Fusers%2F6655%2Fcollections%2F3TB3KT36%2Fitems%3Fkey%3DVFvZhZXIoHNBbzoLZ1IM2zgf%26format%3Dbibtex%26limit%3D100&jsonp=1");
print_r($contents);
?>
<iframe src="https://bibbase.org/show?bib=https%3A%2F%2Fapi.zotero.org%2Fusers%2F6655%2Fcollections%2F3TB3KT36%2Fitems%3Fkey%3DVFvZhZXIoHNBbzoLZ1IM2zgf%26format%3Dbibtex%26limit%3D100&jsonp=1"></iframe>
For more details see the documention.
This is a preview! To use this list on your own web site
or create a new web site from it,
create a free account. The file will be added
and you will be able to edit it in the File Manager.
We will show you instructions once you've created your account.
To the site owner:
Action required! Mendeley is changing its API. In order to keep using Mendeley with BibBase past April 14th, you need to:
- renew the authorization for BibBase on Mendeley, and
- update the BibBase URL in your page the same way you did when you initially set up this page.
2025
(11)
User and Recommender Behavior Over Time: Contextualizing Activity Effectiveness Diversity and Fairness in Book Recommendation.
Vaez Barenji, S.; Parajuli, S.; and Ekstrand, M. D.
In Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, of UMAP Adjunct '25, pages 280–287, New York, NY, USA, June 2025. Association for Computing Machinery
Paper
doi
link
bibtex
abstract
@inproceedings{vaez_barenji_user_2025, address = {New York, NY, USA}, series = {{UMAP} {Adjunct} '25}, title = {User and {Recommender} {Behavior} {Over} {Time}: {Contextualizing} {Activity} {Effectiveness} {Diversity} and {Fairness} in {Book} {Recommendation}}, isbn = {979-8-4007-1399-6}, shorttitle = {User and {Recommender} {Behavior} {Over} {Time}}, url = {https://dl.acm.org/doi/10.1145/3708319.3733710}, doi = {10.1145/3708319.3733710}, abstract = {Data is an essential resource for studying recommender systems. While there has been significant work on improving and evaluating state-of-the-art models and measuring various properties of recommender system outputs, less attention has been given to the data itself, particularly how data has changed over time. Such documentation and analysis provide guidance and context for designing and evaluating recommender systems, particularly for evaluation designs making use of time (e.g., temporal splitting). In this paper, we present a temporal explanatory analysis of the UCSD Book Graph dataset scraped from Goodreads, a social reading and recommendation platform active since 2006. We measure the book interaction data using a set of activity, diversity, and fairness metrics; we then train a set of collaborative filtering algorithms on rolling training windows to observe how the same measures evolve over time in the recommendations. Additionally, we explore whether the introduction of algorithmic recommendations in 2011 was followed by observable changes in user or recommender system behavior.}, urldate = {2025-06-22}, booktitle = {Adjunct {Proceedings} of the 33rd {ACM} {Conference} on {User} {Modeling}, {Adaptation} and {Personalization}}, publisher = {Association for Computing Machinery}, author = {Vaez Barenji, Samira and Parajuli, Sushobhan and Ekstrand, Michael D.}, month = jun, year = {2025}, pages = {280--287}, }
Data is an essential resource for studying recommender systems. While there has been significant work on improving and evaluating state-of-the-art models and measuring various properties of recommender system outputs, less attention has been given to the data itself, particularly how data has changed over time. Such documentation and analysis provide guidance and context for designing and evaluating recommender systems, particularly for evaluation designs making use of time (e.g., temporal splitting). In this paper, we present a temporal explanatory analysis of the UCSD Book Graph dataset scraped from Goodreads, a social reading and recommendation platform active since 2006. We measure the book interaction data using a set of activity, diversity, and fairness metrics; we then train a set of collaborative filtering algorithms on rolling training windows to observe how the same measures evolve over time in the recommendations. Additionally, we explore whether the introduction of algorithmic recommendations in 2011 was followed by observable changes in user or recommender system behavior.
Circumventing Misinformation Controls: Assessing the Robustness of Intervention Strategies in Recommender Systems.
Pathak, R.; and Spezzano, F.
In Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, of UMAP '25, pages 279–284, New York, NY, USA, June 2025. Association for Computing Machinery
Paper
doi
link
bibtex
abstract
@inproceedings{pathak_circumventing_2025, address = {New York, NY, USA}, series = {{UMAP} '25}, title = {Circumventing {Misinformation} {Controls}: {Assessing} the {Robustness} of {Intervention} {Strategies} in {Recommender} {Systems}}, isbn = {979-8-4007-1313-2}, shorttitle = {Circumventing {Misinformation} {Controls}}, url = {https://dl.acm.org/doi/10.1145/3699682.3728350}, doi = {10.1145/3699682.3728350}, abstract = {Recommender systems are essential on social media platforms, shaping the order of information users encounter and facilitating news discovery. However, these systems can inadvertently contribute to the spread of misinformation by reinforcing algorithmic biases, fostering excessive personalization, creating filter bubbles, and amplifying false narratives. Recent studies have demonstrated that intervention strategies, such as Virality Circuit Breakers and accuracy nudges, can effectively mitigate misinformation when implemented on top of recommender systems. Despite this, existing literature has yet to explore the robustness of these interventions against circumvention—where individuals or groups intentionally evade or resist efforts to counter misinformation. This research aims to address this gap, examining how well these interventions hold up in the face of circumvention tactics. Our findings highlight that these intervention strategies are generally robust against misinformation circumvention threats when applied on top of recommender systems.}, urldate = {2025-06-22}, booktitle = {Proceedings of the 33rd {ACM} {Conference} on {User} {Modeling}, {Adaptation} and {Personalization}}, publisher = {Association for Computing Machinery}, author = {Pathak, Royal and Spezzano, Francesca}, month = jun, year = {2025}, pages = {279--284}, }
Recommender systems are essential on social media platforms, shaping the order of information users encounter and facilitating news discovery. However, these systems can inadvertently contribute to the spread of misinformation by reinforcing algorithmic biases, fostering excessive personalization, creating filter bubbles, and amplifying false narratives. Recent studies have demonstrated that intervention strategies, such as Virality Circuit Breakers and accuracy nudges, can effectively mitigate misinformation when implemented on top of recommender systems. Despite this, existing literature has yet to explore the robustness of these interventions against circumvention—where individuals or groups intentionally evade or resist efforts to counter misinformation. This research aims to address this gap, examining how well these interventions hold up in the face of circumvention tactics. Our findings highlight that these intervention strategies are generally robust against misinformation circumvention threats when applied on top of recommender systems.
Privacy Preservation through Practical Machine Unlearning.
Dilworth, R.
February 2025.
arXiv:2502.10635 [cs]
Paper
doi
link
bibtex
abstract
@misc{dilworth_privacy_2025, title = {Privacy {Preservation} through {Practical} {Machine} {Unlearning}}, url = {http://arxiv.org/abs/2502.10635}, doi = {10.48550/arXiv.2502.10635}, abstract = {Machine Learning models thrive on vast datasets, continuously adapting to provide accurate predictions and recommendations. However, in an era dominated by privacy concerns, Machine Unlearning emerges as a transformative approach, enabling the selective removal of data from trained models. This paper examines methods such as Naive Retraining and Exact Unlearning via the SISA framework, evaluating their Computational Costs, Consistency, and feasibility using the \${\textbackslash}texttt\{HSpam14\}\$ dataset. We explore the potential of integrating unlearning principles into Positive Unlabeled (PU) Learning to address challenges posed by partially labeled datasets. Our findings highlight the promise of unlearning frameworks like \${\textbackslash}textit\{DaRE\}\$ for ensuring privacy compliance while maintaining model performance, albeit with significant computational trade-offs. This study underscores the importance of Machine Unlearning in achieving ethical AI and fostering trust in data-driven systems.}, urldate = {2025-05-22}, publisher = {arXiv}, author = {Dilworth, Robert}, month = feb, year = {2025}, note = {arXiv:2502.10635 [cs]}, }
Machine Learning models thrive on vast datasets, continuously adapting to provide accurate predictions and recommendations. However, in an era dominated by privacy concerns, Machine Unlearning emerges as a transformative approach, enabling the selective removal of data from trained models. This paper examines methods such as Naive Retraining and Exact Unlearning via the SISA framework, evaluating their Computational Costs, Consistency, and feasibility using the {\}texttt\{HSpam14\} dataset. We explore the potential of integrating unlearning principles into Positive Unlabeled (PU) Learning to address challenges posed by partially labeled datasets. Our findings highlight the promise of unlearning frameworks like {\}textit\{DaRE\} for ensuring privacy compliance while maintaining model performance, albeit with significant computational trade-offs. This study underscores the importance of Machine Unlearning in achieving ethical AI and fostering trust in data-driven systems.
How to Diversify any Personalized Recommender?.
Slokom, M.; Daniil, S.; and Hollink, L.
In Hauff, C.; Macdonald, C.; Jannach, D.; Kazai, G.; Nardini, F. M.; Pinelli, F.; Silvestri, F.; and Tonellotto, N., editor(s), Advances in Information Retrieval, pages 307–323, Cham, 2025. Springer Nature Switzerland
doi link bibtex abstract
doi link bibtex abstract
@inproceedings{slokom_how_2025, address = {Cham}, title = {How to {Diversify} any {Personalized} {Recommender}?}, isbn = {978-3-031-88717-8}, doi = {10.1007/978-3-031-88717-8_23}, abstract = {In this paper, we introduce a novel approach to improve the diversity of Top-N recommendations while maintaining accuracy. Our approach employs a user-centric pre-processing strategy aimed at exposing users to a wide array of content categories and topics. We personalize this strategy by selectively adding and removing a percentage of interactions from user profiles. This personalization ensures we remain closely aligned with user preferences while gradually introducing distribution shifts. Our pre-processing technique offers flexibility and can seamlessly integrate into any recommender architecture. We run extensive experiments on two publicly available data sets for news and book recommendations to evaluate our approach. We test various standard and neural network-based recommender system algorithms. Our results show that our approach generates diverse recommendations, ensuring users are exposed to a wider range of items. Furthermore, using pre-processed data for training leads to recommender systems achieving performance levels comparable to, and in some cases, better than those trained on original, unmodified data. Additionally, our approach promotes provider fairness by facilitating exposure to minority categories. (Our GitHub code is available at: https://github.com/SlokomManel/How-to-Diversify-any-Personalized-Recommender-).}, language = {en}, booktitle = {Advances in {Information} {Retrieval}}, publisher = {Springer Nature Switzerland}, author = {Slokom, Manel and Daniil, Savvina and Hollink, Laura}, editor = {Hauff, Claudia and Macdonald, Craig and Jannach, Dietmar and Kazai, Gabriella and Nardini, Franco Maria and Pinelli, Fabio and Silvestri, Fabrizio and Tonellotto, Nicola}, year = {2025}, pages = {307--323}, }
In this paper, we introduce a novel approach to improve the diversity of Top-N recommendations while maintaining accuracy. Our approach employs a user-centric pre-processing strategy aimed at exposing users to a wide array of content categories and topics. We personalize this strategy by selectively adding and removing a percentage of interactions from user profiles. This personalization ensures we remain closely aligned with user preferences while gradually introducing distribution shifts. Our pre-processing technique offers flexibility and can seamlessly integrate into any recommender architecture. We run extensive experiments on two publicly available data sets for news and book recommendations to evaluate our approach. We test various standard and neural network-based recommender system algorithms. Our results show that our approach generates diverse recommendations, ensuring users are exposed to a wider range of items. Furthermore, using pre-processed data for training leads to recommender systems achieving performance levels comparable to, and in some cases, better than those trained on original, unmodified data. Additionally, our approach promotes provider fairness by facilitating exposure to minority categories. (Our GitHub code is available at: https://github.com/SlokomManel/How-to-Diversify-any-Personalized-Recommender-).
Using emotion diversification based on movie reviews to improve the user experience of movie recommender systems.
Lansman, L.
Ph.D. Thesis, 2025.
ISBN: 9798311930970 Pages: 83
Paper
link
bibtex
abstract
@phdthesis{lansman_using_2025, type = {Ph.{D}. {Dissertation}}, title = {Using emotion diversification based on movie reviews to improve the user experience of movie recommender systems}, url = {https://www.proquest.com/docview/3196617694}, abstract = {Movies are made with the intention of evoking an emotional response. In recent years, researchers have hypothesized that the emotional response evoked by a movie can be leveraged to augment recommender system algorithms. In this work, we demonstrate that emotion diversification improves the user experience of a movie recommender system. We augmented the 10M MovieLens dataset with values of the eight dimensions of Plutchik’s wheel of emotions by leveraging an emotion analysis method that extracts these eight dimensions from movie reviews on IMDB to form an ’emotional signature’. Based on the finding of Mokryn et al. (October 2020) that showed that a film’s emotional signature reflects the emotions the film elicits in viewers, we used each movie’s emotional signature to diversify the output of our recommender algorithm. We tested this novel emotion diversification method against an existing latent diversification method and a baseline version without diversification in an online user experiment with a custom-built movie recommender system. We also tested two different types of visualization, a graph view against a baseline of a list view, as the graph view would increase user understandability regarding the reason behind the recommended items provided. The results of this study show that the emotion diversification method significantly improves the user experience of the movie recommender system, surpassing both the baseline system and the latent diversification method in terms of perceived taste coverage and system satisfaction without significantly reducing the perceived recommendation quality or increasing the trade-off difficulty. Going beyond the traditional rating and/or interaction data used by traditional recommender systems, our work demonstrates the user experience benefits of extracting emotional data from rich, qualitative user feedback and using it to give users a more emotionally diverse set of recommendations.}, language = {English}, author = {Lansman, Lior}, year = {2025}, note = {ISBN: 9798311930970 Pages: 83}, keywords = {0489:Information Technology, Decision making, Emotions, Information technology, Motion pictures, Recommender systems, User behavior}, }
Movies are made with the intention of evoking an emotional response. In recent years, researchers have hypothesized that the emotional response evoked by a movie can be leveraged to augment recommender system algorithms. In this work, we demonstrate that emotion diversification improves the user experience of a movie recommender system. We augmented the 10M MovieLens dataset with values of the eight dimensions of Plutchik’s wheel of emotions by leveraging an emotion analysis method that extracts these eight dimensions from movie reviews on IMDB to form an ’emotional signature’. Based on the finding of Mokryn et al. (October 2020) that showed that a film’s emotional signature reflects the emotions the film elicits in viewers, we used each movie’s emotional signature to diversify the output of our recommender algorithm. We tested this novel emotion diversification method against an existing latent diversification method and a baseline version without diversification in an online user experiment with a custom-built movie recommender system. We also tested two different types of visualization, a graph view against a baseline of a list view, as the graph view would increase user understandability regarding the reason behind the recommended items provided. The results of this study show that the emotion diversification method significantly improves the user experience of the movie recommender system, surpassing both the baseline system and the latent diversification method in terms of perceived taste coverage and system satisfaction without significantly reducing the perceived recommendation quality or increasing the trade-off difficulty. Going beyond the traditional rating and/or interaction data used by traditional recommender systems, our work demonstrates the user experience benefits of extracting emotional data from rich, qualitative user feedback and using it to give users a more emotionally diverse set of recommendations.
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems.
Mancino, A. C. M.; Bufi, S.; Fazio, A. D.; Ferrara, A.; Malitesta, D.; Pomo, C.; and Noia, T. D.
April 2025.
arXiv:2410.22972 [cs] version: 2
Paper
doi
link
bibtex
abstract
@misc{mancino_datarec_2025, title = {{DataRec}: {A} {Python} {Library} for {Standardized} and {Reproducible} {Data} {Management} in {Recommender} {Systems}}, shorttitle = {{DataRec}}, url = {http://arxiv.org/abs/2410.22972}, doi = {10.48550/arXiv.2410.22972}, abstract = {Recommender systems have demonstrated significant impact across diverse domains, yet ensuring the reproducibility of experimental findings remains a persistent challenge. A primary obstacle lies in the fragmented and often opaque data management strategies employed during the preprocessing stage, where decisions about dataset selection, filtering, and splitting can substantially influence outcomes. To address these limitations, we introduce DataRec, an open-source Python-based library specifically designed to unify and streamline data handling in recommender system research. By providing reproducible routines for dataset preparation, data versioning, and seamless integration with other frameworks, DataRec promotes methodological standardization, interoperability, and comparability across different experimental setups. Our design is informed by an in-depth review of 55 state-of-the-art recommendation studies ensuring that DataRec adopts best practices while addressing common pitfalls in data management. Ultimately, our contribution facilitates fair benchmarking, enhances reproducibility, and fosters greater trust in experimental results within the broader recommender systems community. The DataRec library, documentation, and examples are freely available at https://github.com/sisinflab/DataRec.}, urldate = {2025-04-15}, publisher = {arXiv}, author = {Mancino, Alberto Carlo Maria and Bufi, Salvatore and Fazio, Angela Di and Ferrara, Antonio and Malitesta, Daniele and Pomo, Claudio and Noia, Tommaso Di}, month = apr, year = {2025}, note = {arXiv:2410.22972 [cs] version: 2}, }
Recommender systems have demonstrated significant impact across diverse domains, yet ensuring the reproducibility of experimental findings remains a persistent challenge. A primary obstacle lies in the fragmented and often opaque data management strategies employed during the preprocessing stage, where decisions about dataset selection, filtering, and splitting can substantially influence outcomes. To address these limitations, we introduce DataRec, an open-source Python-based library specifically designed to unify and streamline data handling in recommender system research. By providing reproducible routines for dataset preparation, data versioning, and seamless integration with other frameworks, DataRec promotes methodological standardization, interoperability, and comparability across different experimental setups. Our design is informed by an in-depth review of 55 state-of-the-art recommendation studies ensuring that DataRec adopts best practices while addressing common pitfalls in data management. Ultimately, our contribution facilitates fair benchmarking, enhances reproducibility, and fosters greater trust in experimental results within the broader recommender systems community. The DataRec library, documentation, and examples are freely available at https://github.com/sisinflab/DataRec.
On the challenges of studying bias in Recommender Systems: The effect of data characteristics and algorithm configuration.
Daniil, S.; Slokom, M.; Cuper, M.; Liem, C.; Ossenbruggen, J. v.; and Hollink, L.
Information Retrieval Research, 1(1): 3–27. February 2025.
Number: 1
Paper
doi
link
bibtex
abstract
@article{daniil_challenges_2025, title = {On the challenges of studying bias in {Recommender} {Systems}: {The} effect of data characteristics and algorithm configuration}, volume = {1}, copyright = {Copyright (c) 2025 Savvina Daniil, Manel Slokom, Mirjam Cuper, Cynthia Liem, Jacco van Ossenbruggen, Laura Hollink (Author)}, issn = {3050-9114}, shorttitle = {On the challenges of studying bias in {Recommender} {Systems}}, url = {https://irrj.org/article/view/19607}, doi = {10.54195/irrj.19607}, abstract = {Statements on the propagation of bias by recommender systems are often hard to verify or falsify. Research on bias tends to draw from a small pool of publicly available datasets and is therefore bound by their specific properties. Additionally, implementation choices are often not explicitly described or motivated in research, while they may have an effect on bias propagation. In this paper, we explore the challenges of measuring and reporting popularity bias. We showcase the impact of data properties and algorithm configurations on popularity bias by combining real and synthetic data with well known recommender systems frameworks. First, we identify data characteristics that might impact popularity bias, and explore their presence in a set of available online datasets. Accordingly, we generate various datasets that combine these characteristics. Second, we locate algorithm configurations that vary across implementations in literature. We evaluate popularity bias for a number of datasets, three real and five synthetic, and configurations, and offer insights on their joint effect. We find that, depending on the data characteristics, various configurations of the algorithms examined can lead to different conclusions regarding the propagation of popularity bias. These results motivate the need for explicitly addressing algorithmic configuration and data properties when reporting and interpreting bias in recommender systems.}, language = {en}, number = {1}, urldate = {2025-04-15}, journal = {Information Retrieval Research}, author = {Daniil, Savvina and Slokom, Manel and Cuper, Mirjam and Liem, Cynthia and Ossenbruggen, Jacco van and Hollink, Laura}, month = feb, year = {2025}, note = {Number: 1}, pages = {3--27}, }
Statements on the propagation of bias by recommender systems are often hard to verify or falsify. Research on bias tends to draw from a small pool of publicly available datasets and is therefore bound by their specific properties. Additionally, implementation choices are often not explicitly described or motivated in research, while they may have an effect on bias propagation. In this paper, we explore the challenges of measuring and reporting popularity bias. We showcase the impact of data properties and algorithm configurations on popularity bias by combining real and synthetic data with well known recommender systems frameworks. First, we identify data characteristics that might impact popularity bias, and explore their presence in a set of available online datasets. Accordingly, we generate various datasets that combine these characteristics. Second, we locate algorithm configurations that vary across implementations in literature. We evaluate popularity bias for a number of datasets, three real and five synthetic, and configurations, and offer insights on their joint effect. We find that, depending on the data characteristics, various configurations of the algorithms examined can lead to different conclusions regarding the propagation of popularity bias. These results motivate the need for explicitly addressing algorithmic configuration and data properties when reporting and interpreting bias in recommender systems.
Optimal Dataset Size for Recommender Systems: Evaluating Algorithms' Performance via Downsampling.
Arabzadeh, A.
Master's thesis, University of Siegen, February 2025.
arXiv:2502.08845 [cs]
Paper
link
bibtex
abstract
@mastersthesis{arabzadeh_optimal_2025, title = {Optimal {Dataset} {Size} for {Recommender} {Systems}: {Evaluating} {Algorithms}' {Performance} via {Downsampling}}, shorttitle = {Optimal {Dataset} {Size} for {Recommender} {Systems}}, url = {http://arxiv.org/abs/2502.08845}, abstract = {The analysis reveals that algorithm performance under different downsampling portions is influenced by factors such as dataset characteristics, algorithm complexity, and the specific downsampling configuration (scenario dependent). In particular, some algorithms, which generally showed lower absolute nDCG@10 scores compared to those that performed better, exhibited lower sensitivity to the amount of training data provided, demonstrating greater potential to achieve optimal efficiency in lower downsampling portions. For instance, on average, these algorithms retained ∼81\% of their full-size performance when using only 50\% of the training set. In certain configurations of the downsampling method, where the focus was on progressively involving more users while keeping the test set fixed in size, they even demonstrated higher nDCG@10 scores than when using the original full-size dataset. These findings underscore the feasibility of balancing sustainability and effectiveness, providing practical insights for designing energy-efficient recommender systems and advancing sustainable AI practices.}, language = {en}, urldate = {2025-04-15}, school = {University of Siegen}, author = {Arabzadeh, Ardalan}, month = feb, year = {2025}, note = {arXiv:2502.08845 [cs]}, }
The analysis reveals that algorithm performance under different downsampling portions is influenced by factors such as dataset characteristics, algorithm complexity, and the specific downsampling configuration (scenario dependent). In particular, some algorithms, which generally showed lower absolute nDCG@10 scores compared to those that performed better, exhibited lower sensitivity to the amount of training data provided, demonstrating greater potential to achieve optimal efficiency in lower downsampling portions. For instance, on average, these algorithms retained ∼81% of their full-size performance when using only 50% of the training set. In certain configurations of the downsampling method, where the focus was on progressively involving more users while keeping the test set fixed in size, they even demonstrated higher nDCG@10 scores than when using the original full-size dataset. These findings underscore the feasibility of balancing sustainability and effectiveness, providing practical insights for designing energy-efficient recommender systems and advancing sustainable AI practices.
A Comparative Evaluation of Recommender Systems Tools.
Akhadam, A.; Kbibchi, O.; Mekouar, L.; and Iraqi, Y.
IEEE Access, 13: 29493–29522. 2025.
Paper
doi
link
bibtex
abstract
@article{akhadam_comparative_2025, title = {A {Comparative} {Evaluation} of {Recommender} {Systems} {Tools}}, volume = {13}, issn = {2169-3536}, url = {https://ieeexplore.ieee.org/abstract/document/10879478}, doi = {10.1109/ACCESS.2025.3541014}, abstract = {Due to the vast flow of information on the Internet, easy and effective access to information has become crucial. Recommender systems are important in information filtering, as they significantly impact large-scale internet web services such as YouTube, Netflix, and Amazon. As the demand for personalized recommendations continues to grow, researchers and practitioners alike strive to develop tools specifically designed for this purpose to meet the increasing need. In this work, we address the challenges associated with selecting software frameworks and Machine Learning (ML) algorithms for Recommender Systems (RSs), thus, we offer a detailed comparison of 42 open-source RS software to provide insights into their different features and capabilities. Furthermore, the paper presents a concise overview of various ML algorithms to generate recommendations, reviews the most used performance metrics to evaluate RS, and then compares several ML algorithms provided by four popular recommendation tools: Microsoft Recommenders, Lenskit, Turi Create, and Cornac.}, urldate = {2025-04-15}, journal = {IEEE Access}, author = {Akhadam, Ayoub and Kbibchi, Oumayma and Mekouar, Loubna and Iraqi, Youssef}, year = {2025}, pages = {29493--29522}, }
Due to the vast flow of information on the Internet, easy and effective access to information has become crucial. Recommender systems are important in information filtering, as they significantly impact large-scale internet web services such as YouTube, Netflix, and Amazon. As the demand for personalized recommendations continues to grow, researchers and practitioners alike strive to develop tools specifically designed for this purpose to meet the increasing need. In this work, we address the challenges associated with selecting software frameworks and Machine Learning (ML) algorithms for Recommender Systems (RSs), thus, we offer a detailed comparison of 42 open-source RS software to provide insights into their different features and capabilities. Furthermore, the paper presents a concise overview of various ML algorithms to generate recommendations, reviews the most used performance metrics to evaluate RS, and then compares several ML algorithms provided by four popular recommendation tools: Microsoft Recommenders, Lenskit, Turi Create, and Cornac.
Extending MovieLens-32M to Provide New Evaluation Objectives.
Smucker, M. D.; and Chamani, H.
April 2025.
arXiv:2504.01863 [cs]
Paper
doi
link
bibtex
abstract
@misc{smucker_extending_2025, title = {Extending {MovieLens}-{32M} to {Provide} {New} {Evaluation} {Objectives}}, url = {http://arxiv.org/abs/2504.01863}, doi = {10.48550/arXiv.2504.01863}, abstract = {Offline evaluation of recommender systems has traditionally treated the problem as a machine learning problem. In the classic case of recommending movies, where the user has provided explicit ratings of which movies they like and don't like, each user's ratings are split into test and train sets, and the evaluation task becomes to predict the held out test data using the training data. This machine learning style of evaluation makes the objective to recommend the movies that a user has watched and rated highly, which is not the same task as helping the user find movies that they would enjoy if they watched them. This mismatch in objective between evaluation and task is a compromise to avoid the cost of asking a user to evaluate recommendations by watching each movie. As a resource available for download, we offer an extension to the MovieLens-32M dataset that provides for new evaluation objectives. Our primary objective is to predict the movies that a user would be interested in watching, i.e. predict their watchlist. To construct this extension, we recruited MovieLens users, collected their profiles, made recommendations with a diverse set of algorithms, pooled the recommendations, and had the users assess the pools. Notably, we found that the traditional machine learning style of evaluation ranks the Popular algorithm, which recommends movies based on total number of ratings in the system, in the middle of the twenty-two recommendation runs we used to build the pools. In contrast, when we rank the runs by users' interest in watching movies, we find that recommending popular movies as a recommendation algorithm becomes one of the worst performing runs. It appears that by asking users to assess their personal recommendations, we can alleviate the popularity bias issues created by using information retrieval effectiveness measures for the evaluation of recommender systems.}, urldate = {2025-04-15}, publisher = {arXiv}, author = {Smucker, Mark D. and Chamani, Houmaan}, month = apr, year = {2025}, note = {arXiv:2504.01863 [cs]}, }
Offline evaluation of recommender systems has traditionally treated the problem as a machine learning problem. In the classic case of recommending movies, where the user has provided explicit ratings of which movies they like and don't like, each user's ratings are split into test and train sets, and the evaluation task becomes to predict the held out test data using the training data. This machine learning style of evaluation makes the objective to recommend the movies that a user has watched and rated highly, which is not the same task as helping the user find movies that they would enjoy if they watched them. This mismatch in objective between evaluation and task is a compromise to avoid the cost of asking a user to evaluate recommendations by watching each movie. As a resource available for download, we offer an extension to the MovieLens-32M dataset that provides for new evaluation objectives. Our primary objective is to predict the movies that a user would be interested in watching, i.e. predict their watchlist. To construct this extension, we recruited MovieLens users, collected their profiles, made recommendations with a diverse set of algorithms, pooled the recommendations, and had the users assess the pools. Notably, we found that the traditional machine learning style of evaluation ranks the Popular algorithm, which recommends movies based on total number of ratings in the system, in the middle of the twenty-two recommendation runs we used to build the pools. In contrast, when we rank the runs by users' interest in watching movies, we find that recommending popular movies as a recommendation algorithm becomes one of the worst performing runs. It appears that by asking users to assess their personal recommendations, we can alleviate the popularity bias issues created by using information retrieval effectiveness measures for the evaluation of recommender systems.
Recall, Robustness, and Lexicographic Evaluation.
Diaz, F.; Ekstrand, M. D.; and Mitra, B.
ACM Trans. Recomm. Syst.. April 2025.
Just Accepted
Paper
doi
link
bibtex
abstract
@article{diaz_recall_2025, title = {Recall, {Robustness}, and {Lexicographic} {Evaluation}}, url = {https://dl.acm.org/doi/10.1145/3728373}, doi = {10.1145/3728373}, abstract = {Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define ‘recall-orientation’ as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall-orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across multiple recommendation and retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.}, urldate = {2025-04-15}, journal = {ACM Trans. Recomm. Syst.}, author = {Diaz, Fernando and Ekstrand, Michael D. and Mitra, Bhaskar}, month = apr, year = {2025}, note = {Just Accepted}, }
Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define ‘recall-orientation’ as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall-orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across multiple recommendation and retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
2024
(27)
Um estudo sobre bibliotecas para sistemas de recomendação em Python.
Danesi, L. D. C.
Ph.D. Thesis, Universidade Federal de Santa Maria, December 2024.
Accepted: 2025-01-28T16:09:12Z Publisher: Universidade Federal de Santa Maria
Paper
link
bibtex
abstract
@phdthesis{danesi_um_2024, title = {Um estudo sobre bibliotecas para sistemas de recomendação em {Python}}, copyright = {Acesso Aberto}, url = {http://repositorio.ufsm.br/handle/1/33964}, abstract = {This paper presents a study on recommendation systems, with an emphasis on the analysis and implementation of algorithms using Python libraries for the Collaborative Filtering approach. Identifying the relevance of personalized recommendations in various applications, this research explores algorithms available for the development of such systems, using libraries as tools that facilitate their implementation. In particular, libraries implemented in the Python programming language are examined in the context of recommendation systems, such as Surprise and LensKit for Python (LKPY), presenting the functioning of their main algorithms, K -Nearest Neighbors (K-NN) and Slope One. Thus, the theoretical analysis of these tools is complemented by practical implementation and application in a real scenario demonstrating the performance and applicability of the libraries.}, language = {por}, urldate = {2025-05-22}, school = {Universidade Federal de Santa Maria}, author = {Danesi, Lorenzo Dalla Corte}, month = dec, year = {2024}, note = {Accepted: 2025-01-28T16:09:12Z Publisher: Universidade Federal de Santa Maria}, }
This paper presents a study on recommendation systems, with an emphasis on the analysis and implementation of algorithms using Python libraries for the Collaborative Filtering approach. Identifying the relevance of personalized recommendations in various applications, this research explores algorithms available for the development of such systems, using libraries as tools that facilitate their implementation. In particular, libraries implemented in the Python programming language are examined in the context of recommendation systems, such as Surprise and LensKit for Python (LKPY), presenting the functioning of their main algorithms, K -Nearest Neighbors (K-NN) and Slope One. Thus, the theoretical analysis of these tools is complemented by practical implementation and application in a real scenario demonstrating the performance and applicability of the libraries.
Recommendations with minimum exposure guarantees: a post-processing framework.
Lopes, R.; Alves, R.; Ledent, A.; Santos, R. L. T.; and Kloft, M.
Expert Systems with Applications, 236: 121164. February 2024.
Paper
doi
link
bibtex
abstract
@article{lopes_recommendations_2024, title = {Recommendations with minimum exposure guarantees: a post-processing framework}, volume = {236}, issn = {0957-4174}, shorttitle = {Recommendations with minimum exposure guarantees}, url = {https://www.sciencedirect.com/science/article/pii/S0957417423016664}, doi = {10.1016/j.eswa.2023.121164}, abstract = {Relevance-based ranking is a popular ingredient in recommenders, but it frequently struggles to meet fairness criteria because social and cultural norms may favor some item groups over others. For instance, some items might receive lower ratings due to some sort of bias (e.g. gender bias). A fair ranking should balance the exposure of items from advantaged and disadvantaged groups. To this end, we propose a novel post-processing framework to produce fair, exposure-aware recommendations. Our approach is based on an integer linear programming model maximizing the expected utility while satisfying a minimum exposure constraint. The model has fewer variables than previous work and thus can be deployed to larger datasets and allows the organization to define a minimum level of exposure for groups of items. We conduct an extensive empirical evaluation indicating that our new framework can increase the exposure of items from disadvantaged groups at a small cost of recommendation accuracy.}, urldate = {2023-09-19}, journal = {Expert Systems with Applications}, author = {Lopes, Ramon and Alves, Rodrigo and Ledent, Antoine and Santos, Rodrygo L. T. and Kloft, Marius}, month = feb, year = {2024}, keywords = {Exposure, Fairness, Integer linear programming, Recommender systems, to-read}, pages = {121164}, }
Relevance-based ranking is a popular ingredient in recommenders, but it frequently struggles to meet fairness criteria because social and cultural norms may favor some item groups over others. For instance, some items might receive lower ratings due to some sort of bias (e.g. gender bias). A fair ranking should balance the exposure of items from advantaged and disadvantaged groups. To this end, we propose a novel post-processing framework to produce fair, exposure-aware recommendations. Our approach is based on an integer linear programming model maximizing the expected utility while satisfying a minimum exposure constraint. The model has fewer variables than previous work and thus can be deployed to larger datasets and allows the organization to define a minimum level of exposure for groups of items. We conduct an extensive empirical evaluation indicating that our new framework can increase the exposure of items from disadvantaged groups at a small cost of recommendation accuracy.
A Test Collection for Offline Evaluation of Recommender Systems.
Chamani, H.
. November 2024.
Publisher: University of Waterloo
Paper
link
bibtex
abstract
@article{chamani_test_2024, title = {A {Test} {Collection} for {Offline} {Evaluation} of {Recommender} {Systems}}, url = {https://hdl.handle.net/10012/21175}, abstract = {Recommendation systems have long been evaluated by collecting a large number of individuals' ratings for items, and then dividing these ratings into test and train sets to see how well recommendation algorithms can predict individuals' preferences. A complaint about this approach is that the evaluation measures can only use a small number of known preferences and have no information about the majority of recommended items. Prior research has shown that offline evaluation of recommendation systems using a test/train split methodology may not agree with actual user preferences when all recommended items are judged by the user. To address this issue, we apply traditional information retrieval test collection construction techniques for movie recommendations. An information retrieval test collection is composed of documents, search topics, and relevance judgments that tell us which documents are relevant for each topic. For our test collection, each search topic is an individual who is looking for movies to watch. In other words, while the search topic is always ``Please recommend me movies that I will be interested in watching,'' the context of the search topic changes to be the individual who is requesting the recommendations. When document collections are too large to be completely judged by assessors, the traditional approach is to use pooling. We followed this same approach in the construction of our test collection. For each individual, we used their existing profile of rated movies as input to a wide range of recommendation algorithms to produce recommendations for movies not found in their profile. We then pooled these recommendations separately for each person and asked them to rate the movies. In addition to rating, we also had each individual rate a random sample of movies selected from their ratings profile to measure their consistency in rating. The resulting new test collection consists of 51 individual ratings profiles totaling 123,104 ratings and 31,236 relevance judgments. In this thesis, we detail the creation of the test collection and provide an analysis of the individuals that comprise its search topics, and we analyze the collection's relevance judgments as well as other aspects.}, language = {en}, urldate = {2024-11-22}, author = {Chamani, Houmaan}, month = nov, year = {2024}, note = {Publisher: University of Waterloo}, keywords = {⛔ No DOI found}, }
Recommendation systems have long been evaluated by collecting a large number of individuals' ratings for items, and then dividing these ratings into test and train sets to see how well recommendation algorithms can predict individuals' preferences. A complaint about this approach is that the evaluation measures can only use a small number of known preferences and have no information about the majority of recommended items. Prior research has shown that offline evaluation of recommendation systems using a test/train split methodology may not agree with actual user preferences when all recommended items are judged by the user. To address this issue, we apply traditional information retrieval test collection construction techniques for movie recommendations. An information retrieval test collection is composed of documents, search topics, and relevance judgments that tell us which documents are relevant for each topic. For our test collection, each search topic is an individual who is looking for movies to watch. In other words, while the search topic is always ``Please recommend me movies that I will be interested in watching,'' the context of the search topic changes to be the individual who is requesting the recommendations. When document collections are too large to be completely judged by assessors, the traditional approach is to use pooling. We followed this same approach in the construction of our test collection. For each individual, we used their existing profile of rated movies as input to a wide range of recommendation algorithms to produce recommendations for movies not found in their profile. We then pooled these recommendations separately for each person and asked them to rate the movies. In addition to rating, we also had each individual rate a random sample of movies selected from their ratings profile to measure their consistency in rating. The resulting new test collection consists of 51 individual ratings profiles totaling 123,104 ratings and 31,236 relevance judgments. In this thesis, we detail the creation of the test collection and provide an analysis of the individuals that comprise its search topics, and we analyze the collection's relevance judgments as well as other aspects.
e-Fold Cross-Validation for Recommender-System Evaluation.
Baumgart, M.; Wegmeth, L.; Vente, T.; and Beel, J.
In First International Workshop on Recommender Systems for Sustainability and Social Good (RecSoGood), October 2024.
Paper
link
bibtex
abstract
@inproceedings{baumgart_e-fold_2024, title = {e-{Fold} {Cross}-{Validation} for {Recommender}-{System} {Evaluation}}, url = {https://isg.beel.org/pubs/2024-e-folds-recsys-baumgart.pdf}, abstract = {To combat the rising energy consumption of recommender systems we implement a novel alternative for k-fold cross validation. This alternative, named e-fold cross validation, aims to minimize the number of folds to achieve a reduction in power usage while keeping the reliability and robustness of the test results high. We tested our method on 5 recommender system algorithms across 6 datasets and compared it with 10-fold cross validation. On average e-fold cross validation only needed 41.5\% of the energy that 10-fold cross validation would need, while it’s results only differed by 1.81\%. We conclude that e-fold cross validation is a promising approach that has the potential to be an energy efficient but still reliable alternative to k-fold cross validation.}, language = {en}, booktitle = {First {International} {Workshop} on {Recommender} {Systems} for {Sustainability} and {Social} {Good} ({RecSoGood})}, author = {Baumgart, Moritz and Wegmeth, Lukas and Vente, Tobias and Beel, Joeran}, month = oct, year = {2024}, keywords = {⛔ No DOI found}, }
To combat the rising energy consumption of recommender systems we implement a novel alternative for k-fold cross validation. This alternative, named e-fold cross validation, aims to minimize the number of folds to achieve a reduction in power usage while keeping the reliability and robustness of the test results high. We tested our method on 5 recommender system algorithms across 6 datasets and compared it with 10-fold cross validation. On average e-fold cross validation only needed 41.5% of the energy that 10-fold cross validation would need, while it’s results only differed by 1.81%. We conclude that e-fold cross validation is a promising approach that has the potential to be an energy efficient but still reliable alternative to k-fold cross validation.
Advancing Misinformation Awareness in Recommender Systems for Social Media Information Integrity.
Pathak, R.
In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, of CIKM '24, pages 5471–5474, New York, NY, USA, October 2024. Association for Computing Machinery
Paper
doi
link
bibtex
abstract
@inproceedings{pathak_advancing_2024, address = {New York, NY, USA}, series = {{CIKM} '24}, title = {Advancing {Misinformation} {Awareness} in {Recommender} {Systems} for {Social} {Media} {Information} {Integrity}}, isbn = {9798400704369}, url = {https://dl.acm.org/doi/10.1145/3627673.3680259}, doi = {10.1145/3627673.3680259}, abstract = {Recommender systems play an essential role in determining the content users encounter on social media platforms and in uncovering relevant news. However, they also present significant risks, such as reinforcing biases, over-personalizing content, fostering filter bubbles, and inadvertently promoting misinformation. The spread of false information is rampant across various online platforms, such as Twitter (now X), Meta, and TikTok, especially noticeable during events like the COVID-19 pandemic and the US Presidential elections. These instances underscore the critical necessity for transparency and regulatory oversight in the development of recommender systems. Given the challenge of balancing free speech with the risks of outright removal of fake news, this paper aims to address the spread of misinformation from algorithmic biases in recommender systems using a social science perspective.}, urldate = {2024-11-04}, booktitle = {Proceedings of the 33rd {ACM} {International} {Conference} on {Information} and {Knowledge} {Management}}, publisher = {Association for Computing Machinery}, author = {Pathak, Royal}, month = oct, year = {2024}, pages = {5471--5474}, }
Recommender systems play an essential role in determining the content users encounter on social media platforms and in uncovering relevant news. However, they also present significant risks, such as reinforcing biases, over-personalizing content, fostering filter bubbles, and inadvertently promoting misinformation. The spread of false information is rampant across various online platforms, such as Twitter (now X), Meta, and TikTok, especially noticeable during events like the COVID-19 pandemic and the US Presidential elections. These instances underscore the critical necessity for transparency and regulatory oversight in the development of recommender systems. Given the challenge of balancing free speech with the risks of outright removal of fake news, this paper aims to address the spread of misinformation from algorithmic biases in recommender systems using a social science perspective.
Green Recommender Systems: Optimizing Dataset Size for Energy-Efficient Algorithm Performance.
Arabzadeh, A.; Vente, T.; and Beel, J.
October 2024.
Presented at International Workshop on Recommender Systems for Sustainability and Social Good (RecSoGood)
Paper
doi
link
bibtex
abstract
@misc{arabzadeh_green_2024, title = {Green {Recommender} {Systems}: {Optimizing} {Dataset} {Size} for {Energy}-{Efficient} {Algorithm} {Performance}}, shorttitle = {Green {Recommender} {Systems}}, url = {http://arxiv.org/abs/2410.09359}, doi = {10.48550/arXiv.2410.09359}, abstract = {As recommender systems become increasingly prevalent, the environmental impact and energy efficiency of training large-scale models have come under scrutiny. This paper investigates the potential for energy-efficient algorithm performance by optimizing dataset sizes through downsampling techniques in the context of Green Recommender Systems. We conducted experiments on the MovieLens 100K, 1M, 10M, and Amazon Toys and Games datasets, analyzing the performance of various recommender algorithms under different portions of dataset size. Our results indicate that while more training data generally leads to higher algorithm performance, certain algorithms, such as FunkSVD and BiasedMF, particularly with unbalanced and sparse datasets like Amazon Toys and Games, maintain high-quality recommendations with up to a 50\% reduction in training data, achieving nDCG@10 scores within approximately 13\% of full dataset performance. These findings suggest that strategic dataset reduction can decrease computational and environmental costs without substantially compromising recommendation quality. This study advances sustainable and green recommender systems by providing insights for reducing energy consumption while maintaining effectiveness.}, urldate = {2024-10-17}, publisher = {arXiv}, author = {Arabzadeh, Ardalan and Vente, Tobias and Beel, Joeran}, month = oct, year = {2024}, note = {Presented at International Workshop on Recommender Systems for Sustainability and Social Good (RecSoGood)}, }
As recommender systems become increasingly prevalent, the environmental impact and energy efficiency of training large-scale models have come under scrutiny. This paper investigates the potential for energy-efficient algorithm performance by optimizing dataset sizes through downsampling techniques in the context of Green Recommender Systems. We conducted experiments on the MovieLens 100K, 1M, 10M, and Amazon Toys and Games datasets, analyzing the performance of various recommender algorithms under different portions of dataset size. Our results indicate that while more training data generally leads to higher algorithm performance, certain algorithms, such as FunkSVD and BiasedMF, particularly with unbalanced and sparse datasets like Amazon Toys and Games, maintain high-quality recommendations with up to a 50% reduction in training data, achieving nDCG@10 scores within approximately 13% of full dataset performance. These findings suggest that strategic dataset reduction can decrease computational and environmental costs without substantially compromising recommendation quality. This study advances sustainable and green recommender systems by providing insights for reducing energy consumption while maintaining effectiveness.
Aprimorando a instalação e a configuração de experimentos do RecSysExp.
Silva, S. C. d.
Technical Report Universidade Federal de Ouro Preto, Ouro Preto, BR, 2024.
Accepted: 2024-02-29T14:36:20Z
Paper
link
bibtex
abstract
@techreport{silva_aprimorando_2024, address = {Ouro Preto, BR}, type = {Bachelor {Thesis}}, title = {Aprimorando a instalação e a configuração de experimentos do {RecSysExp}.}, url = {http://www.monografias.ufop.br/handle/35400000/6571}, abstract = {The paper presents significant enhancements to the RecSysExp framework, used for conducting experiments in recommendation systems. These improvements were aimed at enhancing the usability, scalability, and readability of the system. The new functionalities cover three distinct areas: the development of a graphical user interface, the encapsulation of the framework using Docker, and the restructuring of a class for more cohesive integration with datasets, following established design patterns. The primary goal was to enhance the value provided by the framework, aligned with the vision of its creators, aiming at its use as an academic tool in classroom or research environments. The methodological approach adopted employed specific technologies for each addressed context. For the creation of the user interface, React and Next.js frontend frameworks were employed, while Dockerfile and docker-compose were used for the encapsulation of RecSysExp. Finally, the modification of the class responsible for datasets was carried out following the Template Method design pattern. The project successfully achieved all proposed objectives. The implementation of a container structure simplified the installation of the system, while improvements in the visualization of configurations made experiment creation more intuitive. Additionally, the ability to upload files expanded user options. Although the final version of RecSysExp functions similarly to its original iteration, the additions from this work resulted in an enhanced and more user-friendly version. However, it is important to note that configuration through the graphical interface has limitations, as it is only possible to configure algorithms and modules that can be instantiated via configuration files in the framework. Algorithms and modules implemented solely as libraries in other projects cannot be configured via the frontend.}, language = {pt\_BR}, urldate = {2024-10-11}, institution = {Universidade Federal de Ouro Preto}, author = {Silva, San Cunha da}, year = {2024}, note = {Accepted: 2024-02-29T14:36:20Z}, }
The paper presents significant enhancements to the RecSysExp framework, used for conducting experiments in recommendation systems. These improvements were aimed at enhancing the usability, scalability, and readability of the system. The new functionalities cover three distinct areas: the development of a graphical user interface, the encapsulation of the framework using Docker, and the restructuring of a class for more cohesive integration with datasets, following established design patterns. The primary goal was to enhance the value provided by the framework, aligned with the vision of its creators, aiming at its use as an academic tool in classroom or research environments. The methodological approach adopted employed specific technologies for each addressed context. For the creation of the user interface, React and Next.js frontend frameworks were employed, while Dockerfile and docker-compose were used for the encapsulation of RecSysExp. Finally, the modification of the class responsible for datasets was carried out following the Template Method design pattern. The project successfully achieved all proposed objectives. The implementation of a container structure simplified the installation of the system, while improvements in the visualization of configurations made experiment creation more intuitive. Additionally, the ability to upload files expanded user options. Although the final version of RecSysExp functions similarly to its original iteration, the additions from this work resulted in an enhanced and more user-friendly version. However, it is important to note that configuration through the graphical interface has limitations, as it is only possible to configure algorithms and modules that can be instantiated via configuration files in the framework. Algorithms and modules implemented solely as libraries in other projects cannot be configured via the frontend.
Active learning in recommender systems for predicting vulnerabilities in software.
Stijger, E.
Master's thesis, Utrecht University, Utrecht, NL, 2024.
Accepted: 2024-01-06T00:01:00Z
Paper
link
bibtex
abstract
@mastersthesis{stijger_active_2024, address = {Utrecht, NL}, title = {Active learning in recommender systems for predicting vulnerabilities in software}, copyright = {CC-BY-NC-ND}, url = {https://studenttheses.uu.nl/handle/20.500.12932/45783}, abstract = {Due to a rapid advancement of digital technology and growing reliance on the internet, cybersecurity has become a paramount issue for individuals, organizations, and governments. To address this challenge, penetration testing has emerged as a critical tool to ensure the security of computer systems and networks. The reconnaissance phase of penetration testing plays a crucial role in identifying vulnerabilities in a system by gathering relevant information. Although various tools are available to automate this process, most of them are limited to identifying reported vulnerabilities, and they do not provide suggestions or predictions about vulnerabilities. Therefore, this research aims to investigate the application of recommender systems to predict common vulnerabilities during the reconnaissance phase. The main objective of this research is to investigate how active learning affects the performance of a recommender system to identify vulnerabilities in software products. Item-Based k-NN Collaborative Filtering, a recommender system, can improve the identification of potential vulnerabilities and the effectiveness of penetration testing by analyzing information from similar data points. This research involves a comprehensive data preprocessing phase, which utilizes data from the National Vulnerability Database (NVD). Several recommender systems are built using this data, which enables the prediction of potential vulnerabilities during the reconnaissance phase of penetration testing. The performances of these recommender systems are evaluated, and the topperforming recommender system implements active learning to enhance its performance. The findings of this research demonstrate that Item-Based k-NN Collaborative Filtering outperforms other recommender systems in terms of overall performance when it comes to identifying software vulnerabilities. Furthermore, when compared to Item-Based k-NN Collaborative Filtering prior to active learning or with active learning and a random sampling technique, Item-Based k-NN Collaborative Filtering with active learning incorporating a 4- or 10-batch sampling technique with 20 or 40 items added yields a statistically significant improvement in the precision score. This indicates that a greater proportion of the predicted vulnerabilities are correct. Item-Based k-NN Collaborative Filtering with active learning and a single-batch sampling strategy only results in a statistically significant improvement in precision, compared to Item-Based k-NN Collaborative Filtering prior active learning or with active learning and a random sampling technique, when 20 items are added instead of 40. Furthermore, only Item-Based k-NN Collaborative Filtering with a 10-batch sampling strategy adding 20 items demonstrated a statistically significant improvement in nDCG scores compared to Item-Based k-NN Collaborative Filtering prior to active learning. This implies a more accurate ranking of the vulnerabilities. However, this could potentially be a type I error. From these findings, it can be concluded that introducing active learning in Item-Based k-NN Collaborative Filtering, using the approaches outlined, leads to significant improvement in precision score but not necessarily in nDCG score. Considering this conclusion, it is advised to use Item-Based k-NN Collaborative Filtering with active learning to predict vulnerabilities in software products and enhance the reconnaissance phase of penetration testing. This can be achieved by incorporating a single-batch sampling technique with 20 items added or a 4- or 10-batch sampling technique with 20 or 40 added. The insights gained from this research can help individuals, organizations, and governments strengthen their cybersecurity defences and protect against potential cyber threats.}, language = {EN}, urldate = {2024-10-11}, school = {Utrecht University}, author = {Stijger, Elise}, year = {2024}, note = {Accepted: 2024-01-06T00:01:00Z}, }
Due to a rapid advancement of digital technology and growing reliance on the internet, cybersecurity has become a paramount issue for individuals, organizations, and governments. To address this challenge, penetration testing has emerged as a critical tool to ensure the security of computer systems and networks. The reconnaissance phase of penetration testing plays a crucial role in identifying vulnerabilities in a system by gathering relevant information. Although various tools are available to automate this process, most of them are limited to identifying reported vulnerabilities, and they do not provide suggestions or predictions about vulnerabilities. Therefore, this research aims to investigate the application of recommender systems to predict common vulnerabilities during the reconnaissance phase. The main objective of this research is to investigate how active learning affects the performance of a recommender system to identify vulnerabilities in software products. Item-Based k-NN Collaborative Filtering, a recommender system, can improve the identification of potential vulnerabilities and the effectiveness of penetration testing by analyzing information from similar data points. This research involves a comprehensive data preprocessing phase, which utilizes data from the National Vulnerability Database (NVD). Several recommender systems are built using this data, which enables the prediction of potential vulnerabilities during the reconnaissance phase of penetration testing. The performances of these recommender systems are evaluated, and the topperforming recommender system implements active learning to enhance its performance. The findings of this research demonstrate that Item-Based k-NN Collaborative Filtering outperforms other recommender systems in terms of overall performance when it comes to identifying software vulnerabilities. Furthermore, when compared to Item-Based k-NN Collaborative Filtering prior to active learning or with active learning and a random sampling technique, Item-Based k-NN Collaborative Filtering with active learning incorporating a 4- or 10-batch sampling technique with 20 or 40 items added yields a statistically significant improvement in the precision score. This indicates that a greater proportion of the predicted vulnerabilities are correct. Item-Based k-NN Collaborative Filtering with active learning and a single-batch sampling strategy only results in a statistically significant improvement in precision, compared to Item-Based k-NN Collaborative Filtering prior active learning or with active learning and a random sampling technique, when 20 items are added instead of 40. Furthermore, only Item-Based k-NN Collaborative Filtering with a 10-batch sampling strategy adding 20 items demonstrated a statistically significant improvement in nDCG scores compared to Item-Based k-NN Collaborative Filtering prior to active learning. This implies a more accurate ranking of the vulnerabilities. However, this could potentially be a type I error. From these findings, it can be concluded that introducing active learning in Item-Based k-NN Collaborative Filtering, using the approaches outlined, leads to significant improvement in precision score but not necessarily in nDCG score. Considering this conclusion, it is advised to use Item-Based k-NN Collaborative Filtering with active learning to predict vulnerabilities in software products and enhance the reconnaissance phase of penetration testing. This can be achieved by incorporating a single-batch sampling technique with 20 items added or a 4- or 10-batch sampling technique with 20 or 40 added. The insights gained from this research can help individuals, organizations, and governments strengthen their cybersecurity defences and protect against potential cyber threats.
The Potential of AutoML for Recommender Systems.
Vente, T.; and Beel, J.
February 2024.
arXiv:2402.04453 [cs]
Paper
doi
link
bibtex
abstract
@misc{vente_potential_2024, title = {The {Potential} of {AutoML} for {Recommender} {Systems}}, url = {http://arxiv.org/abs/2402.04453}, doi = {10.48550/arXiv.2402.04453}, abstract = {Automated Machine Learning (AutoML) has greatly advanced applications of Machine Learning (ML) including model compression, machine translation, and computer vision. Recommender Systems (RecSys) can be seen as an application of ML. Yet, AutoML has found little attention in the RecSys community; nor has RecSys found notable attention in the AutoML community. Only few and relatively simple Automated Recommender Systems (AutoRecSys) libraries exist that adopt AutoML techniques. However, these libraries are based on student projects and do not offer the features and thorough development of AutoML libraries. We set out to determine how AutoML libraries perform in the scenario of an inexperienced user who wants to implement a recommender system. We compared the predictive performance of 60 AutoML, AutoRecSys, ML, and RecSys algorithms from 15 libraries, including a mean predictor baseline, on 14 explicit feedback RecSys datasets. To simulate the perspective of an inexperienced user, the algorithms were evaluated with default hyperparameters. We found that AutoML and AutoRecSys libraries performed best. AutoML libraries performed best for six of the 14 datasets (43\%), but it was not always the same AutoML library performing best. The single-best library was the AutoRecSys library Auto-Surprise, which performed best on five datasets (36\%). On three datasets (21\%), AutoML libraries performed poorly, and RecSys libraries with default parameters performed best. Although, while obtaining 50\% of all placements in the top five per dataset, RecSys algorithms fall behind AutoML on average. ML algorithms generally performed the worst.}, urldate = {2024-10-11}, publisher = {arXiv}, author = {Vente, Tobias and Beel, Joeran}, month = feb, year = {2024}, note = {arXiv:2402.04453 [cs]}, }
Automated Machine Learning (AutoML) has greatly advanced applications of Machine Learning (ML) including model compression, machine translation, and computer vision. Recommender Systems (RecSys) can be seen as an application of ML. Yet, AutoML has found little attention in the RecSys community; nor has RecSys found notable attention in the AutoML community. Only few and relatively simple Automated Recommender Systems (AutoRecSys) libraries exist that adopt AutoML techniques. However, these libraries are based on student projects and do not offer the features and thorough development of AutoML libraries. We set out to determine how AutoML libraries perform in the scenario of an inexperienced user who wants to implement a recommender system. We compared the predictive performance of 60 AutoML, AutoRecSys, ML, and RecSys algorithms from 15 libraries, including a mean predictor baseline, on 14 explicit feedback RecSys datasets. To simulate the perspective of an inexperienced user, the algorithms were evaluated with default hyperparameters. We found that AutoML and AutoRecSys libraries performed best. AutoML libraries performed best for six of the 14 datasets (43%), but it was not always the same AutoML library performing best. The single-best library was the AutoRecSys library Auto-Surprise, which performed best on five datasets (36%). On three datasets (21%), AutoML libraries performed poorly, and RecSys libraries with default parameters performed best. Although, while obtaining 50% of all placements in the top five per dataset, RecSys algorithms fall behind AutoML on average. ML algorithms generally performed the worst.
On the challenges of studying bias in recommender systems: a UserKNN case study.
Daniil, S.; Slokom, M.; Cuper, M.; Liem, C. C. S.; van Ossenbruggen, J.; and Hollink, L.
September 2024.
Presented at FAccTRec 2024
Paper
doi
link
bibtex
abstract
@misc{daniil_challenges_2024, title = {On the challenges of studying bias in recommender systems: a {UserKNN} case study}, shorttitle = {On the challenges of studying bias in {Recommender} {Systems}}, url = {http://arxiv.org/abs/2409.08046}, doi = {10.48550/arXiv.2409.08046}, abstract = {Statements on the propagation of bias by recommender systems are often hard to verify or falsify. Research on bias tends to draw from a small pool of publicly available datasets and is therefore bound by their specific properties. Additionally, implementation choices are often not explicitly described or motivated in research, while they may have an effect on bias propagation. In this paper, we explore the challenges of measuring and reporting popularity bias. We showcase the impact of data properties and algorithm configurations on popularity bias by combining synthetic data with well known recommender systems frameworks that implement UserKNN. First, we identify data characteristics that might impact popularity bias, based on the functionality of UserKNN. Accordingly, we generate various datasets that combine these characteristics. Second, we locate UserKNN configurations that vary across implementations in literature. We evaluate popularity bias for five synthetic datasets and five UserKNN configurations, and offer insights on their joint effect. We find that, depending on the data characteristics, various UserKNN configurations can lead to different conclusions regarding the propagation of popularity bias. These results motivate the need for explicitly addressing algorithmic configuration and data properties when reporting and interpreting bias in recommender systems.}, urldate = {2024-09-25}, publisher = {arXiv}, author = {Daniil, Savvina and Slokom, Manel and Cuper, Mirjam and Liem, Cynthia C. S. and van Ossenbruggen, Jacco and Hollink, Laura}, month = sep, year = {2024}, note = {Presented at FAccTRec 2024}, }
Statements on the propagation of bias by recommender systems are often hard to verify or falsify. Research on bias tends to draw from a small pool of publicly available datasets and is therefore bound by their specific properties. Additionally, implementation choices are often not explicitly described or motivated in research, while they may have an effect on bias propagation. In this paper, we explore the challenges of measuring and reporting popularity bias. We showcase the impact of data properties and algorithm configurations on popularity bias by combining synthetic data with well known recommender systems frameworks that implement UserKNN. First, we identify data characteristics that might impact popularity bias, based on the functionality of UserKNN. Accordingly, we generate various datasets that combine these characteristics. Second, we locate UserKNN configurations that vary across implementations in literature. We evaluate popularity bias for five synthetic datasets and five UserKNN configurations, and offer insights on their joint effect. We find that, depending on the data characteristics, various UserKNN configurations can lead to different conclusions regarding the propagation of popularity bias. These results motivate the need for explicitly addressing algorithmic configuration and data properties when reporting and interpreting bias in recommender systems.
Recommender systems algorithm selection for ranking prediction on implicit feedback datasets.
Wegmeth, L.; Vente, T.; and Beel, J.
In RecSys '24 Late-Breaking Results, September 2024.
arXiv:2409.05461 [cs]
Paper
doi
link
bibtex
abstract
@inproceedings{wegmeth_recommender_2024, title = {Recommender systems algorithm selection for ranking prediction on implicit feedback datasets}, url = {http://arxiv.org/abs/2409.05461}, doi = {10.1145/3640457.3691718}, abstract = {The recommender systems algorithm selection problem for ranking prediction on implicit feedback datasets is under-explored. Traditional approaches in recommender systems algorithm selection focus predominantly on rating prediction on explicit feedback datasets, leaving a research gap for ranking prediction on implicit feedback datasets. Algorithm selection is a critical challenge for nearly every practitioner in recommender systems. In this work, we take the first steps toward addressing this research gap. We evaluate the NDCG@10 of 24 recommender systems algorithms, each with two hyperparameter configurations, on 72 recommender systems datasets. We train four optimized machine-learning meta-models and one automated machine-learning meta-model with three different settings on the resulting meta-dataset. Our results show that the predictions of all tested meta-models exhibit a median Spearman correlation ranging from 0.857 to 0.918 with the ground truth. We show that the median Spearman correlation between meta-model predictions and the ground truth increases by an average of 0.124 when the meta-model is optimized to predict the ranking of algorithms instead of their performance. Furthermore, in terms of predicting the best algorithm for an unknown dataset, we demonstrate that the best optimized traditional meta-model, e.g., XGBoost, achieves a recall of 48.6\%, outperforming the best tested automated machine learning meta-model, e.g., AutoGluon, which achieves a recall of 47.2\%.}, urldate = {2024-09-25}, booktitle = {{RecSys} '24 {Late}-{Breaking} {Results}}, author = {Wegmeth, Lukas and Vente, Tobias and Beel, Joeran}, month = sep, year = {2024}, note = {arXiv:2409.05461 [cs]}, }
The recommender systems algorithm selection problem for ranking prediction on implicit feedback datasets is under-explored. Traditional approaches in recommender systems algorithm selection focus predominantly on rating prediction on explicit feedback datasets, leaving a research gap for ranking prediction on implicit feedback datasets. Algorithm selection is a critical challenge for nearly every practitioner in recommender systems. In this work, we take the first steps toward addressing this research gap. We evaluate the NDCG@10 of 24 recommender systems algorithms, each with two hyperparameter configurations, on 72 recommender systems datasets. We train four optimized machine-learning meta-models and one automated machine-learning meta-model with three different settings on the resulting meta-dataset. Our results show that the predictions of all tested meta-models exhibit a median Spearman correlation ranging from 0.857 to 0.918 with the ground truth. We show that the median Spearman correlation between meta-model predictions and the ground truth increases by an average of 0.124 when the meta-model is optimized to predict the ranking of algorithms instead of their performance. Furthermore, in terms of predicting the best algorithm for an unknown dataset, we demonstrate that the best optimized traditional meta-model, e.g., XGBoost, achieves a recall of 48.6%, outperforming the best tested automated machine learning meta-model, e.g., AutoGluon, which achieves a recall of 47.2%.
Methodologies to evaluate recommender systems.
Michiels, L.
Ph.D. Thesis, University of Antwerp, Antwerp, 2024.
Paper
doi
link
bibtex
abstract
@phdthesis{michiels_methodologies_2024, address = {Antwerp}, title = {Methodologies to evaluate recommender systems}, url = {https://hdl.handle.net/10067/2080040151162165141}, abstract = {In the current digital landscape, recommender systems play a pivotal role in shaping users' online experiences by providing personalized recommendations for relevant products, news articles, media content, and more. Their pervasive use makes the thorough evaluation of these systems of paramount importance. This dissertation addresses two key challenges in the evaluation of recommender systems. Part II of the dissertation focuses on improving methodologies for offline evaluation. Offline evaluation is a prevalent method for assessing recommendation algorithms in both academia and industry. Despite its widespread use, offline evaluations often suffer from methodological flaws that undermine their validity and real-world impact. This dissertation makes three key contributions to improving the reliability, internal and ecological validity, replicability, reproducibility, and reusability of offline evaluations. First, it presents an extensive review of the current state of practice and knowledge in offline evaluation, proposing a comprehensive set of better practices to address the reliability, replicability, and validity of offline evaluations. Next, it introduces RecPack, an open-source experimentation toolkit designed to facilitate reliable, reproducible, and reusable offline evaluations. Finally, it presents RecPack Tests, a test suite designed to ensure the correctness of recommendation algorithm implementations, thereby enhancing the reliability of offline evaluations. Part III of the dissertation examines the measurement of filter bubbles and serendipity. Both concepts have garnered significant attention due to concerns about the potential negative impacts of recommender systems on users of online platforms. One concern is that personalized content, especially on news and media platforms, may lock users into prior beliefs, contributing to increased polarization in society. Another concern is that exposure only to content previously expressed interest in may lead to boredom and eliminate surprise, preventing users from experiencing serendipity. This research makes three contributions to the study of filter bubbles and serendipity. First, it proposes an operational definition of technological filter bubbles, clarifying the ambiguity surrounding the concept. Second, it introduces a regression model for measuring their presence and strength in news recommendations, providing practitioners with the tools to rigorously study filter bubbles and gather real-world evidence of their (non-)existence. Finally, it proposes a feature repository for serendipity in recommender systems, offering a framework for evaluating how system design can influence users' experiences of serendipity in online information environments. In summary, the findings and tools developed in this dissertation advance the theoretical understanding of recommender system evaluation while offering practical tools for industry practitioners and researchers.}, language = {en}, urldate = {2024-09-25}, school = {University of Antwerp}, author = {Michiels, Lien}, year = {2024}, doi = {10.63028/10067/2080040151162165141}, }
In the current digital landscape, recommender systems play a pivotal role in shaping users' online experiences by providing personalized recommendations for relevant products, news articles, media content, and more. Their pervasive use makes the thorough evaluation of these systems of paramount importance. This dissertation addresses two key challenges in the evaluation of recommender systems. Part II of the dissertation focuses on improving methodologies for offline evaluation. Offline evaluation is a prevalent method for assessing recommendation algorithms in both academia and industry. Despite its widespread use, offline evaluations often suffer from methodological flaws that undermine their validity and real-world impact. This dissertation makes three key contributions to improving the reliability, internal and ecological validity, replicability, reproducibility, and reusability of offline evaluations. First, it presents an extensive review of the current state of practice and knowledge in offline evaluation, proposing a comprehensive set of better practices to address the reliability, replicability, and validity of offline evaluations. Next, it introduces RecPack, an open-source experimentation toolkit designed to facilitate reliable, reproducible, and reusable offline evaluations. Finally, it presents RecPack Tests, a test suite designed to ensure the correctness of recommendation algorithm implementations, thereby enhancing the reliability of offline evaluations. Part III of the dissertation examines the measurement of filter bubbles and serendipity. Both concepts have garnered significant attention due to concerns about the potential negative impacts of recommender systems on users of online platforms. One concern is that personalized content, especially on news and media platforms, may lock users into prior beliefs, contributing to increased polarization in society. Another concern is that exposure only to content previously expressed interest in may lead to boredom and eliminate surprise, preventing users from experiencing serendipity. This research makes three contributions to the study of filter bubbles and serendipity. First, it proposes an operational definition of technological filter bubbles, clarifying the ambiguity surrounding the concept. Second, it introduces a regression model for measuring their presence and strength in news recommendations, providing practitioners with the tools to rigorously study filter bubbles and gather real-world evidence of their (non-)existence. Finally, it proposes a feature repository for serendipity in recommender systems, offering a framework for evaluating how system design can influence users' experiences of serendipity in online information environments. In summary, the findings and tools developed in this dissertation advance the theoretical understanding of recommender system evaluation while offering practical tools for industry practitioners and researchers.
It's not you, it's me: the impact of choice models and ranking strategies on gender imbalance in music recommendation.
Ferraro, A.; Ekstrand, M. D.; and Bauer, C.
In Proceedings of the 18th ACM Conference on Recommender Systems, August 2024. ACM
Paper
doi
link
bibtex
abstract
@inproceedings{ferraro_its_2024, title = {It's not you, it's me: the impact of choice models and ranking strategies on gender imbalance in music recommendation}, shorttitle = {It's not you, it's me}, url = {http://arxiv.org/abs/2409.03781}, doi = {10.1145/3640457.3688163}, abstract = {As recommender systems are prone to various biases, mitigation approaches are needed to ensure that recommendations are fair to various stakeholders. One particular concern in music recommendation is artist gender fairness. Recent work has shown that the gender imbalance in the sector translates to the output of music recommender systems, creating a feedback loop that can reinforce gender biases over time. In this work, we examine that feedback loop to study whether algorithmic strategies or user behavior are a greater contributor to ongoing improvement (or loss) in fairness as models are repeatedly re-trained on new user feedback data. We simulate user interaction and re-training to investigate the effects of ranking strategies and user choice models on gender fairness metrics. We find re-ranking strategies have a greater effect than user choice models on recommendation fairness over time.}, urldate = {2024-09-25}, booktitle = {Proceedings of the 18th {ACM} {Conference} on {Recommender} {Systems}}, publisher = {ACM}, author = {Ferraro, Andres and Ekstrand, Michael D. and Bauer, Christine}, month = aug, year = {2024}, }
As recommender systems are prone to various biases, mitigation approaches are needed to ensure that recommendations are fair to various stakeholders. One particular concern in music recommendation is artist gender fairness. Recent work has shown that the gender imbalance in the sector translates to the output of music recommender systems, creating a feedback loop that can reinforce gender biases over time. In this work, we examine that feedback loop to study whether algorithmic strategies or user behavior are a greater contributor to ongoing improvement (or loss) in fairness as models are repeatedly re-trained on new user feedback data. We simulate user interaction and re-training to investigate the effects of ranking strategies and user choice models on gender fairness metrics. We find re-ranking strategies have a greater effect than user choice models on recommendation fairness over time.
Towards optimizing ranking in grid-layout for provider-side fairness.
Raj, A.; and Ekstrand, M. D.
In Proceedings of the 46th European Conference on Information Retrieval, volume 14612, of LNCS, pages 90–105, March 2024. Springer
Paper
doi
link
bibtex
abstract
@inproceedings{raj_towards_2024, series = {{LNCS}}, title = {Towards optimizing ranking in grid-layout for provider-side fairness}, volume = {14612}, copyright = {All rights reserved}, url = {https://md.ekstrandom.net/pubs/ecir-fair-grids}, doi = {10.1007/978-3-031-56069-9_7}, abstract = {Information access systems, such as search engines and recommender systems, order and position results based on their estimated relevance. These results are then evaluated for a range of concerns, including provider-side fairness: whether exposure to users is fairly distributed among items and the people who created them. Several fairness-aware ranking and re-ranking techniques have been proposed to ensure fair exposure for providers, but this work focuses almost exclusively on linear layouts in which items are displayed in single ranked list. Many widely-used systems use other layouts, such as the grid views common in streaming platforms, image search, and other applications. Providing fair exposure to providers in such layouts is not well-studied. We seek to fill this gap by providing a grid-aware re-ranking algorithm to optimize layouts for provider-side fairness by adapting existing re-ranking techniques to grid-aware browsing models, and an analysis of the effect of grid-specific factors such as device size on the resulting fairness optimization.}, language = {en}, urldate = {2024-01-04}, booktitle = {Proceedings of the 46th {European} {Conference} on {Information} {Retrieval}}, publisher = {Springer}, author = {Raj, Amifa and Ekstrand, Michael D.}, month = mar, year = {2024}, pages = {90--105}, }
Information access systems, such as search engines and recommender systems, order and position results based on their estimated relevance. These results are then evaluated for a range of concerns, including provider-side fairness: whether exposure to users is fairly distributed among items and the people who created them. Several fairness-aware ranking and re-ranking techniques have been proposed to ensure fair exposure for providers, but this work focuses almost exclusively on linear layouts in which items are displayed in single ranked list. Many widely-used systems use other layouts, such as the grid views common in streaming platforms, image search, and other applications. Providing fair exposure to providers in such layouts is not well-studied. We seek to fill this gap by providing a grid-aware re-ranking algorithm to optimize layouts for provider-side fairness by adapting existing re-ranking techniques to grid-aware browsing models, and an analysis of the effect of grid-specific factors such as device size on the resulting fairness optimization.
Distributionally-informed recommender system evaluation.
Ekstrand, M. D.; Carterette, B.; and Diaz, F.
ACM Transactions on Recommender Systems, 2(1): 6:1–27. March 2024.
Paper
doi
link
bibtex
abstract
@article{ekstrand_distributionally-informed_2024, title = {Distributionally-informed recommender system evaluation}, volume = {2}, copyright = {All rights reserved}, url = {https://dl.acm.org/doi/10.1145/3613455}, doi = {10.1145/3613455}, abstract = {Current practice for evaluating recommender systems typically focuses on point estimates of user-oriented effectiveness metrics or business metrics, sometimes combined with additional metrics for considerations such as diversity and novelty. In this paper, we argue for the need for researchers and practitioners to attend more closely to various distributions that arise from a recommender system (or other information access system) and the sources of uncertainty that lead to these distributions. One immediate implication of our argument is that both researchers and practitioners must report and examine more thoroughly the distribution of utility between and within different stakeholder groups. However, distributions of various forms arise in many more aspects of the recommender systems experimental process, and distributional thinking has substantial ramifications for how we design, evaluate, and present recommender systems evaluation and research results. Leveraging and emphasizing distributions in the evaluation of recommender systems is a necessary step to ensure that the systems provide appropriate and equitably-distributed benefit to the people they affect.}, number = {1}, urldate = {2023-09-07}, journal = {ACM Transactions on Recommender Systems}, author = {Ekstrand, Michael D. and Carterette, Ben and Diaz, Fernando}, month = mar, year = {2024}, keywords = {distributions, evaluation, exposure, statistics}, pages = {6:1--27}, }
Current practice for evaluating recommender systems typically focuses on point estimates of user-oriented effectiveness metrics or business metrics, sometimes combined with additional metrics for considerations such as diversity and novelty. In this paper, we argue for the need for researchers and practitioners to attend more closely to various distributions that arise from a recommender system (or other information access system) and the sources of uncertainty that lead to these distributions. One immediate implication of our argument is that both researchers and practitioners must report and examine more thoroughly the distribution of utility between and within different stakeholder groups. However, distributions of various forms arise in many more aspects of the recommender systems experimental process, and distributional thinking has substantial ramifications for how we design, evaluate, and present recommender systems evaluation and research results. Leveraging and emphasizing distributions in the evaluation of recommender systems is a necessary step to ensure that the systems provide appropriate and equitably-distributed benefit to the people they affect.
From Clicks to Carbon: The Environmental Toll of Recommender Systems.
Vente, T.; Wegmeth, L.; Said, A.; and Beel, J.
In Proceedings of the 18th ACM Conference on Recommender Systems, October 2024. ACM
arXiv:2408.08203 [cs]
Paper
doi
link
bibtex
abstract
@inproceedings{vente_clicks_2024, title = {From {Clicks} to {Carbon}: {The} {Environmental} {Toll} of {Recommender} {Systems}}, shorttitle = {From {Clicks} to {Carbon}}, url = {http://arxiv.org/abs/2408.08203}, doi = {10.1145/3640457.203688074}, abstract = {As global warming soars, evaluating the environmental impact of research is more critical now than ever before. However, we find that few to no recommender systems research papers document their impact on the environment. Consequently, in this paper, we conduct a comprehensive analysis of the environmental impact of recommender system research by reproducing a characteristic recommender systems experimental pipeline. We focus on estimating the carbon footprint of recommender systems research papers, highlighting the evolution of the environmental impact of recommender systems research experiments over time. We thoroughly evaluated all 79 full papers from the ACM RecSys conference in the years 2013 and 2023 to analyze representative experimental pipelines for papers utilizing traditional, so-called good old-fashioned AI algorithms and deep learning algorithms, respectively. We reproduced these representative experimental pipelines, measured electricity consumption using a hardware energy meter, and converted the measured energy consumption into CO2 equivalents to estimate the environmental impact. Our results show that a recommender systems research paper utilizing deep learning algorithms emits approximately 42 times more CO2 equivalents than a paper utilizing traditional algorithms. Furthermore, on average, such a paper produces 3,297 kilograms of CO2 equivalents, which is more than one person produces by flying from New York City to Melbourne or the amount one tree sequesters in 300 years.}, urldate = {2024-08-16}, booktitle = {Proceedings of the 18th {ACM} {Conference} on {Recommender} {Systems}}, publisher = {ACM}, author = {Vente, Tobias and Wegmeth, Lukas and Said, Alan and Beel, Joeran}, month = oct, year = {2024}, note = {arXiv:2408.08203 [cs]}, }
As global warming soars, evaluating the environmental impact of research is more critical now than ever before. However, we find that few to no recommender systems research papers document their impact on the environment. Consequently, in this paper, we conduct a comprehensive analysis of the environmental impact of recommender system research by reproducing a characteristic recommender systems experimental pipeline. We focus on estimating the carbon footprint of recommender systems research papers, highlighting the evolution of the environmental impact of recommender systems research experiments over time. We thoroughly evaluated all 79 full papers from the ACM RecSys conference in the years 2013 and 2023 to analyze representative experimental pipelines for papers utilizing traditional, so-called good old-fashioned AI algorithms and deep learning algorithms, respectively. We reproduced these representative experimental pipelines, measured electricity consumption using a hardware energy meter, and converted the measured energy consumption into CO2 equivalents to estimate the environmental impact. Our results show that a recommender systems research paper utilizing deep learning algorithms emits approximately 42 times more CO2 equivalents than a paper utilizing traditional algorithms. Furthermore, on average, such a paper produces 3,297 kilograms of CO2 equivalents, which is more than one person produces by flying from New York City to Melbourne or the amount one tree sequesters in 300 years.
Large Language Models as Recommender Systems: A Study of Popularity Bias.
Lichtenberg, J. M.; Buchholz, A.; and Schwöbel, P.
June 2024.
arXiv:2406.01285 [cs]
Paper
link
bibtex
abstract
@misc{lichtenberg_large_2024, title = {Large {Language} {Models} as {Recommender} {Systems}: {A} {Study} of {Popularity} {Bias}}, shorttitle = {Large {Language} {Models} as {Recommender} {Systems}}, url = {http://arxiv.org/abs/2406.01285}, abstract = {The issue of popularity bias—where popular items are disproportionately recommended, overshadowing less popular but potentially relevant items—remains a significant challenge in recommender systems. Recent advancements have seen the integration of generalpurpose Large Language Models (LLMs) into the architecture of such systems. This integration raises concerns that it might exacerbate popularity bias, given that the LLM’s training data is likely dominated by popular items. However, it simultaneously presents a novel opportunity to address the bias via prompt tuning. Our study explores this dichotomy, examining whether LLMs contribute to or can alleviate popularity bias in recommender systems. We introduce a principled way to measure popularity bias by discussing existing metrics and proposing a novel metric that fulfills a series of desiderata. Based on our new metric, we compare a simple LLM-based recommender to traditional recommender systems on a movie recommendation task. We find that the LLM recommender exhibits less popularity bias, even without any explicit mitigation.}, language = {en}, urldate = {2024-08-15}, publisher = {arXiv}, author = {Lichtenberg, Jan Malte and Buchholz, Alexander and Schwöbel, Pola}, month = jun, year = {2024}, note = {arXiv:2406.01285 [cs]}, }
The issue of popularity bias—where popular items are disproportionately recommended, overshadowing less popular but potentially relevant items—remains a significant challenge in recommender systems. Recent advancements have seen the integration of generalpurpose Large Language Models (LLMs) into the architecture of such systems. This integration raises concerns that it might exacerbate popularity bias, given that the LLM’s training data is likely dominated by popular items. However, it simultaneously presents a novel opportunity to address the bias via prompt tuning. Our study explores this dichotomy, examining whether LLMs contribute to or can alleviate popularity bias in recommender systems. We introduce a principled way to measure popularity bias by discussing existing metrics and proposing a novel metric that fulfills a series of desiderata. Based on our new metric, we compare a simple LLM-based recommender to traditional recommender systems on a movie recommendation task. We find that the LLM recommender exhibits less popularity bias, even without any explicit mitigation.
Towards Purpose-aware Privacy-Preserving Techniques for Predictive Applications.
Slokom, M.
Ph.D. Thesis, TU Delft, 2024.
Paper
link
bibtex
abstract
@phdthesis{slokom_towards_2024, type = {Dissertation}, title = {Towards {Purpose}-aware {Privacy}-{Preserving} {Techniques} for {Predictive} {Applications}}, url = {https://doi.org/10.4233/uuid:4db4a67e-3e4f-4c94-b3e0-1eb8cd1765cb}, abstract = {In the field of machine learning (ML), the goal is to leverage algorithmic models to generate predictions, transforming raw input data into valuable insights. However, the ML pipeline, consisting of input data, models, and output data, is susceptible to various vulnerabilities and attacks. These attacks include re-identification, attribute inference, membership inference, and model inversion attacks, all posing threats to individual privacy. This thesis specifically targets attribute inference attacks, wherein adversaries seek to infer sensitive information about target individuals. The literature on privacy-preserving techniques explores various perturbative approaches, including obfuscation, randomization, and differential privacy, to mitigate privacy attacks. While these methods have shown effectiveness, conventional perturbation based techniques often offer generic protection, lacking the nuance needed to preserve specific utility and accuracy. These conventional techniques are typically purpose unaware, meaning they modify data to protect privacy while maintaining general data usefulness. Recently, there has been a growing interest in purpose-aware techniques.The thesis introduces purpose-aware privacy preservation in the form of a conceptual framework. This approach involves tailoring data modifications to serve specific purposes and implementing changes orthogonal to relevant features. We aim to protect user privacy without compromising utility. We focus on two key applications within the ML spectrum: recommender systems and machine learning classifiers. The objective is to protect these applications against potential privacy attacks, addressing vulnerabilities in both input data and output data (i.e., predictions). We structure the thesis into two parts, each addressing distinct challenges in the ML pipeline. Part 1 tackles attacks on input data, exploring methods to protect sensitive information while maintaining the accuracy of ML models, specifically in recommender systems. Firstly, we explore an attack scenario in which an adversary can acquire the user-item matrix and aims to infer privacy-sensitive information. We assume that the adversary has a gender classifier that is pre-trained on unprotected data. The objective of the adversary is to infer the gender of target individuals. We propose personalized blurring (PerBlur), a personalization-based approach to gender obfuscation that aims to protect user privacy while maintaining the recommendation quality. We demonstrate that recommender system algorithms trained on obfuscated data perform comparably to those trained on the original user-item matrix. Furthermore, our approach not only prevents classifiers from predicting users' gender based on the obfuscated data but also achieves diversity through the recommendation of (non-stereotypical) diverse items. Secondly, we investigate an attack scenario in which an adversary has access to a user-item matrix and aims to exploit the user preference values that it contains. The objective of the adversary is to infer the preferences of individual users. We propose Shuffle-NNN, a data masking-based approach that aims to hide the preferences of users for individual items while maintaining the relative performance of recommendation algorithms. We demonstrate that Shuffle-NNN provides evidence of what information should be retained and what can be removed from the user-item matrix. Shuffle-NNN has great potential for data release, such as in data science challenges. Part 2 investigates attacks on output data, focusing on model inversion attacks aimed at predictions from machine learning classifiers and examining potential privacy risks associated with recommender system outputs. Firstly, we explore a scenario where an adversary attempts to infer individuals' sensitive information by querying a machine learning model and receiving output predictions. We investigate various attack models and identify a potential risk of sensitive information leakage when the target model is trained on original data. To mitigate this risk, we propose to replace the original training data with protected data using synthetic training data + privacy-preserving techniques. We show that the target model trained on protected data achieves performance comparable to the target model trained on original data. We demonstrate that by using privacy-preserving techniques on synthetic training data, we observe a small reduction in the success of certain model inversion attacks measured over a group of target individuals. Secondly, we explore an attack scenario in which the adversary seeks to infer users' sensitive information by intercepting recommendations provided by a recommender system to a set of users. Our goal is to gain insight into possible unintended consequences of using user attributes as side information in context-aware recommender systems. We study the extent to which personal attributes of a user can be inferred from a list of recommendations to that user. We find that both standard recommenders and context-aware recommenders leak personal user information into the recommendation lists.We demonstrate that using user attributes in context-aware recommendations yields a small gain in accuracy. However, the benefit of this gain is distributed unevenly among users and it sacrifices coverage and diversity. This leads us to question the actual value of side information and the need to ensure that there are no hidden `side effects'. The final chapter of the thesis summarizes our findings. It provides recommendations for future research directions which we think are promising for further exploring and promoting the use of purpose-aware privacy-preserving data for ML predictions.}, school = {TU Delft}, author = {Slokom, M.}, year = {2024}, }
In the field of machine learning (ML), the goal is to leverage algorithmic models to generate predictions, transforming raw input data into valuable insights. However, the ML pipeline, consisting of input data, models, and output data, is susceptible to various vulnerabilities and attacks. These attacks include re-identification, attribute inference, membership inference, and model inversion attacks, all posing threats to individual privacy. This thesis specifically targets attribute inference attacks, wherein adversaries seek to infer sensitive information about target individuals. The literature on privacy-preserving techniques explores various perturbative approaches, including obfuscation, randomization, and differential privacy, to mitigate privacy attacks. While these methods have shown effectiveness, conventional perturbation based techniques often offer generic protection, lacking the nuance needed to preserve specific utility and accuracy. These conventional techniques are typically purpose unaware, meaning they modify data to protect privacy while maintaining general data usefulness. Recently, there has been a growing interest in purpose-aware techniques.The thesis introduces purpose-aware privacy preservation in the form of a conceptual framework. This approach involves tailoring data modifications to serve specific purposes and implementing changes orthogonal to relevant features. We aim to protect user privacy without compromising utility. We focus on two key applications within the ML spectrum: recommender systems and machine learning classifiers. The objective is to protect these applications against potential privacy attacks, addressing vulnerabilities in both input data and output data (i.e., predictions). We structure the thesis into two parts, each addressing distinct challenges in the ML pipeline. Part 1 tackles attacks on input data, exploring methods to protect sensitive information while maintaining the accuracy of ML models, specifically in recommender systems. Firstly, we explore an attack scenario in which an adversary can acquire the user-item matrix and aims to infer privacy-sensitive information. We assume that the adversary has a gender classifier that is pre-trained on unprotected data. The objective of the adversary is to infer the gender of target individuals. We propose personalized blurring (PerBlur), a personalization-based approach to gender obfuscation that aims to protect user privacy while maintaining the recommendation quality. We demonstrate that recommender system algorithms trained on obfuscated data perform comparably to those trained on the original user-item matrix. Furthermore, our approach not only prevents classifiers from predicting users' gender based on the obfuscated data but also achieves diversity through the recommendation of (non-stereotypical) diverse items. Secondly, we investigate an attack scenario in which an adversary has access to a user-item matrix and aims to exploit the user preference values that it contains. The objective of the adversary is to infer the preferences of individual users. We propose Shuffle-NNN, a data masking-based approach that aims to hide the preferences of users for individual items while maintaining the relative performance of recommendation algorithms. We demonstrate that Shuffle-NNN provides evidence of what information should be retained and what can be removed from the user-item matrix. Shuffle-NNN has great potential for data release, such as in data science challenges. Part 2 investigates attacks on output data, focusing on model inversion attacks aimed at predictions from machine learning classifiers and examining potential privacy risks associated with recommender system outputs. Firstly, we explore a scenario where an adversary attempts to infer individuals' sensitive information by querying a machine learning model and receiving output predictions. We investigate various attack models and identify a potential risk of sensitive information leakage when the target model is trained on original data. To mitigate this risk, we propose to replace the original training data with protected data using synthetic training data + privacy-preserving techniques. We show that the target model trained on protected data achieves performance comparable to the target model trained on original data. We demonstrate that by using privacy-preserving techniques on synthetic training data, we observe a small reduction in the success of certain model inversion attacks measured over a group of target individuals. Secondly, we explore an attack scenario in which the adversary seeks to infer users' sensitive information by intercepting recommendations provided by a recommender system to a set of users. Our goal is to gain insight into possible unintended consequences of using user attributes as side information in context-aware recommender systems. We study the extent to which personal attributes of a user can be inferred from a list of recommendations to that user. We find that both standard recommenders and context-aware recommenders leak personal user information into the recommendation lists.We demonstrate that using user attributes in context-aware recommendations yields a small gain in accuracy. However, the benefit of this gain is distributed unevenly among users and it sacrifices coverage and diversity. This leads us to question the actual value of side information and the need to ensure that there are no hidden `side effects'. The final chapter of the thesis summarizes our findings. It provides recommendations for future research directions which we think are promising for further exploring and promoting the use of purpose-aware privacy-preserving data for ML predictions.
The Impact of Cluster Centroid and Text Review Embeddings on Recommendation Methods.
Dolog, P.; Sadikaj, Y.; Velaj, Y.; Stephan, A.; Roth, B.; and Plant, C.
In Companion Proceedings of the ACM on Web Conference 2024, of WWW '24, pages 589–592, New York, NY, USA, May 2024. Association for Computing Machinery
Paper
doi
link
bibtex
abstract
@inproceedings{dolog_impact_2024, address = {New York, NY, USA}, series = {{WWW} '24}, title = {The {Impact} of {Cluster} {Centroid} and {Text} {Review} {Embeddings} on {Recommendation} {Methods}}, isbn = {9798400701726}, url = {https://dl.acm.org/doi/10.1145/3589335.3651570}, doi = {10.1145/3589335.3651570}, abstract = {Recommendation systems often neglect global patterns that can be provided by clusters of similar items or even additional information such as text. Therefore, we study the impact of integrating clustering embeddings, review embeddings, and their combinations with embeddings obtained by a recommender system. Our work assesses the performance of this approach across various state-of-the-art recommender system algorithms. Our study highlights the improvement of recommendation performance through clustering, particularly evident when combined with review embeddings, and the enhanced performance of neural methods when incorporating review embeddings.}, urldate = {2024-08-15}, booktitle = {Companion {Proceedings} of the {ACM} on {Web} {Conference} 2024}, publisher = {Association for Computing Machinery}, author = {Dolog, Peter and Sadikaj, Ylli and Velaj, Yllka and Stephan, Andreas and Roth, Benjamin and Plant, Claudia}, month = may, year = {2024}, pages = {589--592}, }
Recommendation systems often neglect global patterns that can be provided by clusters of similar items or even additional information such as text. Therefore, we study the impact of integrating clustering embeddings, review embeddings, and their combinations with embeddings obtained by a recommender system. Our work assesses the performance of this approach across various state-of-the-art recommender system algorithms. Our study highlights the improvement of recommendation performance through clustering, particularly evident when combined with review embeddings, and the enhanced performance of neural methods when incorporating review embeddings.
Rethinking Recommender Systems: Cluster-based Algorithm Selection.
Lizenberger, A.; Pfeifer, F.; and Polewka, B.
May 2024.
arXiv:2405.18011 [cs]
Paper
doi
link
bibtex
abstract
@misc{lizenberger_rethinking_2024, title = {Rethinking {Recommender} {Systems}: {Cluster}-based {Algorithm} {Selection}}, shorttitle = {Rethinking {Recommender} {Systems}}, url = {http://arxiv.org/abs/2405.18011}, doi = {10.48550/arXiv.2405.18011}, abstract = {Cluster-based algorithm selection deals with selecting recommendation algorithms on clusters of users to obtain performance gains. No studies have been attempted for many combinations of clustering approaches and recommendation algorithms. We want to show that clustering users prior to algorithm selection increases the performance of recommendation algorithms. Our study covers eight datasets, four clustering approaches, and eight recommendation algorithms. We select the best performing recommendation algorithm for each cluster. Our work shows that cluster-based algorithm selection is an effective technique for optimizing recommendation algorithm performance. For five out of eight datasets, we report an increase in nDCG@10 between 19.28\% (0.032) and 360.38\% (0.191) compared to algorithm selection without prior clustering.}, urldate = {2024-08-15}, publisher = {arXiv}, author = {Lizenberger, Andreas and Pfeifer, Ferdinand and Polewka, Bastian}, month = may, year = {2024}, note = {arXiv:2405.18011 [cs]}, }
Cluster-based algorithm selection deals with selecting recommendation algorithms on clusters of users to obtain performance gains. No studies have been attempted for many combinations of clustering approaches and recommendation algorithms. We want to show that clustering users prior to algorithm selection increases the performance of recommendation algorithms. Our study covers eight datasets, four clustering approaches, and eight recommendation algorithms. We select the best performing recommendation algorithm for each cluster. Our work shows that cluster-based algorithm selection is an effective technique for optimizing recommendation algorithm performance. For five out of eight datasets, we report an increase in nDCG@10 between 19.28% (0.032) and 360.38% (0.191) compared to algorithm selection without prior clustering.
Anonymity-Aware Framework for Designing Recommender Systems.
Honda, M.; and Nishi, H.
IEEJ Transactions on Electrical and Electronic Engineering, 19(9): 1455–1464. 2024.
_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/tee.24093
Paper
doi
link
bibtex
abstract
@article{honda_anonymity-aware_2024, title = {Anonymity-{Aware} {Framework} for {Designing} {Recommender} {Systems}}, volume = {19}, copyright = {© 2024 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.}, issn = {1931-4981}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/tee.24093}, doi = {10.1002/tee.24093}, abstract = {Due to increasing secondary use of data, recommender systems using anonymized data are in demand. However, implementing a recommender system requires complicated data processing and programming, and the relationship between anonymization level and recommendation quality has not been investigated. Therefore, this study proposes a framework that facilitates the development of recommender systems. Additionally, a method is proposed for quantitatively evaluating recommendation quality when the anonymization level is varied. The proposed method promotes data utilization by recommender systems and determination of compensation for providing data based on anonymization level. © 2024 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.}, language = {en}, number = {9}, urldate = {2024-08-15}, journal = {IEEJ Transactions on Electrical and Electronic Engineering}, author = {Honda, Moena and Nishi, Hiroaki}, year = {2024}, note = {\_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/tee.24093}, pages = {1455--1464}, }
Due to increasing secondary use of data, recommender systems using anonymized data are in demand. However, implementing a recommender system requires complicated data processing and programming, and the relationship between anonymization level and recommendation quality has not been investigated. Therefore, this study proposes a framework that facilitates the development of recommender systems. Additionally, a method is proposed for quantitatively evaluating recommendation quality when the anonymization level is varied. The proposed method promotes data utilization by recommender systems and determination of compensation for providing data based on anonymization level. © 2024 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.
Analyzing the Interplay between Diversity of News Recommendations and Misinformation Spread in Social Media.
Pathak, R.; and Spezzano, F.
In Adjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization, of UMAP Adjunct '24, pages 80–85, New York, NY, USA, June 2024. Association for Computing Machinery
Paper
doi
link
bibtex
abstract
@inproceedings{pathak_analyzing_2024, address = {New York, NY, USA}, series = {{UMAP} {Adjunct} '24}, title = {Analyzing the {Interplay} between {Diversity} of {News} {Recommendations} and {Misinformation} {Spread} in {Social} {Media}}, isbn = {9798400704666}, url = {https://dl.acm.org/doi/10.1145/3631700.3664870}, doi = {10.1145/3631700.3664870}, abstract = {Recommender systems play a crucial role in social media platforms, especially in the context of news, by assisting users in discovering relevant news. However, these systems can inadvertently contribute to increased personalization, and the formation of filter bubbles and echo chambers, thereby aiding in the propagation of fake news or misinformation. This study specifically focuses on examining the tradeoffs between the diversity of news recommendations and the dissemination of misinformation on social media. We evaluated classical recommender algorithms on two Twitter (now X) datasets to assess the diversity of top-10 recommendation lists and simulated the propagation of recommended misinformation within the user network to analyze the impact of diversity on misinformation spread. The research findings indicate that an increase in news recommendation diversity indeed contributes to mitigating the propagation of misinformation. Additionally, collaborative and content-based recommender systems provide more diversity in comparison to popularity and network-based systems, resulting in less misinformation propagation. Our study underscores the crucial role of diversity recommendations in mitigating misinformation propagation, offering valuable insights for designing misinformation-aware recommender systems and diversity-based misinformation intervention.}, urldate = {2024-08-15}, booktitle = {Adjunct {Proceedings} of the 32nd {ACM} {Conference} on {User} {Modeling}, {Adaptation} and {Personalization}}, publisher = {Association for Computing Machinery}, author = {Pathak, Royal and Spezzano, Francesca}, month = jun, year = {2024}, pages = {80--85}, }
Recommender systems play a crucial role in social media platforms, especially in the context of news, by assisting users in discovering relevant news. However, these systems can inadvertently contribute to increased personalization, and the formation of filter bubbles and echo chambers, thereby aiding in the propagation of fake news or misinformation. This study specifically focuses on examining the tradeoffs between the diversity of news recommendations and the dissemination of misinformation on social media. We evaluated classical recommender algorithms on two Twitter (now X) datasets to assess the diversity of top-10 recommendation lists and simulated the propagation of recommended misinformation within the user network to analyze the impact of diversity on misinformation spread. The research findings indicate that an increase in news recommendation diversity indeed contributes to mitigating the propagation of misinformation. Additionally, collaborative and content-based recommender systems provide more diversity in comparison to popularity and network-based systems, resulting in less misinformation propagation. Our study underscores the crucial role of diversity recommendations in mitigating misinformation propagation, offering valuable insights for designing misinformation-aware recommender systems and diversity-based misinformation intervention.
Missing Data, Speculative Reading.
Koeser, R. S.; and LeBlanc, Z.
Journal of Cultural Analytics, 9(2). May 2024.
Paper
doi
link
bibtex
abstract
@article{koeser_missing_2024, title = {Missing {Data}, {Speculative} {Reading}}, volume = {9}, url = {https://culturalanalytics.org/article/116926-missing-data-speculative-reading}, doi = {10.22148/001c.116926}, abstract = {In this article we use an approach we term “speculative reading” to explore gaps in Sylvia Beach’s lending library records and the *Shakespeare and Company Project* datasets. We recast the problem of missing data as an opportunity and use a combination of time series forecasting, evolutionary models, and recommendation systems to estimate the extent of missing information and speculatively fill in some gaps. We conclude that the datasets include ninety-three percent of membership activity, ninety-six percent of members, and sixty-four percent to seventy-six percent of the books despite only including twenty-six percent of the borrowing activity. We then treat Ernest Hemingway as a test case for speculative reading: based on Hemingway’s known borrowing and all documented borrowing activity, we generate a list of books he might have borrowed during the years his borrowing is not documented; we then verify and interpret our list against the substantial scholarly record of the books he read and owned.}, language = {en}, number = {2}, urldate = {2024-08-15}, journal = {Journal of Cultural Analytics}, author = {Koeser, Rebecca Sutton and LeBlanc, Zoe}, month = may, year = {2024}, }
In this article we use an approach we term “speculative reading” to explore gaps in Sylvia Beach’s lending library records and the *Shakespeare and Company Project* datasets. We recast the problem of missing data as an opportunity and use a combination of time series forecasting, evolutionary models, and recommendation systems to estimate the extent of missing information and speculatively fill in some gaps. We conclude that the datasets include ninety-three percent of membership activity, ninety-six percent of members, and sixty-four percent to seventy-six percent of the books despite only including twenty-six percent of the borrowing activity. We then treat Ernest Hemingway as a test case for speculative reading: based on Hemingway’s known borrowing and all documented borrowing activity, we generate a list of books he might have borrowed during the years his borrowing is not documented; we then verify and interpret our list against the substantial scholarly record of the books he read and owned.
Evaluating the performance-deviation of itemKNN in RecBole and LensKit.
Schmidt, M.; Nitschke, J.; and Prinz, T.
July 2024.
arXiv:2407.13531 [cs]
Paper
link
bibtex
abstract
@misc{schmidt_evaluating_2024, title = {Evaluating the performance-deviation of {itemKNN} in {RecBole} and {LensKit}}, url = {http://arxiv.org/abs/2407.13531}, abstract = {This study evaluates the performance variations of item-based kNearest Neighbors (ItemKNN) algorithms implemented in the recommender system libraries, RecBole and LensKit. By using four datasets (Anime, Modcloth, ML-100K, and ML-1M), we explore the efficiency, accuracy, and scalability of each library’s implementation of ItemKNN. The study involves replicating and reproducing experiments to ensure the reliability of results. We are using key metrics such as normalized discounted cumulative gain (nDCG), precision, and recall to evaluate performance with our main focus on nDCG. Our initial findings indicate that RecBole is more performant than LensKit on two out of three metrics. It achieved a 18\% higher nDCG, a 14\% higher Precision and a 35\% lower Recall. To ensure a fair comparison, we adjusted LensKit’s nDCG calculation implementation to match RecBole’s approach. After aligning the nDCG calculations implementation, the performance of the two libraries became more comparable. Using implicit feedback, LensKit achieved an nDCG value of 0.2540, whereas RecBole attained a value of 0.2674. Further analysis revealed that the deviations were caused by differences in the implementation of the similarity matrix calculation. Our findings show that RecBole’s implementation outperforms the LensKit algorithm on three out of our four datasets. Following the implementation of a similarity matrix calculation, where only the top K similar items for each item are retained (a method already incorporated in RecBole’s ItemKNN), we observed nearly identical nDCG values across all four of our datasets. For example, Lenskit achieved an nDCG value of 0.2586 for the ML-1M dataset with a random seed set to 42. Similarly, RecBole attained the same nDCG value of 0.2586 under identical conditions. Using the original implementation of LensKit’s ItemKNN, a higher nDCG value was obtained only on the ModCloth data set.}, language = {en}, urldate = {2024-08-15}, publisher = {arXiv}, author = {Schmidt, Michael and Nitschke, Jannik and Prinz, Tim}, month = jul, year = {2024}, note = {arXiv:2407.13531 [cs]}, }
This study evaluates the performance variations of item-based kNearest Neighbors (ItemKNN) algorithms implemented in the recommender system libraries, RecBole and LensKit. By using four datasets (Anime, Modcloth, ML-100K, and ML-1M), we explore the efficiency, accuracy, and scalability of each library’s implementation of ItemKNN. The study involves replicating and reproducing experiments to ensure the reliability of results. We are using key metrics such as normalized discounted cumulative gain (nDCG), precision, and recall to evaluate performance with our main focus on nDCG. Our initial findings indicate that RecBole is more performant than LensKit on two out of three metrics. It achieved a 18% higher nDCG, a 14% higher Precision and a 35% lower Recall. To ensure a fair comparison, we adjusted LensKit’s nDCG calculation implementation to match RecBole’s approach. After aligning the nDCG calculations implementation, the performance of the two libraries became more comparable. Using implicit feedback, LensKit achieved an nDCG value of 0.2540, whereas RecBole attained a value of 0.2674. Further analysis revealed that the deviations were caused by differences in the implementation of the similarity matrix calculation. Our findings show that RecBole’s implementation outperforms the LensKit algorithm on three out of our four datasets. Following the implementation of a similarity matrix calculation, where only the top K similar items for each item are retained (a method already incorporated in RecBole’s ItemKNN), we observed nearly identical nDCG values across all four of our datasets. For example, Lenskit achieved an nDCG value of 0.2586 for the ML-1M dataset with a random seed set to 42. Similarly, RecBole attained the same nDCG value of 0.2586 under identical conditions. Using the original implementation of LensKit’s ItemKNN, a higher nDCG value was obtained only on the ModCloth data set.
Multiple testing for IR and recommendation system experiments.
Ihemelandu, N.; and Ekstrand, M. D.
In Proceedings of the 46th European Conference on Information Retrieval, volume 14610, of LNCS, pages 449–457, March 2024. Springer
Paper
doi
link
bibtex
abstract
@inproceedings{ihemelandu_multiple_2024, series = {{LNCS}}, title = {Multiple testing for {IR} and recommendation system experiments}, volume = {14610}, url = {https://md.ekstrandom.net/pubs/ecir-mcp}, doi = {10.1007/978-3-031-56063-7_37}, abstract = {While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).}, language = {en}, urldate = {2024-01-04}, booktitle = {Proceedings of the 46th {European} {Conference} on {Information} {Retrieval}}, publisher = {Springer}, author = {Ihemelandu, Ngozi and Ekstrand, Michael D.}, month = mar, year = {2024}, pages = {449--457}, }
While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).
Revealing the Hidden Impact of Top-N Metrics on Optimization in Recommender Systems.
Wegmeth, L.; Vente, T.; and Purucker, L.
In Goharian, N.; Tonellotto, N.; He, Y.; Lipani, A.; McDonald, G.; Macdonald, C.; and Ounis, I., editor(s), Advances in Information Retrieval, pages 140–156, 2024. Springer Nature Switzerland
doi link bibtex abstract
doi link bibtex abstract
@inproceedings{wegmeth_revealing_2024, title = {Revealing the {Hidden} {Impact} of {Top}-{N} {Metrics} on {Optimization} in {Recommender} {Systems}}, isbn = {978-3-031-56027-9}, doi = {10.1007/978-3-031-56027-9_9}, abstract = {The hyperparameters of recommender systems for top-n predictions are typically optimized to enhance the predictive performance of algorithms. Thereby, the optimization algorithm, e.g., grid search or random search, searches for the best hyperparameter configuration according to an optimization-target metric, like nDCG or Precision. In contrast, the optimized algorithm, e.g., Alternating Least Squares Matrix Factorization or Bayesian Personalized Ranking, internally optimizes a different loss function during training, like squared error or cross-entropy. To tackle this discrepancy, recent work focused on generating loss functions better suited for recommender systems. Yet, when evaluating an algorithm using a top-n metric during optimization, another discrepancy between the optimization-target metric and the training loss has so far been ignored. During optimization, the top-n items are selected for computing a top-n metric; ignoring that the top-n items are selected from the recommendations of a model trained with an entirely different loss function. Item recommendations suitable for optimization-target metrics could be outside the top-n recommended items; hiddenly impacting the optimization performance. Therefore, we were motivated to analyze whether the top-n items are optimal for optimization-target top-n metrics. In pursuit of an answer, we exhaustively evaluate the predictive performance of 250 selection strategies besides selecting the top-n. We extensively evaluate each selection strategy over twelve implicit feedback and eight explicit feedback data sets with eleven recommender systems algorithms. Our results show that there exist selection strategies other than top-n that increase predictive performance for various algorithms and recommendation domains. However, the performance of the top \$\${\textbackslash}sim 43{\textbackslash}\%\$\$∼43\%of selection strategies is not significantly different. We discuss the impact of our findings on optimization and re-ranking in recommender systems and feasible solutions. The implementation of our study is publicly available.}, language = {en}, booktitle = {Advances in {Information} {Retrieval}}, publisher = {Springer Nature Switzerland}, author = {Wegmeth, Lukas and Vente, Tobias and Purucker, Lennart}, editor = {Goharian, Nazli and Tonellotto, Nicola and He, Yulan and Lipani, Aldo and McDonald, Graham and Macdonald, Craig and Ounis, Iadh}, year = {2024}, pages = {140--156}, }
The hyperparameters of recommender systems for top-n predictions are typically optimized to enhance the predictive performance of algorithms. Thereby, the optimization algorithm, e.g., grid search or random search, searches for the best hyperparameter configuration according to an optimization-target metric, like nDCG or Precision. In contrast, the optimized algorithm, e.g., Alternating Least Squares Matrix Factorization or Bayesian Personalized Ranking, internally optimizes a different loss function during training, like squared error or cross-entropy. To tackle this discrepancy, recent work focused on generating loss functions better suited for recommender systems. Yet, when evaluating an algorithm using a top-n metric during optimization, another discrepancy between the optimization-target metric and the training loss has so far been ignored. During optimization, the top-n items are selected for computing a top-n metric; ignoring that the top-n items are selected from the recommendations of a model trained with an entirely different loss function. Item recommendations suitable for optimization-target metrics could be outside the top-n recommended items; hiddenly impacting the optimization performance. Therefore, we were motivated to analyze whether the top-n items are optimal for optimization-target top-n metrics. In pursuit of an answer, we exhaustively evaluate the predictive performance of 250 selection strategies besides selecting the top-n. We extensively evaluate each selection strategy over twelve implicit feedback and eight explicit feedback data sets with eleven recommender systems algorithms. Our results show that there exist selection strategies other than top-n that increase predictive performance for various algorithms and recommendation domains. However, the performance of the top ∼43%∼43%of selection strategies is not significantly different. We discuss the impact of our findings on optimization and re-ranking in recommender systems and feasible solutions. The implementation of our study is publicly available.
An Empirical Analysis of Intervention Strategies’ Effectiveness for Countering Misinformation Amplification by Recommendation Algorithms.
Pathak, R.; and Spezzano, F.
In Goharian, N.; Tonellotto, N.; He, Y.; Lipani, A.; McDonald, G.; Macdonald, C.; and Ounis, I., editor(s), Advances in Information Retrieval, volume 14611, of LNCS, pages 285–301, 2024. Springer Nature Switzerland
doi link bibtex abstract
doi link bibtex abstract
@inproceedings{pathak_empirical_2024, series = {{LNCS}}, title = {An {Empirical} {Analysis} of {Intervention} {Strategies}’ {Effectiveness} for {Countering} {Misinformation} {Amplification} by {Recommendation} {Algorithms}}, volume = {14611}, isbn = {978-3-031-56066-8}, doi = {10.1007/978-3-031-56066-8_23}, abstract = {Social network platforms connect people worldwide, facilitating communication, information sharing, and personal/professional networking. They use recommendation algorithms to personalize content and enhance user experiences. However, these algorithms can unintentionally amplify misinformation by prioritizing engagement over accuracy. For instance, recent works suggest that popularity-based and network-based recommendation algorithms contribute the most to misinformation diffusion. In our study, we present an exploration on two Twitter datasets to understand the impact of intervention techniques on combating misinformation amplification initiated by recommendation algorithms. We simulate various scenarios and evaluate the effectiveness of intervention strategies in social sciences such as Virality Circuit Breakers and accuracy nudges. Our findings highlight that these intervention strategies are generally successful when applied on top of collaborative filtering and content-based recommendation algorithms, while having different levels of effectiveness depending on the number of users keen to spread fake news present in the dataset.}, language = {en}, booktitle = {Advances in {Information} {Retrieval}}, publisher = {Springer Nature Switzerland}, author = {Pathak, Royal and Spezzano, Francesca}, editor = {Goharian, Nazli and Tonellotto, Nicola and He, Yulan and Lipani, Aldo and McDonald, Graham and Macdonald, Craig and Ounis, Iadh}, year = {2024}, pages = {285--301}, }
Social network platforms connect people worldwide, facilitating communication, information sharing, and personal/professional networking. They use recommendation algorithms to personalize content and enhance user experiences. However, these algorithms can unintentionally amplify misinformation by prioritizing engagement over accuracy. For instance, recent works suggest that popularity-based and network-based recommendation algorithms contribute the most to misinformation diffusion. In our study, we present an exploration on two Twitter datasets to understand the impact of intervention techniques on combating misinformation amplification initiated by recommendation algorithms. We simulate various scenarios and evaluate the effectiveness of intervention strategies in social sciences such as Virality Circuit Breakers and accuracy nudges. Our findings highlight that these intervention strategies are generally successful when applied on top of collaborative filtering and content-based recommendation algorithms, while having different levels of effectiveness depending on the number of users keen to spread fake news present in the dataset.
2023
(6)
Candidate set sampling for evaluating top-N recommendation.
Ihemelandu, N.; and Ekstrand, M. D.
In Proceedings of the 22nd IEEE/WIC international conference on web intelligence and intelligent agent technology, pages 88–94, October 2023.
arXiv:2309.11723 [cs]
Paper
doi
link
bibtex
abstract
@inproceedings{ihemelandu_candidate_2023, title = {Candidate set sampling for evaluating top-{N} recommendation}, url = {https://doi.org/10.1109/WI-IAT59888.2023.00018}, doi = {10.1109/WI-IAT59888.2023.00018}, abstract = {The strategy for selecting candidate sets -- the set of items that the recommendation system is expected to rank for each user -- is an important decision in carrying out an offline top-\$N\$ recommender system evaluation. The set of candidates is composed of the union of the user's test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments.}, urldate = {2023-11-08}, booktitle = {Proceedings of the 22nd {IEEE}/{WIC} international conference on web intelligence and intelligent agent technology}, author = {Ihemelandu, Ngozi and Ekstrand, Michael D.}, month = oct, year = {2023}, note = {arXiv:2309.11723 [cs]}, keywords = {Computer Science - Information Retrieval}, pages = {88--94}, }
The strategy for selecting candidate sets – the set of items that the recommendation system is expected to rank for each user – is an important decision in carrying out an offline top-N recommender system evaluation. The set of candidates is composed of the union of the user's test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments.
Modeling uncertainty to improve personalized recommendations via Bayesian deep learning.
Wang, X.; and Kadıoğlu, S.
International Journal of Data Science and Analytics, 16(2): 191–201. August 2023.
Paper
doi
link
bibtex
abstract
@article{wang_modeling_2023, title = {Modeling uncertainty to improve personalized recommendations via {Bayesian} deep learning}, volume = {16}, issn = {2364-4168}, url = {https://doi.org/10.1007/s41060-020-00241-1}, doi = {10.1007/s41060-020-00241-1}, abstract = {Modeling uncertainty has been a major challenge in developing Machine Learning solutions to solve real world problems in various domains. In Recommender Systems, a typical usage of uncertainty is to balance exploration and exploitation, where the uncertainty helps to guide the selection of new options in exploration. Recent advances in combining Bayesian methods with deep learning enable us to express uncertain status in deep learning models. In this paper, we investigate an approach based on Bayesian deep learning to improve personalized recommendations. We first build deep learning architectures to learn useful representation of user and item inputs for predicting their interactions. We then explore multiple embedding components to accommodate different types of user and item inputs. Based on Bayesian deep learning techniques, a key novelty of our approach is to capture the uncertainty associated with the model output and further utilize it to boost exploration in the context of Recommender Systems. We test the proposed approach in both a Collaborative Filtering and a simulated online recommendation setting. Experimental results on publicly available benchmarks demonstrate the benefits of our approach in improving the recommendation performance.}, language = {en}, number = {2}, urldate = {2024-03-17}, journal = {International Journal of Data Science and Analytics}, author = {Wang, Xin and Kadıoğlu, Serdar}, month = aug, year = {2023}, pages = {191--201}, }
Modeling uncertainty has been a major challenge in developing Machine Learning solutions to solve real world problems in various domains. In Recommender Systems, a typical usage of uncertainty is to balance exploration and exploitation, where the uncertainty helps to guide the selection of new options in exploration. Recent advances in combining Bayesian methods with deep learning enable us to express uncertain status in deep learning models. In this paper, we investigate an approach based on Bayesian deep learning to improve personalized recommendations. We first build deep learning architectures to learn useful representation of user and item inputs for predicting their interactions. We then explore multiple embedding components to accommodate different types of user and item inputs. Based on Bayesian deep learning techniques, a key novelty of our approach is to capture the uncertainty associated with the model output and further utilize it to boost exploration in the context of Recommender Systems. We test the proposed approach in both a Collaborative Filtering and a simulated online recommendation setting. Experimental results on publicly available benchmarks demonstrate the benefits of our approach in improving the recommendation performance.
The effect of random seeds for data splitting on recommendation accuracy.
Wegmeth, L.; Vente, T.; Purucker, L.; and Beel, J.
In Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES 2023), September 2023.
link bibtex abstract
link bibtex abstract
@inproceedings{wegmeth_effect_2023, title = {The effect of random seeds for data splitting on recommendation accuracy}, abstract = {The evaluation of recommender system algorithms depends on randomness, e.g., during randomly splitting data into training and testing data. We suspect that failing to account for randomness in this scenario may lead to misrepresenting the predictive accuracy of recommendation algorithms. To understand the community’s view of the importance of randomness, we conducted a paper study on 39 full papers published at the ACM RecSys 2022 conference. We found that the authors of 26 papers used some variation of a holdout split that requires a random seed. However, only five papers explicitly repeated experiments and averaged their results over different random seeds. This potentially problematic research practice motivated us to analyze the effect of data split random seeds on recommendation accuracy. Therefore, we train three common algorithms on nine public data sets with 20 data split random seeds, evaluate them on two ranking metrics with three different ranking cutoff values 𝑘, and compare the results. In the extreme case with 𝑘 = 1, we show that depending on the data split random seed, the accuracy with traditional recommendation algorithms deviates by up to ∼6.3\% from the mean accuracy achieved on the data set. Hence, we show that an algorithm may significantly over- or under-perform when maliciously or negligently selecting a random seed for splitting the data. To showcase a mitigation strategy and better research practice, we compare holdout to cross-validation and show that, again, for 𝑘 = 1, the accuracy of algorithms evaluated with cross-validation deviates only up to ∼2.3\% from the mean accuracy achieved on the data set. Furthermore, we found that the deviation becomes smaller the higher the value of 𝑘 for both holdout and cross-validation.}, language = {en}, booktitle = {Perspectives on the {Evaluation} of {Recommender} {Systems} {Workshop} ({PERSPECTIVES} 2023)}, author = {Wegmeth, Lukas and Vente, Tobias and Purucker, Lennart and Beel, Joeran}, month = sep, year = {2023}, keywords = {to-read}, }
The evaluation of recommender system algorithms depends on randomness, e.g., during randomly splitting data into training and testing data. We suspect that failing to account for randomness in this scenario may lead to misrepresenting the predictive accuracy of recommendation algorithms. To understand the community’s view of the importance of randomness, we conducted a paper study on 39 full papers published at the ACM RecSys 2022 conference. We found that the authors of 26 papers used some variation of a holdout split that requires a random seed. However, only five papers explicitly repeated experiments and averaged their results over different random seeds. This potentially problematic research practice motivated us to analyze the effect of data split random seeds on recommendation accuracy. Therefore, we train three common algorithms on nine public data sets with 20 data split random seeds, evaluate them on two ranking metrics with three different ranking cutoff values 𝑘, and compare the results. In the extreme case with 𝑘 = 1, we show that depending on the data split random seed, the accuracy with traditional recommendation algorithms deviates by up to ∼6.3% from the mean accuracy achieved on the data set. Hence, we show that an algorithm may significantly over- or under-perform when maliciously or negligently selecting a random seed for splitting the data. To showcase a mitigation strategy and better research practice, we compare holdout to cross-validation and show that, again, for 𝑘 = 1, the accuracy of algorithms evaluated with cross-validation deviates only up to ∼2.3% from the mean accuracy achieved on the data set. Furthermore, we found that the deviation becomes smaller the higher the value of 𝑘 for both holdout and cross-validation.
Introducing LensKit-Auto, an experimental automated recommender system (AutoRecSys) toolkit.
Vente, T.; Ekstrand, M.; and Beel, J.
In Proceedings of the 17th ACM Conference on Recommender Systems, of RecSys '23, pages 1212–1216, New York, NY, USA, September 2023. Association for Computing Machinery
Paper
doi
link
bibtex
abstract
1 download
@inproceedings{vente_introducing_2023, address = {New York, NY, USA}, series = {{RecSys} '23}, title = {Introducing {LensKit}-{Auto}, an experimental automated recommender system ({AutoRecSys}) toolkit}, isbn = {9798400702419}, url = {https://dl.acm.org/doi/10.1145/3604915.3610656}, doi = {10.1145/3604915.3610656}, abstract = {LensKit is one of the first and most popular Recommender System libraries. While LensKit offers a wide variety of features, it does not include any optimization strategies or guidelines on how to select and tune LensKit algorithms. LensKit developers have to manually include third-party libraries into their experimental setup or implement optimization strategies by hand to optimize hyperparameters. We found that 63.6\% (21 out of 33) of papers using LensKit algorithms for their experiments did not select algorithms or tune hyperparameters. Non-optimized models represent poor baselines and produce less meaningful research results. This demo introduces LensKit-Auto. LensKit-Auto automates the entire Recommender System pipeline and enables LensKit developers to automatically select, optimize, and ensemble LensKit algorithms.}, urldate = {2023-09-18}, booktitle = {Proceedings of the 17th {ACM} {Conference} on {Recommender} {Systems}}, publisher = {Association for Computing Machinery}, author = {Vente, Tobias and Ekstrand, Michael and Beel, Joeran}, month = sep, year = {2023}, keywords = {Algorithm Selection, AutoRecSys, Automated Recommender Systems, CASH, Hyperparameter Optimization, Recommender Systems}, pages = {1212--1216}, }
LensKit is one of the first and most popular Recommender System libraries. While LensKit offers a wide variety of features, it does not include any optimization strategies or guidelines on how to select and tune LensKit algorithms. LensKit developers have to manually include third-party libraries into their experimental setup or implement optimization strategies by hand to optimize hyperparameters. We found that 63.6% (21 out of 33) of papers using LensKit algorithms for their experiments did not select algorithms or tune hyperparameters. Non-optimized models represent poor baselines and produce less meaningful research results. This demo introduces LensKit-Auto. LensKit-Auto automates the entire Recommender System pipeline and enables LensKit developers to automatically select, optimize, and ensemble LensKit algorithms.
Mitigating mainstream bias in recommendation via cost-sensitive learning.
Li, R. Z.; Urbano, J.; and Hanjalic, A.
July 2023.
arXiv:2307.13632 [cs]
Paper
doi
link
bibtex
abstract
@misc{li_mitigating_2023, title = {Mitigating mainstream bias in recommendation via cost-sensitive learning}, url = {http://arxiv.org/abs/2307.13632}, doi = {10.1145/3578337.3605134}, abstract = {Mainstream bias, where some users receive poor recommendations because their preferences are uncommon or simply because they are less active, is an important aspect to consider regarding fairness in recommender systems. Existing methods to mitigate mainstream bias do not explicitly model the importance of these non-mainstream users or, when they do, it is in a way that is not necessarily compatible with the data and recommendation model at hand. In contrast, we use the recommendation utility as a more generic and implicit proxy to quantify mainstreamness, and propose a simple user-weighting approach to incorporate it into the training process while taking the cost of potential recommendation errors into account. We provide extensive experimental results showing that quantifying mainstreamness via utility is better able at identifying non-mainstream users, and that they are indeed better served when training the model in a cost-sensitive way. This is achieved with negligible or no loss in overall recommendation accuracy, meaning that the models learn a better balance across users. In addition, we show that research of this kind, which evaluates recommendation quality at the individual user level, may not be reliable if not using enough interactions when assessing model performance.}, urldate = {2023-07-29}, author = {Li, Roger Zhe and Urbano, Julián and Hanjalic, Alan}, month = jul, year = {2023}, note = {arXiv:2307.13632 [cs]}, keywords = {Computer Science - Information Retrieval}, }
Mainstream bias, where some users receive poor recommendations because their preferences are uncommon or simply because they are less active, is an important aspect to consider regarding fairness in recommender systems. Existing methods to mitigate mainstream bias do not explicitly model the importance of these non-mainstream users or, when they do, it is in a way that is not necessarily compatible with the data and recommendation model at hand. In contrast, we use the recommendation utility as a more generic and implicit proxy to quantify mainstreamness, and propose a simple user-weighting approach to incorporate it into the training process while taking the cost of potential recommendation errors into account. We provide extensive experimental results showing that quantifying mainstreamness via utility is better able at identifying non-mainstream users, and that they are indeed better served when training the model in a cost-sensitive way. This is achieved with negligible or no loss in overall recommendation accuracy, meaning that the models learn a better balance across users. In addition, we show that research of this kind, which evaluates recommendation quality at the individual user level, may not be reliable if not using enough interactions when assessing model performance.
Inference at scale: significance testing for large search and recommendation experiments.
Ihemelandu, N.; and Ekstrand, M. D.
In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, of SIGIR '23, pages 2087–2091, New York, NY, USA, July 2023. Association for Computing Machinery
Paper
doi
link
bibtex
abstract
@inproceedings{ihemelandu_inference_2023, address = {New York, NY, USA}, series = {{SIGIR} '23}, title = {Inference at scale: significance testing for large search and recommendation experiments}, copyright = {All rights reserved}, isbn = {978-1-4503-9408-6}, shorttitle = {Inference at scale}, url = {https://dl.acm.org/doi/10.1145/3539618.3592004}, doi = {10.1145/3539618.3592004}, abstract = {A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-N recommendation and large search evaluation data, most tests would have a 100\% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.}, urldate = {2023-07-23}, booktitle = {Proceedings of the 46th {International} {ACM} {SIGIR} {Conference} on {Research} and {Development} in {Information} {Retrieval}}, publisher = {Association for Computing Machinery}, author = {Ihemelandu, Ngozi and Ekstrand, Michael D.}, month = jul, year = {2023}, keywords = {evaluation, statistical inference}, pages = {2087--2091}, }
A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-N recommendation and large search evaluation data, most tests would have a 100% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.
2022
(1)
Measuring fairness in ranked results: an analytical and empirical comparison.
Raj, A.; and Ekstrand, M. D
In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 726–736, July 2022. ACM
Paper
doi
link
bibtex
abstract
4 downloads
@inproceedings{raj_measuring_2022, title = {Measuring fairness in ranked results: an analytical and empirical comparison}, url = {https://md.ekstrandom.net/pubs/fair-ranking}, doi = {10.1145/3477495.3532018}, abstract = {Information access systems, such as search and recommender systems, often use ranked lists to present results believed to be relevant to the user's information need. Evaluating these lists for their fairness along with other traditional metrics provides a more complete understanding of an information access system's behavior beyond accuracy or utility constructs. To measure the (un)fairness of rankings, particularly with respect to the protected group(s) of producers or providers, several metrics have been proposed in the last several years. However, an empirical and comparative analyses of these metrics showing the applicability to specific scenario or real data, conceptual similarities, and differences is still lacking. We aim to bridge the gap between theoretical and practical ap-plication of these metrics. In this paper we describe several fair ranking metrics from the existing literature in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare them on the same experimental setup and data sets in the context of three information access tasks. We also provide a sensitivity analysis to assess the impact of the design choices and parameter settings that go in to these metrics and point to additional work needed to improve fairness measurement.}, booktitle = {Proceedings of the 45th {International} {ACM} {SIGIR} {Conference} on {Research} and {Development} in {Information} {Retrieval}}, publisher = {ACM}, author = {Raj, Amifa and Ekstrand, Michael D}, month = jul, year = {2022}, pages = {726--736}, }
Information access systems, such as search and recommender systems, often use ranked lists to present results believed to be relevant to the user's information need. Evaluating these lists for their fairness along with other traditional metrics provides a more complete understanding of an information access system's behavior beyond accuracy or utility constructs. To measure the (un)fairness of rankings, particularly with respect to the protected group(s) of producers or providers, several metrics have been proposed in the last several years. However, an empirical and comparative analyses of these metrics showing the applicability to specific scenario or real data, conceptual similarities, and differences is still lacking. We aim to bridge the gap between theoretical and practical ap-plication of these metrics. In this paper we describe several fair ranking metrics from the existing literature in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare them on the same experimental setup and data sets in the context of three information access tasks. We also provide a sensitivity analysis to assess the impact of the design choices and parameter settings that go in to these metrics and point to additional work needed to improve fairness measurement.
2021
(1)
Exploring author gender in book rating and recommendation.
Ekstrand, M. D; and Kluver, D.
User Modeling and User-Adapted Interaction, 31(3): 377–420. July 2021.
Paper
doi
link
bibtex
abstract
12 downloads
@article{ekstrand_exploring_2021, title = {Exploring author gender in book rating and recommendation}, volume = {31}, issn = {0924-1868}, url = {https://md.ekstrandom.net/pubs/bag-extended}, doi = {10.1007/s11257-020-09284-2}, abstract = {Collaborative filtering algorithms find useful patterns in rating and consumption data and exploit these patterns to guide users to good items. Many of the patterns in rating datasets reflect important real-world differences between the various users and items in the data; other patterns may be irrelevant or possibly undesirable for social or ethical reasons, particularly if they reflect undesired discrimination, such as discrimination in publishing or purchasing against authors who are women or ethnic minorities. In this work, we examine the response of collaborative filtering recommender algorithms to the distribution of their input data with respect to a dimension of social concern, namely content creator gender. Using publicly-available book ratings data, we measure the distribution of the genders of the authors of books in user rating profiles and recommendation lists produced from this data. We find that common collaborative filtering algorithms differ in the gender distribution of their recommendation lists, and in the relationship of that output distribution to user profile distribution.}, number = {3}, urldate = {2020-06-05}, journal = {User Modeling and User-Adapted Interaction}, author = {Ekstrand, Michael D and Kluver, Daniel}, month = jul, year = {2021}, pages = {377--420}, }
Collaborative filtering algorithms find useful patterns in rating and consumption data and exploit these patterns to guide users to good items. Many of the patterns in rating datasets reflect important real-world differences between the various users and items in the data; other patterns may be irrelevant or possibly undesirable for social or ethical reasons, particularly if they reflect undesired discrimination, such as discrimination in publishing or purchasing against authors who are women or ethnic minorities. In this work, we examine the response of collaborative filtering recommender algorithms to the distribution of their input data with respect to a dimension of social concern, namely content creator gender. Using publicly-available book ratings data, we measure the distribution of the genders of the authors of books in user rating profiles and recommendation lists produced from this data. We find that common collaborative filtering algorithms differ in the gender distribution of their recommendation lists, and in the relationship of that output distribution to user profile distribution.
2020
(4)
Music recommendation using genetic programming.
Vanhaesebroeck, R.
Master's thesis, Ghent University, Belgium, 2020.
Paper
link
bibtex
@mastersthesis{vanhaesebroeck_music_2020, address = {Belgium}, title = {Music recommendation using genetic programming}, url = {https://libstore.ugent.be/fulltxt/RUG01/002/945/760/RUG01-002945760_2021_0001_AC.pdf}, urldate = {2025-05-30}, school = {Ghent University}, author = {Vanhaesebroeck, Robbe}, year = {2020}, }
User-Specific Bicluster-Based Collaborative Filtering.
da Silva, M. M. G.
Master's thesis, Universidade de Lisboa (Portugal), Portugal, 2020.
ISBN: 9798209925156
Paper
link
bibtex
abstract
@mastersthesis{da_silva_user-specific_2020, address = {Portugal}, title = {User-{Specific} {Bicluster}-{Based} {Collaborative} {Filtering}}, copyright = {Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.}, url = {https://www.proquest.com/docview/2652593247/abstract/29EEDE32E67A4219PQ/1}, abstract = {Os sistemas de recomendação são um conjunto de técnicas e software que têm como objetivo sugerir itens a um determinado utilizador. Sugestões essas que têm como objetivo ajudar os utilizadores durante a tomada de decisão. O processo para uma tomada de decisão pode ser difícil, especialmente quando existe um enorme número de opções para escolher. Grandes empresas tiram partido dos sistemas de recomendação para melhorar o seu serviço e aumentar as suas receitas. Um exemplo é a plataforma de streaming Netflix que, utilizando um sistema de recomendação, personaliza os filmes ou séries destacados para cada cliente. As recomendações personalizadas normalmente têm como base os dados que as empresas recolhem dos utilizadores, que vão desde reações explícitas, por exemplo através avaliações do utilizador a produtos, a reações implícitas, examinando a forma como o utilizador interage com o sistema. Uma das abordagens mais populares dos sistemas de recomendação é a filtragem colaborativa. Os métodos baseados em filtragem colaborativa produzem recomendações personalizadas de itens, tendo por base padrões encontrados em dados de uso ou avaliações anteriores. Os modelos de filtragem colaborativa normalmente usam uma simples matriz de dados, conhecida como matriz de interação U-I, que contém as avaliações que os utilizadores deram aos itens do sistema. Explorando os dados da matriz U-I, a filtragem colaborativa assume que, se um determinado utilizador teve as mesmas preferências que outro utilizador no passado, é provável que também venha a ter no futuro. Desta forma, os modelos de filtragem colaborativa têm como objetivo recomendar uma lista de N itens a um utilizador (denominado utilizador ativo), ou prever o rating que esse utilizador iria dar a um item que ainda não avaliou. Na literatura, os métodos de filtragem colaborativa são divididos em duas classes: os baseados em memórias e os baseados em modelos. Os algoritmos baseados em memória, também conhecidos como algoritmos de vizinhança, usam toda a matriz U-I para realizar as tarefas de recomendação. Os dois principais métodos são conhecidos como “User-based” e “Item-based”. O User-based tenta encontrar utilizadores com preferências parecidas ao utilizador a que se pretende fazer recomendações e usa os dados dessa vizinhança de utilizadores similares para fazer as previsões ou recomendações. Por outro lado, os algoritmos Item-based utilizam os itens já avaliados pelo utilizador ativo, calculam a similaridade entre esses itens e o item que se quer avaliar, construindo assim uma vizinhança de itens. A partir dessa vizinhança de itens, prevê-se uma futura avaliação do utilizador a esse mesmo item. Apesar de os algoritmos de vizinhança obterem bom resultados de previsão e recomendação, apresentam duas grandes debilidades que limitam o seu uso em ambientes de recomendação do mundo real. Os dados de recomendação são normalmente de grandes dimensões e esparsos, isto é, com muitos valores em falta. Dada a complexidade resultante do facto de terem de comparar todos os utilizadores ou itens entre si, o que se traduz em n 2 comparações, torna-se impraticável o uso de algoritmos deste género em sistemas com grande quantidade de users e itens. Além disso, o facto de haver muitos valores em falta, faz que seja recorrente alguns utilizadores/itens terem pequenas vizinhanças. Para tentar lidar com as fraquezas dos algoritmos baseados em memórias, surgiram os algoritmos baseados em modelos. Estas abordagens utilizam modelos que aprendem com os dados e reconhecem padrões para realizar as tarefas de filtragem colaborativa. Técnicas de redução de dimensionalidade como “Singular Value Decomposition” e “Latent Semantic Analysis” são agora as abordagens standard para reduzir a natureza esparsa da matriz de interação. Existem ainda abordagens baseadas em aprendizagem automática, como redes bayesianas, agrupamento de dados, entre outras. Estes modelos de redução de dimensionalidade, apesar de perderem informação que geralmente resulta em piores resultados em termos de previsão/recomendação, conseguem lidar com o problema da escalabilidade apresentado pelos modelos baseados em memória. Alternate abstract: Collaborative Filtering is one of the most popular and successful approaches for Recommender Systems. However, some challenges limit the effectiveness of Collaborative Filtering approaches when dealing with recommendation data, mainly due to the vast amounts of data and their sparse nature. In order to improve the scalability and performance of Collaborative Filtering approaches, several authors proposed successful approaches combining Collaborative Filtering with clustering techniques. In this work, we study the effectiveness of biclustering, an advanced clustering technique that groups rows and columns simultaneously, in Collaborative Filtering. When applied to the classic U-I interaction matrices, biclustering considers the duality relations between users and items, creating clusters of users who are similar under a particular group of items. We propose USBCF, a novel biclustering-based Collaborative Filtering approach that creates user specific models to improve the scalability of traditional CF approaches. Using a realworld dataset, we conduct a set of experiments to objectively evaluate the performance of the proposed approach, comparing it against baseline and state-of-the-art Collaborative Filtering methods. Our results show that the proposed approach can successfully suppress the main limitation of the previously proposed state-of-the-art biclustering-based Collaborative Filtering (BBCF) since BBCF can only output predictions for a small subset of the system users and item (lack of coverage). Moreover, USBCF produces rating predictions with quality comparable to the state-of-the-art approaches.}, language = {English}, urldate = {2025-05-30}, school = {Universidade de Lisboa (Portugal)}, author = {da Silva, Miguel Miranda Garção}, year = {2020}, note = {ISBN: 9798209925156}, }
Os sistemas de recomendação são um conjunto de técnicas e software que têm como objetivo sugerir itens a um determinado utilizador. Sugestões essas que têm como objetivo ajudar os utilizadores durante a tomada de decisão. O processo para uma tomada de decisão pode ser difícil, especialmente quando existe um enorme número de opções para escolher. Grandes empresas tiram partido dos sistemas de recomendação para melhorar o seu serviço e aumentar as suas receitas. Um exemplo é a plataforma de streaming Netflix que, utilizando um sistema de recomendação, personaliza os filmes ou séries destacados para cada cliente. As recomendações personalizadas normalmente têm como base os dados que as empresas recolhem dos utilizadores, que vão desde reações explícitas, por exemplo através avaliações do utilizador a produtos, a reações implícitas, examinando a forma como o utilizador interage com o sistema. Uma das abordagens mais populares dos sistemas de recomendação é a filtragem colaborativa. Os métodos baseados em filtragem colaborativa produzem recomendações personalizadas de itens, tendo por base padrões encontrados em dados de uso ou avaliações anteriores. Os modelos de filtragem colaborativa normalmente usam uma simples matriz de dados, conhecida como matriz de interação U-I, que contém as avaliações que os utilizadores deram aos itens do sistema. Explorando os dados da matriz U-I, a filtragem colaborativa assume que, se um determinado utilizador teve as mesmas preferências que outro utilizador no passado, é provável que também venha a ter no futuro. Desta forma, os modelos de filtragem colaborativa têm como objetivo recomendar uma lista de N itens a um utilizador (denominado utilizador ativo), ou prever o rating que esse utilizador iria dar a um item que ainda não avaliou. Na literatura, os métodos de filtragem colaborativa são divididos em duas classes: os baseados em memórias e os baseados em modelos. Os algoritmos baseados em memória, também conhecidos como algoritmos de vizinhança, usam toda a matriz U-I para realizar as tarefas de recomendação. Os dois principais métodos são conhecidos como “User-based” e “Item-based”. O User-based tenta encontrar utilizadores com preferências parecidas ao utilizador a que se pretende fazer recomendações e usa os dados dessa vizinhança de utilizadores similares para fazer as previsões ou recomendações. Por outro lado, os algoritmos Item-based utilizam os itens já avaliados pelo utilizador ativo, calculam a similaridade entre esses itens e o item que se quer avaliar, construindo assim uma vizinhança de itens. A partir dessa vizinhança de itens, prevê-se uma futura avaliação do utilizador a esse mesmo item. Apesar de os algoritmos de vizinhança obterem bom resultados de previsão e recomendação, apresentam duas grandes debilidades que limitam o seu uso em ambientes de recomendação do mundo real. Os dados de recomendação são normalmente de grandes dimensões e esparsos, isto é, com muitos valores em falta. Dada a complexidade resultante do facto de terem de comparar todos os utilizadores ou itens entre si, o que se traduz em n 2 comparações, torna-se impraticável o uso de algoritmos deste género em sistemas com grande quantidade de users e itens. Além disso, o facto de haver muitos valores em falta, faz que seja recorrente alguns utilizadores/itens terem pequenas vizinhanças. Para tentar lidar com as fraquezas dos algoritmos baseados em memórias, surgiram os algoritmos baseados em modelos. Estas abordagens utilizam modelos que aprendem com os dados e reconhecem padrões para realizar as tarefas de filtragem colaborativa. Técnicas de redução de dimensionalidade como “Singular Value Decomposition” e “Latent Semantic Analysis” são agora as abordagens standard para reduzir a natureza esparsa da matriz de interação. Existem ainda abordagens baseadas em aprendizagem automática, como redes bayesianas, agrupamento de dados, entre outras. Estes modelos de redução de dimensionalidade, apesar de perderem informação que geralmente resulta em piores resultados em termos de previsão/recomendação, conseguem lidar com o problema da escalabilidade apresentado pelos modelos baseados em memória. Alternate abstract: Collaborative Filtering is one of the most popular and successful approaches for Recommender Systems. However, some challenges limit the effectiveness of Collaborative Filtering approaches when dealing with recommendation data, mainly due to the vast amounts of data and their sparse nature. In order to improve the scalability and performance of Collaborative Filtering approaches, several authors proposed successful approaches combining Collaborative Filtering with clustering techniques. In this work, we study the effectiveness of biclustering, an advanced clustering technique that groups rows and columns simultaneously, in Collaborative Filtering. When applied to the classic U-I interaction matrices, biclustering considers the duality relations between users and items, creating clusters of users who are similar under a particular group of items. We propose USBCF, a novel biclustering-based Collaborative Filtering approach that creates user specific models to improve the scalability of traditional CF approaches. Using a realworld dataset, we conduct a set of experiments to objectively evaluate the performance of the proposed approach, comparing it against baseline and state-of-the-art Collaborative Filtering methods. Our results show that the proposed approach can successfully suppress the main limitation of the previously proposed state-of-the-art biclustering-based Collaborative Filtering (BBCF) since BBCF can only output predictions for a small subset of the system users and item (lack of coverage). Moreover, USBCF produces rating predictions with quality comparable to the state-of-the-art approaches.
Evaluating stochastic rankings with expected exposure.
Diaz, F.; Mitra, B.; Ekstrand, M. D; Biega, A. J; and Carterette, B.
In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, of CIKM '20, October 2020. ACM
Paper
doi
link
bibtex
abstract
1 download
@inproceedings{diaz_evaluating_2020, series = {{CIKM} '20}, title = {Evaluating stochastic rankings with expected exposure}, url = {http://arxiv.org/abs/2004.13157}, doi = {10.1145/3340531.3411962}, abstract = {We introduce the concept of expected exposure as the average attention ranked items receive from users over repeated samples of the same query. Furthermore, we advocate for the adoption of the principle of equal expected exposure: given a fixed information need, no item receive more or less expected exposure compared to any other item of the same relevance grade. We argue that this principle is desirable for many retrieval objectives and scenarios, including topical diversity and fair ranking. Leveraging user models from existing retrieval metrics, we propose a general evaluation methodology based on expected exposure and draw connections to related metrics in information retrieval evaluation. Importantly, this methodology relaxes classic information retrieval assumptions, allowing a system, in response to a query, to produce a distribution over rankings instead of a single fixed ranking. We study the behavior of the expected exposure metric and stochastic rankers across a variety of information access conditions, including ad hoc retrieval and recommendation. We believe that measuring and optimizing expected exposure metrics using randomization opens a new area for retrieval algorithm development and progress.}, booktitle = {Proceedings of the 29th {ACM} {International} {Conference} on {Information} and {Knowledge} {Management}}, publisher = {ACM}, author = {Diaz, Fernando and Mitra, Bhaskar and Ekstrand, Michael D and Biega, Asia J and Carterette, Ben}, month = oct, year = {2020}, }
We introduce the concept of expected exposure as the average attention ranked items receive from users over repeated samples of the same query. Furthermore, we advocate for the adoption of the principle of equal expected exposure: given a fixed information need, no item receive more or less expected exposure compared to any other item of the same relevance grade. We argue that this principle is desirable for many retrieval objectives and scenarios, including topical diversity and fair ranking. Leveraging user models from existing retrieval metrics, we propose a general evaluation methodology based on expected exposure and draw connections to related metrics in information retrieval evaluation. Importantly, this methodology relaxes classic information retrieval assumptions, allowing a system, in response to a query, to produce a distribution over rankings instead of a single fixed ranking. We study the behavior of the expected exposure metric and stochastic rankers across a variety of information access conditions, including ad hoc retrieval and recommendation. We believe that measuring and optimizing expected exposure metrics using randomization opens a new area for retrieval algorithm development and progress.
Comparing fair ranking metrics.
Raj, A.; Wood, C.; Montoly, A.; and Ekstrand, M. D
In September 2020.
Paper
link
bibtex
abstract
@inproceedings{raj_comparing_2020, title = {Comparing fair ranking metrics}, url = {http://arxiv.org/abs/2009.01311}, abstract = {Ranking is a fundamental aspect of recommender systems. However, ranked outputs can be susceptible to various biases; some of these may cause disadvantages to members of protected groups. Several metrics have been proposed to quantify the (un)fairness of rankings, but there has not been to date any direct comparison of these metrics. This complicates deciding what fairness metrics are applicable for specific scenarios, and assessing the extent to which metrics agree or disagree. In this paper, we describe several fair ranking metrics in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare them on the same experimental setup and data set. Our work provides a direct comparative analysis identifying similarities and differences of fair ranking metrics selected for our work.}, author = {Raj, Amifa and Wood, Connor and Montoly, Ananda and Ekstrand, Michael D}, month = sep, year = {2020}, }
Ranking is a fundamental aspect of recommender systems. However, ranked outputs can be susceptible to various biases; some of these may cause disadvantages to members of protected groups. Several metrics have been proposed to quantify the (un)fairness of rankings, but there has not been to date any direct comparison of these metrics. This complicates deciding what fairness metrics are applicable for specific scenarios, and assessing the extent to which metrics agree or disagree. In this paper, we describe several fair ranking metrics in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare them on the same experimental setup and data set. Our work provides a direct comparative analysis identifying similarities and differences of fair ranking metrics selected for our work.