Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization
arXiv:2605.06864v1 Announce Type: new Abstract: We study multi-objective multi-agent multi-armed bandits (MO-MA-MAB) under stochastic rewards, where agents observe heterogeneous reward vectors and communicate over time-varying graphs. We formulate this emerging problem setting to address \emph{efficient learning}, measured by Pareto regret, and incorporate \emph{fair learning} as an additional goal, captured via social welfare. To measure efficiency, we formulate Pareto regret and develop \textsc{Pareto UCB1 Gossip}, whose novel exploration radius explicitly separates statistical uncertainty in Pareto-based inference from consensus error. To express the fairness constraint, we formulate a Nash Social Welfare objective over preference-scalarized rewards and propose \textsc{Simulated NSW UCB Gossip}, which integrates preference-based reward simulation, gossip-based utility estimation, and UCB-style exploration. We prove that \textsc{Pareto UCB1 Gossip} achieves \(\mathcal{O}(\log T)\) regret and an instance-independent rate of \(\mathcal{O}(\sqrt{T})\), while \textsc{Simulated NSW UCB Gossip} achieves an instance-independent regret bound of \(\mathcal{O}(T^{3/4})\). This separation reveals the cost of imposing the fairness constraint to our efficiency objective: fairness limits information aggregation and slows convergence. Experiments show that our methods consistently outperform baselines, improving performance by approximately \(100\%\) and \(50\%\) in the efficiency and fairness settings, respectively.
