A03 – Sequential and adaptive learning under dependence and non-standard objective functions

We consider the problem of learning sequentially, adaptively and on the basis of partial information in an uncertain environment. In this setting, the learner sequentially and actively collects the data, which is not available beforehand, in a batch form. Here, ‘batch form' means that the amount of available data is enormous (large scale setting). The process is as follows: at each time t, the learner chooses an action and receives a data point that depends on the performed action. The most basic problem of that form is known under the term multi-armed bandits problem [1,2].

A popular application that we extensively considered in the first period is the one of online recommendation systems [3]. The objective of the recommendation system (learner) is to recommend at each time t an item (a set of actions) to a user that fits their needs. The recommendation system does not observe the preferences of the users, but only their opinions on what the recommendation system has recommended as items previously. This raises many interesting problems, for example, finding optimal sequential strategies that can rapidly single out the most liked items, or efficiently determining the preferences of users, etc.

In many real life cases, the collected data has dependencies and the objectives of the learner can vary. In this project, we consider a setting where our aim is to design strategies to collect (or to sample) sequentially and adaptively the data so that it can be assimilated by the model in the most effective way possible. An important milestone that we achieved was to extend the bandit problem to a case where the samples produced by each arm were allowed to depend on each other, via a mixing assumption.

Our aim in the next phase is to generalise the results of the first phase, targeting dependent and non-stationary bandit settings, to the case where the set of actions is continuous. Spatial and temporal dependencies and nonstationarities will interact, giving rise to new problems and challenges.


[1] Aleksandrs Slivkins, Introduction to Multi-Armed Bandits, 2021, https://arxiv.org/pdf/1904.07272.pdf

[2] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. doi:10.1017/CBO9780511546921

[3] Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits. 2019, https://arxiv.org/abs/1904.10040

  • Zadorozhnyi, O. and Gaillard, P. and Gerchinovitz, S. and Rudi, A. (2021). Online nonparametric regression with Sobolev kernels. arxiv: 2102.03594

  • Vernade, C., Carpentier, A., Lattimore, T., Zappella, G., Ermis, B. and Brueckner, M. (2020). Linear Bandits with Stochastic Delayed Feedback. arXiv:1807.02089

  • Lefakis, L., Zadorozhnyi, O. and Blanchard, G. (2019). Efficient Regularized Piecewise-Linear Regression TreesarXiv: 1907.00275​​​​​​​ 

  • Zadorozhnyi, O., Blanchard, G. and Carpentier, A. (2019). Restless dependent bandits with fading memory. arXiv: 1906.10454 

  • Saggioro, E., de Wiljes, J., Kretschmer, M., and Runge, J. (2020). Reconstructing regime-dependent causalrelationships from observational time series. Chaos, 30(11):113115–1–113115–22, doi:10.1063/5.0020538

  • Maneugueu, A., Vernade, C., Carpentier A., and Valko, M. (2020). Stochastic bandits with arm-dependent delays. In Thirty-seventh International Conference on Machine Learning, accepted for publication. arXiv:2006.10459

  • Blanchard, G. and Zadorozhnyi, O. (2019). Concentration of weakly dependent Banach-valued sums and applications to statistical learning methods. Bernoulli, 25(4B), 3421-3458. doi:10.3150/18-BEJ1095 (arXiv: 1712.01934)

  • Achddou, J., Lam-Weil, J., Carpentier, A. and Blanchard, G. (2019). A minimax near-optimal algorithm for adaptive rejection sampling. Proceedings of the 30th International Conference on Algorithmic Learning Theory, PMLR 98:94-126, 2019. Open Access

  • Locatelli, A., Carpentier, A., and Valko, M. (2019). Active multiple matrix completion with adaptive confidence sets. Proceedings of Machine Learning Research, PMLR, 89, 1783-1791. Open Access

  • Seznec, J, Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. (2019). Rotting bandits are no harder than stochastic ones. Proceedings of Machine Learning Research, in PMLR 89:2564-2572. Open Access

  • Locatelli, A., and Carpentier, A. (2018). Adaptivity to Smoothness in X-armed bandits. Proceedings of Machine Learning Research, PMLR, 75, 1463-1492. Open Access

  • Blanchard, G., Carpentier, A. and Gutzeit, M. (2018). Minimax Euclidean Separation Rates for Testing Convex Hypotheses in Rd. Electron. J. Statist. 12 (2): 3713-3735. doi:10.1214/18-EJS1472

  • Locatelli, A., Carpentier, A., and Kpotufe, S. (2018). An Adaptive Strategy for Active Learning with Smooth Decision Boundary. Proceedings of Machine Learning Research (ALT), 83, 547-571. Open Access