## Groupe de travail des thésards du LPSM

#### Jour, heure et lieu

Le Lundi à 17:00, Jussieu, Salle Paul Lévy, 16-26 209

#### Contact(s)

gtt [AT] lpsm.paris

#### Liste de diffusion du GTT

Si vous souhaitez recevoir toutes les informations sur les événements du GTT, n'hésitez pas à vous inscrire à la liste de diffusion. Il suffit d'ajouter votre nom et votre email à l'adresse suivante:

### Séances passées

#### Année 2023

Groupe de travail des thésards du LPSM

Lundi 16 janvier 2023, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Emilien Bodiot + Ariane Marandon-Carlhian** (LPSM + LPSM) *Discrete Gaussian Markov Processes: Fourier VS Invariant Measure (E. Bodiot) + Machine learning meets false discovery rate (A. Marandon-Carlhian)*

A discrete Gaussian Markov Process is a simple statistical physics model. As for most statistical physics problems, people are usually interested in computing partition function, free energy, correlation function… In the context of discrete Gaussian Markov processes, there are two main methods to perform such computations. The first one, and the most efficient, is to impose periodicity and use Fourier transform. The second one is to take advantage of the Markov property of the model and look for invariant measures. Both methods allow to solve the same model but one don’t easily see any link between them. Fourier transform is a global computation which totally hides local Markov properties whereas invariant measure is a local computation based on the Markov property of the model. After discussing some details about the two approaches, we will focus ourselves on a function which links Fourier with invariant measures in this context.

Ariane MARANDON-CARLHIAN - Machine learning meets false discovery rate.

Novelty/outlier detection is the problem of identifying observations that do not conform to a well defined notion of normal behavior. More formally, the problem we consider is as follows: we have at hand a sample of observations Y_1, …, Y_n each with a common unknown distribution P_0, which we call nominal observations, and a sample of observations X_1, …, X_m that may contain both nominals, i.e. observations marginally distributed according to P0, and novelties if otherwise. The aim being to identify the novelties. Specifically, the aim is to control the false discovery rate (FDR) defined as the proportion of detections that are false (true nominals declared as novelties) at some fixed error margin, say 10% for instance, while maximizing the number of detections under this constraint.

Under this setting, Conformal Anomaly Detection (Bates et al., 2022, AOS) is a breakthrough novelty detection technique that provides the guarantee to have finite-sample control of the FDR (for whatever error margin fixed beforehand). In this talk, I will mainly present CAD. Then, I will speak a bit about our contribution: we improve upon CAD by learning from the data at hand; we show in our paper that the FDR control is retained and prove power results.

Organisation : Alexis Ayme, Loïc Bethencourt, Nicolas Bouchot, Pierre Marion, Miguel Martinez Herrera et Antonio Ocello

Groupe de travail des thésards du LPSM

Lundi 9 janvier 2023, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Elisa Ndiaye + Linus Bleistein** (CIRED/CMAP + INRIA) *Quantification of the impact of climate risks on credit risk (E. Ndiaye) + Learning the dynamics of sparsely observed interacting systems for real-time prediction (L. Bleistein)*

Since the industrial revolution, the human-made greenhouse gases (GHG) emissions have kept increasing. While the greenhouse effect gets stronger, climate change becomes more and more real, with a surge in physical risk (floods, wildfires, extreme temperatures), threatening human lives and economies. Due to the higher frequency and severity of physical damages, the regulators ought to step in, taking up measures aimed at significantly reducing GHG emissions. Whereas the targets are disclosed, the number of possible pathways is endless, yet, they will all require a profound structural change of the economy, also known as energy transition, which poses risks for the financial world through different channels. These risks, commonly referred to as transition risks, will have an impact on traditional financial risks, in particular credit risk. The addition of this climate dimension creates new modelling challenges in the assessment of forward-looking credit risk. This demands a micro-economic simulation of counterparties' financials, conditional to a macro-economic climate scenario, but also a downscaling of the scenario in order to meet the granularity used, which will generate uncertainty that will also be quantified.

Linus BLEISTEIN - Learning the dynamics of sparsely observed interacting systems for real-time prediction

We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to perform real-time prediction of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world healthcare data.

Organisation : Alexis Ayme, Loïc Bethencourt, Nicolas Bouchot, Pierre Marion, Miguel Martinez Herrera et Antonio Ocello

#### Année 2022

Groupe de travail des thésards du LPSM

Lundi 12 décembre 2022, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Nicolas Bouchot + Ludovic Arnould** (LPSM + LPSM) *Polymer in a random poor solvent: edges localization (N. Bouchot) + The Future of Neural Network training (L. Arnould)*

I will present a one dimensional model for a polymer in a poor solvent: the random walk (RW) in dimension 1 penalized by its range in a random environment. We consider a random field $\omega$ consisting of i.i.d. variables $\omega_z$ and we give any path of the RW of length $n$ a weight $\exp(-H_n)$, where $H_n$ is the sum of $h - \beta \omega_z$ over all the sites $z$ that were visited by this RW path. This new law for the path of the random walk is called the polymer measure, and is a random measure over the paths of length $n$. The parameters $h, \beta$ are supposed to be positive, meaning that under the polymer measure, the RW tends to fold itself on an optimal segment given by the random field. I will explain how we can get a rather precise result that states the following: the edges of typical paths under the polymer measure are given by $\omega$ through variational problems involving Brownian-related processes.

Ludovic ARNOULD - The Future of Neural Network training.

Neural networks are among the most popular machine learning models currently used and studied. A typical learning procedure involves the minimisation of an empirical risk via Stochastic Gradient Descent in a supervised setting. In other words, given labeled data (X_i, Y_i) drawn from an unknown distribution P_(X,Y), we fit the network to the data by minimizing a loss on the data (for instance ||X_i - Y_i ||^2) which is done by updating the weights of the network w.r.t. their gradients. The true goal is to make good predictions on unseen data, i.e. to minimize the inaccessible risk E[||X-Y||^2] (the empirical risk being an accessible proxy for this quantity).

Even without much refinement, it is possible to obtain in this way a trained NN that perfectly fits the training data (0 training error) while maintaining a good generalization score: the NN still performs well on new unseen data (X',Y'). However, the mechanisms underpinning this good empirical behavior remain poorly understood theoretically. As a consequence, it remains difficult in practice to design NN architectures and/or training procedures that specifically enhance the generalization power. The PAC-Bayes framework provides powerful tool to analyze the “generalization gap”, the discrepancy between the theoretical risk and its empirical counterpart. In this talk, we will thus study a practical training scheme inspired from the PAC-Bayes theory to improve the generalization capacities of any NN in a supervised setting.

Organisation : Alexis Ayme, Loïc Bethencourt, Nicolas Bouchot, Pierre Marion, Miguel Martinez Herrera et Antonio Ocello

Groupe de travail des thésards du LPSM

Lundi 5 décembre 2022, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Lorenzo Croissant + Grégoire Szymanski** (CEREMADE + Polytechnique) *Diffusion limit control of high-frequency pure-jump processes (L. Croissant) + Rough volatility and optimal estimation of the Hurst parameter (G. Szymanski)*

Pure jump processes with a large number of jumps per unit of time are frequently encountered in modern industrial systems (e.g. financial markets, online advertising auctions, server scheduling…). These applications require sequential decision making, which is mathematically formalized as a control problem. Unfortunately, the non-local nature of jumps leads the control problem to analytical difficulties and computations infeasible in practice. Leveraging the diffusion limit regime, which avoids these difficulties, we present in this talk a characterization of the convergence to this limit, and some interesting perspectives that it opens.

Grégoire SZYMANSKI - Rough volatility and optimal estimation of the Hurst parameter

The goal of this talk is to present the estimation of the Hurst parameter in rough stochastic volatility model. First, we will recall what rough volatility is and why it is used in finance. In these models, we observe a stochastic diffusion driven by a Brownian motion. The volatility is a hidden Fractional Brownian motion. Moreover, the diffusion is only observed at discrete time. We will explain how this problem relates to the somewhat easier observations of the underlying fractional Brownian motion at discrete time polluted by an additive noise, in the same spirit as [3]. Considering the observations $\eta W^H_{i/n} + \eps^n_i$, we build an estimator i/n i based on quadratic variations with convergence rate $n^{1/(4H+2)} as in [3]. We also prove that this rate is minimax, using an adequate wavelet-based construction of the fractional Brownian motion in the second case, see [4]. Finally, we will discuss how this estimator can be generalised to the multiplicative noise model and to a non-parametric setting as in [1].

References: [1] Chong, C.; Hoffmann, M.; Liu, Y.; Rosenbaum, M.; Szymanski, G. (2022) - Statistical inference for rough volatility: central limit theorems - arXiv:2210.01216 [2] Chong, C.; Hoffmann, M.; Liu, Y.; Rosenbaum, M.; Szymanski, G. (2022) - Statistical inference for rough volatility: Minimax theory - arXiv:2210.01214 [3] Fukasawa, M.; Takabatake, T.; Westphal, R. (2019) - Is Volatility Rough ? - arXiv:1905.04852 [4] Szymanski, G. (2022) - Optimal estimation of the rough Hurst parameter in additive noise - arXiv:2205.13035

Groupe de travail des thésards du LPSM

Lundi 21 novembre 2022, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Antonio Ocello + Yazid Janati El Idrissi** (LPSM + LPSM & Télécom SudParis) *Relaxed formulation for the control of branching diffusions: Existence of an optimal control (A. Ocello) + Iterative schemes for divergence minimization (Y. Janati El Idrissi)*

We study the existence of optimal control for branching diffusion processes. The considered problem use rewards that can be nonlinear in the final payoff and linear in the running payout. We give a relaxed formulation, showing its equivalence with the strong problem and proving the existence of optimal controls. Using the dynamic programming principle, we prove that the maximizer can be found in the class of Markovian controls.

Yazid JANATI EL IDRISSI - Iterative schemes for divergence minimization

Traditional Variational Inference proceeds by minimizing the exclusive Kullback-Leibler divergence between the variational approximation and the target distribution. It is now widely admitted that the practical implementation of this procedure yields variational approximations that have lighter tails and fewer modes than the target distribution. In this talk I will review the recent progress made in this subfield by focusing on the methods that aim at optimizing other divergences like the inclusive KL and (Rényi) alpha-divergence. Then, I will introduce a novel iterative scheme that decreases the inclusive KL geometrically fast. Its practical implementation will be discussed and supported by numerical experiments involving the computation of normalizing constants.

Groupe de travail des thésards du LPSM

Lundi 7 novembre 2022, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Tristan Pham-Mariotti + Miguel Martinez Herrera** (LPSM) *Introduction to the Probabilistic Method (T. Pham-Mariotti) + Inference and tests for Multivariate Hawkes processes with Inhibition, application to neuroscience (M. Martinez Herrera)*

To show that there exists an element from a set that satisfy certain properties, just pick an element at random in the set and prove that with positive probability it meets the requirements. This simple idea, called the Probabilistic Method, turns out to be an extremely powerful tool, as noticed by Paul Erdös in the 50s. I hope to illustrate through examples in different fields (combinatorics, geometry, etc) the beauty of this idea to convince you to put it in your mathematical toolbox.

Miguel MARTINEZ HERRERA - Inference and tests for Multivariate Hawkes processes with Inhibition, application to neuroscience.

The Hawkes process is a past-dependant point process used to model the relationship of event occurrences between different phenomena. Since its appearance, it has been widely used in various fields such as finance, criminology and neuroscience. The Hawkes process was originally introduced to describe excitation interactions, which means that one event increases the chances of another occurring. However there has been a growing interest in the opposite effect, known as inhibition. In this talk we propose a Maximum Likelihood estimation method for multivariate Hawkes processes with exponential kernel that can handle both exciting and inhibiting interactions. Parametric estimation methods in the literature are mostly adapted to non-negative interactions and most methods proposed for the inhibiting case are restricted to non-parametric frameworks. We show that the proposed estimator performs better for synthetic data than alternative methods. We also illustrate its application to a neuronal activation dataset.

Groupe de travail des thésards du LPSM

Lundi 24 octobre 2022, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Yoan Tardy + Clément Mantoux** (LPSM + CMAP (Polytechnique)) *Post explosion model for the supercritical Keller-Segel particle system (Y. Tardy) + Modeling Longitudinal Data with Ruptures: Application to Parkinson's Disease (C. Mantoux)*

The Keller-Segel particle system consists in N Brownian motions in the plan interacting through a Coulombian force \theta /|x|² where x is the distance between two particle. As its deterministic version, this equation is subject to a phase transition : if the intensity of the attraction \theta is less than 2 then we expect only reflective collision between particle, and if the intensity is equal or more than 2 we expect sticky collision to occurs, which means that once the collision occurs, particles are glued together forever and we call it explosion. Although this behaviour is well understood before the sticky collision (see « Collisions of the supercritical Keller-Segel particle system » from Nicolas Fournier and Y.T), the particle system is not well defined after that time and we will propose a post-explosion model following the idea of « Stochastic particle approximation of the Keller-Segel equation and two-dimensional generalization of Bessel processes » from Nicolas Fournier and Benjamin Jourdain .

Modeling Longitudinal Data with Ruptures: Application to Parkinson's Disease

Can we model the progression of neurodegenerative diseases? In longitudinal studies, we follow the development of disease symptoms in a cohort of subjects across time. This development is often decomposed into distinct progression phases, corresponding, e.g., to a degradation or to the impact of a treatment. Understanding the pace of the disease progression and its different phases is a key step in the design of monitoring systems for neurodegenerative diseases. In this talk, we present a statistical model for longitudinal trajectories with ruptures. We show that the most appropriate number of ruptures can be selected robustly in spite of a strong noise and a large proportion of missing data. Finally, we use our model on cohorts of subjects affected with Parkinson's disease, and show how it helps understanding the variability in the observed data.

Organisateurs : Alexis Ayme, Loïc Bethencourt, Nicolas Bouchot, Pierre Marion, Miguel Martinez Herrera et Antonio Ocello

Groupe de travail des thésards du LPSM

Lundi 10 octobre 2022, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Equipe Du Gtt** *Journée de rentrée du GTT*

Groupe de travail des thésards du LPSM

Lundi 4 juillet 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Robin Khanfir + David Lee** (LPSM) *Scaling limit of critical branching random walks on a tree (Robin KHANFIR) + The harnack inequality for the fractional laplacian (David LEE)*

A branching random walk can be seen as a population which reproduces while exploring an environment in a Markovian way. Here, we restrict ourselves to the case where the genealogy is described by a critical Galton-Watson tree conditioned on having n vertices and where the environment is a d-ary tree. Moreover, the transition are such that any child has probability 1/2 to be on the vertice just below its parent's position and probability $1/(2d)$ to be on each of the d vertices just above its parent's position. Our object of study is the subtree R_d(n) explored by the population. We show that R_d(n) admits a scaling limit in distribution when $n$ goes to infinity. The limit object is a random compact continuum tree called the Brownian cactus, which has been presented by N. Curien, J-F. Le Gall, and G. Miermont. (Joint work with Thomas Duquesne, Shen Lin, and Niccolò Torri.)

The harnack inequality for the fractional laplacian (David LEE)

In this talk, I will present the fractional laplacian and raise the question of the existence of a harnack inequality.

Organisateurs : Ludovic Arnould, Jérôme Carrand, Lucas Ducrot, Robin Khanfir, Ariane Marandon-Carlhian, Antonio Ocello

Groupe de travail des thésards du LPSM

Lundi 20 juin 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Ariane Marandon + Armand Bernou** (LPSM) *False clustering rate control in mixture models (Ariane MARANDON) + Beyond the mean-field limit for the McKean-Vlasov particle system : Uniform in time estimates for the cumulants (Armand BERNOU)*

The clustering task consists in delivering labels to the members of a sample. For most data sets, some individuals are ambiguous and intrinsically difficult to attribute to one or another cluster. However, in practical applications, misclassifying individuals is potentially disastrous. To overcome this difficulty, the idea followed here is to classify only a part of the sample in order to obtain a small misclassification rate. This approach is well known in the supervised setting, and referred to as classification with an abstention option. The purpose of this paper is to revisit this approach in an unsupervised mixture-model framework. The problem is formalized in terms of controlling the false clustering rate (FCR) below a prescribed level α, while maximizing the number of classified items. New procedures are introduced and their behavior is shown to be close to the optimal one by establishing theoretical results and conducting numerical experiments.

Beyond the mean-field limit for the McKean-Vlasov particle system : Uniform in time estimates for the cumulants (Armand BERNOU)

The study of the convergence to the mean-field limits of the empirical distribution of particle systems is a well-traveled topic, with many approaches coming from both analysis and probability. One of the key questions is the one of propagation of chaos: roughly, the goal is to control an error saying “how far from independence” the system is to justify the limit equation. In this talk, in a nice, smooth setting, I will explain how to go further: if we keep the leading term of this error between the particle system and the independent one to get a more precise description of the system, we can control the new (smaller) error. The method can be iterated, in the sense that we can also keep track of its leading term and control the corresponding error and so on. Along the way, I will try to give some insights about our main tools: derivatives with respect to the measure and Glauber calculus. A very important aspect of this analysis is that our estimates for those errors are uniform in time. This is joint (and ongoing) work with Mitia Duerinckx (FNRS).

Organisateurs : Ludovic Arnould, Jérôme Carrand, Lucas Ducrot, Robin Khanfir, Ariane Marandon-Carlhian, Antonio Ocello

Groupe de travail des thésards du LPSM

Lundi 23 mai 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Pierre Bras + Alexis Ayme** (LPSM) *Stochastic gradient descent and Langevin-simulated annealing algorithms (PIERRE BRAS) + Minimax rate of consistency for linear models with missing values (Alexis AYME)*

I will show how minimization problems arising for example in machine learning can be solved using stochastic gradient descent. For non-convex optimization problems, I will introduce variants of this algorithm in order to improve the optimization procedure. Langevin algorithms consist in adding white noise to the gradient descent, hoping to escape local (but not global) minima. In its simulated annealing version, the noise is gradually decreased to zero to make the algorithm asymptotically converge, sharing its heuristic with the original simulated annealing algorithm.

Minimax rate of consistency for linear models with missing values (Alexis AYME)

Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys…). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, and derive associated adaptive risk bounds that turn out to be minimax optimal.

Organisateurs : Ludovic Arnould, Jérôme Carrand, Lucas Ducrot, Robin Khanfir, Ariane Marandon-Carlhian, Antonio Ocello

Groupe de travail des thésards du LPSM

Lundi 9 mai 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Tristan Pham-Mariotti + Nicklas Werge** (LPSM) *Boundary correlations for the Z-invariant Ising Model (Tristan PHAM-MARIOTTI) + Learning from time-dependent streaming data with stochastic algorithms (Nicklas WERGE)*

In statistical mechanics, the aim is to understand the macroscopic properties of a system knowing the microscopic forces between the components. Computations for a large number of particles are in general quite hard but it turns out that interesting properties can sometimes be expressed in terms of matrices making the computations feasible. In this talk, I will explain some of theses links and mainly focus on the one between boundary correlations in an Ising Model and a special set of matrices called the positive orthogonal Grassmannian.

Learning from time-dependent streaming data with stochastic algorithms (Nicklas WERGE)

Many machine learning problems can be written as a minimization problem, a task that stochastic optimization methods can solve. As more and more data becomes available, optimization methods need to address high-dimensional problems with low computational costs; therefore, in recent years, first-order optimization methods have become prevalent in the literature. A common feature of these procedures is slow convergence, which motivates a great interest in accelerating existing algorithms. I will give a brief insight into stochastic optimization for high-dimensional problems in a streaming framework with time-dependent data.

Groupe de travail des thésards du LPSM

Lundi 25 avril 2022, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Apolline Louvet + Ludovic Arnould** (LPSM) *Limit of a Wright-Fisher model with ghosts (Apolline LOUVET) + Generalisation Properties of Interpolating Estimators and Random Forests (Ludovic ARNOULD)*

I will introduce a variant of the Wright-Fisher model featuring frequent local extinction events, which result in strong fluctuations in population sizes. This model is characterized by the use of “ghost individuals” to fill empty areas, in the spirit of the contact process from interacting particle systems theory. I will show that this model can be used to study genetic diversity in populations expanding in a fragmented and disturbed environment, and find a necessary condition on extinction probabilities for an expansion to be possible. This condition is derived from a coupling with an oriented percolation process, to which our variant of the Wright-Fisher model converges in the limiting regime considered here. This work was motivated by an ongoing project with the Muséum National d'Histoire Naturelle, which focuses on understanding to what extent urban tree bases can act as ecological corridors.

Generalisation Properties of Interpolating Estimators and Random Forests (Ludovic ARNOULD)

We will have a glance at the behavior of a few machine learning estimators in the interpolating regime, i.e. when the estimator is learned by perfectly fitting the data (0 train error). In practice, this regime yields very good generalisation performances (good predictions on unseen data) despite the old statistical belief that interpolation should induce over-fitting. This regime has yet to be theoretically explored in order to precisely understand how can a good generalisation still occur in the interpolation regime and whether interpolation is a key property to explain the success of these estimators. We will finally focus on the case of Random Forest estimators.

Groupe de travail des thésards du LPSM

Lundi 11 avril 2022, 17 heures 30, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Antonio Ocello + Victor Roca Lucio** (LPSM) *Stochastic Target Problem (for Branching Diffusions) (Antonio OCELLO) + An algebraic ballad (Victor ROCA LUCIO)*

“This is a story about control. My control”. And more specifically an optimal stochastic target control problem. This problem consists in finding the minimal condition for which a control allows the underlying process to reach a target set at a finite terminal time. We then apply it to branching diffusions, with a special target for each of its branches. We first state a dynamic programming principle for the value function of the stochastic target problem. We then show that the value function can be characterised as the unique viscosity solution to an Hamilton-Jacobi-Bellman variational inequality, with the use of viscosity solutions.

An algebraic ballad (Victor ROCA LUCIO)

This talk is a gentle introduction to the world of algebraic structures and their relationship to probabilities. Via simple examples, we will introduce the notion of an operad and explain how these objects can encode common algebraic structures. Finally, we will show how many Hopf algebras appearing in renormalization can be constructed using operads, after the work of van der Laan and Moerdijk.

Groupe de travail des thésards du LPSM

Lundi 21 mars 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Kimia Nadjahi + Jean-David Jacques** (LPSM) *Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections (Kimia NADJAHI) + B-series over the set of compactly supported multi-indices, (S)PDEs, and pre-Lie algebras (Jean-David JACQUES)*

The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on the problem of image generation.

B-series over the set of compactly supported multi-indices, (S)PDEs, and pre-Lie algebras (Jean-David JACQUES)

In this talk, i will give a short overview about some recent developments in in the field of renormalization of SPDEs via regularity structures. To that aim, i will introduce the category of pre-Lie algebras and give few concrete examples.

Groupe de travail des thésards du LPSM

Lundi 7 mars 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Pierre Le Bris + Marine Demangeot** (LPSM) *Uniform in time propagation of chaos for the generalized Dyson Brownian motion and 1D Riesz gases (Pierre LE BRIS) + Continuous simulation of storm processes (Marine DEMANGEOT)*

We consider the case of one dimensional N-particle system in mean field interaction, with a singular and repulsive interaction, and we wish to understand the limit, as N goes to infinity, of the empirical measure of the system. After an introduction containing the motivation and some “usual” methods used to tackle this sort of problem, we describe a rather short proof that only relies on the well posedness of the SDE, and that in particular requires no study of the non linear limit PDE.

Continuous simulation of storm processes (Marine DEMANGEOT)

Spatial extreme value theory helps model and predict the frequency of extreme events in a spatial context like, for instance, extreme precipitations, extreme temperatures or high concentrations of pollution in the air. In this presentation, we focus on storm processes, which constitute prototype models for spatial extremes. They are classically simulated on a finite number of points within a given domain. We propose a new algorithm that allows to perform such a task everywhere, not just anywhere, in continuous domains like hyperrectangles or balls, in arbitrary dimension. This consists in generating basic ingredients that can subsequently be used to assign a value at any and every point of the simulation field. Therefore, the resolution of a single simulation can be refined indefinitely; this is particularly appropriate to investigate the geometrical properties of storm processes. Particular attention is paid to efficiency: by introducing and exploiting the notion of domain of influence of each storm, the running time is considerably reduced. Besides, most parts of the algorithm are designed to be parallelizable.

Groupe de travail des thésards du LPSM

Lundi 21 février 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Matthieu Dolbeault + Camila Fernandez** (LPSM) *Random sampling for weighted least-squares approximation (Matthieu DOLBEAULT) + Time-to-Event Analysis (Camila FERNANDEZ)*

We investigate the problem of approximating a function in L^2 with a polynomial of degree N, using only evaluations at M chosen points, with M of the order of N. A first approach, based on weighted least-squares at i.i.d random points, provides a near-best approximation thanks to a matrix concentration inequality, but requires M of order N log(N). To reduce the sample while preserving the quality of approximation, we will need a recent result on sums of rank-one matrices, which answers to a conjecture from quantum physics dating back to 1959.

Time-to-Event Analysis (Camila FERNANDEZ)

Time-to-event analysis is a branch of statistics that has increased its popularity during the last decades due to its many different application fields such as predictive maintenance, customer churn prediction, population's lifetime estimation, etc. In this presentation, we review and compare the performance of well used prediction methods for time-to-event analysis. These consist of semi-parametric and parametric statistical models and machine learning approaches. The comparison is carried out on three different datasets and using two different scores (the integrated Brier score and concordance index). Moreover, we show how aggregation processes, which surprisingly have not yet been much studied in time-to-event analysis, can improve the prediction accuracy. Finally, we present simulation results to complete the comparison between the two scores while varying the number of samples and the censored data percentage to show their impact.

Groupe de travail des thésards du LPSM

Lundi 7 février 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Emilien Bodiot + Luca Brusa** (LPSM) *Two-dimensional Gaussian Markov Processes, an operadic approach (Emilien BODIOT) + A tempered Expectation-Maximization algorithm for discrete latent variable models (Luca BRUSA)*

When working with one-dimensional Markov Processes, right and left eigenvectors of the transfer matrix provide a good notion of invariant boundary conditions. Unfortunately, from what we can tell, such a tool does not exist for two-dimensional Markov Processes. Recent work by Damien Simon fills this gap using a higher algebraic approach based on the theory of Operads. In this talk, we will present the algebraic objects that arise when considering this new formalism and how they define invariant boundary conditions. We will discuss the case of Gaussian Markov processes.

A tempered Expectation-Maximization algorithm for discrete latent variable models (Luca BRUSA)

Despite maximum likelihood estimation of many discrete latent variable models can be performed using the Expectation Maximization (EM) algorithm, a well-known drawback of this estimation method is related to the multimodality of the log-likelihood function. The consequence is that the estimation algorithm could converge to one of the local maxima, not corresponding to the global maximum. We propose a Tempered EM (T-EM) algorithm, which is able to explore the parameter space adequately. It consists in rescaling the objective function depending on a parameter known as temperature, which controls global and local maxima prominence. By properly tuning the sequence of temperature values, the target function is gradually attracted toward the global maximum, escaping local sub-optimal solutions. We rely on an accurate Monte Carlo simulation study to compare the proposal with the standard EM algorithm, evaluating both the ability to hit the global maximum and the computational time of the proposed algorithm. We conclude that the proposal outperforms the standard EM algorithm, improving the chance to reach the global maximum in the overwhelming majority of considered cases.

Groupe de travail des thésards du LPSM

Lundi 24 janvier 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Nicolas Bouchot + Eddy Ella Mintsa** (LPSM) *Self-attracting polymer in homogeneous and random environment (Nicolas BOUCHOT) + Plug-in classification procedures for diffusion paths (Eddy ELLA MINTSA)*

A polymer can be modelled as a random walk (RW) whose trajectories are pondered by a density e^(H) where H is an Hamiltonian. Studying the partition function (expectation of e^(H) under the RW law) is a way to obtain information about typical configurations for the polymer.

In this talk, I will detail the study of an Hamiltonian that penalises trajectories whose range (set of visited sites) is large (hence self attracting), leading to a competition between the diffusivity of the RW and this “self-attraction”. I will prove scaling limits for the fluctuations around an optimal range size and for the centre of the range. Afterwards, I will consider the polymer to be in a Gaussian environment and will reward trajectories visiting sites with greater associated Gaussian variables. I will prove scaling limits in probability for the extrema of the range and fluctuations around these optimal positions, holding for almost all realisations of the environment.

Plug-in classification procedures for diffusion paths (Eddy ELLA MINTSA)

Recent advents in modern technology have generated labeled data, recorded at high frequency, that can be modelled as functional data. This work focuses on multiclass classification problem for functional data modelled by a stochastic differential equation. The drift function depends on the label of the class Y ∈ {1, …, K }, K ∈ N \ {0, 1}. An observation is a solution X = (Xt )t ∈[0,1] of the following time-homogeneous stochastic differential equation d X_t = b_Y ( X_t ) d t + σ ( X_t ) d W_t, x_0 = 0 , t ∈ [ 0 , 1 ] , with unknown drift functions (bk)k=1,…,K and unknown diffusion coefficient σ . Furthermore, we assume that the law p=(p1,…,pK) of the label Y is unknown. From a learning sample DN = X ,Yi i∈[1,N] that consists of N independent copies of the pair (X,Y), our aim is to build an implementable plug-in nonparametric classification procedure and derive upper bounds of its excess risk over Hölder spaces.

Few works have investigated the classification of functional data in the stochastic differential equation framework. In [3], it is done from a parametric point of view with a known diffusion coefficient. Here, we deal with a more challenging problem : the drift function is nonparametric, plus, the diffusion coefficient and the distribution p = (p1, …, pK ) are unknown.

Our classification procedure relies on the nonparametric estimation of functions bk, k = 1, …, K and σ^2 minimizing a least-squares contrast over a spline basis (as in [1] for the estimation of the drift function). We establish the consistency of the resulting empirical classifier as a function of N, the size of the learning sample and n ∈ N∗, the number of discrete observations for each path. We obtain rates of convergence under mild assumptions. These computations rely here on stochastic calculus, in particular fine estimate of the transition densities. Finally, the obtained empirical classifier is implemented and successfully evaluated from simulated data.

Groupe de travail des thésards du LPSM

Lundi 10 janvier 2022, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Arthur Blanc-Renaudie + Félix Cheysson** (LPSM) *Stick breaking constructions: study of D-trees and ICRT (Arthur BLANC-RENAUDIE) + Evolution of groups at risk of death from Covid-19 using hospital data (Félix CHEYSSON)*

A tree is a trunk with large branches, small branches, even smaller branches… and (outside of winter) a few leaves. I will present two algorithms, based on this principle, that have theoretical applications to additive coalescent, Galton–Watson trees, and the configuration model. I will also explain several methods to study those algorithms, which I use to prove scaling limits, bound the heights, prove compactness, and compute some fractal dimensions.

Evolution of groups at risk of death from Covid-19 using hospital data (Félix CHEYSSON)

The Health Data Warehouse (Entrepôt de Santé, EDS) of the AP-HP (Assistance Publique - Hôpitaux de Paris) is collecting and enriching information related to the hospitalisation of patients for whom a positive diagnosis of Covid-19 has been established (via PCR analysis or a lung x-ray). In order to improve patient management, it is essential to identify the risk factors of the disease and to determine whether they evolve with the different waves of the pandemic. In this context, we focus on the estimation of death rates for the groups at higher risk of death from Covid-19, using binary classification trees built from the CART algorithm (Breiman et al., 1984).

To be able to study the temporal evolution of death rates amongst these groups and thus adapt their healthcare, we propose a hypothesis test to compare CART trees and detect changes in the death rates. We show that a bootstrap version of this test holds good empirical properties, and illustrate it with numerical experiments and an application to the first wave of the pandemic. Finally, we present some theoretical insight into the distributional properties of the test statistic for our proposed hypothesis test, using results derived from the theory of U-statistics.

#### Année 2021

Groupe de travail des thésards du LPSM

Mardi 14 décembre 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Thibault Randrianarisoa + Alexandra Lefebvre** (LPSM) *Optional Pólya trees: posterior rates and uncertainty quantification (Thibault Randrianarisoa) + Exact inference in probabilistic graphical models and extensions over polynomials. Application to genetics and segmentation (Alexandra Lefebvre)*

We consider statistical inference in the density estimation model using a tree-based Bayesian approach, with Optional Pólya trees as prior distribution. We derive near-optimal convergence rates for corresponding posterior distributions with respect to the supremum norm. For broad classes of Hölder-smooth densities, we show that the method automatically adapts to the unknown Hölder regularity parameter. We consider the question of uncertainty quantification by providing mathematical guarantees for credible sets from the obtained posterior distributions, leading to near-optimal uncertainty quantification for the density function, as well as related functionals such as the cumulative distribution function. The results are illustrated through a brief simulation study.

Exact inference in probabilistic graphical models and extensions over polynomials. Application to genetics and segmentation (Alexandra Lefebvre)

Probabilistic graphical models play a central role for reasoning in complex systems involving latent variables. They provide a graphical representation of the dependency structure in a joint distribution. In this talk, we start with an introduction to the sum-product algorithm which exploites the dependency structure to reduce the complexity in time for an exact inference from exponential in the number of variables to exponential in the treewidth of a triangulation of the graph. Such computations are thus rendered tractable in graphs of reasonable treewidth. We will pursue with extensions of the algorithm over polynomial quantities with two illustrations: 1) computing the derivatives of the likelihood in parametric Bayesian networks; 2) relaxing the constraint on the prior distribution of the number of segments in sequence segmentation.

Groupe de travail des thésards du LPSM

Mardi 30 novembre 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Iqraa Meah** (LPSM) *Online multiple testing with super-uniformity reward (Iqraa Meah)*

Valid online inference is an important problem in contemporary multiple testing research, to which various solutions have been proposed recently. It is well-known that these methods can suffer from a significant loss of power if the null p-values are conservative. This occurs frequently, for instance whenever discrete tests are performed. To reduce conservatism, we introduce the method of super-uniformity reward (SURE). This approach works by incorporating information about the individual null cumulative distribution functions (or upper bounds of them), which we assume to be available. Our approach yields several new “rewarded” procedures that theoretically control online error criteria based either on the family-wise error rate (FWER) or the marginal false discovery rate (mFDR). We prove that the rewarded procedures uniformly improve upon the non-rewarded ones, and illustrate their performance for simulated and real data.

Groupe de travail des thésards du LPSM

Lundi 22 novembre 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Antoine Heranval + Lucas Broux** (LPSM) *Application of Generalized Pareto Regression Trees to the cost prediction of floods in France (Antoine HERANVAL) + The Sewing Lemma for all gamma > 0 (Lucas BROUX)*

In this work we use Generalized Pareto Regression Trees the cost prediction of floods in France on a real dataset. The aim of this study is to improve the cost prediction of an event of floods, shortly after its occurrence, for the entire French market. Indeed, following a natural catastrophe, it can be difficult to evaluate the scale and the cost of an event. In order to do that we use GPD Regression Trees to have a special focus on the extreme events and to gain further insight on the heterogeneity of the severity of these events. Thanks to a partnership with the French Federation of Insurance (FFA), essentially with one of his dedicated technical body, the association of French insurance undertaking for natural risk knowledge and reduction (Mission Risques Naturels, MRN), we have access to a large volume of events. These events represent all the events of floods that have been acknowledged as in state of natural catastrophe in France for the past 20 years. The cost of these events comes from the claims reported by the insurance company.

The Sewing Lemma for all gamma > 0 (Lucas BROUX)

Introduced by Massimiliano Gubinelli in 2004, the Sewing Lemma is a fundamental result in the theory of Rough Paths.

In this talk, I will present this lemma in its original form, along with a recently obtained version in the regime 0 < gamma ⇐ 1. I will also briefly introduce the theory of Rough Paths and explain how this new version of the Sewing Lemma can provide an elegant solution to the so-called “extension problem” of extending any Hölder path to a Rough Path.

Groupe de travail des thésards du LPSM

Lundi 15 novembre 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Pierre Marion + Sergi Burniol Clotet** (LPSM) *Framing RNN as a kernel method : A neural ODE approach (Pierre MARION) + Horospheres and horocyclic flows in nonpositive curvature (Sergi BURNIOL CLOTET)*

We study the behavior of a class of neural networks called recurrent neural network (RNN). Building on their interpretation as a discretization of a continuous-time neural differential equation, we show, under appropriate conditions, that the solution of a RNN can be viewed as a linear function of a specific feature set of the input, known as the signature. This connection allows us to frame a RNN as a linear method in a suitable Hilbert space. As a consequence, we obtain theoretical guarantees of generalization and stability. Our results are illustrated on simulated datasets.

Horospheres and horocyclic flows in nonpositive curvature (Sergi BURNIOL CLOTET)

I will present geodesic flows on rank 1 nonpositively curved manifolds, which are examples of weakly hyperbolic systems of geometric origin. Then I will explain two results obtained in my thesis which generalize what was known in the strong hyperbolic case. The first is the equidistribution of horospheres under the action of the geodesic flow, and the second is the unique ergodicity of the horocyclic flow of a certain class of surfaces.

Groupe de travail des thésards du LPSM

Lundi 18 octobre 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Miguel Martinez Herrera + Loïc Bethencourt** (LPSM) *Inference and tests for Multivariate Hawkes processes with Inhibition, application to neuroscience and genomics (Miguel MARTINEZ HERRERA) + Stable limit theorems for additive functionals of one-dimensional diffusion processes (Loïc BETHENCOURT)*

Since its origin for the study of earthquakes, the Hawkes process models interactions interactions where the occurrence of an event has a direct impact on the phenomenon itself. The well-studied “excitation” increases the event rate each time we observe a point whereas the “inhibition”, which is the main interest of this subject, considers the opposite effect. The main goal is to obtain a parametric estimation method for multidimensional processes, for which it is important to consider both probabilistic and statistical approaches. Such methods would provide a better understanding of the inner workings of interacting phenomena, such as neural activity networks.

Stable limit theorems for additive functionals of one-dimensional diffusion processes (Loïc BETHENCOURT)

We consider a positive recurrent one-dimensional diffusion process with continuous coefficients and we establish stable central limit theorems for a certain type of additive functionals of this diffusion. In other words we find some explicit conditions on the additive functional so that its fluctuations behave like some alpha-stable process in large time for alpha in(0,2].

Groupe de travail des thésards du LPSM

Lundi 14 juin 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Yoan Tardy + Francesco Bonacina** (LPSM) *Collisions of the supercritical Keller-Segel particle system (Yoan Tardy) + Influenza Decline During COVID-19 Pandemic: a Global Analysis Leveraging Classification and Regression Trees (Francesco Bonacina)*

We study a particle system naturally associated to the 2-dimensional Keller-Segel equation. It consists of N Brownian particles in the plane, interacting through a binary attraction in 1/r, where r stands for the distance between two particles. We will discuss about the two cases : the subcritical and the supercritical cases which correspond to the factor of attractivity less and greater than 2. In particular, we will see first that in the subcritical case there are only collisions between pair of particules which are not an issue to define properly the solution in the classical sense, and secondly that in the supercritical case there is an explosion due to a collision between several particles that we will study precisely.

Influenza Decline During COVID-19 Pandemic: a Global Analysis Leveraging Classification and Regression Trees (Francesco Bonacina)

The COVID-19 pandemic has caused a profound shock on the ecology of infectious diseases, in particular some studies highlighted that the circulation of influenza dramatically reduced in specific countries after the COVID-19 emergence. Also, they pointed out that the phenomenon could be associated with the non-pharmaceutical interventions (NPIs) applied by governments to control the pandemic. Here we address the problem at the global scale analyzing the FluNet influenza public repository for the periods before (2015-19) and during (2020-21) COVID-19 pandemic. Firstly, we map the space-time variation of influenza and we find that the percentage of positive tests decreased globally by 98.6%, but showing very heterogeneous patterns across countries and seasons. Then, we use Random Forests and Classification And Regression Trees to link the variation of influenza incidence with several covariates such as COVID-19 incidence, strictness of NPIs, change in human mobility, demography, season and geographical region.

Organisateurs : Emilien Bodiot, Lucas Broux, Gloria Buritica, David Lee, Thibault Randrianarisoa, Yoan Tardy

Groupe de travail des thésards du LPSM

Lundi 31 mai 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Sébastien Farkas + Toyomou Matsuda** (LPSM) *Regression trees algorithms tailored for Generalized Pareto Distribution estimation and application to Cyber risk quantification (Sébastien Farkas) + Parabolic Anderson models with singular potentials (Toyomou Matsuda)*

With the rise of the cyber insurance market, there is a need for better quantification of the economic impact of this risk and its rapid evolution. We particularly focus on severe/extreme claims, by combining a Generalized Pareto modeling – legitimate from Extreme Value Theory – and a regression tree approach. We will introduce the methodology, discuss some hypothesis, comment simulations results and explore some perspectives.

Parabolic Anderson models with singular potentials (Toyomou Matsuda)

The aim of this talk is twofold. Firstly, we will review some important facts about the parabolic Anderson model (PAM). The PAM is a heat equation with a random potential and it exhibits interesting properties such as intermittency. We will also review Schrödinger operators with random potentials, called Anderson Hamiltonians, focusing on their connections with PAMs.

Secondly, we will discuss PAMs and Anderson Hamiltonians with singular random potentials. In particular, we will discuss my ongoing work with Willem van Zuijlen to construct Anderson Hamiltonians with such singular potentials and to study asymptotics of the corresponding PAMs.

Organisateurs : Emilien Bodiot, Lucas Broux, Gloria Buritica, David Lee, Thibault Randrianarisoa, Yoan Tardy

Groupe de travail des thésards du LPSM

Lundi 17 mai 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Carlo Bellingeri** (TU Berlin) *From Ordinary Differential Equations to Rough Differential Equations (Carlo Bellingeri)*

Starting from deterministic concepts of ordinary differential equations, we will introduce the concepts of rough differential equation “à la Davie” and geometric rough paths. These notions allow us to formulate a stochastic differential equation without having to resort to Itô's calculus and have served as an inspiration for the study of much more sophisticated equations in recent years. In conclusion, we will discuss possible extensions of these concepts to non-commutative probabilities.

Organisateurs : Emilien Bodiot, Lucas Broux, Gloria Buritica, David Lee, Thibault Randrianarisoa, Yoan Tardy

Groupe de travail des thésards du LPSM

Lundi 3 mai 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Henri Elad Altman + Pierre Marion** (Imperial College London / LPSM) *Scaling limits of additive functionals of mixed fractional Brownian motions (Henri Elad Altman) + “Who plays Gandalf in LOTR?” - Natural Language Processing on structured data (Pierre Marion)*

In this talk we will present new methods to analyse the long-time behaviour of additive functionals of stochastic processes. As an application, we obtain scaling limit results for additive functionals of mixed fractional Brownian motions. This is based on joint work with Khoa Lê (TU Berlin).

“Who plays Gandalf in LOTR?” - Natural Language Processing on structured data (Pierre Marion)

Most recent approaches to language tasks (translation, text generation, etc.), model text as a stream of tokens, and learn a meaningful representation of these tokens via mechanisms like attention. However, some tasks require handling structured data, such as graphs or tables. One such example is knowledge graphs, which are a common way to encode real-world facts. The largest public knowledge graph is Wikidata (90M nodes, 1.2B edges). Finding information in these graphs, for instance to answer questions phrased in natural language, is a challenging task, which benefits from having a more structured approach to language modeling (semantic parsing, context embedding). The talk will give an overview of these different notions, and present recent results on the task of question answering grounded in Wikidata.

Groupe de travail des thésards du LPSM

Lundi 19 avril 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**William Da Silva + Aude Sportisse** (LPSM) *Hamburgers, cheeseburgers and Brownian excursions (William Da Silva) + Debiasing Averaged Stochastic Gradient Descent to handle missing values (Aude Sportisse)*

We give an elementary introduction to a bijection, due to Scott Sheffield, between loop-decorated planar maps and hamburger-cheeseburger inventory trajectories. Going through the looking glass, one may use these burger walks to study the convergence of loop-decorated random planar map models (this is the so-called peanosphere topology). In the scaling limit, the hamburger and cheeseburger walks encoding a particular model of random planar maps reveals intriguing connections between planar maps and Brownian excursions, which are reminiscent of the mating-of-trees story for space-filling explorations of Liouville quantum gravity surfaces. If time allows, I will describe a branching process arising when slicing half-planar Brownian excursions at heights.

Debiasing Averaged Stochastic Gradient Descent to handle missing values (Aude Sportisse)

Stochastic gradient algorithm is a key ingredient of many machine learning methods, particularly appropriate for large-scale learning. However, a major caveat of large data is their incompleteness. We propose an averaged stochastic gradient algorithm handling missing values in linear models. This approach has the merit to be free from the need of any data distribution modeling and to account for heterogeneous missing proportion. In both streaming and finite-sample settings, we prove that this algorithm achieves convergence rate of O(1/n) at the iteration n, the same as without missing values. We show the convergence behavior and the relevance of the algorithm not only on synthetic data but also on real data sets, including those collected from medical register.

Groupe de travail des thésards du LPSM

Lundi 22 mars 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Joseph De Vilmarest + Helena Kremp** ((LPSM + Freie Universität Berlin) *Adaptive Forecasting by Kalman Filter, Application to Electricity Consumption During Lockdown (Joseph de Vilmarest) + A weak solution concept for singular SDEs (Helena Kremp)*

As the electricity storage capacities are still negligible compared to the need, the electricity production must always balance the consumption thus forecasting the load is a crucial task. During the coronavirus crisis, the electricity consumption patterns of most countries changed abruptly and the average load dropped. It highlights the need for adaptive forecasting methods, taking into account the recent observations in an online manner. We consider here generalized additive models which have demonstrated their efficiency to predict the electricity load, and we present an adaptive version based on Kalman Filtering. We apply the proposed method on the French electricity load and we test it before, during and after the lockdown of March 2020.

A weak solution concept for singular SDEs (Helena Kremp)

Since the works by Delarue, Diel (2016) and Cannizzaro, Chouk (2018) (in the Brownian noise setting), and our previous work (in the Lévy noise case), the existence and uniqueness of solutions to the martingale problem associated to multidimensional SDEs with additive \alpha-stable Lévy noise for \alpha in (1,2] and rough Besov drift of regularity \beta > (2-2\alpha)/3 is known. Motivated by the equivalence of probabilistic weak solutions to SDEs with bounded, measurable drift and solutions to the martingale problem, we define a (non-canonical) weak solution concept for singular Lévy diffusions, proving moreover equivalence to martingale solutions in both the Young \beta > (1-\alpha)/2, as well as in the rough regime \beta>(2-2\alpha)/3. This turns out to be highly non-trivial in the rough case and forces us to define certain rough stochastic sewing integrals involved. In particular, we show that the canonical weak solution concept (introduced also by Athreya, Butkovski, Mytnik (2018) in the Young case), which is well-posed in the Young case, yields non-uniqueness of solutions in the rough case.

This is ongoing work together with Nicolas Perkowski.

Groupe de travail des thésards du LPSM

Lundi 8 mars 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**David Lee** (LPSM) *A Functional Calculus via the Extension technique*

In the pioneering work of [Caffarelli and Silvestre,Comm. Par.Diff. Equ. 32(2007)] the following problem was raised: Which type of linear operators can be realized by the Dirichlet-to-Neumann operator for a Dirichlet problem on the halfspace? Even, though the above is an analysis problem the tools of excursion theory become very useful in this situation.

In this talk, I will present a probabilistic solution to the above problem. This work was done jointly with Daniel Hauer, from the University of Sydney.

Groupe de travail des thésards du LPSM

Lundi 22 février 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Aaraona Rakotoarivony + Ludovic Arnould** (LPSM) *Modigliani Miller Theorem, a stochastic control approach (Aaraona Rakotoarivony) + Analyzing the tree-layer structure of Deep Forests (Ludovic Arnould)*

In their seminal paper, Modigliani and Miller showed that in a perfect market the value of a company is independent of its capital structure or its dividend policy. In this talk, we are going to investigate how this result is modified when we introduce friction in the capital market. We adopt a stochastic control approach in which the manager of a company is able to act on the cash process through a dividend, capital, or debt emission.

Analyzing the tree-layer structure of Deep Forests (Ludovic Arnould)

Random forests on the one hand, and neural networks on the other hand, have met great success in the machine learning community for their predictive performance. Combinations of both have been proposed in the literature, notably leading to the so-called deep forests (DF). We investigate the mechanisms at work in DF and outline that DF architecture can generally be simplified into more simple and computationally efficient shallow forests networks. Despite some instability, the latter may outperform standard predictive tree-based methods. In order to precisely quantify the improvement achieved by these light network configurations over standard tree learners, we theoretically study the performance of a shallow tree network made of two layers, each one composed of a single centered tree. We provide tight theoretical lower and upper bounds on its excess risk. These theoretical results show the interest of tree-network architectures for well-structured data provided that the first layer, acting as a data encoder, is rich enough.

Groupe de travail des thésards du LPSM

Lundi 1 février 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Eva Lawrence + Armand Bernou** (Sorbonne Université) *Entropy maximisation problems for multidimensional functional reconstruction (Eva Lawrence) + Asymptotic Behavior of Markov Processes: a Dive into the Subgeometric Case (Armand Bernou)*

In the framework of non-parametric reconstruction problems, we are interested in the reconstruction of a multidimensional function f over a compact set U and such that f satisfies a certain amount of very general integral constraints. We propose to solve this reconstruction problem by setting it in the frame of the γ-entropy maximisation problem under constraints.

That is, we define a convex function γof Rpto R+ with some good properties and we are interested in the maximisation under constraints of the quantity I_γ(f) = - ∫_U γ(f) dP.

We explain that this problem can be linked to another one that deals with signed measures F.

Such problems have been studied in the case of a single function or a single measure reconstruction.

We propose to study the more general case of the reconstruction of a function or a measure with values in Rp.

Asymptotic Behavior of Markov Processes: a Dive into the Subgeometric Case (Armand Bernou)

In this talk, I will quickly introduce the notions required to discuss the stability structure of Markov processes on general state space. Those generalise the usual ideas of recurrence, positive recurrence and aperiodicity from the countable state space theory. I will then present some methods, based on the existence of Lyapunov functionals for the stochastic generator, which allow one to obtain the convergence towards the invariant measure of the Markov process at a subgeometric rate. This case differs from the geometric one in several manners, and if time, I will shortly discuss the main differences.

Groupe de travail des thésards du LPSM

Lundi 4 janvier 2021, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Bo Ning** (LPSM) *Spike and slab Bayesian sparse principal component analysis (Bo Ning)*

Sparse principal component analysis (PCA) is a popular tool for dimension reduction of high-dimensional data. Recently, there have been some works on Bayesian sparse PCA. However, those studies are mostly theoretical. There is a lack of efficient computational algorithms that are available for practical use. In this talk, I will propose a new method for Bayesian sparse PCA. In addition to studying the posterior contraction rate of this method, I will introduce a PX-CAVI algorithm based on variational inference. The PX-CAVI algorithm applies the parameter expansion to the principal components to avoid dealing with the orthogonal constraints between eigenvectors directly. This algorithm is fast and accurate. Our simulation studies showed that the PX-CAVI algorithm outperforms the existing penalized methods and an EM algorithm that uses a continuous spike and slab prior. The R code of this algorithm is available online.

#### Année 2020

Groupe de travail des thésards du LPSM

Lundi 14 décembre 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Jean-David Jacques** (LPSM) *Rota-Baxter algebras, Bohnenblust-Spitzer identity and application to probability (Jean-David Jacques)*

In 1955, Frank Spitzer studied some combinatoric relations involving the maximum of some truncations of a serie of real numbers. This combinatorial study led to a usefull relation for caracteristic functions of real random walk. Later, Glen Baxter obtained the same relation by introducing an operator over the space of caracteristic functions, the so called “Baxter operator”. Finaly, a complete algebraic study by the famous combinatorist Gian-Carlo Rota led to what we call nowaday “Rota-Baxter algebra”.

Groupe de travail des thésards du LPSM

Lundi 7 décembre 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Joseph De Vilmarest** (LPSM) *Adaptive Forecasting by Kalman FIlter, Application to Electricity Consumption During Lockdown (Joseph de Vilmarest)*

As the electricity storage capacities are still negligible compared to the need, the electricity production must always balance the consumption thus forecasting the load is a crucial task. During the coronavirus crisis, the electricity consumption patterns of most countries changed abruptly and the average load dropped. It highlights the need for adaptive forecasting methods, taking into account the recent observations in an online manner. We consider here generalized additive models which have demonstrated their efficiency to predict the electricity load, and we present an adaptive version based on Kalman Filtering. We apply the proposed method on the French electricity load and we test it before, during and after the lockdown of March 2020.

Groupe de travail des thésards du LPSM

Lundi 30 novembre 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Lucas Iziquel** (LPSM) *Tree-like random metric spaces seen as fixed points of distributional equations (Lucas Iziquel)*

When studying the scaling limits of some models of random trees or graphs, we obtain random compact metric spaces. From the study of the Continuum Random Tree (CRT), a classical example of such random spaces, we will use its self-similarity property - a well-chosen subspace of the CRT has the same distribution as a rescaled copy of the entire CRT itself - to see the CRT as a fixed-point of a particular equation. From there we will introduce a framework to study this kind of distributional fixed-point equations, the existence and uniqueness of the fixed-points, and the possible convergence towards these fixed-points. Moreover, some geometric properties of the fixed-point can be deduced directly from the equation, for instance its almost sure fractal dimensions.

Image: A realization of the 1.07-stable looptree from the website of Igor Kortchemski

Groupe de travail des thésards du LPSM

Lundi 23 novembre 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Vasiliki Velona** (Universitat Pompeu Fabra and Universitat Politècnica de Catalunya / Visiting LPSM) *The broadcasting problem (Vasiliki Velona)*

Consider a large rooted tree, where the root-vertex of the tree has a random bit value assigned. Every other vertex has the same bit value as its parent with probability 1-q and the opposite value with probability q, where q\in [0,1]. The broadcasting problem consists in estimating the value of the root bit, upon observing the unlabelled tree graph and the bit values associated with a subset of the vertices. I will discuss such results on various tree classes and, in particular, on random recursive trees created either by the uniform attachment model or by the linear preferential attachment model.

Groupe de travail des thésards du LPSM

Lundi 16 novembre 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Robin Khanfir** (LPSM) *The range of branching random walks on Galton-Watson trees (Robin Khanfir)*

The scaling limit of critical Galton-Watson trees conditionned to be large is an universal random compact metric space, called the Continuum Random Tree (CRT). For super-critical Galton-Watson trees, finding such scaling limit is not possible because they are infinite, and so not compact, with positive probability. One could overcome this difficulty by randomly choosing a subtree, with large fixed size, of an infinite Galton-Watson tree. To do so, we consider a random walk on the infinite tree then we look at the set of the vertices visited by the walk before time n, which we call the range or the trace of the walk. In this case, there is indeed a scaling limit that consists of several CRT glued together at their root. Then, an interesting question arises. What does happen if we replace the random walk by a branching random walk ? In other words, we try to index our walk by a critical Galton Watson tree conditioned to be large instead of a linear time.

Groupe de travail des thésards du LPSM

Lundi 9 novembre 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Linus Bleistein** (ENS Ulm, LPSM) *Wasserstein-GANs and the Signature Transform (Linus Bleistein)*

Wasserstein-GANs (W-GANs), a class of generative adversarial models based on neural networks and the Wasserstein distance, have recently drawn a lot of attention because of their high generative power for images. However, little has been done regarding time series generation. We introduce the signature transform, a non-linear transformation that allows for simple representations of complex, high dimensional paths, and combine them with W-GANs in order to generate time series.

Groupe de travail des thésards du LPSM

Lundi 26 octobre 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Nicklas Werge** (LPSM) *AdaVol: An Adaptive Recursive Volatility Prediction Method (Nicklas Werge)*

Quasi-Maximum Likelihood (QML) procedures are theoretically appealing and widely used for statistical inference. While there are extensive references on QML estimation in batch settings, the QML estimation in streaming settings has attracted little attention until recently. An investigation of the convergence properties of the QML procedure in a general conditionally heteroscedastic time series model is conducted, and the classical batch optimization routines extended to the framework of streaming and large-scale problems. An adaptive recursive estimation routine for GARCH models named AdaVol is presented. The AdaVol procedure relies on stochastic approximations combined with the technique of Variance Targeting Estimation (VTE). This recursive method has computationally efficient properties, while VTE alleviates some convergence difficulties encountered by the usual QML estimation due to a lack of convexity. Empirical results demonstrate a favorable trade-off between AdaVol’s stability and the ability to adapt to time-varying estimates for real-life data.

Groupe de travail des thésards du LPSM

Lundi 19 octobre 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Isao Sauzedde** (LPSM) *Covariant Young's integration (Isao Sauzedde)*

We will present a new point of view on the Young's integral which uses no approximation of the path. It will give us a slight extension of it, with the additional property of being stable by smooth deformations of the plane. We will then go from Young to stochastic integration, and finally define the integral of some irregular random 1-forms along Brownian motion. The main role will be played by the winding of curves around points.

Groupe de travail des thésards du LPSM

Lundi 6 juillet 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Gloria Buritica** (LPSM) *Clustering of extreme events (Gloria Buritica)*

The occurrence of an extreme event usually triggers a sequence of extreme events in a short period. In practice, this phenomenon enchains very negative consequences when the risk model does not account for the probability of time-clustering of extremes. For example, many floods occur after recording extreme rainfall data for consecutive days. Similarly, returns for stock prices crash for several days before returning to a usual dynamic. In the setting of regularly varying stationary time series, the extremal index is a parameter that fully describes in most cases the time-clustering of extremes for univariate time series. The multivariate setting inspires from these results and considers time-clusters as data from a portion of time with the supremum norm exceeding a high threshold. We study a generalisation of this notion considering time-blocks of data with norm Lp above a high threshold. This new definition allows for capturing extreme behavior from more time-blocks than before.

Organisateurs : F. Bechtold, W. Da Silva , A. Fermanian, S. Has, Y. Yu

Groupe de travail des thésards du LPSM

Mercredi 1 janvier 2020, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Anciens Orateurs** (LPSM) *Anciens GTT année 2020*

KMT coupling for random walk bridges - Xuan Wu (15 Juin 2020)

Smoothing of Bayesian forest estimators in density estimation - Thibault Randrianarisoa (8 Juin 2020)

An averaging (path-by-path) approach to regularisation by noise for ODEs - Lucio Galeati (1 Juin 2020)

High Regularity Invariant Measures in PDEs - Mickaël Latocca (25 Mai 2020)

Wasserstein Random Forests and Applications in Heterogeneous Treatment Effects - Qiming Du (18 Mai 2020)

An approach to analyze the tail of the distribution of heterogenous data and application to insurance - Sébastien Farkas (11 Mai 2020)

Pathwise Regularisation of McKean–Vlasov Equations - Avi Mayorcas (4 Mai 2020)

On a class of completely random measures (CRMs) and its role in Bayesian analysis - Riccardo Passeggeri (27 Avril 2020)

An introduction to statistical properties of expanding maps - Malo Jézéquel (20 Avril 2020)

Introduction on post hoc inference (online) - Marie Perrot-Dockès (6 Avril 2020)

Learning with signatures (online) - Adeline Fermanian (30 Mars 2020)

Introduction on post hoc inference (cancelled!) - Marie Perrot-Dockès (9 Mars 2020)

Towards a better understanding of Wasserstein GANs - Ugo Tanielian (24 Février 2020)

Quantum Computing and Applications to Machine Learning - Jonas Landman (10 Février 2020)

Polymer Pinning Model - Order of the phase transition in the inhomogeneous model - Alexandre Legrand (3 Février 2020)

A kernel-based consensual regression aggregation method - Sothea HAS (3 Février 2020)

A Bayesian hierarchical model for traffic flow and waiting time prediction in carpooling lines - Panayotis Papoutsis (27 Janvier 2020)

On weighted sampling without replacement - Othmane Safsafi (20 Janvier 2020)

Coupling methods for the convergence rate of Markov processes - Armand Bernou (13 Janvier 2020)

Change-point analysis of copula models - Karen A. Vásquez Vivas (6 Janvier 2020)

Organisateurs : F. Bechtold, W. Da Silva , A. Fermanian, S. Has, Y. Yu

#### Année 2019

Groupe de travail des thésards du LPSM

Mardi 1 janvier 2019, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Anciens Orateurs** (LPSM) *Anciens GTT année 2019*

Growth-fragmentation in Planar Brownian excursions - William Da Silva (9 Décembre 2019)

The Calderón Problem through the eyes of a probabilist - David Lee (2 Décembre 2019)

Statistical inference for a partially observed interacting system of Hawkes processes - Chenguang Liu (25 Novembre 2019)

Long time dynamics for interacting oscillators on graphs - Fabio Coppini (18 Novembre 2019)

Heat kernel on the infinite percolation cluster - Chenlin GU (4 Novembre 2019)

Modified Runge-Kutta methods for pathwise approximations of SDEs - F. Bechtold (28 Octobre 2019)

Informative missing data - Aude Sportisse (21 Octobre 2019)

Random lifts and cutoff phenomenon - Guillaume Conchon-Kerjan (14 Octobre 2019)

Les EDPS semi-linéaires à diffusion non-bornée - Florian Bechtold (24 Juin 2019)

Belief propagation in Bayesian networks and Markov chains and extensions with polynomials - Alexandra Lefebvre (17 Juin 2019)

Algèbres de quasi-battage, mouvement Brownien et chemins rugueux - Carlo Bellingeri (3 Juin 2019)

Chaînes d'oscillateurs et modèles à champ moyen - Alejandro Fernandez Montero (27 Mai 2019)

Construction séquentielle d'arbre couvrant minimal - Othmane SAFSAFI (20 Mai 2019)

A mathematical model on black market - Chenlin GU (13 Mai 2019)

Rough paths, signature and statistical learning - Adeline Fermanian (6 Mai 2019)

Annulation de la constante isopérimétrique ancrée de percolation en p_c - Barbara Dembin (15 Avril 2019)

On the diffusion of eigenvectors of random matrices - Lucas Benigni (8 Avril 2019)

Impact of tree choice in metagenomics differential abundance studies - Antoine Bichat (25 Mars 2019)

Quelques propriétés sur les ICRT - Arthur Blanc-Renaudie (18 Mars 2019)

Concentration inequalities for functions of independent random variables - Antoine Marchina (11 Mars 2019)

K-means algorithm with Bregman divergences and constructing predictive models based on this algorithm - Sothea Has (4 Mars 2019)

Penalized likelihood methods applied to age-period-cohort analysis - Vivien Goepp (25 Février 2019)

Optimal control and dynamic programming - Enzo Miller (18 Février 2019)

Universality for critical kinetically constrained models - Ivailo Hartarsky (11 Février 2019)

An introduction to Extreme Value Theory - Nicolas Meyer (4 Février 2019)

Windings of Brownian motion - Isao Sauzedde (27 Janvier 2019)

Modèle de Poland-Scheraga pour la dénaturation de l'ADN - Alexandre Legrand (21 Janvier 2019)

Spectral techniques in matrix completion - Simon Coste (14 Janvier 2019)

Introduction of high-dimensional interpretable machine learning models and their applications - Simon Bussy (7 Janvier 2019)

Organisateurs : A. Lefebvre, N. Meyer, O. Safsafi, T. Touati

#### Année 2018

Groupe de travail des thésards du LPSM

Lundi 1 janvier 2018, 17 heures, Jussieu salle Paul Lévy (209) couloir 16-26 (2ème étage)

**Anciens Orateurs** (LPSM) *Anciens GTT année 2018*

The Hitchhiker’s Guide to the Galaxy of Financial Risk Management: Risk Measures and Procyclicality - Marcel Bräutigam (3 Décembre 2018)

Scaling limits of a random graph: critical configuration model with power-law degrees - Guillaume Conchon-Kerjan (26 Novembre 2018)

Temps d’infection dans le modèle de Duarte : le rôle des barrières d’énergie - Laure Marêché (19 Novembre 2018)

Numerical consistent estimates in the multivariate linear mixed-effects model and application to the malaria infection study - Eric Adjakossa (12 Novembre 2018)

How to free the boundary - Clément Cosco (5 Novembre 2018)

A brief introduction to Sequential Monte Carlo - Qiming Du (22 Octobre 2018)

Records of the Fractional Brownian Motion - Assaf Shapira (15 Octobre 2018)

Local convergence for random permutations: the case of uniform pattern-avoiding permutations - Jacopo Borga (27 Juin 2018)

Triangle, Étoile, Intégrabilité - Paul Melotti (11 Juin 2018)

Introduction à la théorie des jeux à champs moyens (ou Mean Field Games - MFG) - Ziad Kobeissi (4 Juin 2018)

Random obstacle problems, and integration by parts formulae for the laws of Bessel bridges - Henri Elad-Altman (28 Mai 2018)

Bayesian inference of causality in gene regulatory networks - Flaminia Zane (23 Mai 2018)

Régularité de l'exposant de Liapounov d'un produit de matrices aléatoires - Benjamin Havret (23 Avril 2018)

Loi asymptotique de l'estimateur des moindres carrés pour les modèles linéaires avec erreurs dépendantes - Emmanuel Caron (9 Avril 2018)

Un graphe aléatoire pour modéliser la spéciation - François Bienvenu (26 Mars 2018)

Théorèmes limites pour des fonctionnelles de clusters d'extrêmes de processus et champs aléatoires faiblement dépendants - José-Gregorio Gomez (20 Mars 2018)

Contrôle Stochastique et Apprentissage Statistique : exemple de la gestion middle-out d'un portefeuille - Alexis Bismuth (19 Mars 2018)

Arbres, marches et laminations aléatoires - Paul Thévenin (12 Mars 2018)

Cutoff of sparse Markov chains - Guillaume Conchon-Kerjan (5 Mars 2018)

Formule d'Itô avec les structures de régularité - Carlo Bellingeri (26 Février 2018)

Théorie de l’estimation par partition dépendante des données et Rules Induction Partitioning Estimator - Vincent Margot (20 Février 2018)

Cardinal minimal d'une surface de coupure dans une percolation de premier passage surcritique - Barbara Dembin (19 Février 2018)

Peignes et coalescents échangeables - Félix Foutel-Rodier (12 Février 2018)

Organisateurs : C. Cosco, S. Coste, L. Marêché, P. Melotti, N. Meyer, B. Dembin, G. Conchon-Kerjan, F. Coppini, O. Safsafi