Stochastic Models for the Inference of Life Evolution


SMILE is an interdisciplinary research group gathering probabilists, statisticians, bio-informaticians and biologists.
SMILE is affiliated to the Stochastics and Biology group of LPSM (Lab of Probability, Statistics and Modeling) at Sorbonne Université (ex Université Pierre et Marie Curie Paris 06).
SMILE is hosted within the CIRB (Center for Interdisciplinary Research in Biology) at Collège de France.
SMILE is supported by Collège de France and CNRS.
Visit also our homepage at CIRB.


SMILE is hosted at Collège de France in the Latin Quarter of Paris. To reach us, go to 11 place Marcelin Berthelot (stations Luxembourg or Saint-Michel on RER B).
Our working spaces are rooms 107, 121 and 122 on first floor of building B1 (ask us for the code). Building B1 is facing you upon exiting the traversing hall behind Champollion's statue.


You can reach us by email (amaury.lambert - at - or (smile - at -

Light on



The impact of selection, gene conversion, and biased sampling on the assessment of microbial demography

Recent studies have linked demographic changes and epidemiological patterns in bacterial populations using coalescent-based approaches. We identified 26 studies using skyline plots and found that 21 inferred overall population expansion. This surprising result led us to analyze the impact of natural selection, recombination (gene conversion), and sampling biases on demographic inference using skyline plots and site frequency spectra (SFS). Forward simulations based on biologically relevant parameters from Escherichia coli populations showed that theoretical arguments on the detrimental impact of recombination and especially natural selection on the reconstructed genealogies cannot be ignored in practice. In fact, both processes systematically lead to spurious interpretations of population expansion in skyline plots (and in SFS for selection). Weak purifying selection, and especially positive selection, had important effects on skyline plots, showing patterns akin to those of population expansions. State-of-the-art techniques to remove recombination further amplified these biases. We simulated three common sampling biases in microbiological research: uniform, clustered, and mixed sampling. Alone, or together with recombination and selection, they further mislead demographic inferences producing almost any possible skyline shape or SFS. Interestingly, sampling sub-populations also affected skyline plots and SFS, because the coalescent rates of populations and their sub-populations had different distributions. This study suggests that extreme caution is needed to infer demographic changes solely based on reconstructed genealogies. We suggest that the development of novel sampling strategies and the joint analyzes of diverse population genetic methods are strictly necessary to estimate demographic changes in populations where selection, recombination, and biased sampling are present.



The genomic view of diversification

Evolutionary relationships between species are traditionally represented in the form of a tree, called the species tree. The reconstruction of the species tree from molecular data is hindered by frequent conflicts between gene genealogies. A standard way of dealing with this issue is to postulate the existence of a unique species tree where disagreements between gene trees are explained by incomplete lineage sorting (ILS) due to random coalescences of gene lineages inside the edges of the species tree. This paradigm, known as the multi-species coalescent (MSC), is constantly violated by the ubiquitous presence of gene flow revealed by empirical studies, leading to topological incongruences of gene trees that cannot be explained by ILS alone. Here we argue that this paradigm should be revised in favor of a vision acknowledging the importance of gene flow and where gene histories shape the species tree rather than the opposite. We propose a new, plastic framework for modeling the joint evolution of gene and species lineages relaxing the hierarchy between the species tree and gene trees. As an illustration, we implement this framework in a mathematical model called the genomic diversification (GD) model based on coalescent theory, with four parameters tuning replication, genetic differentiation, gene flow and reproductive isolation. We use it to evaluate the amount of gene flow in two empirical data-sets. We find that in these data-sets, gene tree distributions are better explained by the best fitting GD model than by the best fitting MSC model. This work should pave the way for approaches of diversification using the richer signal contained in genomic evolutionary histories rather than in the mere species tree.



The reconstructed tree in the lineage-based model of protracted speciation

A popular line of research in evolutionary biology is the use of time-calibrated phylogenies for the inference of diversification processes. This requires computing the likelihood of a given ultrametric tree as the reconstructed tree produced by a given model of diversification. Etienne and Rosindell in Syst Biol 61(2):204–213, (2012) proposed a lineage-based model of diversification, called protracted speciation, where species remain incipient during a random duration before turning good species, and showed that this can explain the slowdown in lineage accumulation observed in real phylogenies. However, they were unable to provide a general likelihood formula. Here, we present a likelihood formula for protracted speciation models, where rates at which species turn good or become extinct can depend both on their age and on time. Our only restrictive assumption is that speciation rate does not depend on species status. Our likelihood formula utilizes a new technique, based on the contour of the phylogenetic tree and first developed by Lambert in Ann Probab 38(1):348–395, (2010). We consider the reconstructed trees spanned by all extant species, by all good extant species, or by all representative species, which are either good extant species or incipient species representative of some good extinct species. Specifically, we prove that each of these trees is a coalescent point process, that is, a planar, ultrametric tree where the coalescence times between two consecutive tips are independent, identically distributed random variables. We characterize the common distribution of these coalescence times in some, biologically meaningful, special cases for which the likelihood reduces to an elegant analytical formula or becomes numerically tractable.



Exchangeable coalescents, ultrametric spaces, nested interval-partitions: A unifying approach

Kingman's representation theorem (Kingman 1978) states that any exchangeable partition of \$$\mathbb{N}\$$ can be represented as a paintbox based on a random mass-partition. Similarly, any exchangeable composition (i.e.\ ordered partition of \$$\mathbb{N}\$$) can be represented as a paintbox based on an interval-partition (Gnedin 1997. Our first main result is that any exchangeable coalescent process (not necessarily Markovian) can be represented as a paintbox based on a random non-decreasing process valued in interval-partitions, called nested interval-partition, generalizing the notion of comb metric space introduced by Lambert & Uribe Bravo (2017) to represent compact ultrametric spaces. As a special case, we show that any \$$\Lambda\$$-coalescent can be obtained from a paintbox based on a unique random nested interval partition called \$$\Lambda\$$-comb, which is Markovian with explicit semi-group. This nested interval-partition directly relates to the flow of bridges of Bertoin & Le~Gall (2003). We also display a particularly simple description of the so-called evolving coalescent by a comb-valued Markov process. Next, we prove that any measured ultrametric space \$$U\$$, under mild measure-theoretic assumptions on \$$U\$$, is the leaf set of a tree composed of a separable subtree called the backbone, on which are grafted additional subtrees, which act as star-trees from the standpoint of sampling. Displaying this so-called weak isometry requires us to extend the Gromov-weak topology, that was initially designed for separable metric spaces, to non-separable ultrametric spaces. It allows us to show that for any such ultrametric space \$$U\$$, there is a nested interval-partition which is 1) indistinguishable from \$$U\$$ in the Gromov-weak topology; 2) weakly isometric to \$$U\$$ if \$$U\$$ has complete backbone; 3) isometric to \$$U\$$ if \$$U\$$ is complete and separable.

Upcoming seminars


Planning des salles du Collège de France.
Intranet du Collège de France.