Mathematical Modelling


These courses will take place at SACEMA, Stellenbosch, and are registered as University of Stellenbosch Short Courses. More details, including fees and registration forms, are posted at the SACEMA website.

Introduction to R: Management, Exploration, and Communication of Data

Presenter: Roxanne Beauclair, data analysis consultant with Data Yarn, and SACEMA alumna.

Dates: Five days, 2-6 July 2018.

The course will be built around practical exercises, and will deal with the basics of programming in R, but also how to import and clean data, and how to visualise and to report results. Some specific topics: reshaping/tidying data; merging/joining datasets; handling dates and transforming numeric and categorical variables; summarizing data with plots and tables; and producing reproducible and shareable reports.
Read more.

Advanced Epidemiological Methods

Presenter: Matthew Fox, Department of Epidemiology and Center for Global Health  and Development, Boston University.

Dates: Four days, 30 July-2 August 2018.

This course builds on introductory and intermediate courses to rethink basic concepts: measures of effect, confounding, misclassification and selection bias. The course challenges participants to question the implications of various sources of bias in studies, and to work through novel methods and approaches to these biases; it explores the basic statistics used in epidemiological research and common misunderstandings about what these statistics can tell us. In focusing on the central concepts of validity and precision, the course develops skills that every doctoral level epidemiologist should have – skills that are both practical and marketable.
Read more.

From phylogenies to phylodynamics

Using DNA to reconstruct the history of species (i.e. phylogeny) is now commonplace in evolutionary biology. This is done by using two principles: 1. parsimony, i.e. more similar sequences cluster closer together in the phylogenetic tree; and 2. molecular clock, i.e. the number of differences between two sequences indicates the time since divergence. In the end, we can display the relatedness between species in so called phylogenetic trees (diagrams).

With the advent of mass sequencing, these phylogenetic methods have been applied to infections instead of individuals. By obtaining the genetic sequence of a virus in a set of hosts, it is possible to infer a phylogenetic tree that is somehow related to the transmission chain. The underlying assumption here is that the virus evolves rapidly enough for mutations to take place on the time scale of the epidemics such that the virus transmitted from one person differs from the virus that caused the infection (1). The evolutionary rate of the virus determines the time scale on which inferences can be made: for human immunodeficiency virus (HIV) or hepatitis C virus (HCV), the resolution can be of days, whereas for more slowly evolving viruses such as human papillomaviruses (HPVs) we need hundreds of years to get enough resolution (2).

In 2004, Grenfell et al. (3) coined the term “phylodynamics” to capture the idea that the way viruses spread leaves footprints in their phylogenies and that by analysing these via phylogenetic methods, we can make inferences regarding key epidemiological parameters (like the basic reproduction number, R0). For several years, phylodynamics used generic population dynamics models, but the last decade has seen the development of new methods to use classical epidemiological models, such as the SIR model, with Susceptible, Infected and Removed hosts (4,5).

Successes and limits of phylodynamics

The Ebola epidemic in West Africa marked a qualitative change in terms of the amount of viral infections sequences, but also showed how rapid and shared this sequencing was (6). The development of ad hocmodels allowed estimation of R0, which is the number of secondary infections caused by an infected host in a completely susceptible population. Classically, the greater the R0, the more difficult the epidemic is to control. Another strength of phylodynamics is that it helped to understand the geographical spread of the epidemics (7).

Several limits of phylodynamics also appeared. Firstly, implementing a detailed epidemiological model is difficult for methods that require the true likelihood of observing the phylogeny for a given set of parameters. Furthermore, computing time typically increases faster than linear as the number of sequences goes up, hence larger phylogenies are more difficult to handle with current approaches. Finally, existing phylogenetic methods tend to only use dated sequences and cannot include other types of data such as time series of incident cases, which were particularly detailed during the recent Ebola epidemics.

ABC phylodynamics

In a recent study (8), we introduced a new approach that relies on Approximate Bayesian Computation (ABC). Contrarily to existing methods, it does not require the derivation of the exact likelihood of observing the phylogeny. In a nutshell, the idea of ABC is to use numerical simulations to generate thousands of phylogenies of infection using a given epidemiological model. Then, we identify the simulated phylogenies that are the most similar to the target phylogeny obtained from the empirical data. Since we know how we simulated the phylogenies, we can estimate the model parameters that are most likely to generate the target phylogeny.

An important difficulty in this approach resides in the comparison between two phylogenies. This is almost like finding a metric that captures the difference between two actual trees: the only way to do this is to summarise the tree into quantitative variables (height, diameter, number of branches and number of leaves) that can be compared between trees. We do the same by breaking our phylogenies into dozens of summary statistics such as branch lengths, ratio of internal to external branches, general balance of the phylogeny, timing of events etc. We then use all of our 83 summary statistics to calculate a Euclidean distance (length of a true straight line) between each of the simulated phylogenies and the target one.

Being able to tell how different phylogenies are allows us to perform the first step of the ABC, which is called the “rejection step”. This consists of keeping only the phylogenies that are sufficiently close to the target one. Classical ABC algorithms then re-iterate this rejection by re-simulating data using a Markov Chain Monte-Carlo (MCMC) algorithm. However, here we used a more recent method called regression ABC. It consists of using the phylogenies kept after the rejection step to inform a regression model. If successful, we can then predict model parameters for any given set of summary statistics. As we will see later, regression ABC has several advantages compared to the canonical ABC approach. Since this method is quite new, we compared two algorithms to perform the regression step, both of which rely on machine learning: Least Absolute Shrinkage Selector (LASSO) regression and neural networks – an information processing paradigm that is inspired by the way the brain processes information.

ABC regression is powerful

In general, our cross-validation analysis shows that regression ABC is powerful. By definition, we know ABC cannot outperform a method that is based on the exact likelihood. However, for a simple SIR model, we reach an inference precision for R0and infection duration close to the “gold-standard” obtained by maximising the true likelihood function (via the BDSIR model in the software BEAST). Regression-ABC even outperforms BDSIR when it comes to estimating the host population size, which is due to the fact that the algorithm implemented in BEAST has to make numerical approximations to compute the likelihood function (Figure 1).

Figure 1: Inference power of the regression ABC for three parameters of the SIR model. The first row shows the prior that was explored (in gray), the second row is the regression-ABC using neural networks (in yellow), the third line is the regression-ABC using LASSO regression (in red) and the bottom row is the results using the maximum likelihood approach implemented in Beast (in black). The inference power is close to that of BEAST for Rand infection duration and is somehow better for the population size (right column). Vertical lines show the target value. This estimation was performed for 100 target phylogenies and each line shows the inferred distribution for the respective target. The large dot shows the median value for all 100 targets. This figure is reproduced from (8).

Regarding the more technical details, we first showed that it is not possible to capture the information contained by the phylogeny with few summary statistics as this information seemed to be spread out in our summary statistics. However, one of the differences between LASSO and neural networks was that the former was more efficient at selecting between summary statistics, making it more reliable as a generic regression tool.

To further investigate the power of our model, we focused on the 2014-2016 Ebola epidemics and used the first outbreak phylogeny published by Gire et al. (6) to assess our ability to infer parameters with a more elaborate epidemiological model. To this end, we simulated phylogenies assuming an SEIR model (where E stands for an Exposed state, that is infected but not yet infectious) and then ran our regression ABC to infer R0, duration of latency and duration of infectiousness. Compared to the approach encoded in BEAST (9), we have a lower ability to infer R0, but a greater ability to infer other parameters such as the duration of latency. This is interesting because duration parameters are typically difficult to infer from incidence data and require contact tracing data. However, the small size of the phylogeny (72 leaves) and its low resolution seem to be close to the limit of our method, which is much more powerful for larger phylogenies.

Perspectives on regression ABC

We showed that a “naive” approach that only requires the ability to simulate phylogenies of infections can reach a precision close to that obtained by state-of-the art methods that rely on the true likelihood of observing a phylogeny given some parameter values. We also showed that the accuracy of this method, in terms of minimising the error made when inferring parameters, increases with the size of the phylogeny. From a more technical point of view, regression techniques now allow us to efficiently deal with a large number of summary statistics, which has been a long-standing limitation of ABC approaches (few summary statistics could be used and choosing them was near impossible), by either giving different weights to summary statistics or by sorting out the redundant ones. This advance may explain the difference of estimation accuracy in our study compared to earlier ones (10).

We expect regression ABC to be most appropriate in situations where we have a large phylogeny and where there is enough biological knowledge to simulate a rather detailed individual-based model. Indeed, the number of simulated phylogenies required to perform regression ABC does not increase with phylogeny size (although the time to simulate each phylogeny does increase), whereas maximum likelihood functions tend to take much more time to converge as the phylogeny size and the model complexity increase. Another advantage of the ABC approach, which we are currently investigating, is that it can readily combine different sources of data. In the case of the Ebola epidemics for instance, there was an extremely detailed follow-up of incident cases. By extracting summary statistics of these time series, we can further improve the power of the ABC and hopefully combine the precision of classical epidemiological methods to infer parameters such as R0with the ability of phylodynamics to access more hidden parameters such as, for Ebola, the fraction of transmission through dead hosts.

Individual-based models in HIV epidemiology

In epidemiology, mathematical models are widely used to simulate progression, transmission, prevention and treatment of infectious diseases. Most of these models are deterministic compartmental models, simulating population averages of changes in infection status and disease stages over time. However, many infectious diseases, in particular sexually transmitted diseases (STIs), are subject to high individual heterogeneity. Unlike compartmental models simulating population averages, individual-based models (IBMs) keep track of the events that happen to each individual separately and are therefore able to take into account various sources of individual heterogeneity.

The ability to let population-level features of complex systems emerge from processes and events that happen to interacting individuals, is arguably the most important quality of IBMs. As the computational expense of IBMs has become less prohibitive thanks to multi-core processors and increased access to high-performance computers, there is a growing use of IBMs in infectious disease epidemiology. SimpactCyan is conceived as a multipurpose model-building tool to address research questions in HIV epidemiology at the intersection of network and social epidemiology, computational biology, public health and policy modelling.

SimpactCyan is by no means the first HIV-specific individual-based modelling tool to have been developed. However, with SimpactCyan we aim to overcome some of the limitations of current software for implementing IBMs in HIV epidemiology. While some modelling tools (e.g. STDSIM for simulating transmission of HIV and other STIs (1)) are not open source, other IBMs (e.g. EMOD (2)) are relatively difficult to modify. Another limitation of EMOD is that it can only be used on computers running Windows 10, Windows Server 12, Windows HPC Server 12 or CentOS 7.1. Furthermore, while it has interfaces for Matlab and Python, it does not have an R interface. NetLogo models, on the other hand, are easily modifiable (3), and can be run from within the R environment (4) but are too slow for simulating large populations over the time-scale relevant for HIV epidemiology.

With a few exceptions (e.g. the MicSim Package (5)), existing simulators implement IBMs in with fixed-time steps. For example, the state of model population gets updated every week. However, a continuous time implementation of IBMs has the advantage that it elegantly handles competing risks to multiple events. For instance, an individual may be concurrently at risk of HIV-related mortality as well as at risk of transmitting the virus to a partner. Evaluating the model in fixed time-steps may lead to the situation where both events are scheduled to have taken place between now and the next time-step. However, in reality, this is only possible if the transmission event happens first. In the continuous time model evaluation, we know exactly which of the two events is scheduled first, and logical consequences for the likelihood of subsequent events are processed along with the execution of the first event.

Furthermore, events happening after short and long time-periods can be included in a single simulation. In contrast, in a discrete time model, simulating events that occur on vastly different time-scales can be computationally inefficient. Frequently occurring events may require a small time-step, possibly leading to the occurrence of rare events being evaluated with a much higher frequency than necessary. Another limitation of existing implementations of IBMs for dynamic sexual networks is that they require ad-hoc decisions about who and in what order people “go out” to find partners and can “be found”. SimpactCyan is a simulator for event-driven IBMs in HIV epidemiology, evaluated in continuous time: the state of the system is updated each time an event happens. Furthermore, all possible relationships are considered simultaneously instead of sequentially.

The simulation algorithm

Event times, i.e. time points in the simulation at which events are scheduled to take place, are determined using the modified Next Reaction Method (mNRM) (6). The mNRM was originally designed for simulating chemical systems with time-dependent propensities and delays, but in SimpactCyan we use it to simulate how individuals are at risk of events according to hazard functions. In the mNRM algorithm, there is a core distinction between internal event times} and (simulated) real-world event times. The internal event times determine when an event will be triggered according to the event’s internal clock. We can think of it as a time bomb. When the internal clock reaches a pre-set time, the event takes place (bomb goes off). How fast this clock ticks, depends on the hazard function for the event. A low hazard means that the clock will tick slowly. As a result, two events can have the same internal clock time scheduled, but the real-world times of the events may be different if the hazard functions for the events are different. Full details of the simulation algorithm and the events that can happen to individuals in models formulated with SimpactCyan can be found at:

Model applications

To illustrate what kind of analyses could be conducted with SimpactCyan, we present two simple examples of model applications. The first illustrates how SimpactCyan can be used to assess the impact of progressive changes to the ART eligibility criteria in Eswatini (formerly known as Swaziland). The second illustrates the use of SimpactCyan as a data-generating tool for assessing the performance of other modelling frameworks. All code and data files necessary to reproduce the examples are available at

In the MaxART project (7), SimpactCyan is used to estimate the likely impact of Eswatini’s shift towards “Early Access to ART for All” (EAAA) on the incidence of HIV. HIV incidence is the rate at which HIV-uninfected people acquire the infection. Such HIV transmission events are scheduled each time a relationship is formed between an HIV-infected and an HIV-uninfected individual. The hazard for transmission events primarily depends on the viral load of the infected partner but can be defined in such a way to also allow for a hazard-lowering effect of multiple ongoing relationships (so-called coital dilution), as well as a hazard-increasing effect of adolescent age among women.

The viral load of an infected individual varies over time: During the chronic stage of the infection, viral concentration assumes a set-point viral load. Model parameters determine by which factors the transmission hazard should be multiplied during the initial acute stage, as well as the early and late AIDS stages. At the time of HIV acquisition, time till HIV-related death is determined, based on an early paper by Arnaout et al. (8) to allow for a negative correlation between set-point viral load and post-infection survival time.

The set-point viral load value allocated to a newly infected individual is partially determined by that of their infector, i.e. some heritability of set-point viral load is assumed (9). ART initiation affects both the expected time to HIV-related death and the infectiousness of the person on ART. As soon as ART is started, the log10 viral load is assumed to drop by a user-defined fraction, and the updated current viral load is used to recalculate the post-infection survival time. In the simulations of which key output is shown in Figure 1, we assume that upon ART initiation, the log10 viral load drops by 70%, effectively rendering the viral load “undetectable” for most ART clients. We further assumed that ART was gradually introduced around the year 2000 and that the CD4 cell count threshold for ART eligibility progressively shifted towards ever more inclusive criteria, alongside a decreasing lag-time between HIV infection and HIV diagnosis. These assumptions hold in both the “Status Quo” scenario and the “EAAA” scenario. In the EAAA scenario, however, an additional policy change is modelled: a policy of immediate access to ART for all people infected with HIV is adopted from October 2016. In the alternative scenario, the CD4 cell count threshold for ART eligibility stays at 500 cells/microliter from mid-2013 onwards. Extra intervention events are added in both scenarios to model a moderate reduction in sexual risk behaviour (rate of partner turnover) that took place in the period 2000-2003.

The second use case illustrates how SimpactCyan can be used as a data-generating tool for benchmarking the performance of other modelling frameworks. Phylogenetic models have been used to infer properties of epidemics from reconstructed phylogenetic trees, including time-trends in HIV incidence rates (10, 11) and the age-mixing pattern in HIV transmission clusters (12). As the truth is typically unknown, it is difficult to assess the validity of these novel modelling frameworks or document their sensitivity to breaches in the models’ assumptions. For instance, phylogenetic inference methods typically assume that HIV sequence data are available for most HIV-positive people and that the individuals for whom the viral genome was sequenced, are a random subset of all HIV infected people. However, in many settings, neither of these assumptions is met. In Figure 3 we illustrate the basic idea of SimpactCyan as a data-generating tool for benchmarking.

First, we simulate an emerging HIV epidemic. Panel (a) shows the cumulative HIV transmission network of the epidemic, linking all individuals who got infected with HIV by the end of the simulation. Next, assuming HIV transmission events correspond to branching points in the phylogeny, we convert the transmission network into its corresponding phylogenetic tree. Using the Seq-Gen program (13), we simulate the viral evolution along this phylogeny, assuming a generalised time-reversible substitution model (14) with a gamma-invariable mixture model for rate heterogeneity among sites. In this way, we generate synthetic HIV sequence data. Lastly, we feed these sequences into the FastTree 2 program (15) and the treedater R package (16) to reconstruct the time-resolved phylogenetic tree, shown in panel (b). If all people ever infected are included in the sequence dataset and the same molecular evolution model is used to generate the sequence data and to reconstruct the phylogenetic tree, the timing of the internal nodes in the reconstructed tree should correspond with the timing of the simulated HIV transmission events, as shown in panel (c). Now that we have established the validity of this phylogenetic inference approach under ideal circumstances, we can examine the performance of the inference method under alternative scenarios in which some of the viral sequence data are missing completely at random (MCAR), missing at random (MAR) or not at random (MNAR). Through simulation-based sensitivity analyses, we could quantify how the accuracy of epidemiological characteristics inferred by the phylodynamic method depends on the magnitude of the missing data problem and the strength of the correlations between the probability of sequences being missing and covariates such as age, indicators of sexual risk behaviour or calendar time.


Future directions

Ongoing developments of SimpactCyan include the addition of events for the transmission and treatment of other STIs such as Herpes Simplex Virus 2 (HSV-2) and Hepatitis C Virus (HCV), as well as additional events for parenteral and MTCT of HIV and co-infections, to allow studies of HIV transmission in injecting drug users (IDU) and children. We also plan to extend the software by enabling an explicit modelling of the correlation between sexual risk behaviour and health-seeking behaviour. This is in response to recent evidence to suggest that high sexual risk behaviour is associated with a lower likelihood to be aware of one’s HIV infection, and a lower likelihood of being virally suppressed among people who know they are HIV positive (17).

Conceived as a flexible open-source, open access tool, rather than a proprietary asset, SimpactCyan’s extensions and applications should not solely come from its original developers. Instead, we want to position this simulator as a vehicle for open science in HIV epidemiology. Therefore, others are encouraged to use it for the development of their IBMs, as the starting point for their simulation engine, as a data-generating and benchmarking tool in methodological research, or for educational purposes.