Short item Published on November 30, 2015

Modelling big “N” spatiotemporal data

Public Health data on mortality have been growing increasingly rich as more accurate information on “who”, “where” and “when” becomes available. This information forms hierarchical (multilevel) data structures which are correlated such that person-level (“who”) information can be repeated, geo-statistical (“where”) data often has spatial correlation, and temporal (“when”) data can be auto-correlated. Statistical techniques are usually based on independent observations, but when applied to multilevel data structures they often underestimate the standard errors. As a result of this the statistical significance is overestimated leading to erroneous results and subsequent inferences (1). This defeats the main goal in epidemiological analysis, which is to identify and quantify correctly any exposures, behaviours and characteristics that may modify a population’s or individual’s risk and use these to implement more appropriate interventions (2).

In modelling hierarchical data we can take into account spatial and temporal correlations by introducing spatiotemporal random effects in the model. Several other hurdles have to be overcome when modelling hierarchical mortality data, such as: zero inflation when there is a greater proportion of non-occurrence for an outcome; handling large data structures; repeated measures; and estimating many parameters rapidly and accurately. Bayesian techniques with the aid of the Markov chain Monte Carlo (MCMC) simulation methods have successfully overcome these hurdles and fit spatiotemporal random effects for reasonably sized geo-locations of Gaussian fields (GF) (3). However, as the number of geo-locations increases, MCMC computations of a dense GF spatial correlation matrix become infeasible or extremely slow in the order of power three, which is very common in Big Data Analytics (BDA). This problem is popularly known as the “big m” or “big N” (4). Several approaches have been used to resolve the “big N”. Banerjee et al. (2003) give brief summaries of these: sub-sampling, spectral, lattice, dimension reduction and course fine coupling methods (5). Generally these techniques attempt to reduce the dimension of the GF by selecting a “representative” sub-sample, or fixing some parameters, or changing the scale from continuous to discrete with the aim of reducing the computational burden in running thousands of iterations for large datasets.

We addressed this problem firstly using techniques proposed by Rue et al. (2005) who changed the continuous scale GF to a discrete scale Gaussian Markov Random Field (GMRF), for the Matérn family of covariance structures (6). Lindgren et al. (2011) provide some detail of how the GF and GMRF relate via Stochastic Partial Differential Equations (SPDE) using basis functions (7). Secondly, we performed inference and prediction using Integrated Nested Laplace Approximation (INLA) well suited for GMRF as opposed to the commonly used MCMC (8). Hence we greatly reduced the computational burden and could run in hours what usually took days, having reduced the computational operations for a spatiotemporal model from power 3 to power Power toThis is a great milestone in handling large data sets in BDA.

In the article Bayesian analysis of zero inflated spatio-temporal HIV/TB child mortality data through the INLA and SPDE approaches (9) we discuss a Bayesian model that can handle large spatiotemporal observational data and produces reliable estimates speedily.