mod

Local law and Tracy–Widom limit for sparse stochastic block models

Jong Yun Hwang, Ji Oon Lee, Wooseok Yang.

Source: Bernoulli, Volume 26, Number 3, 2400--2435.

Abstract:
We consider the spectral properties of sparse stochastic block models, where $N$ vertices are partitioned into $K$ balanced communities. Under an assumption that the intra-community probability and inter-community probability are of similar order, we prove a local semicircle law up to the spectral edges, with an explicit formula on the deterministic shift of the spectral edge. We also prove that the fluctuation of the extremal eigenvalues is given by the GOE Tracy–Widom law after rescaling and centering the entries of sparse stochastic block models. Applying the result to sparse stochastic block models, we rigorously prove that there is a large gap between the outliers and the spectral edge without centering.




mod

A refined Cramér-type moderate deviation for sums of local statistics

Xiao Fang, Li Luo, Qi-Man Shao.

Source: Bernoulli, Volume 26, Number 3, 2319--2352.

Abstract:
We prove a refined Cramér-type moderate deviation result by taking into account of the skewness in normal approximation for sums of local statistics of independent random variables. We apply the main result to $k$-runs, U-statistics and subgraph counts in the Erdős–Rényi random graph. To prove our main result, we develop exponential concentration inequalities and higher-order tail probability expansions via Stein’s method.




mod

A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

Xin Bing, Florentina Bunea, Marten Wegkamp.

Source: Bernoulli, Volume 26, Number 3, 1765--1796.

Abstract:
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations, with parameters $N_{i}inmathbb{N}$ and $Pi_{i}in[0,1]^{p}$ for $i=1,ldots,n$. The model links all cell probabilities, collected in a $p imes n$ matrix $Pi$, via the assumption that $Pi$ can be factorized as the product of two nonnegative matrices $Ain[0,1]^{p imes K}$ and $Win[0,1]^{K imes n}$. Topic models have been originally developed in text mining, when one browses through $n$ documents, based on a dictionary of $p$ words, and covering $K$ topics. In this terminology, the matrix $A$ is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in detail in this work. Notably, the unique $A$ is required to satisfy what is commonly known as the anchor word assumption, under which $A$ has an unknown number of rows respectively proportional to the canonical basis vectors in $mathbb{R}^{K}$. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the $K$ vertices of a simplex. This crucial step in the estimation of $A$ requires $K$ to be known, and cannot be easily extended to the more realistic set-up when $K$ is unknown. This work takes a different view on anchor word estimation, and on the estimation of $A$. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates $K$ from the observed data. We derive new finite sample minimax lower bounds for the estimation of $A$, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any $n,N_{i},p$ and $K$, and both $p$ and $K$ are allowed to increase with $n$, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics $K$, while we provide the competing methods with the correct value in our simulations.




mod

Efficient estimation in single index models through smoothing splines

Arun K. Kuchibhotla, Rohit K. Patra.

Source: Bernoulli, Volume 26, Number 2, 1587--1618.

Abstract:
We consider estimation and inference in a single index regression model with an unknown but smooth link function. In contrast to the standard approach of using kernels or regression splines, we use smoothing splines to estimate the smooth link function. We develop a method to compute the penalized least squares estimators (PLSEs) of the parametric and the nonparametric components given independent and identically distributed (i.i.d.) data. We prove the consistency and find the rates of convergence of the estimators. We establish asymptotic normality under mild assumption and prove asymptotic efficiency of the parametric component under homoscedastic errors. A finite sample simulation corroborates our asymptotic theory. We also analyze a car mileage data set and a Ozone concentration data set. The identifiability and existence of the PLSEs are also investigated.




mod

Reliable clustering of Bernoulli mixture models

Amir Najafi, Seyed Abolfazl Motahari, Hamid R. Rabiee.

Source: Bernoulli, Volume 26, Number 2, 1535--1559.

Abstract:
A Bernoulli Mixture Model (BMM) is a finite mixture of random binary vectors with independent dimensions. The problem of clustering BMM data arises in a variety of real-world applications, ranging from population genetics to activity analysis in social networks. In this paper, we analyze the clusterability of BMMs from a theoretical perspective, when the number of clusters is unknown. In particular, we stipulate a set of conditions on the sample complexity and dimension of the model in order to guarantee the Probably Approximately Correct (PAC)-clusterability of a dataset. To the best of our knowledge, these findings are the first non-asymptotic bounds on the sample complexity of learning or clustering BMMs.




mod

The moduli of non-differentiability for Gaussian random fields with stationary increments

Wensheng Wang, Zhonggen Su, Yimin Xiao.

Source: Bernoulli, Volume 26, Number 2, 1410--1430.

Abstract:
We establish the exact moduli of non-differentiability of Gaussian random fields with stationary increments. As an application of the result, we prove that the uniform Hölder condition for the maximum local times of Gaussian random fields with stationary increments obtained in Xiao (1997) is optimal. These results are applicable to fractional Riesz–Bessel processes and stationary Gaussian random fields in the Matérn and Cauchy classes.




mod

A new McKean–Vlasov stochastic interpretation of the parabolic–parabolic Keller–Segel model: The one-dimensional case

Denis Talay, Milica Tomašević.

Source: Bernoulli, Volume 26, Number 2, 1323--1353.

Abstract:
In this paper, we analyze a stochastic interpretation of the one-dimensional parabolic–parabolic Keller–Segel system without cut-off. It involves an original type of McKean–Vlasov interaction kernel. At the particle level, each particle interacts with all the past of each other particle by means of a time integrated functional involving a singular kernel. At the mean-field level studied here, the McKean–Vlasov limit process interacts with all the past time marginals of its probability distribution in a similarly singular way. We prove that the parabolic–parabolic Keller–Segel system in the whole Euclidean space and the corresponding McKean–Vlasov stochastic differential equation are well-posed for any values of the parameters of the model.




mod

Strictly weak consensus in the uniform compass model on $mathbb{Z}$

Nina Gantert, Markus Heydenreich, Timo Hirscher.

Source: Bernoulli, Volume 26, Number 2, 1269--1293.

Abstract:
We investigate a model for opinion dynamics, where individuals (modeled by vertices of a graph) hold certain abstract opinions. As time progresses, neighboring individuals interact with each other, and this interaction results in a realignment of opinions closer towards each other. This mechanism triggers formation of consensus among the individuals. Our main focus is on strong consensus (i.e., global agreement of all individuals) versus weak consensus (i.e., local agreement among neighbors). By extending a known model to a more general opinion space, which lacks a “central” opinion acting as a contraction point, we provide an example of an opinion formation process on the one-dimensional lattice $mathbb{Z}$ with weak consensus but no strong consensus.




mod

Consistent structure estimation of exponential-family random graph models with block structure

Michael Schweinberger.

Source: Bernoulli, Volume 26, Number 2, 1205--1233.

Abstract:
We consider the challenging problem of statistical inference for exponential-family random graph models based on a single observation of a random graph with complex dependence. To facilitate statistical inference, we consider random graphs with additional structure in the form of block structure. We have shown elsewhere that when the block structure is known, it facilitates consistency results for $M$-estimators of canonical and curved exponential-family random graph models with complex dependence, such as transitivity. In practice, the block structure is known in some applications (e.g., multilevel networks), but is unknown in others. When the block structure is unknown, the first and foremost question is whether it can be recovered with high probability based on a single observation of a random graph with complex dependence. The main consistency results of the paper show that it is possible to do so under weak dependence and smoothness conditions. These results confirm that exponential-family random graph models with block structure constitute a promising direction of statistical network analysis.




mod

Distances and large deviations in the spatial preferential attachment model

Christian Hirsch, Christian Mönch.

Source: Bernoulli, Volume 26, Number 2, 927--947.

Abstract:
This paper considers two asymptotic properties of a spatial preferential-attachment model introduced by E. Jacob and P. Mörters (In Algorithms and Models for the Web Graph (2013) 14–25 Springer). First, in a regime of strong linear reinforcement, we show that typical distances are at most of doubly-logarithmic order. Second, we derive a large deviation principle for the empirical neighbourhood structure and express the rate function as solution to an entropy minimisation problem in the space of stationary marked point processes.




mod

Robust estimation of mixing measures in finite mixture models

Nhat Ho, XuanLong Nguyen, Ya’acov Ritov.

Source: Bernoulli, Volume 26, Number 2, 828--857.

Abstract:
In finite mixture models, apart from underlying mixing measure, true kernel density function of each subpopulation in the data is, in many scenarios, unknown. Perhaps the most popular approach is to choose some kernel functions that we empirically believe our data are generated from and use these kernels to fit our models. Nevertheless, as long as the chosen kernel and the true kernel are different, statistical inference of mixing measure under this setting will be highly unstable. To overcome this challenge, we propose flexible and efficient robust estimators of the mixing measure in these models, which are inspired by the idea of minimum Hellinger distance estimator, model selection criteria, and superefficiency phenomenon. We demonstrate that our estimators consistently recover the true number of components and achieve the optimal convergence rates of parameter estimation under both the well- and misspecified kernel settings for any fixed bandwidth. These desirable asymptotic properties are illustrated via careful simulation studies with both synthetic and real data.




mod

Stochastic differential equations with a fractionally filtered delay: A semimartingale model for long-range dependent processes

Richard A. Davis, Mikkel Slot Nielsen, Victor Rohde.

Source: Bernoulli, Volume 26, Number 2, 799--827.

Abstract:
In this paper, we introduce a model, the stochastic fractional delay differential equation (SFDDE), which is based on the linear stochastic delay differential equation and produces stationary processes with hyperbolically decaying autocovariance functions. The model departs from the usual way of incorporating this type of long-range dependence into a short-memory model as it is obtained by applying a fractional filter to the drift term rather than to the noise term. The advantages of this approach are that the corresponding long-range dependent solutions are semimartingales and the local behavior of the sample paths is unaffected by the degree of long memory. We prove existence and uniqueness of solutions to the SFDDEs and study their spectral densities and autocovariance functions. Moreover, we define a subclass of SFDDEs which we study in detail and relate to the well-known fractionally integrated CARMA processes. Finally, we consider the task of simulating from the defining SFDDEs.




mod

Robust modifications of U-statistics and applications to covariance estimation problems

Stanislav Minsker, Xiaohan Wei.

Source: Bernoulli, Volume 26, Number 1, 694--727.

Abstract:
Let $Y$ be a $d$-dimensional random vector with unknown mean $mu $ and covariance matrix $Sigma $. This paper is motivated by the problem of designing an estimator of $Sigma $ that admits exponential deviation bounds in the operator norm under minimal assumptions on the underlying distribution, such as existence of only 4th moments of the coordinates of $Y$. To address this problem, we propose robust modifications of the operator-valued U-statistics, obtain non-asymptotic guarantees for their performance, and demonstrate the implications of these results to the covariance estimation problem under various structural assumptions.




mod

On frequentist coverage errors of Bayesian credible sets in moderately high dimensions

Keisuke Yano, Kengo Kato.

Source: Bernoulli, Volume 26, Number 1, 616--641.

Abstract:
In this paper, we study frequentist coverage errors of Bayesian credible sets for an approximately linear regression model with (moderately) high dimensional regressors, where the dimension of the regressors may increase with but is smaller than the sample size. Specifically, we consider quasi-Bayesian inference on the slope vector under the quasi-likelihood with Gaussian error distribution. Under this setup, we derive finite sample bounds on frequentist coverage errors of Bayesian credible rectangles. Derivation of those bounds builds on a novel Berry–Esseen type bound on quasi-posterior distributions and recent results on high-dimensional CLT on hyperrectangles. We use this general result to quantify coverage errors of Castillo–Nickl and $L^{infty}$-credible bands for Gaussian white noise models, linear inverse problems, and (possibly non-Gaussian) nonparametric regression models. In particular, we show that Bayesian credible bands for those nonparametric models have coverage errors decaying polynomially fast in the sample size, implying advantages of Bayesian credible bands over confidence bands based on extreme value theory.




mod

Consistent semiparametric estimators for recurrent event times models with application to virtual age models

Eric Beutner, Laurent Bordes, Laurent Doyen.

Source: Bernoulli, Volume 26, Number 1, 557--586.

Abstract:
Virtual age models are very useful to analyse recurrent events. Among the strengths of these models is their ability to account for treatment (or intervention) effects after an event occurrence. Despite their flexibility for modeling recurrent events, the number of applications is limited. This seems to be a result of the fact that in the semiparametric setting all the existing results assume the virtual age function that describes the treatment (or intervention) effects to be known. This shortcoming can be overcome by considering semiparametric virtual age models with parametrically specified virtual age functions. Yet, fitting such a model is a difficult task. Indeed, it has recently been shown that for these models the standard profile likelihood method fails to lead to consistent estimators. Here we show that consistent estimators can be constructed by smoothing the profile log-likelihood function appropriately. We show that our general result can be applied to most of the relevant virtual age models of the literature. Our approach shows that empirical process techniques may be a worthwhile alternative to martingale methods for studying asymptotic properties of these inference methods. A simulation study is provided to illustrate our consistency results together with an application to real data.




mod

Joint Modeling of Longitudinal Relational Data and Exogenous Variables

Rajarshi Guhaniyogi, Abel Rodriguez.

Source: Bayesian Analysis, Volume 15, Number 2, 477--503.

Abstract:
This article proposes a framework based on shared, time varying stochastic latent factor models for modeling relational data in which network and node-attributes co-evolve over time. Our proposed framework is flexible enough to handle both categorical and continuous attributes, allows us to estimate the dimension of the latent social space, and automatically yields Bayesian hypothesis tests for the association between network structure and nodal attributes. Additionally, the model is easy to compute and readily yields inference and prediction for missing link between nodes. We employ our model framework to study co-evolution of international relations between 22 countries and the country specific indicators over a period of 11 years.




mod

Bayesian Inference in Nonparanormal Graphical Models

Jami J. Mulgrave, Subhashis Ghosal.

Source: Bayesian Analysis, Volume 15, Number 2, 449--475.

Abstract:
Gaussian graphical models have been used to study intrinsic dependence among several variables, but the Gaussianity assumption may be restrictive in many applications. A nonparanormal graphical model is a semiparametric generalization for continuous variables where it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformations on each of them. We consider a Bayesian approach in the nonparanormal graphical model by putting priors on the unknown transformations through a random series based on B-splines where the coefficients are ordered to induce monotonicity. A truncated normal prior leads to partial conjugacy in the model and is useful for posterior simulation using Gibbs sampling. On the underlying precision matrix of the transformed variables, we consider a spike-and-slab prior and use an efficient posterior Gibbs sampling scheme. We use the Bayesian Information Criterion to choose the hyperparameters for the spike-and-slab prior. We present a posterior consistency result on the underlying transformation and the precision matrix. We study the numerical performance of the proposed method through an extensive simulation study and finally apply the proposed method on a real data set.




mod

Additive Multivariate Gaussian Processes for Joint Species Distribution Modeling with Heterogeneous Data

Jarno Vanhatalo, Marcelo Hartmann, Lari Veneranta.

Source: Bayesian Analysis, Volume 15, Number 2, 415--447.

Abstract:
Species distribution models (SDM) are a key tool in ecology, conservation and management of natural resources. Two key components of the state-of-the-art SDMs are the description for species distribution response along environmental covariates and the spatial random effect that captures deviations from the distribution patterns explained by environmental covariates. Joint species distribution models (JSDMs) additionally include interspecific correlations which have been shown to improve their descriptive and predictive performance compared to single species models. However, current JSDMs are restricted to hierarchical generalized linear modeling framework. Their limitation is that parametric models have trouble in explaining changes in abundance due, for example, highly non-linear physical tolerance limits which is particularly important when predicting species distribution in new areas or under scenarios of environmental change. On the other hand, semi-parametric response functions have been shown to improve the predictive performance of SDMs in these tasks in single species models. Here, we propose JSDMs where the responses to environmental covariates are modeled with additive multivariate Gaussian processes coded as linear models of coregionalization. These allow inference for wide range of functional forms and interspecific correlations between the responses. We propose also an efficient approach for inference with Laplace approximation and parameterization of the interspecific covariance matrices on the Euclidean space. We demonstrate the benefits of our model with two small scale examples and one real world case study. We use cross-validation to compare the proposed model to analogous semi-parametric single species models and parametric single and joint species models in interpolation and extrapolation tasks. The proposed model outperforms the alternative models in all cases. We also show that the proposed model can be seen as an extension of the current state-of-the-art JSDMs to semi-parametric models.




mod

Dynamic Quantile Linear Models: A Bayesian Approach

Kelly C. M. Gonçalves, Hélio S. Migon, Leonardo S. Bastos.

Source: Bayesian Analysis, Volume 15, Number 2, 335--362.

Abstract:
The paper introduces a new class of models, named dynamic quantile linear models, which combines dynamic linear models with distribution-free quantile regression producing a robust statistical method. Bayesian estimation for the dynamic quantile linear model is performed using an efficient Markov chain Monte Carlo algorithm. The paper also proposes a fast sequential procedure suited for high-dimensional predictive modeling with massive data, where the generating process is changing over time. The proposed model is evaluated using synthetic and well-known time series data. The model is also applied to predict annual incidence of tuberculosis in the state of Rio de Janeiro and compared with global targets set by the World Health Organization.




mod

Learning Semiparametric Regression with Missing Covariates Using Gaussian Process Models

Abhishek Bishoyi, Xiaojing Wang, Dipak K. Dey.

Source: Bayesian Analysis, Volume 15, Number 1, 215--239.

Abstract:
Missing data often appear as a practical problem while applying classical models in the statistical analysis. In this paper, we consider a semiparametric regression model in the presence of missing covariates for nonparametric components under a Bayesian framework. As it is known that Gaussian processes are a popular tool in nonparametric regression because of their flexibility and the fact that much of the ensuing computation is parametric Gaussian computation. However, in the absence of covariates, the most frequently used covariance functions of a Gaussian process will not be well defined. We propose an imputation method to solve this issue and perform our analysis using Bayesian inference, where we specify the objective priors on the parameters of Gaussian process models. Several simulations are conducted to illustrate effectiveness of our proposed method and further, our method is exemplified via two real datasets, one through Langmuir equation, commonly used in pharmacokinetic models, and another through Auto-mpg data taken from the StatLib library.




mod

Adaptive Bayesian Nonparametric Regression Using a Kernel Mixture of Polynomials with Application to Partial Linear Models

Fangzheng Xie, Yanxun Xu.

Source: Bayesian Analysis, Volume 15, Number 1, 159--186.

Abstract:
We propose a kernel mixture of polynomials prior for Bayesian nonparametric regression. The regression function is modeled by local averages of polynomials with kernel mixture weights. We obtain the minimax-optimal contraction rate of the full posterior distribution up to a logarithmic factor by estimating metric entropies of certain function classes. Under the assumption that the degree of the polynomials is larger than the unknown smoothness level of the true function, the posterior contraction behavior can adapt to this smoothness level provided an upper bound is known. We also provide a frequentist sieve maximum likelihood estimator with a near-optimal convergence rate. We further investigate the application of the kernel mixture of polynomials to partial linear models and obtain both the near-optimal rate of contraction for the nonparametric component and the Bernstein-von Mises limit (i.e., asymptotic normality) of the parametric component. The proposed method is illustrated with numerical examples and shows superior performance in terms of computational efficiency, accuracy, and uncertainty quantification compared to the local polynomial regression, DiceKriging, and the robust Gaussian stochastic process.




mod

Bayesian Design of Experiments for Intractable Likelihood Models Using Coupled Auxiliary Models and Multivariate Emulation

Antony Overstall, James McGree.

Source: Bayesian Analysis, Volume 15, Number 1, 103--131.

Abstract:
A Bayesian design is given by maximising an expected utility over a design space. The utility is chosen to represent the aim of the experiment and its expectation is taken with respect to all unknowns: responses, parameters and/or models. Although straightforward in principle, there are several challenges to finding Bayesian designs in practice. Firstly, the utility and expected utility are rarely available in closed form and require approximation. Secondly, the design space can be of high-dimensionality. In the case of intractable likelihood models, these problems are compounded by the fact that the likelihood function, whose evaluation is required to approximate the expected utility, is not available in closed form. A strategy is proposed to find Bayesian designs for intractable likelihood models. It relies on the development of an automatic, auxiliary modelling approach, using multivariate Gaussian process emulators, to approximate the likelihood function. This is then combined with a copula-based approach to approximate the marginal likelihood (a quantity commonly required to evaluate many utility functions). These approximations are demonstrated on examples of stochastic process models involving experimental aims of both parameter estimation and model comparison.




mod

Scalable Bayesian Inference for the Inverse Temperature of a Hidden Potts Model

Matthew Moores, Geoff Nicholls, Anthony Pettitt, Kerrie Mengersen.

Source: Bayesian Analysis, Volume 15, Number 1, 1--27.

Abstract:
The inverse temperature parameter of the Potts model governs the strength of spatial cohesion and therefore has a major influence over the resulting model fit. A difficulty arises from the dependence of an intractable normalising constant on the value of this parameter and thus there is no closed-form solution for sampling from the posterior distribution directly. There is a variety of computational approaches for sampling from the posterior without evaluating the normalising constant, including the exchange algorithm and approximate Bayesian computation (ABC). A serious drawback of these algorithms is that they do not scale well for models with a large state space, such as images with a million or more pixels. We introduce a parametric surrogate model, which approximates the score function using an integral curve. Our surrogate model incorporates known properties of the likelihood, such as heteroskedasticity and critical temperature. We demonstrate this method using synthetic data as well as remotely-sensed imagery from the Landsat-8 satellite. We achieve up to a hundredfold improvement in the elapsed runtime, compared to the exchange algorithm or ABC. An open-source implementation of our algorithm is available in the R package bayesImageS .




mod

Hierarchical Normalized Completely Random Measures for Robust Graphical Modeling

Andrea Cremaschi, Raffaele Argiento, Katherine Shoemaker, Christine Peterson, Marina Vannucci.

Source: Bayesian Analysis, Volume 14, Number 4, 1271--1301.

Abstract:
Gaussian graphical models are useful tools for exploring network structures in multivariate normal data. In this paper we are interested in situations where data show departures from Gaussianity, therefore requiring alternative modeling distributions. The multivariate $t$ -distribution, obtained by dividing each component of the data vector by a gamma random variable, is a straightforward generalization to accommodate deviations from normality such as heavy tails. Since different groups of variables may be contaminated to a different extent, Finegold and Drton (2014) introduced the Dirichlet $t$ -distribution, where the divisors are clustered using a Dirichlet process. In this work, we consider a more general class of nonparametric distributions as the prior on the divisor terms, namely the class of normalized completely random measures (NormCRMs). To improve the effectiveness of the clustering, we propose modeling the dependence among the divisors through a nonparametric hierarchical structure, which allows for the sharing of parameters across the samples in the data set. This desirable feature enables us to cluster together different components of multivariate data in a parsimonious way. We demonstrate through simulations that this approach provides accurate graphical model inference, and apply it to a case study examining the dependence structure in radiomics data derived from The Cancer Imaging Atlas.




mod

Spatial Disease Mapping Using Directed Acyclic Graph Auto-Regressive (DAGAR) Models

Abhirup Datta, Sudipto Banerjee, James S. Hodges, Leiwen Gao.

Source: Bayesian Analysis, Volume 14, Number 4, 1221--1244.

Abstract:
Hierarchical models for regionally aggregated disease incidence data commonly involve region specific latent random effects that are modeled jointly as having a multivariate Gaussian distribution. The covariance or precision matrix incorporates the spatial dependence between the regions. Common choices for the precision matrix include the widely used ICAR model, which is singular, and its nonsingular extension which lacks interpretability. We propose a new parametric model for the precision matrix based on a directed acyclic graph (DAG) representation of the spatial dependence. Our model guarantees positive definiteness and, hence, in addition to being a valid prior for regional spatially correlated random effects, can also directly model the outcome from dependent data like images and networks. Theoretical results establish a link between the parameters in our model and the variance and covariances of the random effects. Simulation studies demonstrate that the improved interpretability of our model reaps benefits in terms of accurately recovering the latent spatial random effects as well as for inference on the spatial covariance parameters. Under modest spatial correlation, our model far outperforms the CAR models, while the performances are similar when the spatial correlation is strong. We also assess sensitivity to the choice of the ordering in the DAG construction using theoretical and empirical results which testify to the robustness of our model. We also present a large-scale public health application demonstrating the competitive performance of the model.




mod

Estimating the Use of Public Lands: Integrated Modeling of Open Populations with Convolution Likelihood Ecological Abundance Regression

Lutz F. Gruber, Erica F. Stuber, Lyndsie S. Wszola, Joseph J. Fontaine.

Source: Bayesian Analysis, Volume 14, Number 4, 1173--1199.

Abstract:
We present an integrated open population model where the population dynamics are defined by a differential equation, and the related statistical model utilizes a Poisson binomial convolution likelihood. Key advantages of the proposed approach over existing open population models include the flexibility to predict related, but unobserved quantities such as total immigration or emigration over a specified time period, and more computationally efficient posterior simulation by elimination of the need to explicitly simulate latent immigration and emigration. The viability of the proposed method is shown in an in-depth analysis of outdoor recreation participation on public lands, where the surveyed populations changed rapidly and demographic population closure cannot be assumed even within a single day.




mod

Bayes Factors for Partially Observed Stochastic Epidemic Models

Muteb Alharthi, Theodore Kypraios, Philip D. O’Neill.

Source: Bayesian Analysis, Volume 14, Number 3, 927--956.

Abstract:
We consider the problem of model choice for stochastic epidemic models given partial observation of a disease outbreak through time. Our main focus is on the use of Bayes factors. Although Bayes factors have appeared in the epidemic modelling literature before, they can be hard to compute and little attention has been given to fundamental questions concerning their utility. In this paper we derive analytic expressions for Bayes factors given complete observation through time, which suggest practical guidelines for model choice problems. We adapt the power posterior method for computing Bayes factors so as to account for missing data and apply this approach to partially observed epidemics. For comparison, we also explore the use of a deviance information criterion for missing data scenarios. The methods are illustrated via examples involving both simulated and real data.




mod

Probability Based Independence Sampler for Bayesian Quantitative Learning in Graphical Log-Linear Marginal Models

Ioannis Ntzoufras, Claudia Tarantola, Monia Lupparelli.

Source: Bayesian Analysis, Volume 14, Number 3, 797--823.

Abstract:
We introduce a novel Bayesian approach for quantitative learning for graphical log-linear marginal models. These models belong to curved exponential families that are difficult to handle from a Bayesian perspective. The likelihood cannot be analytically expressed as a function of the marginal log-linear interactions, but only in terms of cell counts or probabilities. Posterior distributions cannot be directly obtained, and Markov Chain Monte Carlo (MCMC) methods are needed. Finally, a well-defined model requires parameter values that lead to compatible marginal probabilities. Hence, any MCMC should account for this important restriction. We construct a fully automatic and efficient MCMC strategy for quantitative learning for such models that handles these problems. While the prior is expressed in terms of the marginal log-linear interactions, we build an MCMC algorithm that employs a proposal on the probability parameter space. The corresponding proposal on the marginal log-linear interactions is obtained via parameter transformation. We exploit a conditional conjugate setup to build an efficient proposal on probability parameters. The proposed methodology is illustrated by a simulation study and a real dataset.




mod

Semiparametric Multivariate and Multiple Change-Point Modeling

Stefano Peluso, Siddhartha Chib, Antonietta Mira.

Source: Bayesian Analysis, Volume 14, Number 3, 727--751.

Abstract:
We develop a general Bayesian semiparametric change-point model in which separate groups of structural parameters (for example, location and dispersion parameters) can each follow a separate multiple change-point process, driven by time-dependent transition matrices among the latent regimes. The distribution of the observations within regimes is unknown and given by a Dirichlet process mixture prior. The properties of the proposed model are studied theoretically through the analysis of inter-arrival times and of the number of change-points in a given time interval. The prior-posterior analysis by Markov chain Monte Carlo techniques is developed on a forward-backward algorithm for sampling the various regime indicators. Analysis with simulated data under various scenarios and an application to short-term interest rates are used to show the generality and usefulness of the proposed model.




mod

Model Criticism in Latent Space

Sohan Seth, Iain Murray, Christopher K. I. Williams.

Source: Bayesian Analysis, Volume 14, Number 3, 703--725.

Abstract:
Model criticism is usually carried out by assessing if replicated data generated under the fitted model looks similar to the observed data, see e.g. Gelman, Carlin, Stern, and Rubin (2004, p. 165). This paper presents a method for latent variable models by pulling back the data into the space of latent variables, and carrying out model criticism in that space. Making use of a model's structure enables a more direct assessment of the assumptions made in the prior and likelihood. We demonstrate the method with examples of model criticism in latent space applied to factor analysis, linear dynamical systems and Gaussian processes.




mod

Low Information Omnibus (LIO) Priors for Dirichlet Process Mixture Models

Yushu Shi, Michael Martens, Anjishnu Banerjee, Purushottam Laud.

Source: Bayesian Analysis, Volume 14, Number 3, 677--702.

Abstract:
Dirichlet process mixture (DPM) models provide flexible modeling for distributions of data as an infinite mixture of distributions from a chosen collection. Specifying priors for these models in individual data contexts can be challenging. In this paper, we introduce a scheme which requires the investigator to specify only simple scaling information. This is used to transform the data to a fixed scale on which a low information prior is constructed. Samples from the posterior with the rescaled data are transformed back for inference on the original scale. The low information prior is selected to provide a wide variety of components for the DPM to generate flexible distributions for the data on the fixed scale. The method can be applied to all DPM models with kernel functions closed under a suitable scaling transformation. Construction of the low information prior, however, is kernel dependent. Using DPM-of-Gaussians and DPM-of-Weibulls models as examples, we show that the method provides accurate estimates of a diverse collection of distributions that includes skewed, multimodal, and highly dispersed members. With the recommended priors, repeated data simulations show performance comparable to that of standard empirical estimates. Finally, we show weak convergence of posteriors with the proposed priors for both kernels considered.




mod

Efficient Acquisition Rules for Model-Based Approximate Bayesian Computation

Marko Järvenpää, Michael U. Gutmann, Arijus Pleska, Aki Vehtari, Pekka Marttinen.

Source: Bayesian Analysis, Volume 14, Number 2, 595--622.

Abstract:
Approximate Bayesian computation (ABC) is a method for Bayesian inference when the likelihood is unavailable but simulating from the model is possible. However, many ABC algorithms require a large number of simulations, which can be costly. To reduce the computational cost, Bayesian optimisation (BO) and surrogate models such as Gaussian processes have been proposed. Bayesian optimisation enables one to intelligently decide where to evaluate the model next but common BO strategies are not designed for the goal of estimating the posterior distribution. Our paper addresses this gap in the literature. We propose to compute the uncertainty in the ABC posterior density, which is due to a lack of simulations to estimate this quantity accurately, and define a loss function that measures this uncertainty. We then propose to select the next evaluation location to minimise the expected loss. Experiments show that the proposed method often produces the most accurate approximations as compared to common BO strategies.




mod

Fast Model-Fitting of Bayesian Variable Selection Regression Using the Iterative Complex Factorization Algorithm

Quan Zhou, Yongtao Guan.

Source: Bayesian Analysis, Volume 14, Number 2, 573--594.

Abstract:
Bayesian variable selection regression (BVSR) is able to jointly analyze genome-wide genetic datasets, but the slow computation via Markov chain Monte Carlo (MCMC) hampered its wide-spread usage. Here we present a novel iterative method to solve a special class of linear systems, which can increase the speed of the BVSR model-fitting tenfold. The iterative method hinges on the complex factorization of the sum of two matrices and the solution path resides in the complex domain (instead of the real domain). Compared to the Gauss-Seidel method, the complex factorization converges almost instantaneously and its error is several magnitude smaller than that of the Gauss-Seidel method. More importantly, the error is always within the pre-specified precision while the Gauss-Seidel method is not. For large problems with thousands of covariates, the complex factorization is 10–100 times faster than either the Gauss-Seidel method or the direct method via the Cholesky decomposition. In BVSR, one needs to repetitively solve large penalized regression systems whose design matrices only change slightly between adjacent MCMC steps. This slight change in design matrix enables the adaptation of the iterative complex factorization method. The computational innovation will facilitate the wide-spread use of BVSR in reanalyzing genome-wide association datasets.




mod

A Bayesian Nonparametric Spiked Process Prior for Dynamic Model Selection

Alberto Cassese, Weixuan Zhu, Michele Guindani, Marina Vannucci.

Source: Bayesian Analysis, Volume 14, Number 2, 553--572.

Abstract:
In many applications, investigators monitor processes that vary in space and time, with the goal of identifying temporally persistent and spatially localized departures from a baseline or “normal” behavior. In this manuscript, we consider the monitoring of pneumonia and influenza (P&I) mortality, to detect influenza outbreaks in the continental United States, and propose a Bayesian nonparametric model selection approach to take into account the spatio-temporal dependence of outbreaks. More specifically, we introduce a zero-inflated conditionally identically distributed species sampling prior which allows borrowing information across time and to assign data to clusters associated to either a null or an alternate process. Spatial dependences are accounted for by means of a Markov random field prior, which allows to inform the selection based on inferences conducted at nearby locations. We show how the proposed modeling framework performs in an application to the P&I mortality data and in a simulation study, and compare with common threshold methods for detecting outbreaks over time, with more recent Markov switching based models, and with spike-and-slab Bayesian nonparametric priors that do not take into account spatio-temporal dependence.




mod

Analysis of the Maximal a Posteriori Partition in the Gaussian Dirichlet Process Mixture Model

Łukasz Rajkowski.

Source: Bayesian Analysis, Volume 14, Number 2, 477--494.

Abstract:
Mixture models are a natural choice in many applications, but it can be difficult to place an a priori upper bound on the number of components. To circumvent this, investigators are turning increasingly to Dirichlet process mixture models (DPMMs). It is therefore important to develop an understanding of the strengths and weaknesses of this approach. This work considers the MAP (maximum a posteriori) clustering for the Gaussian DPMM (where the cluster means have Gaussian distribution and, for each cluster, the observations within the cluster have Gaussian distribution). Some desirable properties of the MAP partition are proved: ‘almost disjointness’ of the convex hulls of clusters (they may have at most one point in common) and (with natural assumptions) the comparability of sizes of those clusters that intersect any fixed ball with the number of observations (as the latter goes to infinity). Consequently, the number of such clusters remains bounded. Furthermore, if the data arises from independent identically distributed sampling from a given distribution with bounded support then the asymptotic MAP partition of the observation space maximises a function which has a straightforward expression, which depends only on the within-group covariance parameter. As the operator norm of this covariance parameter decreases, the number of clusters in the MAP partition becomes arbitrarily large, which may lead to the overestimation of the number of mixture components.




mod

Efficient Bayesian Regularization for Graphical Model Selection

Suprateek Kundu, Bani K. Mallick, Veera Baladandayuthapani.

Source: Bayesian Analysis, Volume 14, Number 2, 449--476.

Abstract:
There has been an intense development in the Bayesian graphical model literature over the past decade; however, most of the existing methods are restricted to moderate dimensions. We propose a novel graphical model selection approach for large dimensional settings where the dimension increases with the sample size, by decoupling model fitting and covariance selection. First, a full model based on a complete graph is fit under a novel class of mixtures of inverse–Wishart priors, which induce shrinkage on the precision matrix under an equivalence with Cholesky-based regularization, while enabling conjugate updates. Subsequently, a post-fitting model selection step uses penalized joint credible regions to perform model selection. This allows our methods to be computationally feasible for large dimensional settings using a combination of straightforward Gibbs samplers and efficient post-fitting inferences. Theoretical guarantees in terms of selection consistency are also established. Simulations show that the proposed approach compares favorably with competing methods, both in terms of accuracy metrics and computation times. We apply this approach to a cancer genomics data example.




mod

Variational Message Passing for Elaborate Response Regression Models

M. W. McLean, M. P. Wand.

Source: Bayesian Analysis, Volume 14, Number 2, 371--398.

Abstract:
We build on recent work concerning message passing approaches to approximate fitting and inference for arbitrarily large regression models. The focus is on regression models where the response variable is modeled to have an elaborate distribution, which is loosely defined to mean a distribution that is more complicated than common distributions such as those in the Bernoulli, Poisson and Normal families. Examples of elaborate response families considered here are the Negative Binomial and $t$ families. Variational message passing is more challenging due to some of the conjugate exponential families being non-standard and numerical integration being needed. Nevertheless, a factor graph fragment approach means the requisite calculations only need to be done once for a particular elaborate response distribution family. Computer code can be compartmentalized, including that involving numerical integration. A major finding of this work is that the modularity of variational message passing extends to elaborate response regression models.




mod

Modeling Population Structure Under Hierarchical Dirichlet Processes

Lloyd T. Elliott, Maria De Iorio, Stefano Favaro, Kaustubh Adhikari, Yee Whye Teh.

Source: Bayesian Analysis, Volume 14, Number 2, 313--339.

Abstract:
We propose a Bayesian nonparametric model to infer population admixture, extending the hierarchical Dirichlet process to allow for correlation between loci due to linkage disequilibrium. Given multilocus genotype data from a sample of individuals, the proposed model allows inferring and classifying individuals as unadmixed or admixed, inferring the number of subpopulations ancestral to an admixed population and the population of origin of chromosomal regions. Our model does not assume any specific mutation process, and can be applied to most of the commonly used genetic markers. We present a Markov chain Monte Carlo (MCMC) algorithm to perform posterior inference from the model and we discuss some methods to summarize the MCMC output for the analysis of population admixture. Finally, we demonstrate the performance of the proposed model in a real application, using genetic data from the ectodysplasin-A receptor (EDAR) gene, which is considered to be ancestry-informative due to well-known variations in allele frequency as well as phenotypic effects across ancestry. The structure analysis of this dataset leads to the identification of a rare haplotype in Europeans. We also conduct a simulated experiment and show that our algorithm outperforms parametric methods.




mod

A Tale of Two Parasites: Statistical Modelling to Support Disease Control Programmes in Africa

Peter J. Diggle, Emanuele Giorgi, Julienne Atsame, Sylvie Ntsame Ella, Kisito Ogoussan, Katherine Gass.

Source: Statistical Science, Volume 35, Number 1, 42--50.

Abstract:
Vector-borne diseases have long presented major challenges to the health of rural communities in the wet tropical regions of the world, but especially in sub-Saharan Africa. In this paper, we describe the contribution that statistical modelling has made to the global elimination programme for one vector-borne disease, onchocerciasis. We explain why information on the spatial distribution of a second vector-borne disease, Loa loa, is needed before communities at high risk of onchocerciasis can be treated safely with mass distribution of ivermectin, an antifiarial medication. We show how a model-based geostatistical analysis of Loa loa prevalence survey data can be used to map the predictive probability that each location in the region of interest meets a WHO policy guideline for safe mass distribution of ivermectin and describe two applications: one is to data from Cameroon that assesses prevalence using traditional blood-smear microscopy; the other is to Africa-wide data that uses a low-cost questionnaire-based method. We describe how a recent technological development in image-based microscopy has resulted in a change of emphasis from prevalence alone to the bivariate spatial distribution of prevalence and the intensity of infection among infected individuals. We discuss how statistical modelling of the kind described here can contribute to health policy guidelines and decision-making in two ways. One is to ensure that, in a resource-limited setting, prevalence surveys are designed, and the resulting data analysed, as efficiently as possible. The other is to provide an honest quantification of the uncertainty attached to any binary decision by reporting predictive probabilities that a policy-defined condition for action is or is not met.




mod

Risk Models for Breast Cancer and Their Validation

Adam R. Brentnall, Jack Cuzick.

Source: Statistical Science, Volume 35, Number 1, 14--30.

Abstract:
Strategies to prevent cancer and diagnose it early when it is most treatable are needed to reduce the public health burden from rising disease incidence. Risk assessment is playing an increasingly important role in targeting individuals in need of such interventions. For breast cancer many individual risk factors have been well understood for a long time, but the development of a fully comprehensive risk model has not been straightforward, in part because there have been limited data where joint effects of an extensive set of risk factors may be estimated with precision. In this article we first review the approach taken to develop the IBIS (Tyrer–Cuzick) model, and describe recent updates. We then review and develop methods to assess calibration of models such as this one, where the risk of disease allowing for competing mortality over a long follow-up time or lifetime is estimated. The breast cancer risk model model and calibration assessment methods are demonstrated using a cohort of 132,139 women attending mammography screening in the State of Washington, USA.




mod

Model-Based Approach to the Joint Analysis of Single-Cell Data on Chromatin Accessibility and Gene Expression

Zhixiang Lin, Mahdi Zamanighomi, Timothy Daley, Shining Ma, Wing Hung Wong.

Source: Statistical Science, Volume 35, Number 1, 2--13.

Abstract:
Unsupervised methods, including clustering methods, are essential to the analysis of single-cell genomic data. Model-based clustering methods are under-explored in the area of single-cell genomics, and have the advantage of quantifying the uncertainty of the clustering result. Here we develop a model-based approach for the integrative analysis of single-cell chromatin accessibility and gene expression data. We show that combining these two types of data, we can achieve a better separation of the underlying cell types. An efficient Markov chain Monte Carlo algorithm is also developed.




mod

Gaussianization Machines for Non-Gaussian Function Estimation Models

T. Tony Cai.

Source: Statistical Science, Volume 34, Number 4, 635--656.

Abstract:
A wide range of nonparametric function estimation models have been studied individually in the literature. Among them the homoscedastic nonparametric Gaussian regression is arguably the best known and understood. Inspired by the asymptotic equivalence theory, Brown, Cai and Zhou ( Ann. Statist. 36 (2008) 2055–2084; Ann. Statist. 38 (2010) 2005–2046) and Brown et al. ( Probab. Theory Related Fields 146 (2010) 401–433) developed a unified approach to turn a collection of non-Gaussian function estimation models into a standard Gaussian regression and any good Gaussian nonparametric regression method can then be used. These Gaussianization Machines have two key components, binning and transformation. When combined with BlockJS, a wavelet thresholding procedure for Gaussian regression, the procedures are computationally efficient with strong theoretical guarantees. Technical analysis given in Brown, Cai and Zhou ( Ann. Statist. 36 (2008) 2055–2084; Ann. Statist. 38 (2010) 2005–2046) and Brown et al. ( Probab. Theory Related Fields 146 (2010) 401–433) shows that the estimators attain the optimal rate of convergence adaptively over a large set of Besov spaces and across a collection of non-Gaussian function estimation models, including robust nonparametric regression, density estimation, and nonparametric regression in exponential families. The estimators are also spatially adaptive. The Gaussianization Machines significantly extend the flexibility and scope of the theories and methodologies originally developed for the conventional nonparametric Gaussian regression. This article aims to provide a concise account of the Gaussianization Machines developed in Brown, Cai and Zhou ( Ann. Statist. 36 (2008) 2055–2084; Ann. Statist. 38 (2010) 2005–2046), Brown et al. ( Probab. Theory Related Fields 146 (2010) 401–433).




mod

Models as Approximations—Rejoinder

Andreas Buja, Arun Kumar Kuchibhotla, Richard Berk, Edward George, Eric Tchetgen Tchetgen, Linda Zhao.

Source: Statistical Science, Volume 34, Number 4, 606--620.

Abstract:
We respond to the discussants of our articles emphasizing the importance of inference under misspecification in the context of the reproducibility/replicability crisis. Along the way, we discuss the roles of diagnostics and model building in regression as well as connections between our well-specification framework and semiparametric theory.




mod

Discussion: Models as Approximations

Dalia Ghanem, Todd A. Kuffner.

Source: Statistical Science, Volume 34, Number 4, 604--605.




mod

Comment: Models as (Deliberate) Approximations

David Whitney, Ali Shojaie, Marco Carone.

Source: Statistical Science, Volume 34, Number 4, 591--598.




mod

Comment: Models Are Approximations!

Anthony C. Davison, Erwan Koch, Jonathan Koh.

Source: Statistical Science, Volume 34, Number 4, 584--590.

Abstract:
This discussion focuses on areas of disagreement with the papers, particularly the target of inference and the case for using the robust ‘sandwich’ variance estimator in the presence of moderate mis-specification. We also suggest that existing procedures may be appreciably more powerful for detecting mis-specification than the authors’ RAV statistic, and comment on the use of the pairs bootstrap in balanced situations.




mod

Comment: “Models as Approximations I: Consequences Illustrated with Linear Regression” by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, L. Zhan and K. Zhang

Roderick J. Little.

Source: Statistical Science, Volume 34, Number 4, 580--583.




mod

Discussion of Models as Approximations I & II

Dag Tjøstheim.

Source: Statistical Science, Volume 34, Number 4, 575--579.




mod

Comment: Models as Approximations

Nikki L. B. Freeman, Xiaotong Jiang, Owen E. Leete, Daniel J. Luckett, Teeranan Pokaprakarn, Michael R. Kosorok.

Source: Statistical Science, Volume 34, Number 4, 572--574.




mod

Comment on Models as Approximations, Parts I and II, by Buja et al.

Jerald F. Lawless.

Source: Statistical Science, Volume 34, Number 4, 569--571.

Abstract:
I comment on the papers Models as Approximations I and II, by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, M. Traskin, L. Zhao and K. Zhang.