al

A Novel Algorithmic Approach to Bayesian Logic Regression (with Discussion)

Aliaksandr Hubin, Geir Storvik, Florian Frommlet.

Source: Bayesian Analysis, Volume 15, Number 1, 263--333.

Abstract:
Logic regression was developed more than a decade ago as a tool to construct predictors from Boolean combinations of binary covariates. It has been mainly used to model epistatic effects in genetic association studies, which is very appealing due to the intuitive interpretation of logic expressions to describe the interaction between genetic variations. Nevertheless logic regression has (partly due to computational challenges) remained less well known than other approaches to epistatic association mapping. Here we will adapt an advanced evolutionary algorithm called GMJMCMC (Genetically modified Mode Jumping Markov Chain Monte Carlo) to perform Bayesian model selection in the space of logic regression models. After describing the algorithmic details of GMJMCMC we perform a comprehensive simulation study that illustrates its performance given logic regression terms of various complexity. Specifically GMJMCMC is shown to be able to identify three-way and even four-way interactions with relatively large power, a level of complexity which has not been achieved by previous implementations of logic regression. We apply GMJMCMC to reanalyze QTL (quantitative trait locus) mapping data for Recombinant Inbred Lines in Arabidopsis thaliana and from a backcross population in Drosophila where we identify several interesting epistatic effects. The method is implemented in an R package which is available on github.




al

High-Dimensional Posterior Consistency for Hierarchical Non-Local Priors in Regression

Xuan Cao, Kshitij Khare, Malay Ghosh.

Source: Bayesian Analysis, Volume 15, Number 1, 241--262.

Abstract:
The choice of tuning parameters in Bayesian variable selection is a critical problem in modern statistics. In particular, for Bayesian linear regression with non-local priors, the scale parameter in the non-local prior density is an important tuning parameter which reflects the dispersion of the non-local prior density around zero, and implicitly determines the size of the regression coefficients that will be shrunk to zero. Current approaches treat the scale parameter as given, and suggest choices based on prior coverage/asymptotic considerations. In this paper, we consider the fully Bayesian approach introduced in (Wu, 2016) with the pMOM non-local prior and an appropriate Inverse-Gamma prior on the tuning parameter to analyze the underlying theoretical property. Under standard regularity assumptions, we establish strong model selection consistency in a high-dimensional setting, where $p$ is allowed to increase at a polynomial rate with $n$ or even at a sub-exponential rate with $n$ . Through simulation studies, we demonstrate that our model selection procedure can outperform other Bayesian methods which treat the scale parameter as given, and commonly used penalized likelihood methods, in a range of simulation settings.




al

Determinantal Point Process Mixtures Via Spectral Density Approach

Ilaria Bianchini, Alessandra Guglielmi, Fernando A. Quintana.

Source: Bayesian Analysis, Volume 15, Number 1, 187--214.

Abstract:
We consider mixture models where location parameters are a priori encouraged to be well separated. We explore a class of determinantal point process (DPP) mixture models, which provide the desired notion of separation or repulsion. Instead of using the rather restrictive case where analytical results are partially available, we adopt a spectral representation from which approximations to the DPP density functions can be readily computed. For the sake of concreteness the presentation focuses on a power exponential spectral density, but the proposed approach is in fact quite general. We later extend our model to incorporate covariate information in the likelihood and also in the assignment to mixture components, yielding a trade-off between repulsiveness of locations in the mixtures and attraction among subjects with similar covariates. We develop full Bayesian inference, and explore model properties and posterior behavior using several simulation scenarios and data illustrations. Supplementary materials for this article are available online (Bianchini et al., 2019).




al

Adaptive Bayesian Nonparametric Regression Using a Kernel Mixture of Polynomials with Application to Partial Linear Models

Fangzheng Xie, Yanxun Xu.

Source: Bayesian Analysis, Volume 15, Number 1, 159--186.

Abstract:
We propose a kernel mixture of polynomials prior for Bayesian nonparametric regression. The regression function is modeled by local averages of polynomials with kernel mixture weights. We obtain the minimax-optimal contraction rate of the full posterior distribution up to a logarithmic factor by estimating metric entropies of certain function classes. Under the assumption that the degree of the polynomials is larger than the unknown smoothness level of the true function, the posterior contraction behavior can adapt to this smoothness level provided an upper bound is known. We also provide a frequentist sieve maximum likelihood estimator with a near-optimal convergence rate. We further investigate the application of the kernel mixture of polynomials to partial linear models and obtain both the near-optimal rate of contraction for the nonparametric component and the Bernstein-von Mises limit (i.e., asymptotic normality) of the parametric component. The proposed method is illustrated with numerical examples and shows superior performance in terms of computational efficiency, accuracy, and uncertainty quantification compared to the local polynomial regression, DiceKriging, and the robust Gaussian stochastic process.




al

Detecting Structural Changes in Longitudinal Network Data

Jong Hee Park, Yunkyu Sohn.

Source: Bayesian Analysis, Volume 15, Number 1, 133--157.

Abstract:
Dynamic modeling of longitudinal networks has been an increasingly important topic in applied research. While longitudinal network data commonly exhibit dramatic changes in its structures, existing methods have largely focused on modeling smooth topological changes over time. In this paper, we develop a hidden Markov network change-point model (HNC) that combines the multilinear tensor regression model (Hoff, 2011) with a hidden Markov model using Bayesian inference. We model changes in network structure as shifts in discrete states yielding particular sets of network generating parameters. Our simulation results demonstrate that the proposed method correctly detects the number, locations, and types of changes in latent node characteristics. We apply the proposed method to international military alliance networks to find structural changes in the coalition structure among nations.




al

The Bayesian Update: Variational Formulations and Gradient Flows

Nicolas Garcia Trillos, Daniel Sanz-Alonso.

Source: Bayesian Analysis, Volume 15, Number 1, 29--56.

Abstract:
The Bayesian update can be viewed as a variational problem by characterizing the posterior as the minimizer of a functional. The variational viewpoint is far from new and is at the heart of popular methods for posterior approximation. However, some of its consequences seem largely unexplored. We focus on the following one: defining the posterior as the minimizer of a functional gives a natural path towards the posterior by moving in the direction of steepest descent of the functional. This idea is made precise through the theory of gradient flows, allowing to bring new tools to the study of Bayesian models and algorithms. Since the posterior may be characterized as the minimizer of different functionals, several variational formulations may be considered. We study three of them and their three associated gradient flows. We show that, in all cases, the rate of convergence of the flows to the posterior can be bounded by the geodesic convexity of the functional to be minimized. Each gradient flow naturally suggests a nonlinear diffusion with the posterior as invariant distribution. These diffusions may be discretized to build proposals for Markov chain Monte Carlo (MCMC) algorithms. By construction, the diffusions are guaranteed to satisfy a certain optimality condition, and rates of convergence are given by the convexity of the functionals. We use this observation to propose a criterion for the choice of metric in Riemannian MCMC methods.




al

Scalable Bayesian Inference for the Inverse Temperature of a Hidden Potts Model

Matthew Moores, Geoff Nicholls, Anthony Pettitt, Kerrie Mengersen.

Source: Bayesian Analysis, Volume 15, Number 1, 1--27.

Abstract:
The inverse temperature parameter of the Potts model governs the strength of spatial cohesion and therefore has a major influence over the resulting model fit. A difficulty arises from the dependence of an intractable normalising constant on the value of this parameter and thus there is no closed-form solution for sampling from the posterior distribution directly. There is a variety of computational approaches for sampling from the posterior without evaluating the normalising constant, including the exchange algorithm and approximate Bayesian computation (ABC). A serious drawback of these algorithms is that they do not scale well for models with a large state space, such as images with a million or more pixels. We introduce a parametric surrogate model, which approximates the score function using an integral curve. Our surrogate model incorporates known properties of the likelihood, such as heteroskedasticity and critical temperature. We demonstrate this method using synthetic data as well as remotely-sensed imagery from the Landsat-8 satellite. We achieve up to a hundredfold improvement in the elapsed runtime, compared to the exchange algorithm or ABC. An open-source implementation of our algorithm is available in the R package bayesImageS .




al

Hierarchical Normalized Completely Random Measures for Robust Graphical Modeling

Andrea Cremaschi, Raffaele Argiento, Katherine Shoemaker, Christine Peterson, Marina Vannucci.

Source: Bayesian Analysis, Volume 14, Number 4, 1271--1301.

Abstract:
Gaussian graphical models are useful tools for exploring network structures in multivariate normal data. In this paper we are interested in situations where data show departures from Gaussianity, therefore requiring alternative modeling distributions. The multivariate $t$ -distribution, obtained by dividing each component of the data vector by a gamma random variable, is a straightforward generalization to accommodate deviations from normality such as heavy tails. Since different groups of variables may be contaminated to a different extent, Finegold and Drton (2014) introduced the Dirichlet $t$ -distribution, where the divisors are clustered using a Dirichlet process. In this work, we consider a more general class of nonparametric distributions as the prior on the divisor terms, namely the class of normalized completely random measures (NormCRMs). To improve the effectiveness of the clustering, we propose modeling the dependence among the divisors through a nonparametric hierarchical structure, which allows for the sharing of parameters across the samples in the data set. This desirable feature enables us to cluster together different components of multivariate data in a parsimonious way. We demonstrate through simulations that this approach provides accurate graphical model inference, and apply it to a case study examining the dependence structure in radiomics data derived from The Cancer Imaging Atlas.




al

Calibration Procedures for Approximate Bayesian Credible Sets

Jeong Eun Lee, Geoff K. Nicholls, Robin J. Ryder.

Source: Bayesian Analysis, Volume 14, Number 4, 1245--1269.

Abstract:
We develop and apply two calibration procedures for checking the coverage of approximate Bayesian credible sets, including intervals estimated using Monte Carlo methods. The user has an ideal prior and likelihood, but generates a credible set for an approximate posterior based on some approximate prior and likelihood. We estimate the realised posterior coverage achieved by the approximate credible set. This is the coverage of the unknown “true” parameter if the data are a realisation of the user’s ideal observation model conditioned on the parameter, and the parameter is a draw from the user’s ideal prior. In one approach we estimate the posterior coverage at the data by making a semi-parametric logistic regression of binary coverage outcomes on simulated data against summary statistics evaluated on simulated data. In another we use Importance Sampling from the approximate posterior, windowing simulated data to fall close to the observed data. We illustrate our methods on four examples.




al

Spatial Disease Mapping Using Directed Acyclic Graph Auto-Regressive (DAGAR) Models

Abhirup Datta, Sudipto Banerjee, James S. Hodges, Leiwen Gao.

Source: Bayesian Analysis, Volume 14, Number 4, 1221--1244.

Abstract:
Hierarchical models for regionally aggregated disease incidence data commonly involve region specific latent random effects that are modeled jointly as having a multivariate Gaussian distribution. The covariance or precision matrix incorporates the spatial dependence between the regions. Common choices for the precision matrix include the widely used ICAR model, which is singular, and its nonsingular extension which lacks interpretability. We propose a new parametric model for the precision matrix based on a directed acyclic graph (DAG) representation of the spatial dependence. Our model guarantees positive definiteness and, hence, in addition to being a valid prior for regional spatially correlated random effects, can also directly model the outcome from dependent data like images and networks. Theoretical results establish a link between the parameters in our model and the variance and covariances of the random effects. Simulation studies demonstrate that the improved interpretability of our model reaps benefits in terms of accurately recovering the latent spatial random effects as well as for inference on the spatial covariance parameters. Under modest spatial correlation, our model far outperforms the CAR models, while the performances are similar when the spatial correlation is strong. We also assess sensitivity to the choice of the ordering in the DAG construction using theoretical and empirical results which testify to the robustness of our model. We also present a large-scale public health application demonstrating the competitive performance of the model.




al

Estimating the Use of Public Lands: Integrated Modeling of Open Populations with Convolution Likelihood Ecological Abundance Regression

Lutz F. Gruber, Erica F. Stuber, Lyndsie S. Wszola, Joseph J. Fontaine.

Source: Bayesian Analysis, Volume 14, Number 4, 1173--1199.

Abstract:
We present an integrated open population model where the population dynamics are defined by a differential equation, and the related statistical model utilizes a Poisson binomial convolution likelihood. Key advantages of the proposed approach over existing open population models include the flexibility to predict related, but unobserved quantities such as total immigration or emigration over a specified time period, and more computationally efficient posterior simulation by elimination of the need to explicitly simulate latent immigration and emigration. The viability of the proposed method is shown in an in-depth analysis of outdoor recreation participation on public lands, where the surveyed populations changed rapidly and demographic population closure cannot be assumed even within a single day.




al

Bayesian Functional Forecasting with Locally-Autoregressive Dependent Processes

Guillaume Kon Kam King, Antonio Canale, Matteo Ruggiero.

Source: Bayesian Analysis, Volume 14, Number 4, 1121--1141.

Abstract:
Motivated by the problem of forecasting demand and offer curves, we introduce a class of nonparametric dynamic models with locally-autoregressive behaviour, and provide a full inferential strategy for forecasting time series of piecewise-constant non-decreasing functions over arbitrary time horizons. The model is induced by a non Markovian system of interacting particles whose evolution is governed by a resampling step and a drift mechanism. The former is based on a global interaction and accounts for the volatility of the functional time series, while the latter is determined by a neighbourhood-based interaction with the past curves and accounts for local trend behaviours, separating these from pure noise. We discuss the implementation of the model for functional forecasting by combining a population Monte Carlo and a semi-automatic learning approach to approximate Bayesian computation which require limited tuning. We validate the inference method with a simulation study, and carry out predictive inference on a real dataset on the Italian natural gas market.




al

Variance Prior Forms for High-Dimensional Bayesian Variable Selection

Gemma E. Moran, Veronika Ročková, Edward I. George.

Source: Bayesian Analysis, Volume 14, Number 4, 1091--1119.

Abstract:
Consider the problem of high dimensional variable selection for the Gaussian linear model when the unknown error variance is also of interest. In this paper, we show that the use of conjugate shrinkage priors for Bayesian variable selection can have detrimental consequences for such variance estimation. Such priors are often motivated by the invariance argument of Jeffreys (1961). Revisiting this work, however, we highlight a caveat that Jeffreys himself noticed; namely that biased estimators can result from inducing dependence between parameters a priori . In a similar way, we show that conjugate priors for linear regression, which induce prior dependence, can lead to such underestimation in the Bayesian high-dimensional regression setting. Following Jeffreys, we recommend as a remedy to treat regression coefficients and the error variance as independent a priori . Using such an independence prior framework, we extend the Spike-and-Slab Lasso of Ročková and George (2018) to the unknown variance case. This extended procedure outperforms both the fixed variance approach and alternative penalized likelihood methods on simulated data. On the protein activity dataset of Clyde and Parmigiani (1998), the Spike-and-Slab Lasso with unknown variance achieves lower cross-validation error than alternative penalized likelihood methods, demonstrating the gains in predictive accuracy afforded by simultaneous error variance estimation. The unknown variance implementation of the Spike-and-Slab Lasso is provided in the publicly available R package SSLASSO (Ročková and Moran, 2017).




al

Beyond Whittle: Nonparametric Correction of a Parametric Likelihood with a Focus on Bayesian Time Series Analysis

Claudia Kirch, Matthew C. Edwards, Alexander Meier, Renate Meyer.

Source: Bayesian Analysis, Volume 14, Number 4, 1037--1073.

Abstract:
Nonparametric Bayesian inference has seen a rapid growth over the last decade but only few nonparametric Bayesian approaches to time series analysis have been developed. Most existing approaches use Whittle’s likelihood for Bayesian modelling of the spectral density as the main nonparametric characteristic of stationary time series. It is known that the loss of efficiency using Whittle’s likelihood can be substantial. On the other hand, parametric methods are more powerful than nonparametric methods if the observed time series is close to the considered model class but fail if the model is misspecified. Therefore, we suggest a nonparametric correction of a parametric likelihood that takes advantage of the efficiency of parametric models while mitigating sensitivities through a nonparametric amendment. We use a nonparametric Bernstein polynomial prior on the spectral density with weights induced by a Dirichlet process and prove posterior consistency for Gaussian stationary time series. Bayesian posterior computations are implemented via an MH-within-Gibbs sampler and the performance of the nonparametrically corrected likelihood for Gaussian time series is illustrated in a simulation study and in three astronomy applications, including estimating the spectral density of gravitational wave data from the Advanced Laser Interferometer Gravitational-wave Observatory (LIGO).




al

Bayes Factors for Partially Observed Stochastic Epidemic Models

Muteb Alharthi, Theodore Kypraios, Philip D. O’Neill.

Source: Bayesian Analysis, Volume 14, Number 3, 927--956.

Abstract:
We consider the problem of model choice for stochastic epidemic models given partial observation of a disease outbreak through time. Our main focus is on the use of Bayes factors. Although Bayes factors have appeared in the epidemic modelling literature before, they can be hard to compute and little attention has been given to fundamental questions concerning their utility. In this paper we derive analytic expressions for Bayes factors given complete observation through time, which suggest practical guidelines for model choice problems. We adapt the power posterior method for computing Bayes factors so as to account for missing data and apply this approach to partially observed epidemics. For comparison, we also explore the use of a deviance information criterion for missing data scenarios. The methods are illustrated via examples involving both simulated and real data.




al

Jointly Robust Prior for Gaussian Stochastic Process in Emulation, Calibration and Variable Selection

Mengyang Gu.

Source: Bayesian Analysis, Volume 14, Number 3, 877--905.

Abstract:
Gaussian stochastic process (GaSP) has been widely used in two fundamental problems in uncertainty quantification, namely the emulation and calibration of mathematical models. Some objective priors, such as the reference prior, are studied in the context of emulating (approximating) computationally expensive mathematical models. In this work, we introduce a new class of priors, called the jointly robust prior, for both the emulation and calibration. This prior is designed to maintain various advantages from the reference prior. In emulation, the jointly robust prior has an appropriate tail decay rate as the reference prior, and is computationally simpler than the reference prior in parameter estimation. Moreover, the marginal posterior mode estimation with the jointly robust prior can separate the influential and inert inputs in mathematical models, while the reference prior does not have this property. We establish the posterior propriety for a large class of priors in calibration, including the reference prior and jointly robust prior in general scenarios, but the jointly robust prior is preferred because the calibrated mathematical model typically predicts the reality well. The jointly robust prior is used as the default prior in two new R packages, called “RobustGaSP” and “RobustCalibration”, available on CRAN for emulation and calibration, respectively.




al

Bayesian Zero-Inflated Negative Binomial Regression Based on Pólya-Gamma Mixtures

Brian Neelon.

Source: Bayesian Analysis, Volume 14, Number 3, 849--875.

Abstract:
Motivated by a study examining spatiotemporal patterns in inpatient hospitalizations, we propose an efficient Bayesian approach for fitting zero-inflated negative binomial models. To facilitate posterior sampling, we introduce a set of latent variables that are represented as scale mixtures of normals, where the precision terms follow independent Pólya-Gamma distributions. Conditional on the latent variables, inference proceeds via straightforward Gibbs sampling. For fixed-effects models, our approach is comparable to existing methods. However, our model can accommodate more complex data structures, including multivariate and spatiotemporal data, settings in which current approaches often fail due to computational challenges. Using simulation studies, we highlight key features of the method and compare its performance to other estimation procedures. We apply the approach to a spatiotemporal analysis examining the number of annual inpatient admissions among United States veterans with type 2 diabetes.




al

High-Dimensional Confounding Adjustment Using Continuous Spike and Slab Priors

Joseph Antonelli, Giovanni Parmigiani, Francesca Dominici.

Source: Bayesian Analysis, Volume 14, Number 3, 825--848.

Abstract:
In observational studies, estimation of a causal effect of a treatment on an outcome relies on proper adjustment for confounding. If the number of the potential confounders ( $p$ ) is larger than the number of observations ( $n$ ), then direct control for all potential confounders is infeasible. Existing approaches for dimension reduction and penalization are generally aimed at predicting the outcome, and are less suited for estimation of causal effects. Under standard penalization approaches (e.g. Lasso), if a variable $X_{j}$ is strongly associated with the treatment $T$ but weakly with the outcome $Y$ , the coefficient $eta_{j}$ will be shrunk towards zero thus leading to confounding bias. Under the assumption of a linear model for the outcome and sparsity, we propose continuous spike and slab priors on the regression coefficients $eta_{j}$ corresponding to the potential confounders $X_{j}$ . Specifically, we introduce a prior distribution that does not heavily shrink to zero the coefficients ( $eta_{j}$ s) of the $X_{j}$ s that are strongly associated with $T$ but weakly associated with $Y$ . We compare our proposed approach to several state of the art methods proposed in the literature. Our proposed approach has the following features: 1) it reduces confounding bias in high dimensional settings; 2) it shrinks towards zero coefficients of instrumental variables; and 3) it achieves good coverages even in small sample sizes. We apply our approach to the National Health and Nutrition Examination Survey (NHANES) data to estimate the causal effects of persistent pesticide exposure on triglyceride levels.




al

Probability Based Independence Sampler for Bayesian Quantitative Learning in Graphical Log-Linear Marginal Models

Ioannis Ntzoufras, Claudia Tarantola, Monia Lupparelli.

Source: Bayesian Analysis, Volume 14, Number 3, 797--823.

Abstract:
We introduce a novel Bayesian approach for quantitative learning for graphical log-linear marginal models. These models belong to curved exponential families that are difficult to handle from a Bayesian perspective. The likelihood cannot be analytically expressed as a function of the marginal log-linear interactions, but only in terms of cell counts or probabilities. Posterior distributions cannot be directly obtained, and Markov Chain Monte Carlo (MCMC) methods are needed. Finally, a well-defined model requires parameter values that lead to compatible marginal probabilities. Hence, any MCMC should account for this important restriction. We construct a fully automatic and efficient MCMC strategy for quantitative learning for such models that handles these problems. While the prior is expressed in terms of the marginal log-linear interactions, we build an MCMC algorithm that employs a proposal on the probability parameter space. The corresponding proposal on the marginal log-linear interactions is obtained via parameter transformation. We exploit a conditional conjugate setup to build an efficient proposal on probability parameters. The proposed methodology is illustrated by a simulation study and a real dataset.




al

Sequential Monte Carlo Samplers with Independent Markov Chain Monte Carlo Proposals

L. F. South, A. N. Pettitt, C. C. Drovandi.

Source: Bayesian Analysis, Volume 14, Number 3, 773--796.

Abstract:
Sequential Monte Carlo (SMC) methods for sampling from the posterior of static Bayesian models are flexible, parallelisable and capable of handling complex targets. However, it is common practice to adopt a Markov chain Monte Carlo (MCMC) kernel with a multivariate normal random walk (RW) proposal in the move step, which can be both inefficient and detrimental for exploring challenging posterior distributions. We develop new SMC methods with independent proposals which allow recycling of all candidates generated in the SMC process and are embarrassingly parallelisable. A novel evidence estimator that is easily computed from the output of our independent SMC is proposed. Our independent proposals are constructed via flexible copula-type models calibrated with the population of SMC particles. We demonstrate through several examples that more precise estimates of posterior expectations and the marginal likelihood can be obtained using fewer likelihood evaluations than the more standard RW approach.




al

A Bayesian Nonparametric Multiple Testing Procedure for Comparing Several Treatments Against a Control

Luis Gutiérrez, Andrés F. Barrientos, Jorge González, Daniel Taylor-Rodríguez.

Source: Bayesian Analysis, Volume 14, Number 2, 649--675.

Abstract:
We propose a Bayesian nonparametric strategy to test for differences between a control group and several treatment regimes. Most of the existing tests for this type of comparison are based on the differences between location parameters. In contrast, our approach identifies differences across the entire distribution, avoids strong modeling assumptions over the distributions for each treatment, and accounts for multiple testing through the prior distribution on the space of hypotheses. The proposal is compared to other commonly used hypothesis testing procedures under simulated scenarios. Two real applications are also analyzed with the proposed methodology.




al

Alleviating Spatial Confounding for Areal Data Problems by Displacing the Geographical Centroids

Marcos Oliveira Prates, Renato Martins Assunção, Erica Castilho Rodrigues.

Source: Bayesian Analysis, Volume 14, Number 2, 623--647.

Abstract:
Spatial confounding between the spatial random effects and fixed effects covariates has been recently discovered and showed that it may bring misleading interpretation to the model results. Techniques to alleviate this problem are based on decomposing the spatial random effect and fitting a restricted spatial regression. In this paper, we propose a different approach: a transformation of the geographic space to ensure that the unobserved spatial random effect added to the regression is orthogonal to the fixed effects covariates. Our approach, named SPOCK, has the additional benefit of providing a fast and simple computational method to estimate the parameters. Also, it does not constrain the distribution class assumed for the spatial error term. A simulation study and real data analyses are presented to better understand the advantages of the new method in comparison with the existing ones.




al

Fast Model-Fitting of Bayesian Variable Selection Regression Using the Iterative Complex Factorization Algorithm

Quan Zhou, Yongtao Guan.

Source: Bayesian Analysis, Volume 14, Number 2, 573--594.

Abstract:
Bayesian variable selection regression (BVSR) is able to jointly analyze genome-wide genetic datasets, but the slow computation via Markov chain Monte Carlo (MCMC) hampered its wide-spread usage. Here we present a novel iterative method to solve a special class of linear systems, which can increase the speed of the BVSR model-fitting tenfold. The iterative method hinges on the complex factorization of the sum of two matrices and the solution path resides in the complex domain (instead of the real domain). Compared to the Gauss-Seidel method, the complex factorization converges almost instantaneously and its error is several magnitude smaller than that of the Gauss-Seidel method. More importantly, the error is always within the pre-specified precision while the Gauss-Seidel method is not. For large problems with thousands of covariates, the complex factorization is 10–100 times faster than either the Gauss-Seidel method or the direct method via the Cholesky decomposition. In BVSR, one needs to repetitively solve large penalized regression systems whose design matrices only change slightly between adjacent MCMC steps. This slight change in design matrix enables the adaptation of the iterative complex factorization method. The computational innovation will facilitate the wide-spread use of BVSR in reanalyzing genome-wide association datasets.




al

Analysis of the Maximal a Posteriori Partition in the Gaussian Dirichlet Process Mixture Model

Łukasz Rajkowski.

Source: Bayesian Analysis, Volume 14, Number 2, 477--494.

Abstract:
Mixture models are a natural choice in many applications, but it can be difficult to place an a priori upper bound on the number of components. To circumvent this, investigators are turning increasingly to Dirichlet process mixture models (DPMMs). It is therefore important to develop an understanding of the strengths and weaknesses of this approach. This work considers the MAP (maximum a posteriori) clustering for the Gaussian DPMM (where the cluster means have Gaussian distribution and, for each cluster, the observations within the cluster have Gaussian distribution). Some desirable properties of the MAP partition are proved: ‘almost disjointness’ of the convex hulls of clusters (they may have at most one point in common) and (with natural assumptions) the comparability of sizes of those clusters that intersect any fixed ball with the number of observations (as the latter goes to infinity). Consequently, the number of such clusters remains bounded. Furthermore, if the data arises from independent identically distributed sampling from a given distribution with bounded support then the asymptotic MAP partition of the observation space maximises a function which has a straightforward expression, which depends only on the within-group covariance parameter. As the operator norm of this covariance parameter decreases, the number of clusters in the MAP partition becomes arbitrarily large, which may lead to the overestimation of the number of mixture components.




al

Efficient Bayesian Regularization for Graphical Model Selection

Suprateek Kundu, Bani K. Mallick, Veera Baladandayuthapani.

Source: Bayesian Analysis, Volume 14, Number 2, 449--476.

Abstract:
There has been an intense development in the Bayesian graphical model literature over the past decade; however, most of the existing methods are restricted to moderate dimensions. We propose a novel graphical model selection approach for large dimensional settings where the dimension increases with the sample size, by decoupling model fitting and covariance selection. First, a full model based on a complete graph is fit under a novel class of mixtures of inverse–Wishart priors, which induce shrinkage on the precision matrix under an equivalence with Cholesky-based regularization, while enabling conjugate updates. Subsequently, a post-fitting model selection step uses penalized joint credible regions to perform model selection. This allows our methods to be computationally feasible for large dimensional settings using a combination of straightforward Gibbs samplers and efficient post-fitting inferences. Theoretical guarantees in terms of selection consistency are also established. Simulations show that the proposed approach compares favorably with competing methods, both in terms of accuracy metrics and computation times. We apply this approach to a cancer genomics data example.




al

A Bayesian Approach to Statistical Shape Analysis via the Projected Normal Distribution

Luis Gutiérrez, Eduardo Gutiérrez-Peña, Ramsés H. Mena.

Source: Bayesian Analysis, Volume 14, Number 2, 427--447.

Abstract:
This work presents a Bayesian predictive approach to statistical shape analysis. A modeling strategy that starts with a Gaussian distribution on the configuration space, and then removes the effects of location, rotation and scale, is studied. This boils down to an application of the projected normal distribution to model the configurations in the shape space, which together with certain identifiability constraints, facilitates parameter interpretation. Having better control over the parameters allows us to generalize the model to a regression setting where the effect of predictors on shapes can be considered. The methodology is illustrated and tested using both simulated scenarios and a real data set concerning eight anatomical landmarks on a sagittal plane of the corpus callosum in patients with autism and in a group of controls.




al

Control of Type I Error Rates in Bayesian Sequential Designs

Haolun Shi, Guosheng Yin.

Source: Bayesian Analysis, Volume 14, Number 2, 399--425.

Abstract:
Bayesian approaches to phase II clinical trial designs are usually based on the posterior distribution of the parameter of interest and calibration of certain threshold for decision making. If the posterior probability is computed and assessed in a sequential manner, the design may involve the problem of multiplicity, which, however, is often a neglected aspect in Bayesian trial designs. To effectively maintain the overall type I error rate, we propose solutions to the problem of multiplicity for Bayesian sequential designs and, in particular, the determination of the cutoff boundaries for the posterior probabilities. We present both theoretical and numerical methods for finding the optimal posterior probability boundaries with $alpha$ -spending functions that mimic those of the frequentist group sequential designs. The theoretical approach is based on the asymptotic properties of the posterior probability, which establishes a connection between the Bayesian trial design and the frequentist group sequential method. The numerical approach uses a sandwich-type searching algorithm, which immensely reduces the computational burden. We apply least-square fitting to find the $alpha$ -spending function closest to the target. We discuss the application of our method to single-arm and double-arm cases with binary and normal endpoints, respectively, and provide a real trial example for each case.




al

Variational Message Passing for Elaborate Response Regression Models

M. W. McLean, M. P. Wand.

Source: Bayesian Analysis, Volume 14, Number 2, 371--398.

Abstract:
We build on recent work concerning message passing approaches to approximate fitting and inference for arbitrarily large regression models. The focus is on regression models where the response variable is modeled to have an elaborate distribution, which is loosely defined to mean a distribution that is more complicated than common distributions such as those in the Bernoulli, Poisson and Normal families. Examples of elaborate response families considered here are the Negative Binomial and $t$ families. Variational message passing is more challenging due to some of the conjugate exponential families being non-standard and numerical integration being needed. Nevertheless, a factor graph fragment approach means the requisite calculations only need to be done once for a particular elaborate response distribution family. Computer code can be compartmentalized, including that involving numerical integration. A major finding of this work is that the modularity of variational message passing extends to elaborate response regression models.




al

Bayesian Effect Fusion for Categorical Predictors

Daniela Pauger, Helga Wagner.

Source: Bayesian Analysis, Volume 14, Number 2, 341--369.

Abstract:
We propose a Bayesian approach to obtain a sparse representation of the effect of a categorical predictor in regression type models. As this effect is captured by a group of level effects, sparsity cannot only be achieved by excluding single irrelevant level effects or the whole group of effects associated to this predictor but also by fusing levels which have essentially the same effect on the response. To achieve this goal, we propose a prior which allows for almost perfect as well as almost zero dependence between level effects a priori. This prior can alternatively be obtained by specifying spike and slab prior distributions on all effect differences associated to this categorical predictor. We show how restricted fusion can be implemented and develop an efficient MCMC (Markov chain Monte Carlo) method for posterior computation. The performance of the proposed method is investigated on simulated data and we illustrate its application on real data from EU-SILC (European Union Statistics on Income and Living Conditions).




al

Modeling Population Structure Under Hierarchical Dirichlet Processes

Lloyd T. Elliott, Maria De Iorio, Stefano Favaro, Kaustubh Adhikari, Yee Whye Teh.

Source: Bayesian Analysis, Volume 14, Number 2, 313--339.

Abstract:
We propose a Bayesian nonparametric model to infer population admixture, extending the hierarchical Dirichlet process to allow for correlation between loci due to linkage disequilibrium. Given multilocus genotype data from a sample of individuals, the proposed model allows inferring and classifying individuals as unadmixed or admixed, inferring the number of subpopulations ancestral to an admixed population and the population of origin of chromosomal regions. Our model does not assume any specific mutation process, and can be applied to most of the commonly used genetic markers. We present a Markov chain Monte Carlo (MCMC) algorithm to perform posterior inference from the model and we discuss some methods to summarize the MCMC output for the analysis of population admixture. Finally, we demonstrate the performance of the proposed model in a real application, using genetic data from the ectodysplasin-A receptor (EDAR) gene, which is considered to be ancestry-informative due to well-known variations in allele frequency as well as phenotypic effects across ancestry. The structure analysis of this dataset leads to the identification of a rare haplotype in Europeans. We also conduct a simulated experiment and show that our algorithm outperforms parametric methods.




al

Separable covariance arrays via the Tucker product, with applications to multivariate relational data

Peter D. Hoff

Source: Bayesian Anal., Volume 6, Number 2, 179--196.

Abstract:
Modern datasets are often in the form of matrices or arrays, potentially having correlations along each set of data indices. For example, data involving repeated measurements of several variables over time may exhibit temporal correlation as well as correlation among the variables. A possible model for matrix-valued data is the class of matrix normal distributions, which is parametrized by two covariance matrices, one for each index set of the data. In this article we discuss an extension of the matrix normal model to accommodate multidimensional data arrays, or tensors. We show how a particular array-matrix product can be used to generate the class of array normal distributions having separable covariance structure. We derive some properties of these covariance structures and the corresponding array normal distributions, and show how the array-matrix product can be used to define a semi-conjugate prior distribution and calculate the corresponding posterior distribution. We illustrate the methodology in an analysis of multivariate longitudinal network data which take the form of a four-way array.




al

Maximum Independent Component Analysis with Application to EEG Data

Ruosi Guo, Chunming Zhang, Zhengjun Zhang.

Source: Statistical Science, Volume 35, Number 1, 145--157.

Abstract:
In many scientific disciplines, finding hidden influential factors behind observational data is essential but challenging. The majority of existing approaches, such as the independent component analysis (${mathrm{ICA}}$), rely on linear transformation, that is, true signals are linear combinations of hidden components. Motivated from analyzing nonlinear temporal signals in neuroscience, genetics, and finance, this paper proposes the “maximum independent component analysis” (${mathrm{MaxICA}}$), based on max-linear combinations of components. In contrast to existing methods, ${mathrm{MaxICA}}$ benefits from focusing on significant major components while filtering out ignorable components. A major tool for parameter learning of ${mathrm{MaxICA}}$ is an augmented genetic algorithm, consisting of three schemes for the elite weighted sum selection, randomly combined crossover, and dynamic mutation. Extensive empirical evaluations demonstrate the effectiveness of ${mathrm{MaxICA}}$ in either extracting max-linearly combined essential sources in many applications or supplying a better approximation for nonlinearly combined source signals, such as $mathrm{EEG}$ recordings analyzed in this paper.




al

Statistical Inference for the Evolutionary History of Cancer Genomes

Khanh N. Dinh, Roman Jaksik, Marek Kimmel, Amaury Lambert, Simon Tavaré.

Source: Statistical Science, Volume 35, Number 1, 129--144.

Abstract:
Recent years have seen considerable work on inference about cancer evolution from mutations identified in cancer samples. Much of the modeling work has been based on classical models of population genetics, generalized to accommodate time-varying cell population size. Reverse-time, genealogical views of such models, commonly known as coalescents, have been used to infer aspects of the past of growing populations. Another approach is to use branching processes, the simplest scenario being the classical linear birth-death process. Inference from evolutionary models of DNA often exploits summary statistics of the sequence data, a common one being the so-called Site Frequency Spectrum (SFS). In a bulk tumor sequencing experiment, we can estimate for each site at which a novel somatic point mutation has arisen, the proportion of cells that carry that mutation. These numbers are then grouped into collections of sites which have similar mutant fractions. We examine how the SFS based on birth-death processes differs from those based on the coalescent model. This may stem from the different sampling mechanisms in the two approaches. However, we also show that despite this, they are quantitatively comparable for the range of parameters typical for tumor cell populations. We also present a model of tumor evolution with selective sweeps, and demonstrate how it may help in understanding the history of a tumor as well as the influence of data pre-processing. We illustrate the theory with applications to several examples from The Cancer Genome Atlas tumors.




al

Statistical Molecule Counting in Super-Resolution Fluorescence Microscopy: Towards Quantitative Nanoscopy

Thomas Staudt, Timo Aspelmeier, Oskar Laitenberger, Claudia Geisler, Alexander Egner, Axel Munk.

Source: Statistical Science, Volume 35, Number 1, 92--111.

Abstract:
Super-resolution microscopy is rapidly gaining importance as an analytical tool in the life sciences. A compelling feature is the ability to label biological units of interest with fluorescent markers in (living) cells and to observe them with considerably higher resolution than conventional microscopy permits. The images obtained this way, however, lack an absolute intensity scale in terms of numbers of fluorophores observed. In this article, we discuss state of the art methods to count such fluorophores and statistical challenges that come along with it. In particular, we suggest a modeling scheme for time series generated by single-marker-switching (SMS) microscopy that makes it possible to quantify the number of markers in a statistically meaningful manner from the raw data. To this end, we model the entire process of photon generation in the fluorophore, their passage through the microscope, detection and photoelectron amplification in the camera, and extraction of time series from the microscopic images. At the heart of these modeling steps is a careful description of the fluorophore dynamics by a novel hidden Markov model that operates on two timescales (HTMM). Besides the fluorophore number, information about the kinetic transition rates of the fluorophore’s internal states is also inferred during estimation. We comment on computational issues that arise when applying our model to simulated or measured fluorescence traces and illustrate our methodology on simulated data.




al

Statistical Methodology in Single-Molecule Experiments

Chao Du, S. C. Kou.

Source: Statistical Science, Volume 35, Number 1, 75--91.

Abstract:
Toward the last quarter of the 20th century, the emergence of single-molecule experiments enabled scientists to track and study individual molecules’ dynamic properties in real time. Unlike macroscopic systems’ dynamics, those of single molecules can only be properly described by stochastic models even in the absence of external noise. Consequently, statistical methods have played a key role in extracting hidden information about molecular dynamics from data obtained through single-molecule experiments. In this article, we survey the major statistical methodologies used to analyze single-molecule experimental data. Our discussion is organized according to the types of stochastic models used to describe single-molecule systems as well as major experimental data collection techniques. We also highlight challenges and future directions in the application of statistical methodologies to single-molecule experiments.




al

A Tale of Two Parasites: Statistical Modelling to Support Disease Control Programmes in Africa

Peter J. Diggle, Emanuele Giorgi, Julienne Atsame, Sylvie Ntsame Ella, Kisito Ogoussan, Katherine Gass.

Source: Statistical Science, Volume 35, Number 1, 42--50.

Abstract:
Vector-borne diseases have long presented major challenges to the health of rural communities in the wet tropical regions of the world, but especially in sub-Saharan Africa. In this paper, we describe the contribution that statistical modelling has made to the global elimination programme for one vector-borne disease, onchocerciasis. We explain why information on the spatial distribution of a second vector-borne disease, Loa loa, is needed before communities at high risk of onchocerciasis can be treated safely with mass distribution of ivermectin, an antifiarial medication. We show how a model-based geostatistical analysis of Loa loa prevalence survey data can be used to map the predictive probability that each location in the region of interest meets a WHO policy guideline for safe mass distribution of ivermectin and describe two applications: one is to data from Cameroon that assesses prevalence using traditional blood-smear microscopy; the other is to Africa-wide data that uses a low-cost questionnaire-based method. We describe how a recent technological development in image-based microscopy has resulted in a change of emphasis from prevalence alone to the bivariate spatial distribution of prevalence and the intensity of infection among infected individuals. We discuss how statistical modelling of the kind described here can contribute to health policy guidelines and decision-making in two ways. One is to ensure that, in a resource-limited setting, prevalence surveys are designed, and the resulting data analysed, as efficiently as possible. The other is to provide an honest quantification of the uncertainty attached to any binary decision by reporting predictive probabilities that a policy-defined condition for action is or is not met.




al

Some Statistical Issues in Climate Science

Michael L. Stein.

Source: Statistical Science, Volume 35, Number 1, 31--41.

Abstract:
Climate science is a field that is arguably both data-rich and data-poor. Data rich in that huge and quickly increasing amounts of data about the state of the climate are collected every day. Data poor in that important aspects of the climate are still undersampled, such as the deep oceans and some characteristics of the upper atmosphere. Data rich in that modern climate models can produce climatological quantities over long time periods with global coverage, including quantities that are difficult to measure and under conditions for which there is no data presently. Data poor in that the correspondence between climate model output to the actual climate, especially for future climate change due to human activities, is difficult to assess. The scope for fruitful interactions between climate scientists and statisticians is great, but requires serious commitments from researchers in both disciplines to understand the scientific and statistical nuances arising from the complex relationships between the data and the real-world problems. This paper describes a small fraction of some of the intellectual challenges that occur at the interface between climate science and statistics, including inferences for extremes for processes with seasonality and long-term trends, the use of climate model ensembles for studying extremes, the scope for using new data sources for studying space-time characteristics of environmental processes and a discussion of non-Gaussian space-time process models for climate variables. The paper concludes with a call to the statistical community to become more engaged in one of the great scientific and policy issues of our time, anthropogenic climate change and its impacts.




al

Risk Models for Breast Cancer and Their Validation

Adam R. Brentnall, Jack Cuzick.

Source: Statistical Science, Volume 35, Number 1, 14--30.

Abstract:
Strategies to prevent cancer and diagnose it early when it is most treatable are needed to reduce the public health burden from rising disease incidence. Risk assessment is playing an increasingly important role in targeting individuals in need of such interventions. For breast cancer many individual risk factors have been well understood for a long time, but the development of a fully comprehensive risk model has not been straightforward, in part because there have been limited data where joint effects of an extensive set of risk factors may be estimated with precision. In this article we first review the approach taken to develop the IBIS (Tyrer–Cuzick) model, and describe recent updates. We then review and develop methods to assess calibration of models such as this one, where the risk of disease allowing for competing mortality over a long follow-up time or lifetime is estimated. The breast cancer risk model model and calibration assessment methods are demonstrated using a cohort of 132,139 women attending mammography screening in the State of Washington, USA.




al

Model-Based Approach to the Joint Analysis of Single-Cell Data on Chromatin Accessibility and Gene Expression

Zhixiang Lin, Mahdi Zamanighomi, Timothy Daley, Shining Ma, Wing Hung Wong.

Source: Statistical Science, Volume 35, Number 1, 2--13.

Abstract:
Unsupervised methods, including clustering methods, are essential to the analysis of single-cell genomic data. Model-based clustering methods are under-explored in the area of single-cell genomics, and have the advantage of quantifying the uncertainty of the clustering result. Here we develop a model-based approach for the integrative analysis of single-cell chromatin accessibility and gene expression data. We show that combining these two types of data, we can achieve a better separation of the underlying cell types. An efficient Markov chain Monte Carlo algorithm is also developed.




al

Introduction to the Special Issue

Source: Statistical Science, Volume 35, Number 1, 1--1.




al

Statistical Theory Powering Data Science

Junhui Cai, Avishai Mandelbaum, Chaitra H. Nagaraja, Haipeng Shen, Linda Zhao.

Source: Statistical Science, Volume 34, Number 4, 669--691.

Abstract:
Statisticians are finding their place in the emerging field of data science. However, many issues considered “new” in data science have long histories in statistics. Examples of using statistical thinking are illustrated, which range from exploratory data analysis to measuring uncertainty to accommodating nonrandom samples. These examples are then applied to service networks, baseball predictions and official statistics.




al

Comment: Statistical Inference from a Predictive Perspective

Alessandro Rinaldo, Ryan J. Tibshirani, Larry Wasserman.

Source: Statistical Science, Volume 34, Number 4, 599--603.

Abstract:
What is the meaning of a regression parameter? Why is this the de facto standard object of interest for statistical inference? These are delicate issues, especially when the model is misspecified. We argue that focusing on predictive quantities may be a desirable alternative.




al

Comment on Models as Approximations, Parts I and II, by Buja et al.

Jerald F. Lawless.

Source: Statistical Science, Volume 34, Number 4, 569--571.

Abstract:
I comment on the papers Models as Approximations I and II, by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, M. Traskin, L. Zhao and K. Zhang.




al

Assessing the Causal Effect of Binary Interventions from Observational Panel Data with Few Treated Units

Pantelis Samartsidis, Shaun R. Seaman, Anne M. Presanis, Matthew Hickman, Daniela De Angelis.

Source: Statistical Science, Volume 34, Number 3, 486--503.

Abstract:
Researchers are often challenged with assessing the impact of an intervention on an outcome of interest in situations where the intervention is nonrandomised, the intervention is only applied to one or few units, the intervention is binary, and outcome measurements are available at multiple time points. In this paper, we review existing methods for causal inference in these situations. We detail the assumptions underlying each method, emphasize connections between the different approaches and provide guidelines regarding their practical implementation. Several open problems are identified thus highlighting the need for future research.




al

Conditionally Conjugate Mean-Field Variational Bayes for Logistic Models

Daniele Durante, Tommaso Rigon.

Source: Statistical Science, Volume 34, Number 3, 472--485.

Abstract:
Variational Bayes (VB) is a common strategy for approximate Bayesian inference, but simple methods are only available for specific classes of models including, in particular, representations having conditionally conjugate constructions within an exponential family. Models with logit components are an apparently notable exception to this class, due to the absence of conjugacy among the logistic likelihood and the Gaussian priors for the coefficients in the linear predictor. To facilitate approximate inference within this widely used class of models, Jaakkola and Jordan ( Stat. Comput. 10 (2000) 25–37) proposed a simple variational approach which relies on a family of tangent quadratic lower bounds of the logistic log-likelihood, thus restoring conjugacy between these approximate bounds and the Gaussian priors. This strategy is still implemented successfully, but few attempts have been made to formally understand the reasons underlying its excellent performance. Following a review on VB for logistic models, we cover this gap by providing a formal connection between the above bound and a recent Pólya-gamma data augmentation for logistic regression. Such a result places the computational methods associated with the aforementioned bounds within the framework of variational inference for conditionally conjugate exponential family models, thereby allowing recent advances for this class to be inherited also by the methods relying on Jaakkola and Jordan ( Stat. Comput. 10 (2000) 25–37).




al

ROS Regression: Integrating Regularization with Optimal Scaling Regression

Jacqueline J. Meulman, Anita J. van der Kooij, Kevin L. W. Duisters.

Source: Statistical Science, Volume 34, Number 3, 361--390.

Abstract:
We present a methodology for multiple regression analysis that deals with categorical variables (possibly mixed with continuous ones), in combination with regularization, variable selection and high-dimensional data ($Pgg N$). Regularization and optimal scaling (OS) are two important extensions of ordinary least squares regression (OLS) that will be combined in this paper. There are two data analytic situations for which optimal scaling was developed. One is the analysis of categorical data, and the other the need for transformations because of nonlinear relationships between predictors and outcome. Optimal scaling of categorical data finds quantifications for the categories, both for the predictors and for the outcome variables, that are optimal for the regression model in the sense that they maximize the multiple correlation. When nonlinear relationships exist, nonlinear transformation of predictors and outcome maximize the multiple correlation in the same way. We will consider a variety of transformation types; typically we use step functions for categorical variables, and smooth (spline) functions for continuous variables. Both types of functions can be restricted to be monotonic, preserving the ordinal information in the data. In combination with optimal scaling, three popular regularization methods will be considered: Ridge regression, the Lasso and the Elastic Net. The resulting method will be called ROS Regression (Regularized Optimal Scaling Regression). The OS algorithm provides straightforward and efficient estimation of the regularized regression coefficients, automatically gives the Group Lasso and Blockwise Sparse Regression, and extends them by the possibility to maintain ordinal properties in the data. Extended examples are provided.




al

Two-Sample Instrumental Variable Analyses Using Heterogeneous Samples

Qingyuan Zhao, Jingshu Wang, Wes Spiller, Jack Bowden, Dylan S. Small.

Source: Statistical Science, Volume 34, Number 2, 317--333.

Abstract:
Instrumental variable analysis is a widely used method to estimate causal effects in the presence of unmeasured confounding. When the instruments, exposure and outcome are not measured in the same sample, Angrist and Krueger ( J. Amer. Statist. Assoc. 87 (1992) 328–336) suggested to use two-sample instrumental variable (TSIV) estimators that use sample moments from an instrument-exposure sample and an instrument-outcome sample. However, this method is biased if the two samples are from heterogeneous populations so that the distributions of the instruments are different. In linear structural equation models, we derive a new class of TSIV estimators that are robust to heterogeneous samples under the key assumption that the structural relations in the two samples are the same. The widely used two-sample two-stage least squares estimator belongs to this class. It is generally not asymptotically efficient, although we find that it performs similarly to the optimal TSIV estimator in most practical situations. We then attempt to relax the linearity assumption. We find that, unlike one-sample analyses, the TSIV estimator is not robust to misspecified exposure model. Additionally, to nonparametrically identify the magnitude of the causal effect, the noise in the exposure must have the same distributions in the two samples. However, this assumption is in general untestable because the exposure is not observed in one sample. Nonetheless, we may still identify the sign of the causal effect in the absence of homogeneity of the noise.




al

Producing Official County-Level Agricultural Estimates in the United States: Needs and Challenges

Nathan B. Cruze, Andreea L. Erciulescu, Balgobin Nandram, Wendy J. Barboza, Linda J. Young.

Source: Statistical Science, Volume 34, Number 2, 301--316.

Abstract:
In the United States, county-level estimates of crop yield, production, and acreage published by the United States Department of Agriculture’s National Agricultural Statistics Service (USDA NASS) play an important role in determining the value of payments allotted to farmers and ranchers enrolled in several federal programs. Given the importance of these official county-level crop estimates, NASS continually strives to improve its crops county estimates program in terms of accuracy, reliability and coverage. In 2015, NASS engaged a panel of experts convened under the auspices of the National Academies of Sciences, Engineering, and Medicine Committee on National Statistics (CNSTAT) for guidance on implementing models that may synthesize multiple sources of information into a single estimate, provide defensible measures of uncertainty, and potentially increase the number of publishable county estimates. The final report titled Improving Crop Estimates by Integrating Multiple Data Sources was released in 2017. This paper discusses several needs and requirements for NASS county-level crop estimates that were illuminated during the activities of the CNSTAT panel. A motivating example of planted acreage estimation in Illinois illustrates several challenges that NASS faces as it considers adopting any explicit model for official crops county estimates.




al

Statistical Analysis of Zero-Inflated Nonnegative Continuous Data: A Review

Lei Liu, Ya-Chen Tina Shih, Robert L. Strawderman, Daowen Zhang, Bankole A. Johnson, Haitao Chai.

Source: Statistical Science, Volume 34, Number 2, 253--279.

Abstract:
Zero-inflated nonnegative continuous (or semicontinuous) data arise frequently in biomedical, economical, and ecological studies. Examples include substance abuse, medical costs, medical care utilization, biomarkers (e.g., CD4 cell counts, coronary artery calcium scores), single cell gene expression rates, and (relative) abundance of microbiome. Such data are often characterized by the presence of a large portion of zero values and positive continuous values that are skewed to the right and heteroscedastic. Both of these features suggest that no simple parametric distribution may be suitable for modeling such type of outcomes. In this paper, we review statistical methods for analyzing zero-inflated nonnegative outcome data. We will start with the cross-sectional setting, discussing ways to separate zero and positive values and introducing flexible models to characterize right skewness and heteroscedasticity in the positive values. We will then present models of correlated zero-inflated nonnegative continuous data, using random effects to tackle the correlation on repeated measures from the same subject and that across different parts of the model. We will also discuss expansion to related topics, for example, zero-inflated count and survival data, nonlinear covariate effects, and joint models of longitudinal zero-inflated nonnegative continuous data and survival. Finally, we will present applications to three real datasets (i.e., microbiome, medical costs, and alcohol drinking) to illustrate these methods. Example code will be provided to facilitate applications of these methods.




al

A Kernel Regression Procedure in the 3D Shape Space with an Application to Online Sales of Children’s Wear

Gregorio Quintana-Ortí, Amelia Simó.

Source: Statistical Science, Volume 34, Number 2, 236--252.

Abstract:
This paper is focused on kernel regression when the response variable is the shape of a 3D object represented by a configuration matrix of landmarks. Regression methods on this shape space are not trivial because this space has a complex finite-dimensional Riemannian manifold structure (non-Euclidean). Papers about it are scarce in the literature, the majority of them are restricted to the case of a single explanatory variable, and many of them are based on the approximated tangent space. In this paper, there are several methodological innovations. The first one is the adaptation of the general method for kernel regression analysis in manifold-valued data to the three-dimensional case of Kendall’s shape space. The second one is its generalization to the multivariate case and the addressing of the curse-of-dimensionality problem. Finally, we propose bootstrap confidence intervals for prediction. A simulation study is carried out to check the goodness of the procedure, and a comparison with a current approach is performed. Then, it is applied to a 3D database obtained from an anthropometric survey of the Spanish child population with a potential application to online sales of children’s wear.