pp

Approximate inference for constructing astronomical catalogs from images

Jeffrey Regier, Andrew C. Miller, David Schlegel, Ryan P. Adams, Jon D. McAuliffe, Prabhat.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1884--1926.

Abstract:
We present a new, fully generative model for constructing astronomical catalogs from optical telescope image sets. Each pixel intensity is treated as a random variable with parameters that depend on the latent properties of stars and galaxies. These latent properties are themselves modeled as random. We compare two procedures for posterior inference. One procedure is based on Markov chain Monte Carlo (MCMC) while the other is based on variational inference (VI). The MCMC procedure excels at quantifying uncertainty, while the VI procedure is 1000 times faster. On a supercomputer, the VI procedure efficiently uses 665,000 CPU cores to construct an astronomical catalog from 50 terabytes of images in 14.6 minutes, demonstrating the scaling characteristics necessary to construct catalogs for upcoming astronomical surveys.




pp

Wavelet spectral testing: Application to nonstationary circadian rhythms

Jessica K. Hargreaves, Marina I. Knight, Jon W. Pitchford, Rachael J. Oakenfull, Sangeeta Chawla, Jack Munns, Seth J. Davis.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1817--1846.

Abstract:
Rhythmic data are ubiquitous in the life sciences. Biologists need reliable statistical tests to identify whether a particular experimental treatment has caused a significant change in a rhythmic signal. When these signals display nonstationary behaviour, as is common in many biological systems, the established methodologies may be misleading. Therefore, there is a real need for new methodology that enables the formal comparison of nonstationary processes. As circadian behaviour is best understood in the spectral domain, here we develop novel hypothesis testing procedures in the (wavelet) spectral domain, embedding replicate information when available. The data are modelled as realisations of locally stationary wavelet processes, allowing us to define and rigorously estimate their evolutionary wavelet spectra. Motivated by three complementary applications in circadian biology, our new methodology allows the identification of three specific types of spectral difference. We demonstrate the advantages of our methodology over alternative approaches, by means of a comprehensive simulation study and real data applications, using both published and newly generated circadian datasets. In contrast to the current standard methodologies, our method successfully identifies differences within the motivating circadian datasets, and facilitates wider ranging analyses of rhythmic biological data in general.




pp

Sequential decision model for inference and prediction on nonuniform hypergraphs with application to knot matching from computational forestry

Seong-Hwan Jun, Samuel W. K. Wong, James V. Zidek, Alexandre Bouchard-Côté.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1678--1707.

Abstract:
In this paper, we consider the knot-matching problem arising in computational forestry. The knot-matching problem is an important problem that needs to be solved to advance the state of the art in automatic strength prediction of lumber. We show that this problem can be formulated as a quadripartite matching problem and develop a sequential decision model that admits efficient parameter estimation along with a sequential Monte Carlo sampler on graph matching that can be utilized for rapid sampling of graph matching. We demonstrate the effectiveness of our methods on 30 manually annotated boards and present findings from various simulation studies to provide further evidence supporting the efficacy of our methods.




pp

Network classification with applications to brain connectomics

Jesús D. Arroyo Relión, Daniel Kessler, Elizaveta Levina, Stephan F. Taylor.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1648--1677.

Abstract:
While statistical analysis of a single network has received a lot of attention in recent years, with a focus on social networks, analysis of a sample of networks presents its own challenges which require a different set of analytic tools. Here we study the problem of classification of networks with labeled nodes, motivated by applications in neuroimaging. Brain networks are constructed from imaging data to represent functional connectivity between regions of the brain, and previous work has shown the potential of such networks to distinguish between various brain disorders, giving rise to a network classification problem. Existing approaches tend to either treat all edge weights as a long vector, ignoring the network structure, or focus on graph topology as represented by summary measures while ignoring the edge weights. Our goal is to design a classification method that uses both the individual edge information and the network structure of the data in a computationally efficient way, and that can produce a parsimonious and interpretable representation of differences in brain connectivity patterns between classes. We propose a graph classification method that uses edge weights as predictors but incorporates the network nature of the data via penalties that promote sparsity in the number of nodes, in addition to the usual sparsity penalties that encourage selection of edges. We implement the method via efficient convex optimization and provide a detailed analysis of data from two fMRI studies of schizophrenia.




pp

The classification permutation test: A flexible approach to testing for covariate imbalance in observational studies

Johann Gagnon-Bartsch, Yotam Shem-Tov.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1464--1483.

Abstract:
The gold standard for identifying causal relationships is a randomized controlled experiment. In many applications in the social sciences and medicine, the researcher does not control the assignment mechanism and instead may rely upon natural experiments or matching methods as a substitute to experimental randomization. The standard testable implication of random assignment is covariate balance between the treated and control units. Covariate balance is commonly used to validate the claim of as good as random assignment. We propose a new nonparametric test of covariate balance. Our Classification Permutation Test (CPT) is based on a combination of classification methods (e.g., random forests) with Fisherian permutation inference. We revisit four real data examples and present Monte Carlo power simulations to demonstrate the applicability of the CPT relative to other nonparametric tests of equality of multivariate distributions.




pp

Identifying multiple changes for a functional data sequence with application to freeway traffic segmentation

Jeng-Min Chiou, Yu-Ting Chen, Tailen Hsing.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1430--1463.

Abstract:
Motivated by the study of road segmentation partitioned by shifts in traffic conditions along a freeway, we introduce a two-stage procedure, Dynamic Segmentation and Backward Elimination (DSBE), for identifying multiple changes in the mean functions for a sequence of functional data. The Dynamic Segmentation procedure searches for all possible changepoints using the derived global optimality criterion coupled with the local strategy of at-most-one-changepoint by dividing the entire sequence into individual subsequences that are recursively adjusted until convergence. Then, the Backward Elimination procedure verifies these changepoints by iteratively testing the unlikely changes to ensure their significance until no more changepoints can be removed. By combining the local strategy with the global optimal changepoint criterion, the DSBE algorithm is conceptually simple and easy to implement and performs better than the binary segmentation-based approach at detecting small multiple changes. The consistency property of the changepoint estimators and the convergence of the algorithm are proved. We apply DSBE to detect changes in traffic streams through real freeway traffic data. The practical performance of DSBE is also investigated through intensive simulation studies for various scenarios.




pp

A hidden Markov model approach to characterizing the photo-switching behavior of fluorophores

Lekha Patel, Nils Gustafsson, Yu Lin, Raimund Ober, Ricardo Henriques, Edward Cohen.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1397--1429.

Abstract:
Fluorescing molecules (fluorophores) that stochastically switch between photon-emitting and dark states underpin some of the most celebrated advancements in super-resolution microscopy. While this stochastic behavior has been heavily exploited, full characterization of the underlying models can potentially drive forward further imaging methodologies. Under the assumption that fluorophores move between fluorescing and dark states as continuous time Markov processes, the goal is to use a sequence of images to select a model and estimate the transition rates. We use a hidden Markov model to relate the observed discrete time signal to the hidden continuous time process. With imaging involving several repeat exposures of the fluorophore, we show the observed signal depends on both the current and past states of the hidden process, producing emission probabilities that depend on the transition rate parameters to be estimated. To tackle this unusual coupling of the transition and emission probabilities, we conceive transmission (transition-emission) matrices that capture all dependencies of the model. We provide a scheme of computing these matrices and adapt the forward-backward algorithm to compute a likelihood which is readily optimized to provide rate estimates. When confronted with several model proposals, combining this procedure with the Bayesian Information Criterion provides accurate model selection.




pp

Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines

Lin Liu, Yuqi Qiu, Loki Natarajan, Karen Messer.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1370--1396.

Abstract:
It is common to encounter missing data among the potential predictor variables in the setting of model selection. For example, in a recent study we attempted to improve the US guidelines for risk stratification after screening colonoscopy ( Cancer Causes Control 27 (2016) 1175–1185), with the aim to help reduce both overuse and underuse of follow-on surveillance colonoscopy. The goal was to incorporate selected additional informative variables into a neoplasia risk-prediction model, going beyond the three currently established risk factors, using a large dataset pooled from seven different prospective studies in North America. Unfortunately, not all candidate variables were collected in all studies, so that one or more important potential predictors were missing on over half of the subjects. Thus, while variable selection was a main focus of the study, it was necessary to address the substantial amount of missing data. Multiple imputation can effectively address missing data, and there are also good approaches to incorporate the variable selection process into model-based confidence intervals. However, there is not consensus on appropriate methods of inference which address both issues simultaneously. Our goal here is to study the properties of model-based confidence intervals in the setting of imputation for missing data followed by variable selection. We use both simulation and theory to compare three approaches to such post-imputation-selection inference: a multiple-imputation approach based on Rubin’s Rules for variance estimation ( Comput. Statist. Data Anal. 71 (2014) 758–770); a single imputation-selection followed by bootstrap percentile confidence intervals; and a new bootstrap model-averaging approach presented here, following Efron ( J. Amer. Statist. Assoc. 109 (2014) 991–1007). We investigate relative strengths and weaknesses of each method. The “Rubin’s Rules” multiple imputation estimator can have severe undercoverage, and is not recommended. The imputation-selection estimator with bootstrap percentile confidence intervals works well. The bootstrap-model-averaged estimator, with the “Efron’s Rules” estimated variance, may be preferred if the true effect sizes are moderate. We apply these results to the colorectal neoplasia risk-prediction problem which motivated the present work.




pp

Directional differentiability for supremum-type functionals: Statistical applications

Javier Cárcamo, Antonio Cuevas, Luis-Alberto Rodríguez.

Source: Bernoulli, Volume 26, Number 3, 2143--2175.

Abstract:
We show that various functionals related to the supremum of a real function defined on an arbitrary set or a measure space are Hadamard directionally differentiable. We specifically consider the supremum norm, the supremum, the infimum, and the amplitude of a function. The (usually non-linear) derivatives of these maps adopt simple expressions under suitable assumptions on the underlying space. As an application, we improve and extend to the multidimensional case the results in Raghavachari ( Ann. Statist. 1 (1973) 67–73) regarding the limiting distributions of Kolmogorov–Smirnov type statistics under the alternative hypothesis. Similar results are obtained for analogous statistics associated with copulas. We additionally solve an open problem about the Berk–Jones statistic proposed by Jager and Wellner (In A Festschrift for Herman Rubin (2004) 319–331 IMS). Finally, the asymptotic distribution of maximum mean discrepancies over Donsker classes of functions is derived.




pp

Noncommutative Lebesgue decomposition and contiguity with applications in quantum statistics

Akio Fujiwara, Koichi Yamagata.

Source: Bernoulli, Volume 26, Number 3, 2105--2142.

Abstract:
We herein develop a theory of contiguity in the quantum domain based upon a novel quantum analogue of the Lebesgue decomposition. The theory thus formulated is pertinent to the weak quantum local asymptotic normality introduced in the previous paper [Yamagata, Fujiwara, and Gill, Ann. Statist. 41 (2013) 2197–2217], yielding substantial enlargement of the scope of quantum statistics.




pp

Functional weak limit theorem for a local empirical process of non-stationary time series and its application

Ulrike Mayer, Henryk Zähle, Zhou Zhou.

Source: Bernoulli, Volume 26, Number 3, 1891--1911.

Abstract:
We derive a functional weak limit theorem for a local empirical process of a wide class of piece-wise locally stationary (PLS) time series. The latter result is applied to derive the asymptotics of weighted empirical quantiles and weighted V-statistics of non-stationary time series. The class of admissible underlying time series is illustrated by means of PLS linear processes and PLS ARCH processes.




pp

Logarithmic Sobolev inequalities for finite spin systems and applications

Holger Sambale, Arthur Sinulis.

Source: Bernoulli, Volume 26, Number 3, 1863--1890.

Abstract:
We derive sufficient conditions for a probability measure on a finite product space (a spin system ) to satisfy a (modified) logarithmic Sobolev inequality. We establish these conditions for various examples, such as the (vertex-weighted) exponential random graph model, the random coloring and the hard-core model with fugacity. This leads to two separate branches of applications. The first branch is given by mixing time estimates of the Glauber dynamics. The proofs do not rely on coupling arguments, but instead use functional inequalities. As a byproduct, this also yields exponential decay of the relative entropy along the Glauber semigroup. Secondly, we investigate the concentration of measure phenomenon (particularly of higher order) for these spin systems. We show the effect of better concentration properties by centering not around the mean, but around a stochastic term in the exponential random graph model. From there, one can deduce a central limit theorem for the number of triangles from the CLT of the edge count. In the Erdős–Rényi model the first-order approximation leads to a quantification and a proof of a central limit theorem for subgraph counts.




pp

A Bayesian nonparametric approach to log-concave density estimation

Ester Mariucci, Kolyan Ray, Botond Szabó.

Source: Bernoulli, Volume 26, Number 2, 1070--1097.

Abstract:
The estimation of a log-concave density on $mathbb{R}$ is a canonical problem in the area of shape-constrained nonparametric inference. We present a Bayesian nonparametric approach to this problem based on an exponentiated Dirichlet process mixture prior and show that the posterior distribution converges to the log-concave truth at the (near-) minimax rate in Hellinger distance. Our proof proceeds by establishing a general contraction result based on the log-concave maximum likelihood estimator that prevents the need for further metric entropy calculations. We further present computationally more feasible approximations and both an empirical and hierarchical Bayes approach. All priors are illustrated numerically via simulations.




pp

Robust modifications of U-statistics and applications to covariance estimation problems

Stanislav Minsker, Xiaohan Wei.

Source: Bernoulli, Volume 26, Number 1, 694--727.

Abstract:
Let $Y$ be a $d$-dimensional random vector with unknown mean $mu $ and covariance matrix $Sigma $. This paper is motivated by the problem of designing an estimator of $Sigma $ that admits exponential deviation bounds in the operator norm under minimal assumptions on the underlying distribution, such as existence of only 4th moments of the coordinates of $Y$. To address this problem, we propose robust modifications of the operator-valued U-statistics, obtain non-asymptotic guarantees for their performance, and demonstrate the implications of these results to the covariance estimation problem under various structural assumptions.




pp

A unified approach to coupling SDEs driven by Lévy noise and some applications

Mingjie Liang, René L. Schilling, Jian Wang.

Source: Bernoulli, Volume 26, Number 1, 664--693.

Abstract:
We present a general method to construct couplings of stochastic differential equations driven by Lévy noise in terms of coupling operators. This approach covers both coupling by reflection and refined basic coupling which are often discussed in the literature. As applications, we prove regularity results for the transition semigroups and obtain successful couplings for the solutions to stochastic differential equations driven by additive Lévy noise.




pp

Normal approximation for sums of weighted $U$-statistics – application to Kolmogorov bounds in random subgraph counting

Nicolas Privault, Grzegorz Serafin.

Source: Bernoulli, Volume 26, Number 1, 587--615.

Abstract:
We derive normal approximation bounds in the Kolmogorov distance for sums of discrete multiple integrals and weighted $U$-statistics made of independent Bernoulli random variables. Such bounds are applied to normal approximation for the renormalized subgraph counts in the Erdős–Rényi random graph. This approach completely solves a long-standing conjecture in the general setting of arbitrary graph counting, while recovering recent results obtained for triangles and improving other bounds in the Wasserstein distance.




pp

Consistent semiparametric estimators for recurrent event times models with application to virtual age models

Eric Beutner, Laurent Bordes, Laurent Doyen.

Source: Bernoulli, Volume 26, Number 1, 557--586.

Abstract:
Virtual age models are very useful to analyse recurrent events. Among the strengths of these models is their ability to account for treatment (or intervention) effects after an event occurrence. Despite their flexibility for modeling recurrent events, the number of applications is limited. This seems to be a result of the fact that in the semiparametric setting all the existing results assume the virtual age function that describes the treatment (or intervention) effects to be known. This shortcoming can be overcome by considering semiparametric virtual age models with parametrically specified virtual age functions. Yet, fitting such a model is a difficult task. Indeed, it has recently been shown that for these models the standard profile likelihood method fails to lead to consistent estimators. Here we show that consistent estimators can be constructed by smoothing the profile log-likelihood function appropriately. We show that our general result can be applied to most of the relevant virtual age models of the literature. Our approach shows that empirical process techniques may be a worthwhile alternative to martingale methods for studying asymptotic properties of these inference methods. A simulation study is provided to illustrate our consistency results together with an application to real data.




pp

High dimensional deformed rectangular matrices with applications in matrix denoising

Xiucai Ding.

Source: Bernoulli, Volume 26, Number 1, 387--417.

Abstract:
We consider the recovery of a low rank $M imes N$ matrix $S$ from its noisy observation $ ilde{S}$ in the high dimensional framework when $M$ is comparable to $N$. We propose two efficient estimators for $S$ under two different regimes. Our analysis relies on the local asymptotics of the eigenstructure of large dimensional rectangular matrices with finite rank perturbation. We derive the convergent limits and rates for the singular values and vectors for such matrices.




pp

A new method for obtaining sharp compound Poisson approximation error estimates for sums of locally dependent random variables

Michael V. Boutsikas, Eutichia Vaggelatou

Source: Bernoulli, Volume 16, Number 2, 301--330.

Abstract:
Let X 1 , X 2 , …, X n be a sequence of independent or locally dependent random variables taking values in ℤ + . In this paper, we derive sharp bounds, via a new probabilistic method, for the total variation distance between the distribution of the sum ∑ i =1 n X i and an appropriate Poisson or compound Poisson distribution. These bounds include a factor which depends on the smoothness of the approximating Poisson or compound Poisson distribution. This “smoothness factor” is of order O( σ −2 ), according to a heuristic argument, where σ 2 denotes the variance of the approximating distribution. In this way, we offer sharp error estimates for a large range of values of the parameters. Finally, specific examples concerning appearances of rare runs in sequences of Bernoulli trials are presented by way of illustration.




pp

From Wends we came : the story of Johann and Maria Huppatz & their descendants / compiled by Frank Huppatz and Rone McDonnell.

Huppatz (Family).




pp

What Districts Want From Assessments, as They Grapple With the Coronavirus

EdWeek Market Brief asked district officials in a nationwide survey about their most urgent assessment needs, as they cope with COVID-19 and tentatively plan for reopening schools.

The post What Districts Want From Assessments, as They Grapple With the Coronavirus appeared first on Market Brief.




pp

Item 01: Autograph letter signed, from Hume, Appin, to William E. Riley, concerning an account for money owed by Riley, 4 September 1834




pp

Sydney in 1848 : illustrated by copper-plate engravings of its principal streets, public buildings, churches, chapels, etc. / from drawings by Joseph Fowles.




pp

CNN legal analysts say Barr dropping the Flynn case shows 'the fix was in.' Barr says winners write history.

The Justice Department announced Thursday that it is dropping its criminal case against President Trump's first national security adviser Michael Flynn. Flynn twice admitted in court he lied to the FBI about his conversations with Russia's U.S. ambassador, and then cooperated in Special Counsel Robert Mueller's investigation. It was an unusual move by the Justice Department, and CNN's legal and political analysts smelled a rat."Attorney General [William] Barr is already being accused of creating a special justice system just for President Trump's friends," and this will only feed that perception, CNN's Jake Tapper suggested. Political correspondent Sara Murray agreed, noting that the prosecutor in the case, Brandon Van Grack, withdrew right before the Justice Department submitted its filing, just like when Barr intervened to request a reduced sentence for Roger Stone.National security correspondent Jim Sciutto laid out several reason why the substance of Flynn's admitted lie was a big deal, and chief legal analyst Jeffrey Toobin was appalled. "It is one of the most incredible legal documents I have read, and certainly something that I never expected to see from the United States Department of Justice," Toobin said. "The idea that the Justice Department would invent an argument -- an argument that the judge in this case has already rejected -- and say that's a basis for dropping a case where a defendant admitted his guilt shows that this is a case where the fix was in."Barr told CBS News' Cathrine Herridge on Thursday that dropping Flynn's case actually "sends the message that there is one standard of justice in this country." Herridge told Barr he would take flak for this, asking: "When history looks back on this decision, how do you think it will be written?" Barr laughed: "Well, history's written by the winners. So it largely depends on who's writing the history." Watch below. More stories from theweek.com Outed CIA agent Valerie Plame is running for Congress, and her launch video looks like a spy movie trailer 7 scathing cartoons about America's rush to reopen Trump says he couldn't have exposed WWII vets to COVID-19 because the wind was blowing the wrong way





pp

A New Bayesian Approach to Robustness Against Outliers in Linear Regression

Philippe Gagnon, Alain Desgagné, Mylène Bédard.

Source: Bayesian Analysis, Volume 15, Number 2, 389--414.

Abstract:
Linear regression is ubiquitous in statistical analysis. It is well understood that conflicting sources of information may contaminate the inference when the classical normality of errors is assumed. The contamination caused by the light normal tails follows from an undesirable effect: the posterior concentrates in an area in between the different sources with a large enough scaling to incorporate them all. The theory of conflict resolution in Bayesian statistics (O’Hagan and Pericchi (2012)) recommends to address this problem by limiting the impact of outliers to obtain conclusions consistent with the bulk of the data. In this paper, we propose a model with super heavy-tailed errors to achieve this. We prove that it is wholly robust, meaning that the impact of outliers gradually vanishes as they move further and further away from the general trend. The super heavy-tailed density is similar to the normal outside of the tails, which gives rise to an efficient estimation procedure. In addition, estimates are easily computed. This is highlighted via a detailed user guide, where all steps are explained through a simulated case study. The performance is shown using simulation. All required code is given.




pp

Dynamic Quantile Linear Models: A Bayesian Approach

Kelly C. M. Gonçalves, Hélio S. Migon, Leonardo S. Bastos.

Source: Bayesian Analysis, Volume 15, Number 2, 335--362.

Abstract:
The paper introduces a new class of models, named dynamic quantile linear models, which combines dynamic linear models with distribution-free quantile regression producing a robust statistical method. Bayesian estimation for the dynamic quantile linear model is performed using an efficient Markov chain Monte Carlo algorithm. The paper also proposes a fast sequential procedure suited for high-dimensional predictive modeling with massive data, where the generating process is changing over time. The proposed model is evaluated using synthetic and well-known time series data. The model is also applied to predict annual incidence of tuberculosis in the state of Rio de Janeiro and compared with global targets set by the World Health Organization.




pp

A Novel Algorithmic Approach to Bayesian Logic Regression (with Discussion)

Aliaksandr Hubin, Geir Storvik, Florian Frommlet.

Source: Bayesian Analysis, Volume 15, Number 1, 263--333.

Abstract:
Logic regression was developed more than a decade ago as a tool to construct predictors from Boolean combinations of binary covariates. It has been mainly used to model epistatic effects in genetic association studies, which is very appealing due to the intuitive interpretation of logic expressions to describe the interaction between genetic variations. Nevertheless logic regression has (partly due to computational challenges) remained less well known than other approaches to epistatic association mapping. Here we will adapt an advanced evolutionary algorithm called GMJMCMC (Genetically modified Mode Jumping Markov Chain Monte Carlo) to perform Bayesian model selection in the space of logic regression models. After describing the algorithmic details of GMJMCMC we perform a comprehensive simulation study that illustrates its performance given logic regression terms of various complexity. Specifically GMJMCMC is shown to be able to identify three-way and even four-way interactions with relatively large power, a level of complexity which has not been achieved by previous implementations of logic regression. We apply GMJMCMC to reanalyze QTL (quantitative trait locus) mapping data for Recombinant Inbred Lines in Arabidopsis thaliana and from a backcross population in Drosophila where we identify several interesting epistatic effects. The method is implemented in an R package which is available on github.




pp

Determinantal Point Process Mixtures Via Spectral Density Approach

Ilaria Bianchini, Alessandra Guglielmi, Fernando A. Quintana.

Source: Bayesian Analysis, Volume 15, Number 1, 187--214.

Abstract:
We consider mixture models where location parameters are a priori encouraged to be well separated. We explore a class of determinantal point process (DPP) mixture models, which provide the desired notion of separation or repulsion. Instead of using the rather restrictive case where analytical results are partially available, we adopt a spectral representation from which approximations to the DPP density functions can be readily computed. For the sake of concreteness the presentation focuses on a power exponential spectral density, but the proposed approach is in fact quite general. We later extend our model to incorporate covariate information in the likelihood and also in the assignment to mixture components, yielding a trade-off between repulsiveness of locations in the mixtures and attraction among subjects with similar covariates. We develop full Bayesian inference, and explore model properties and posterior behavior using several simulation scenarios and data illustrations. Supplementary materials for this article are available online (Bianchini et al., 2019).




pp

Adaptive Bayesian Nonparametric Regression Using a Kernel Mixture of Polynomials with Application to Partial Linear Models

Fangzheng Xie, Yanxun Xu.

Source: Bayesian Analysis, Volume 15, Number 1, 159--186.

Abstract:
We propose a kernel mixture of polynomials prior for Bayesian nonparametric regression. The regression function is modeled by local averages of polynomials with kernel mixture weights. We obtain the minimax-optimal contraction rate of the full posterior distribution up to a logarithmic factor by estimating metric entropies of certain function classes. Under the assumption that the degree of the polynomials is larger than the unknown smoothness level of the true function, the posterior contraction behavior can adapt to this smoothness level provided an upper bound is known. We also provide a frequentist sieve maximum likelihood estimator with a near-optimal convergence rate. We further investigate the application of the kernel mixture of polynomials to partial linear models and obtain both the near-optimal rate of contraction for the nonparametric component and the Bernstein-von Mises limit (i.e., asymptotic normality) of the parametric component. The proposed method is illustrated with numerical examples and shows superior performance in terms of computational efficiency, accuracy, and uncertainty quantification compared to the local polynomial regression, DiceKriging, and the robust Gaussian stochastic process.




pp

Calibration Procedures for Approximate Bayesian Credible Sets

Jeong Eun Lee, Geoff K. Nicholls, Robin J. Ryder.

Source: Bayesian Analysis, Volume 14, Number 4, 1245--1269.

Abstract:
We develop and apply two calibration procedures for checking the coverage of approximate Bayesian credible sets, including intervals estimated using Monte Carlo methods. The user has an ideal prior and likelihood, but generates a credible set for an approximate posterior based on some approximate prior and likelihood. We estimate the realised posterior coverage achieved by the approximate credible set. This is the coverage of the unknown “true” parameter if the data are a realisation of the user’s ideal observation model conditioned on the parameter, and the parameter is a draw from the user’s ideal prior. In one approach we estimate the posterior coverage at the data by making a semi-parametric logistic regression of binary coverage outcomes on simulated data against summary statistics evaluated on simulated data. In another we use Importance Sampling from the approximate posterior, windowing simulated data to fall close to the observed data. We illustrate our methods on four examples.




pp

Spatial Disease Mapping Using Directed Acyclic Graph Auto-Regressive (DAGAR) Models

Abhirup Datta, Sudipto Banerjee, James S. Hodges, Leiwen Gao.

Source: Bayesian Analysis, Volume 14, Number 4, 1221--1244.

Abstract:
Hierarchical models for regionally aggregated disease incidence data commonly involve region specific latent random effects that are modeled jointly as having a multivariate Gaussian distribution. The covariance or precision matrix incorporates the spatial dependence between the regions. Common choices for the precision matrix include the widely used ICAR model, which is singular, and its nonsingular extension which lacks interpretability. We propose a new parametric model for the precision matrix based on a directed acyclic graph (DAG) representation of the spatial dependence. Our model guarantees positive definiteness and, hence, in addition to being a valid prior for regional spatially correlated random effects, can also directly model the outcome from dependent data like images and networks. Theoretical results establish a link between the parameters in our model and the variance and covariances of the random effects. Simulation studies demonstrate that the improved interpretability of our model reaps benefits in terms of accurately recovering the latent spatial random effects as well as for inference on the spatial covariance parameters. Under modest spatial correlation, our model far outperforms the CAR models, while the performances are similar when the spatial correlation is strong. We also assess sensitivity to the choice of the ordering in the DAG construction using theoretical and empirical results which testify to the robustness of our model. We also present a large-scale public health application demonstrating the competitive performance of the model.




pp

Stochastic Approximations to the Pitman–Yor Process

Julyan Arbel, Pierpaolo De Blasi, Igor Prünster.

Source: Bayesian Analysis, Volume 14, Number 3, 753--771.

Abstract:
In this paper we consider approximations to the popular Pitman–Yor process obtained by truncating the stick-breaking representation. The truncation is determined by a random stopping rule that achieves an almost sure control on the approximation error in total variation distance. We derive the asymptotic distribution of the random truncation point as the approximation error $epsilon$ goes to zero in terms of a polynomially tilted positive stable random variable. The practical usefulness and effectiveness of this theoretical result is demonstrated by devising a sampling algorithm to approximate functionals of the $epsilon$ -version of the Pitman–Yor process.




pp

Efficient Acquisition Rules for Model-Based Approximate Bayesian Computation

Marko Järvenpää, Michael U. Gutmann, Arijus Pleska, Aki Vehtari, Pekka Marttinen.

Source: Bayesian Analysis, Volume 14, Number 2, 595--622.

Abstract:
Approximate Bayesian computation (ABC) is a method for Bayesian inference when the likelihood is unavailable but simulating from the model is possible. However, many ABC algorithms require a large number of simulations, which can be costly. To reduce the computational cost, Bayesian optimisation (BO) and surrogate models such as Gaussian processes have been proposed. Bayesian optimisation enables one to intelligently decide where to evaluate the model next but common BO strategies are not designed for the goal of estimating the posterior distribution. Our paper addresses this gap in the literature. We propose to compute the uncertainty in the ABC posterior density, which is due to a lack of simulations to estimate this quantity accurately, and define a loss function that measures this uncertainty. We then propose to select the next evaluation location to minimise the expected loss. Experiments show that the proposed method often produces the most accurate approximations as compared to common BO strategies.




pp

A Bayesian Approach to Statistical Shape Analysis via the Projected Normal Distribution

Luis Gutiérrez, Eduardo Gutiérrez-Peña, Ramsés H. Mena.

Source: Bayesian Analysis, Volume 14, Number 2, 427--447.

Abstract:
This work presents a Bayesian predictive approach to statistical shape analysis. A modeling strategy that starts with a Gaussian distribution on the configuration space, and then removes the effects of location, rotation and scale, is studied. This boils down to an application of the projected normal distribution to model the configurations in the shape space, which together with certain identifiability constraints, facilitates parameter interpretation. Having better control over the parameters allows us to generalize the model to a regression setting where the effect of predictors on shapes can be considered. The methodology is illustrated and tested using both simulated scenarios and a real data set concerning eight anatomical landmarks on a sagittal plane of the corpus callosum in patients with autism and in a group of controls.




pp

Separable covariance arrays via the Tucker product, with applications to multivariate relational data

Peter D. Hoff

Source: Bayesian Anal., Volume 6, Number 2, 179--196.

Abstract:
Modern datasets are often in the form of matrices or arrays, potentially having correlations along each set of data indices. For example, data involving repeated measurements of several variables over time may exhibit temporal correlation as well as correlation among the variables. A possible model for matrix-valued data is the class of matrix normal distributions, which is parametrized by two covariance matrices, one for each index set of the data. In this article we discuss an extension of the matrix normal model to accommodate multidimensional data arrays, or tensors. We show how a particular array-matrix product can be used to generate the class of array normal distributions having separable covariance structure. We derive some properties of these covariance structures and the corresponding array normal distributions, and show how the array-matrix product can be used to define a semi-conjugate prior distribution and calculate the corresponding posterior distribution. We illustrate the methodology in an analysis of multivariate longitudinal network data which take the form of a four-way array.




pp

Maximum Independent Component Analysis with Application to EEG Data

Ruosi Guo, Chunming Zhang, Zhengjun Zhang.

Source: Statistical Science, Volume 35, Number 1, 145--157.

Abstract:
In many scientific disciplines, finding hidden influential factors behind observational data is essential but challenging. The majority of existing approaches, such as the independent component analysis (${mathrm{ICA}}$), rely on linear transformation, that is, true signals are linear combinations of hidden components. Motivated from analyzing nonlinear temporal signals in neuroscience, genetics, and finance, this paper proposes the “maximum independent component analysis” (${mathrm{MaxICA}}$), based on max-linear combinations of components. In contrast to existing methods, ${mathrm{MaxICA}}$ benefits from focusing on significant major components while filtering out ignorable components. A major tool for parameter learning of ${mathrm{MaxICA}}$ is an augmented genetic algorithm, consisting of three schemes for the elite weighted sum selection, randomly combined crossover, and dynamic mutation. Extensive empirical evaluations demonstrate the effectiveness of ${mathrm{MaxICA}}$ in either extracting max-linearly combined essential sources in many applications or supplying a better approximation for nonlinearly combined source signals, such as $mathrm{EEG}$ recordings analyzed in this paper.




pp

A Tale of Two Parasites: Statistical Modelling to Support Disease Control Programmes in Africa

Peter J. Diggle, Emanuele Giorgi, Julienne Atsame, Sylvie Ntsame Ella, Kisito Ogoussan, Katherine Gass.

Source: Statistical Science, Volume 35, Number 1, 42--50.

Abstract:
Vector-borne diseases have long presented major challenges to the health of rural communities in the wet tropical regions of the world, but especially in sub-Saharan Africa. In this paper, we describe the contribution that statistical modelling has made to the global elimination programme for one vector-borne disease, onchocerciasis. We explain why information on the spatial distribution of a second vector-borne disease, Loa loa, is needed before communities at high risk of onchocerciasis can be treated safely with mass distribution of ivermectin, an antifiarial medication. We show how a model-based geostatistical analysis of Loa loa prevalence survey data can be used to map the predictive probability that each location in the region of interest meets a WHO policy guideline for safe mass distribution of ivermectin and describe two applications: one is to data from Cameroon that assesses prevalence using traditional blood-smear microscopy; the other is to Africa-wide data that uses a low-cost questionnaire-based method. We describe how a recent technological development in image-based microscopy has resulted in a change of emphasis from prevalence alone to the bivariate spatial distribution of prevalence and the intensity of infection among infected individuals. We discuss how statistical modelling of the kind described here can contribute to health policy guidelines and decision-making in two ways. One is to ensure that, in a resource-limited setting, prevalence surveys are designed, and the resulting data analysed, as efficiently as possible. The other is to provide an honest quantification of the uncertainty attached to any binary decision by reporting predictive probabilities that a policy-defined condition for action is or is not met.




pp

Model-Based Approach to the Joint Analysis of Single-Cell Data on Chromatin Accessibility and Gene Expression

Zhixiang Lin, Mahdi Zamanighomi, Timothy Daley, Shining Ma, Wing Hung Wong.

Source: Statistical Science, Volume 35, Number 1, 2--13.

Abstract:
Unsupervised methods, including clustering methods, are essential to the analysis of single-cell genomic data. Model-based clustering methods are under-explored in the area of single-cell genomics, and have the advantage of quantifying the uncertainty of the clustering result. Here we develop a model-based approach for the integrative analysis of single-cell chromatin accessibility and gene expression data. We show that combining these two types of data, we can achieve a better separation of the underlying cell types. An efficient Markov chain Monte Carlo algorithm is also developed.




pp

Models as Approximations—Rejoinder

Andreas Buja, Arun Kumar Kuchibhotla, Richard Berk, Edward George, Eric Tchetgen Tchetgen, Linda Zhao.

Source: Statistical Science, Volume 34, Number 4, 606--620.

Abstract:
We respond to the discussants of our articles emphasizing the importance of inference under misspecification in the context of the reproducibility/replicability crisis. Along the way, we discuss the roles of diagnostics and model building in regression as well as connections between our well-specification framework and semiparametric theory.




pp

Discussion: Models as Approximations

Dalia Ghanem, Todd A. Kuffner.

Source: Statistical Science, Volume 34, Number 4, 604--605.




pp

Comment: Models as (Deliberate) Approximations

David Whitney, Ali Shojaie, Marco Carone.

Source: Statistical Science, Volume 34, Number 4, 591--598.




pp

Comment: Models Are Approximations!

Anthony C. Davison, Erwan Koch, Jonathan Koh.

Source: Statistical Science, Volume 34, Number 4, 584--590.

Abstract:
This discussion focuses on areas of disagreement with the papers, particularly the target of inference and the case for using the robust ‘sandwich’ variance estimator in the presence of moderate mis-specification. We also suggest that existing procedures may be appreciably more powerful for detecting mis-specification than the authors’ RAV statistic, and comment on the use of the pairs bootstrap in balanced situations.




pp

Comment: “Models as Approximations I: Consequences Illustrated with Linear Regression” by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, L. Zhan and K. Zhang

Roderick J. Little.

Source: Statistical Science, Volume 34, Number 4, 580--583.




pp

Discussion of Models as Approximations I & II

Dag Tjøstheim.

Source: Statistical Science, Volume 34, Number 4, 575--579.




pp

Comment: Models as Approximations

Nikki L. B. Freeman, Xiaotong Jiang, Owen E. Leete, Daniel J. Luckett, Teeranan Pokaprakarn, Michael R. Kosorok.

Source: Statistical Science, Volume 34, Number 4, 572--574.




pp

Comment on Models as Approximations, Parts I and II, by Buja et al.

Jerald F. Lawless.

Source: Statistical Science, Volume 34, Number 4, 569--571.

Abstract:
I comment on the papers Models as Approximations I and II, by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, M. Traskin, L. Zhao and K. Zhang.




pp

Discussion of Models as Approximations I & II

Sara van de Geer.

Source: Statistical Science, Volume 34, Number 4, 566--568.

Abstract:
We discuss the papers “Models as Approximations” I & II, by A. Buja, R. Berk, L. Brown, E. George, E. Pitkin, M. Traskin, L. Zao and K. Zhang (Part I) and A. Buja, L. Brown, A. K. Kuchibhota, R. Berk, E. George and L. Zhao (Part II). We present a summary with some details for the generalized linear model.




pp

Models as Approximations II: A Model-Free Theory of Parametric Regression

Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard Berk, Edward George, Linda Zhao.

Source: Statistical Science, Volume 34, Number 4, 545--565.

Abstract:
We develop a model-free theory of general types of parametric regression for i.i.d. observations. The theory replaces the parameters of parametric models with statistical functionals, to be called “regression functionals,” defined on large nonparametric classes of joint ${x extrm{-}y}$ distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective functions. An example of a regression functional is the vector of slopes of linear equations fitted by OLS to largely arbitrary ${x extrm{-}y}$ distributions, without assuming a linear model (see Part I). More generally, regression functionals can be defined by minimizing objective functions, solving estimating equations, or with ad hoc constructions. In this framework, it is possible to achieve the following: (1) define a notion of “well-specification” for regression functionals that replaces the notion of correct specification of models, (2) propose a well-specification diagnostic for regression functionals based on reweighting distributions and data, (3) decompose sampling variability of regression functionals into two sources, one due to the conditional response distribution and another due to the regressor distribution interacting with misspecification, both of order $N^{-1/2}$, (4) exhibit plug-in/sandwich estimators of standard error as limit cases of ${x extrm{-}y}$ bootstrap estimators, and (5) provide theoretical heuristics to indicate that ${x extrm{-}y}$ bootstrap standard errors may generally be preferred over sandwich estimators.




pp

Models as Approximations I: Consequences Illustrated with Linear Regression

Andreas Buja, Lawrence Brown, Richard Berk, Edward George, Emil Pitkin, Mikhail Traskin, Kai Zhang, Linda Zhao.

Source: Statistical Science, Volume 34, Number 4, 523--544.

Abstract:
In the early 1980s, Halbert White inaugurated a “model-robust” form of statistical inference based on the “sandwich estimator” of standard error. This estimator is known to be “heteroskedasticity-consistent,” but it is less well known to be “nonlinearity-consistent” as well. Nonlinearity, however, raises fundamental issues because in its presence regressors are not ancillary, hence cannot be treated as fixed. The consequences are deep: (1) population slopes need to be reinterpreted as statistical functionals obtained from OLS fits to largely arbitrary joint ${x extrm{-}y}$ distributions; (2) the meaning of slope parameters needs to be rethought; (3) the regressor distribution affects the slope parameters; (4) randomness of the regressors becomes a source of sampling variability in slope estimates of order $1/sqrt{N}$; (5) inference needs to be based on model-robust standard errors, including sandwich estimators or the ${x extrm{-}y}$ bootstrap. In theory, model-robust and model-trusting standard errors can deviate by arbitrary magnitudes either way. In practice, significant deviations between them can be detected with a diagnostic test.




pp

A Kernel Regression Procedure in the 3D Shape Space with an Application to Online Sales of Children’s Wear

Gregorio Quintana-Ortí, Amelia Simó.

Source: Statistical Science, Volume 34, Number 2, 236--252.

Abstract:
This paper is focused on kernel regression when the response variable is the shape of a 3D object represented by a configuration matrix of landmarks. Regression methods on this shape space are not trivial because this space has a complex finite-dimensional Riemannian manifold structure (non-Euclidean). Papers about it are scarce in the literature, the majority of them are restricted to the case of a single explanatory variable, and many of them are based on the approximated tangent space. In this paper, there are several methodological innovations. The first one is the adaptation of the general method for kernel regression analysis in manifold-valued data to the three-dimensional case of Kendall’s shape space. The second one is its generalization to the multivariate case and the addressing of the curse-of-dimensionality problem. Finally, we propose bootstrap confidence intervals for prediction. A simulation study is carried out to check the goodness of the procedure, and a comparison with a current approach is performed. Then, it is applied to a 3D database obtained from an anthropometric survey of the Spanish child population with a potential application to online sales of children’s wear.