model

Optimal rates for community estimation in the weighted stochastic block model

Min Xu, Varun Jog, Po-Ling Loh.

Source: The Annals of Statistics, Volume 48, Number 1, 183--204.

Abstract:
Community identification in a network is an important problem in fields such as social science, neuroscience and genetics. Over the past decade, stochastic block models (SBMs) have emerged as a popular statistical framework for this problem. However, SBMs have an important limitation in that they are suited only for networks with unweighted edges; in various scientific applications, disregarding the edge weights may result in a loss of valuable information. We study a weighted generalization of the SBM, in which observations are collected in the form of a weighted adjacency matrix and the weight of each edge is generated independently from an unknown probability density determined by the community membership of its endpoints. We characterize the optimal rate of misclustering error of the weighted SBM in terms of the Renyi divergence of order 1/2 between the weight distributions of within-community and between-community edges, substantially generalizing existing results for unweighted SBMs. Furthermore, we present a computationally tractable algorithm based on discretization that achieves the optimal error rate. Our method is adaptive in the sense that the algorithm, without assuming knowledge of the weight densities, performs as well as the best algorithm that knows the weight densities.




model

Model assisted variable clustering: Minimax-optimal recovery and algorithms

Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen.

Source: The Annals of Statistics, Volume 48, Number 1, 111--137.

Abstract:
The problem of variable clustering is that of estimating groups of similar components of a $p$-dimensional vector $X=(X_{1},ldots ,X_{p})$ from $n$ independent copies of $X$. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of $G$-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a $G$-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to $G$-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular $K$-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.




model

Minimax posterior convergence rates and model selection consistency in high-dimensional DAG models based on sparse Cholesky factors

Kyoungjae Lee, Jaeyong Lee, Lizhen Lin.

Source: The Annals of Statistics, Volume 47, Number 6, 3413--3437.

Abstract:
In this paper we study the high-dimensional sparse directed acyclic graph (DAG) models under the empirical sparse Cholesky prior. Among our results, strong model selection consistency or graph selection consistency is obtained under more general conditions than those in the existing literature. Compared to Cao, Khare and Ghosh [ Ann. Statist. (2019) 47 319–348], the required conditions are weakened in terms of the dimensionality, sparsity and lower bound of the nonzero elements in the Cholesky factor. Furthermore, our result does not require the irrepresentable condition, which is necessary for Lasso-type methods. We also derive the posterior convergence rates for precision matrices and Cholesky factors with respect to various matrix norms. The obtained posterior convergence rates are the fastest among those of the existing Bayesian approaches. In particular, we prove that our posterior convergence rates for Cholesky factors are the minimax or at least nearly minimax depending on the relative size of true sparseness for the entire dimension. The simulation study confirms that the proposed method outperforms the competing methods.




model

On optimal designs for nonregular models

Yi Lin, Ryan Martin, Min Yang.

Source: The Annals of Statistics, Volume 47, Number 6, 3335--3359.

Abstract:
Classically, Fisher information is the relevant object in defining optimal experimental designs. However, for models that lack certain regularity, the Fisher information does not exist, and hence, there is no notion of design optimality available in the literature. This article seeks to fill the gap by proposing a so-called Hellinger information , which generalizes Fisher information in the sense that the two measures agree in regular problems, but the former also exists for certain types of nonregular problems. We derive a Hellinger information inequality, showing that Hellinger information defines a lower bound on the local minimax risk of estimators. This provides a connection between features of the underlying model—in particular, the design—and the performance of estimators, motivating the use of this new Hellinger information for nonregular optimal design problems. Hellinger optimal designs are derived for several nonregular regression problems, with numerical results empirically demonstrating the efficiency of these designs compared to alternatives.




model

Statistical inference for autoregressive models under heteroscedasticity of unknown form

Ke Zhu.

Source: The Annals of Statistics, Volume 47, Number 6, 3185--3215.

Abstract:
This paper provides an entire inference procedure for the autoregressive model under (conditional) heteroscedasticity of unknown form with a finite variance. We first establish the asymptotic normality of the weighted least absolute deviations estimator (LADE) for the model. Second, we develop the random weighting (RW) method to estimate its asymptotic covariance matrix, leading to the implementation of the Wald test. Third, we construct a portmanteau test for model checking, and use the RW method to obtain its critical values. As a special weighted LADE, the feasible adaptive LADE (ALADE) is proposed and proved to have the same efficiency as its infeasible counterpart. The importance of our entire methodology based on the feasible ALADE is illustrated by simulation results and the real data analysis on three U.S. economic data sets.




model

Adaptive estimation of the rank of the coefficient matrix in high-dimensional multivariate response regression models

Xin Bing, Marten H. Wegkamp.

Source: The Annals of Statistics, Volume 47, Number 6, 3157--3184.

Abstract:
We consider the multivariate response regression problem with a regression coefficient matrix of low, unknown rank. In this setting, we analyze a new criterion for selecting the optimal reduced rank. This criterion differs notably from the one proposed in Bunea, She and Wegkamp ( Ann. Statist. 39 (2011) 1282–1309) in that it does not require estimation of the unknown variance of the noise, nor does it depend on a delicate choice of a tuning parameter. We develop an iterative, fully data-driven procedure, that adapts to the optimal signal-to-noise ratio. This procedure finds the true rank in a few steps with overwhelming probability. At each step, our estimate increases, while at the same time it does not exceed the true rank. Our finite sample results hold for any sample size and any dimension, even when the number of responses and of covariates grow much faster than the number of observations. We perform an extensive simulation study that confirms our theoretical findings. The new method performs better and is more stable than the procedure of Bunea, She and Wegkamp ( Ann. Statist. 39 (2011) 1282–1309) in both low- and high-dimensional settings.




model

Additive models with trend filtering

Veeranjaneyulu Sadhanala, Ryan J. Tibshirani.

Source: The Annals of Statistics, Volume 47, Number 6, 3032--3068.

Abstract:
We study additive models built with trend filtering, that is, additive models whose components are each regularized by the (discrete) total variation of their $k$th (discrete) derivative, for a chosen integer $kgeq0$. This results in $k$th degree piecewise polynomial components, (e.g., $k=0$ gives piecewise constant components, $k=1$ gives piecewise linear, $k=2$ gives piecewise quadratic, etc.). Analogous to its advantages in the univariate case, additive trend filtering has favorable theoretical and computational properties, thanks in large part to the localized nature of the (discrete) total variation regularizer that it uses. On the theory side, we derive fast error rates for additive trend filtering estimates, and show these rates are minimax optimal when the underlying function is additive and has component functions whose derivatives are of bounded variation. We also show that these rates are unattainable by additive smoothing splines (and by additive models built from linear smoothers, in general). On the computational side, we use backfitting, to leverage fast univariate trend filtering solvers; we also describe a new backfitting algorithm whose iterations can be run in parallel, which (as far as we can tell) is the first of its kind. Lastly, we present a number of experiments to examine the empirical performance of trend filtering.




model

Projected spline estimation of the nonparametric function in high-dimensional partially linear models for massive data

Heng Lian, Kaifeng Zhao, Shaogao Lv.

Source: The Annals of Statistics, Volume 47, Number 5, 2922--2949.

Abstract:
In this paper, we consider the local asymptotics of the nonparametric function in a partially linear model, within the framework of the divide-and-conquer estimation. Unlike the fixed-dimensional setting in which the parametric part does not affect the nonparametric part, the high-dimensional setting makes the issue more complicated. In particular, when a sparsity-inducing penalty such as lasso is used to make the estimation of the linear part feasible, the bias introduced will propagate to the nonparametric part. We propose a novel approach for estimation of the nonparametric function and establish the local asymptotics of the estimator. The result is useful for massive data with possibly different linear coefficients in each subpopulation but common nonparametric function. Some numerical illustrations are also presented.




model

Eigenvalue distributions of variance components estimators in high-dimensional random effects models

Zhou Fan, Iain M. Johnstone.

Source: The Annals of Statistics, Volume 47, Number 5, 2855--2886.

Abstract:
We study the spectra of MANOVA estimators for variance component covariance matrices in multivariate random effects models. When the dimensionality of the observations is large and comparable to the number of realizations of each random effect, we show that the empirical spectra of such estimators are well approximated by deterministic laws. The Stieltjes transforms of these laws are characterized by systems of fixed-point equations, which are numerically solvable by a simple iterative procedure. Our proof uses operator-valued free probability theory, and we establish a general asymptotic freeness result for families of rectangular orthogonally invariant random matrices, which is of independent interest. Our work is motivated in part by the estimation of components of covariance between multiple phenotypic traits in quantitative genetics, and we specialize our results to common experimental designs that arise in this application.




model

Exact lower bounds for the agnostic probably-approximately-correct (PAC) machine learning model

Aryeh Kontorovich, Iosif Pinelis.

Source: The Annals of Statistics, Volume 47, Number 5, 2822--2854.

Abstract:
We provide an exact nonasymptotic lower bound on the minimax expected excess risk (EER) in the agnostic probably-approximately-correct (PAC) machine learning classification model and identify minimax learning algorithms as certain maximally symmetric and minimally randomized “voting” procedures. Based on this result, an exact asymptotic lower bound on the minimax EER is provided. This bound is of the simple form $c_{infty}/sqrt{ u}$ as $ u oinfty$, where $c_{infty}=0.16997dots$ is a universal constant, $ u=m/d$, $m$ is the size of the training sample and $d$ is the Vapnik–Chervonenkis dimension of the hypothesis class. It is shown that the differences between these asymptotic and nonasymptotic bounds, as well as the differences between these two bounds and the maximum EER of any learning algorithms that minimize the empirical risk, are asymptotically negligible, and all these differences are due to ties in the mentioned “voting” procedures. A few easy to compute nonasymptotic lower bounds on the minimax EER are also obtained, which are shown to be close to the exact asymptotic lower bound $c_{infty}/sqrt{ u}$ even for rather small values of the ratio $ u=m/d$. As an application of these results, we substantially improve existing lower bounds on the tail probability of the excess risk. Among the tools used are Bayes estimation and apparently new identities and inequalities for binomial distributions.




model

An operator theoretic approach to nonparametric mixture models

Robert A. Vandermeulen, Clayton D. Scott.

Source: The Annals of Statistics, Volume 47, Number 5, 2704--2733.

Abstract:
When estimating finite mixture models, it is common to make assumptions on the mixture components, such as parametric assumptions. In this work, we make no distributional assumptions on the mixture components and instead assume that observations from the mixture model are grouped, such that observations in the same group are known to be drawn from the same mixture component. We precisely characterize the number of observations $n$ per group needed for the mixture model to be identifiable, as a function of the number $m$ of mixture components. In addition to our assumption-free analysis, we also study the settings where the mixture components are either linearly independent or jointly irreducible. Furthermore, our analysis considers two kinds of identifiability, where the mixture model is the simplest one explaining the data, and where it is the only one. As an application of these results, we precisely characterize identifiability of multinomial mixture models. Our analysis relies on an operator-theoretic framework that associates mixture models in the grouped-sample setting with certain infinite-dimensional tensors. Based on this framework, we introduce a general spectral algorithm for recovering the mixture components.




model

Linear hypothesis testing for high dimensional generalized linear models

Chengchun Shi, Rui Song, Zhao Chen, Runze Li.

Source: The Annals of Statistics, Volume 47, Number 5, 2671--2703.

Abstract:
This paper is concerned with testing linear hypotheses in high dimensional generalized linear models. To deal with linear hypotheses, we first propose the constrained partial regularization method and study its statistical properties. We further introduce an algorithm for solving regularization problems with folded-concave penalty functions and linear constraints. To test linear hypotheses, we propose a partial penalized likelihood ratio test, a partial penalized score test and a partial penalized Wald test. We show that the limiting null distributions of these three test statistics are $chi^{2}$ distribution with the same degrees of freedom, and under local alternatives, they asymptotically follow noncentral $chi^{2}$ distributions with the same degrees of freedom and noncentral parameter, provided the number of parameters involved in the test hypothesis grows to $infty$ at a certain rate. Simulation studies are conducted to examine the finite sample performance of the proposed tests. Empirical analysis of a real data example is used to illustrate the proposed testing procedures.




model

Property testing in high-dimensional Ising models

Matey Neykov, Han Liu.

Source: The Annals of Statistics, Volume 47, Number 5, 2472--2503.

Abstract:
This paper explores the information-theoretic limitations of graph property testing in zero-field Ising models. Instead of learning the entire graph structure, sometimes testing a basic graph property such as connectivity, cycle presence or maximum clique size is a more relevant and attainable objective. Since property testing is more fundamental than graph recovery, any necessary conditions for property testing imply corresponding conditions for graph recovery, while custom property tests can be statistically and/or computationally more efficient than graph recovery based algorithms. Understanding the statistical complexity of property testing requires the distinction of ferromagnetic (i.e., positive interactions only) and general Ising models. Using combinatorial constructs such as graph packing and strong monotonicity, we characterize how target properties affect the corresponding minimax upper and lower bounds within the realm of ferromagnets. On the other hand, by studying the detection of an antiferromagnetic (i.e., negative interactions only) Curie–Weiss model buried in Rademacher noise, we show that property testing is strictly more challenging over general Ising models. In terms of methodological development, we propose two types of correlation based tests: computationally efficient screening for ferromagnets, and score type tests for general models, including a fast cycle presence test. Our correlation screening tests match the information-theoretic bounds for property testing in ferromagnets in certain regimes.




model

Dynamic network models and graphon estimation

Marianna Pensky.

Source: The Annals of Statistics, Volume 47, Number 4, 2378--2403.

Abstract:
In the present paper, we consider a dynamic stochastic network model. The objective is estimation of the tensor of connection probabilities $mathbf{{Lambda}}$ when it is generated by a Dynamic Stochastic Block Model (DSBM) or a dynamic graphon. In particular, in the context of the DSBM, we derive a penalized least squares estimator $widehat{oldsymbol{Lambda}}$ of $mathbf{{Lambda}}$ and show that $widehat{oldsymbol{Lambda}}$ satisfies an oracle inequality and also attains minimax lower bounds for the risk. We extend those results to estimation of $mathbf{{Lambda}}$ when it is generated by a dynamic graphon function. The estimators constructed in the paper are adaptive to the unknown number of blocks in the context of the DSBM or to the smoothness of the graphon function. The technique relies on the vectorization of the model and leads to much simpler mathematical arguments than the ones used previously in the stationary set up. In addition, all results in the paper are nonasymptotic and allow a variety of extensions.




model

Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis

Boyu Ren, Sergio Bacallado, Stefano Favaro, Tommi Vatanen, Curtis Huttenhower, Lorenzo Trippa.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 494--517.

Abstract:
Detecting associations between microbial compositions and sample characteristics is one of the most important tasks in microbiome studies. Most of the existing methods apply univariate models to single microbial species separately, with adjustments for multiple hypothesis testing. We propose a Bayesian analysis for a generalized mixed effects linear model tailored to this application. The marginal prior on each microbial composition is a Dirichlet process, and dependence across compositions is induced through a linear combination of individual covariates, such as disease biomarkers or the subject’s age, and latent factors. The latent factors capture residual variability and their dimensionality is learned from the data in a fully Bayesian procedure. The proposed model is tested in data analyses and simulation studies with zero-inflated compositions. In these settings and within each sample, a large proportion of counts per microbial species are equal to zero. In our Bayesian model a priori the probability of compositions with absent microbial species is strictly positive. We propose an efficient algorithm to sample from the posterior and visualizations of model parameters which reveal associations between covariates and microbial compositions. We evaluate the proposed method in simulation studies, and then analyze a microbiome dataset for infants with type 1 diabetes which contains a large proportion of zeros in the sample-specific microbial compositions.




model

A hierarchical dependent Dirichlet process prior for modelling bird migration patterns in the UK

Alex Diana, Eleni Matechou, Jim Griffin, Alison Johnston.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 473--493.

Abstract:
Environmental changes in recent years have been linked to phenological shifts which in turn are linked to the survival of species. The work in this paper is motivated by capture-recapture data on blackcaps collected by the British Trust for Ornithology as part of the Constant Effort Sites monitoring scheme. Blackcaps overwinter abroad and migrate to the UK annually for breeding purposes. We propose a novel Bayesian nonparametric approach for expressing the bivariate density of individual arrival and departure times at different sites across a number of years as a mixture model. The new model combines the ideas of the hierarchical and the dependent Dirichlet process, allowing the estimation of site-specific weights and year-specific mixture locations, which are modelled as functions of environmental covariates using a multivariate extension of the Gaussian process. The proposed modelling framework is extremely general and can be used in any context where multivariate density estimation is performed jointly across different groups and in the presence of a continuous covariate.




model

Estimating causal effects in studies of human brain function: New models, methods and estimands

Michael E. Sobel, Martin A. Lindquist.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 452--472.

Abstract:
Neuroscientists often use functional magnetic resonance imaging (fMRI) to infer effects of treatments on neural activity in brain regions. In a typical fMRI experiment, each subject is observed at several hundred time points. At each point, the blood oxygenation level dependent (BOLD) response is measured at 100,000 or more locations (voxels). Typically, these responses are modeled treating each voxel separately, and no rationale for interpreting associations as effects is given. Building on Sobel and Lindquist ( J. Amer. Statist. Assoc. 109 (2014) 967–976), who used potential outcomes to define unit and average effects at each voxel and time point, we define and estimate both “point” and “cumulated” effects for brain regions. Second, we construct a multisubject, multivoxel, multirun whole brain causal model with explicit parameters for regions. We justify estimation using BOLD responses averaged over voxels within regions, making feasible estimation for all regions simultaneously, thereby also facilitating inferences about association between effects in different regions. We apply the model to a study of pain, finding effects in standard pain regions. We also observe more cerebellar activity than observed in previous studies using prevailing methods.




model

Regression for copula-linked compound distributions with applications in modeling aggregate insurance claims

Peng Shi, Zifeng Zhao.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 357--380.

Abstract:
In actuarial research a task of particular interest and importance is to predict the loss cost for individual risks so that informative decisions are made in various insurance operations such as underwriting, ratemaking and capital management. The loss cost is typically viewed to follow a compound distribution where the summation of the severity variables is stopped by the frequency variable. A challenging issue in modeling such outcomes is to accommodate the potential dependence between the number of claims and the size of each individual claim. In this article we introduce a novel regression framework for compound distributions that uses a copula to accommodate the association between the frequency and the severity variables and, thus, allows for arbitrary dependence between the two components. We further show that the new model is very flexible and is easily modified to account for incomplete data due to censoring or truncation. The flexibility of the proposed model is illustrated using both simulated and real data sets. In the analysis of granular claims data from property insurance, we find substantive negative relationship between the number and the size of insurance claims. In addition, we demonstrate that ignoring the frequency-severity association could lead to biased decision-making in insurance operations.




model

Modeling wildfire ignition origins in southern California using linear network point processes

Medha Uppala, Mark S. Handcock.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 339--356.

Abstract:
This paper focuses on spatial and temporal modeling of point processes on linear networks. Point processes on linear networks can simply be defined as point events occurring on or near line segment network structures embedded in a certain space. A separable modeling framework is introduced that posits separate formation and dissolution models of point processes on linear networks over time. While the model was inspired by spider web building activity in brick mortar lines, the focus is on modeling wildfire ignition origins near road networks over a span of 14 years. As most wildfires in California have human-related origins, modeling the origin locations with respect to the road network provides insight into how human, vehicular and structural densities affect ignition occurrence. Model results show that roads that traverse different types of regions such as residential, interface and wildland regions have higher ignition intensities compared to roads that only exist in each of the mentioned region types.




model

Optimal asset allocation with multivariate Bayesian dynamic linear models

Jared D. Fisher, Davide Pettenuzzo, Carlos M. Carvalho.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 299--338.

Abstract:
We introduce a fast, closed-form, simulation-free method to model and forecast multiple asset returns and employ it to investigate the optimal ensemble of features to include when jointly predicting monthly stock and bond excess returns. Our approach builds on the Bayesian dynamic linear models of West and Harrison ( Bayesian Forecasting and Dynamic Models (1997) Springer), and it can objectively determine, through a fully automated procedure, both the optimal set of regressors to include in the predictive system and the degree to which the model coefficients, volatilities and covariances should vary over time. When applied to a portfolio of five stock and bond returns, we find that our method leads to large forecast gains, both in statistical and economic terms. In particular, we find that relative to a standard no-predictability benchmark, the optimal combination of predictors, stochastic volatility and time-varying covariances increases the annualized certainty equivalent returns of a leverage-constrained power utility investor by more than 500 basis points.




model

Feature selection for generalized varying coefficient mixed-effect models with application to obesity GWAS

Wanghuan Chu, Runze Li, Jingyuan Liu, Matthew Reimherr.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 276--298.

Abstract:
Motivated by an empirical analysis of data from a genome-wide association study on obesity, measured by the body mass index (BMI), we propose a two-step gene-detection procedure for generalized varying coefficient mixed-effects models with ultrahigh dimensional covariates. The proposed procedure selects significant single nucleotide polymorphisms (SNPs) impacting the mean BMI trend, some of which have already been biologically proven to be “fat genes.” The method also discovers SNPs that significantly influence the age-dependent variability of BMI. The proposed procedure takes into account individual variations of genetic effects and can also be directly applied to longitudinal data with continuous, binary or count responses. We employ Monte Carlo simulation studies to assess the performance of the proposed method and further carry out causal inference for the selected SNPs.




model

Bayesian factor models for probabilistic cause of death assessment with verbal autopsies

Tsuyoshi Kunihama, Zehang Richard Li, Samuel J. Clark, Tyler H. McCormick.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 241--256.

Abstract:
The distribution of deaths by cause provides crucial information for public health planning, response and evaluation. About 60% of deaths globally are not registered or given a cause, limiting our ability to understand disease epidemiology. Verbal autopsy (VA) surveys are increasingly used in such settings to collect information on the signs, symptoms and medical history of people who have recently died. This article develops a novel Bayesian method for estimation of population distributions of deaths by cause using verbal autopsy data. The proposed approach is based on a multivariate probit model where associations among items in questionnaires are flexibly induced by latent factors. Using the Population Health Metrics Research Consortium labeled data that include both VA and medically certified causes of death, we assess performance of the proposed method. Further, we estimate important questionnaire items that are highly associated with causes of death. This framework provides insights that will simplify future data




model

A hierarchical Bayesian model for predicting ecological interactions using scaled evolutionary relationships

Mohamad Elmasri, Maxwell J. Farrell, T. Jonathan Davies, David A. Stephens.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 221--240.

Abstract:
Identifying undocumented or potential future interactions among species is a challenge facing modern ecologists. Recent link prediction methods rely on trait data; however, large species interaction databases are typically sparse and covariates are limited to only a fraction of species. On the other hand, evolutionary relationships, encoded as phylogenetic trees, can act as proxies for underlying traits and historical patterns of parasite sharing among hosts. We show that, using a network-based conditional model, phylogenetic information provides strong predictive power in a recently published global database of host-parasite interactions. By scaling the phylogeny using an evolutionary model, our method allows for biological interpretation often missing from latent variable models. To further improve on the phylogeny-only model, we combine a hierarchical Bayesian latent score framework for bipartite graphs that accounts for the number of interactions per species with host dependence informed by phylogeny. Combining the two information sources yields significant improvement in predictive accuracy over each of the submodels alone. As many interaction networks are constructed from presence-only data, we extend the model by integrating a correction mechanism for missing interactions which proves valuable in reducing uncertainty in unobserved interactions.




model

Modeling microbial abundances and dysbiosis with beta-binomial regression

Bryan D. Martin, Daniela Witten, Amy D. Willis.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 94--115.

Abstract:
Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon’s relative abundance . In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon’s relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon’s counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis , the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.




model

SHOPPER: A probabilistic model of consumer choice with substitutes and complements

Francisco J. R. Ruiz, Susan Athey, David M. Blei.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 1--27.

Abstract:
We develop SHOPPER, a sequential probabilistic model of shopping data. SHOPPER uses interpretable components to model the forces that drive how a customer chooses products; in particular, we designed SHOPPER to capture how items interact with other items. We develop an efficient posterior inference algorithm to estimate these forces from large-scale data, and we analyze a large dataset from a major chain grocery store. We are interested in answering counterfactual queries about changes in prices. We found that SHOPPER provides accurate predictions even under price interventions, and that it helps identify complementary and substitutable pairs of products.




model

Hierarchical infinite factor models for improving the prediction of surgical complications for geriatric patients

Elizabeth Lorenzi, Ricardo Henao, Katherine Heller.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2637--2661.

Abstract:
Nearly a third of all surgeries performed in the United States occur for patients over the age of 65; these older adults experience a higher rate of postoperative morbidity and mortality. To improve the care for these patients, we aim to identify and characterize high risk geriatric patients to send to a specialized perioperative clinic while leveraging the overall surgical population to improve learning. To this end, we develop a hierarchical infinite latent factor model (HIFM) to appropriately account for the covariance structure across subpopulations in data. We propose a novel Hierarchical Dirichlet Process shrinkage prior on the loadings matrix that flexibly captures the underlying structure of our data while sharing information across subpopulations to improve inference and prediction. The stick-breaking construction of the prior assumes an infinite number of factors and allows for each subpopulation to utilize different subsets of the factor space and select the number of factors needed to best explain the variation. We develop the model into a latent factor regression method that excels at prediction and inference of regression coefficients. Simulations validate this strong performance compared to baseline methods. We apply this work to the problem of predicting surgical complications using electronic health record data for geriatric patients and all surgical patients at Duke University Health System (DUHS). The motivating application demonstrates the improved predictive performance when using HIFM in both area under the ROC curve and area under the PR Curve while providing interpretable coefficients that may lead to actionable interventions.




model

Objective Bayes model selection of Gaussian interventional essential graphs for the identification of signaling pathways

Federico Castelletti, Guido Consonni.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2289--2311.

Abstract:
A signalling pathway is a sequence of chemical reactions initiated by a stimulus which in turn affects a receptor, and then through some intermediate steps cascades down to the final cell response. Based on the technique of flow cytometry, samples of cell-by-cell measurements are collected under each experimental condition, resulting in a collection of interventional data (assuming no latent variables are involved). Usually several external interventions are applied at different points of the pathway, the ultimate aim being the structural recovery of the underlying signalling network which we model as a causal Directed Acyclic Graph (DAG) using intervention calculus. The advantage of using interventional data, rather than purely observational one, is that identifiability of the true data generating DAG is enhanced. More technically a Markov equivalence class of DAGs, whose members are statistically indistinguishable based on observational data alone, can be further decomposed, using additional interventional data, into smaller distinct Interventional Markov equivalence classes. We present a Bayesian methodology for structural learning of Interventional Markov equivalence classes based on observational and interventional samples of multivariate Gaussian observations. Our approach is objective, meaning that it is based on default parameter priors requiring no personal elicitation; some flexibility is however allowed through a tuning parameter which regulates sparsity in the prior on model space. Based on an analytical expression for the marginal likelihood of a given Interventional Essential Graph, and a suitable MCMC scheme, our analysis produces an approximate posterior distribution on the space of Interventional Markov equivalence classes, which can be used to provide uncertainty quantification for features of substantive scientific interest, such as the posterior probability of inclusion of selected edges, or paths.




model

Fitting a deeply nested hierarchical model to a large book review dataset using a moment-based estimator

Ningshan Zhang, Kyle Schmaus, Patrick O. Perry.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2260--2288.

Abstract:
We consider a particular instance of a common problem in recommender systems, using a database of book reviews to inform user-targeted recommendations. In our dataset, books are categorized into genres and subgenres. To exploit this nested taxonomy, we use a hierarchical model that enables information pooling across across similar items at many levels within the genre hierarchy. The main challenge in deploying this model is computational. The data sizes are large and fitting the model at scale using off-the-shelf maximum likelihood procedures is prohibitive. To get around this computational bottleneck, we extend a moment-based fitting procedure proposed for fitting single-level hierarchical models to the general case of arbitrarily deep hierarchies. This extension is an order of magnitude faster than standard maximum likelihood procedures. The fitting method can be deployed beyond recommender systems to general contexts with deeply nested hierarchical generalized linear mixed models.




model

Spatial modeling of trends in crime over time in Philadelphia

Cecilia Balocchi, Shane T. Jensen.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2235--2259.

Abstract:
Understanding the relationship between change in crime over time and the geography of urban areas is an important problem for urban planning. Accurate estimation of changing crime rates throughout a city would aid law enforcement as well as enable studies of the association between crime and the built environment. Bayesian modeling is a promising direction since areal data require principled sharing of information to address spatial autocorrelation between proximal neighborhoods. We develop several Bayesian approaches to spatial sharing of information between neighborhoods while modeling trends in crime counts over time. We apply our methodology to estimate changes in crime throughout Philadelphia over the 2006-15 period while also incorporating spatially-varying economic and demographic predictors. We find that the local shrinkage imposed by a conditional autoregressive model has substantial benefits in terms of out-of-sample predictive accuracy of crime. We also explore the possibility of spatial discontinuities between neighborhoods that could represent natural barriers or aspects of the built environment.




model

Microsimulation model calibration using incremental mixture approximate Bayesian computation

Carolyn M. Rutter, Jonathan Ozik, Maria DeYoreo, Nicholson Collier.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2189--2212.

Abstract:
Microsimulation models (MSMs) are used to inform policy by predicting population-level outcomes under different scenarios. MSMs simulate individual-level event histories that mark the disease process (such as the development of cancer) and the effect of policy actions (such as screening) on these events. MSMs often have many unknown parameters; calibration is the process of searching the parameter space to select parameters that result in accurate MSM prediction of a wide range of targets. We develop Incremental Mixture Approximate Bayesian Computation (IMABC) for MSM calibration which results in a simulated sample from the posterior distribution of model parameters given calibration targets. IMABC begins with a rejection-based ABC step, drawing a sample of points from the prior distribution of model parameters and accepting points that result in simulated targets that are near observed targets. Next, the sample is iteratively updated by drawing additional points from a mixture of multivariate normal distributions and accepting points that result in accurate predictions. Posterior estimates are obtained by weighting the final set of accepted points to account for the adaptive sampling scheme. We demonstrate IMABC by calibrating CRC-SPIN 2.0, an updated version of a MSM for colorectal cancer (CRC) that has been used to inform national CRC screening guidelines.




model

Prediction of small area quantiles for the conservation effects assessment project using a mixed effects quantile regression model

Emily Berg, Danhyang Lee.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2158--2188.

Abstract:
Quantiles of the distributions of several measures of erosion are important parameters in the Conservation Effects Assessment Project, a survey intended to quantify soil and nutrient loss on crop fields. Because sample sizes for domains of interest are too small to support reliable direct estimators, model based methods are needed. Quantile regression is appealing for CEAP because finding a single family of parametric models that adequately describes the distributions of all variables is difficult and small area quantiles are parameters of interest. We construct empirical Bayes predictors and bootstrap mean squared error estimators based on the linearly interpolated generalized Pareto distribution (LIGPD). We apply the procedures to predict county-level quantiles for four types of erosion in Wisconsin and validate the procedures through simulation.




model

Joint model of accelerated failure time and mechanistic nonlinear model for censored covariates, with application in HIV/AIDS

Hongbin Zhang, Lang Wu.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2140--2157.

Abstract:
For a time-to-event outcome with censored time-varying covariates, a joint Cox model with a linear mixed effects model is the standard modeling approach. In some applications such as AIDS studies, mechanistic nonlinear models are available for some covariate process such as viral load during anti-HIV treatments, derived from the underlying data-generation mechanisms and disease progression. Such a mechanistic nonlinear covariate model may provide better-predicted values when the covariates are left censored or mismeasured. When the focus is on the impact of the time-varying covariate process on the survival outcome, an accelerated failure time (AFT) model provides an excellent alternative to the Cox proportional hazard model since an AFT model is formulated to allow the influence of the outcome by the entire covariate process. In this article, we consider a nonlinear mixed effects model for the censored covariates in an AFT model, implemented using a Monte Carlo EM algorithm, under the framework of a joint model for simultaneous inference. We apply the joint model to an HIV/AIDS data to gain insights for assessing the association between viral load and immunological restoration during antiretroviral therapy. Simulation is conducted to compare model performance when the covariate model and the survival model are misspecified.




model

Estimating abundance from multiple sampling capture-recapture data via a multi-state multi-period stopover model

Hannah Worthington, Rachel McCrea, Ruth King, Richard Griffiths.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2043--2064.

Abstract:
Capture-recapture studies often involve collecting data on numerous capture occasions over a relatively short period of time. For many study species this process is repeated, for example, annually, resulting in capture information spanning multiple sampling periods. To account for the different temporal scales, the robust design class of models have traditionally been applied providing a framework in which to analyse all of the available capture data in a single likelihood expression. However, these models typically require strong constraints, either the assumption of closure within a sampling period (the closed robust design) or conditioning on the number of individuals captured within a sampling period (the open robust design). For real datasets these assumptions may not be appropriate. We develop a general modelling structure that requires neither assumption by explicitly modelling the movement of individuals into the population both within and between the sampling periods, which in turn permits the estimation of abundance within a single consistent framework. The flexibility of the novel model structure is further demonstrated by including the computationally challenging case of multi-state data where there is individual time-varying discrete covariate information. We derive an efficient likelihood expression for the new multi-state multi-period stopover model using the hidden Markov model framework. We demonstrate the significant improvement in parameter estimation using our new modelling approach in terms of both the multi-period and multi-state components through both a simulation study and a real dataset relating to the protected species of great crested newts, Triturus cristatus .




model

A semiparametric modeling approach using Bayesian Additive Regression Trees with an application to evaluate heterogeneous treatment effects

Bret Zeldow, Vincent Lo Re III, Jason Roy.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1989--2010.

Abstract:
Bayesian Additive Regression Trees (BART) is a flexible machine learning algorithm capable of capturing nonlinearities between an outcome and covariates and interactions among covariates. We extend BART to a semiparametric regression framework in which the conditional expectation of an outcome is a function of treatment, its effect modifiers, and confounders. The confounders are allowed to have unspecified functional form, while treatment and effect modifiers that are directly related to the research question are given a linear form. The result is a Bayesian semiparametric linear regression model where the posterior distribution of the parameters of the linear part can be interpreted as in parametric Bayesian regression. This is useful in situations where a subset of the variables are of substantive interest and the others are nuisance variables that we would like to control for. An example of this occurs in causal modeling with the structural mean model (SMM). Under certain causal assumptions, our method can be used as a Bayesian SMM. Our methods are demonstrated with simulation studies and an application to dataset involving adults with HIV/Hepatitis C coinfection who newly initiate antiretroviral therapy. The methods are available in an R package called semibart.




model

Bayesian modeling of the structural connectome for studying Alzheimer’s disease

Arkaprava Roy, Subhashis Ghosal, Jeffrey Prescott, Kingshuk Roy Choudhury.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1791--1816.

Abstract:
We study possible relations between Alzheimer’s disease progression and the structure of the connectome which is white matter connecting different regions of the brain. Regression models in covariates including age, gender and disease status for the extent of white matter connecting each pair of regions of the brain are proposed. Subject inhomogeneity is also incorporated in the model through random effects with an unknown distribution. As there is a large number of pairs of regions, we also adopt a dimension reduction technique through graphon ( J. Combin. Theory Ser. B 96 (2006) 933–957) functions which reduces the functions of pairs of regions to functions of regions. The connecting graphon functions are considered unknown but the assumed smoothness allows putting priors of low complexity on these functions. We pursue a nonparametric Bayesian approach by assigning a Dirichlet process scale mixture of zero to mean normal prior on the distributions of the random effects and finite random series of tensor products of B-splines priors on the underlying graphon functions. We develop efficient Markov chain Monte Carlo techniques for drawing samples for the posterior distributions using Hamiltonian Monte Carlo (HMC). The proposed Bayesian method overwhelmingly outperforms a competing method based on ANCOVA models in the simulation setup. The proposed Bayesian approach is applied on a dataset of 100 subjects and 83 brain regions and key regions implicated in the changing connectome are identified.




model

Incorporating conditional dependence in latent class models for probabilistic record linkage: Does it matter?

Huiping Xu, Xiaochun Li, Changyu Shen, Siu L. Hui, Shaun Grannis.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1753--1790.

Abstract:
The conditional independence assumption of the Felligi and Sunter (FS) model in probabilistic record linkage is often violated when matching real-world data. Ignoring conditional dependence has been shown to seriously bias parameter estimates. However, in record linkage, the ultimate goal is to inform the match status of record pairs and therefore, record linkage algorithms should be evaluated in terms of matching accuracy. In the literature, more flexible models have been proposed to relax the conditional independence assumption, but few studies have assessed whether such accommodations improve matching accuracy. In this paper, we show that incorporating the conditional dependence appropriately yields comparable or improved matching accuracy than the FS model using three real-world data linkage examples. Through a simulation study, we further investigate when conditional dependence models provide improved matching accuracy. Our study shows that the FS model is generally robust to the conditional independence assumption and provides comparable matching accuracy as the more complex conditional dependence models. However, when the match prevalence approaches 0% or 100% and conditional dependence exists in the dominating class, it is necessary to address conditional dependence as the FS model produces suboptimal matching accuracy. The need to address conditional dependence becomes less important when highly discriminating fields are used. Our simulation study also shows that conditional dependence models with misspecified dependence structure could produce less accurate record matching than the FS model and therefore we caution against the blind use of conditional dependence models.




model

A hierarchical Bayesian model for single-cell clustering using RNA-sequencing data

Yiyi Liu, Joshua L. Warren, Hongyu Zhao.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1733--1752.

Abstract:
Understanding the heterogeneity of cells is an important biological question. The development of single-cell RNA-sequencing (scRNA-seq) technology provides high resolution data for such inquiry. A key challenge in scRNA-seq analysis is the high variability of measured RNA expression levels and frequent dropouts (missing values) due to limited input RNA compared to bulk RNA-seq measurement. Existing clustering methods do not perform well for these noisy and zero-inflated scRNA-seq data. In this manuscript we propose a Bayesian hierarchical model, called BasClu, to appropriately characterize important features of scRNA-seq data in order to more accurately cluster cells. We demonstrate the effectiveness of our method with extensive simulation studies and applications to three real scRNA-seq datasets.




model

A Bayesian mark interaction model for analysis of tumor pathology images

Qiwei Li, Xinlei Wang, Faming Liang, Guanghua Xiao.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1708--1732.

Abstract:
With the advance of imaging technology, digital pathology imaging of tumor tissue slides is becoming a routine clinical procedure for cancer diagnosis. This process produces massive imaging data that capture histological details in high resolution. Recent developments in deep-learning methods have enabled us to identify and classify individual cells from digital pathology images at large scale. Reliable statistical approaches to model the spatial pattern of cells can provide new insight into tumor progression and shed light on the biological mechanisms of cancer. We consider the problem of modeling spatial correlations among three commonly seen cells observed in tumor pathology images. A novel geostatistical marking model with interpretable underlying parameters is proposed in a Bayesian framework. We use auxiliary variable MCMC algorithms to sample from the posterior distribution with an intractable normalizing constant. We demonstrate how this model-based analysis can lead to sharper inferences than ordinary exploratory analyses, by means of application to three benchmark datasets and a case study on the pathology images of $188$ lung cancer patients. The case study shows that the spatial correlation between tumor and stromal cells predicts patient prognosis. This statistical methodology not only presents a new model for characterizing spatial correlations in a multitype spatial point pattern conditioning on the locations of the points, but also provides a new perspective for understanding the role of cell–cell interactions in cancer progression.




model

Sequential decision model for inference and prediction on nonuniform hypergraphs with application to knot matching from computational forestry

Seong-Hwan Jun, Samuel W. K. Wong, James V. Zidek, Alexandre Bouchard-Côté.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1678--1707.

Abstract:
In this paper, we consider the knot-matching problem arising in computational forestry. The knot-matching problem is an important problem that needs to be solved to advance the state of the art in automatic strength prediction of lumber. We show that this problem can be formulated as a quadripartite matching problem and develop a sequential decision model that admits efficient parameter estimation along with a sequential Monte Carlo sampler on graph matching that can be utilized for rapid sampling of graph matching. We demonstrate the effectiveness of our methods on 30 manually annotated boards and present findings from various simulation studies to provide further evidence supporting the efficacy of our methods.




model

RCRnorm: An integrated system of random-coefficient hierarchical regression models for normalizing NanoString nCounter data

Gaoxiang Jia, Xinlei Wang, Qiwei Li, Wei Lu, Ximing Tang, Ignacio Wistuba, Yang Xie.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1617--1647.

Abstract:
Formalin-fixed paraffin-embedded (FFPE) samples have great potential for biomarker discovery, retrospective studies and diagnosis or prognosis of diseases. Their application, however, is hindered by the unsatisfactory performance of traditional gene expression profiling techniques on damaged RNAs. NanoString nCounter platform is well suited for profiling of FFPE samples and measures gene expression with high sensitivity which may greatly facilitate realization of scientific and clinical values of FFPE samples. However, methodological development for normalization, a critical step when analyzing this type of data, is far behind. Existing methods designed for the platform use information from different types of internal controls separately and rely on an overly-simplified assumption that expression of housekeeping genes is constant across samples for global scaling. Thus, these methods are not optimized for the nCounter system, not mentioning that they were not developed for FFPE samples. We construct an integrated system of random-coefficient hierarchical regression models to capture main patterns and characteristics observed from NanoString data of FFPE samples and develop a Bayesian approach to estimate parameters and normalize gene expression across samples. Our method, labeled RCRnorm, incorporates information from all aspects of the experimental design and simultaneously removes biases from various sources. It eliminates the unrealistic assumption on housekeeping genes and offers great interpretability. Furthermore, it is applicable to freshly frozen or like samples that can be generally viewed as a reduced case of FFPE samples. Simulation and applications showed the superior performance of RCRnorm.




model

Modeling seasonality and serial dependence of electricity price curves with warping functional autoregressive dynamics

Ying Chen, J. S. Marron, Jiejie Zhang.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1590--1616.

Abstract:
Electricity prices are high dimensional, serially dependent and have seasonal variations. We propose a Warping Functional AutoRegressive (WFAR) model that simultaneously accounts for the cross time-dependence and seasonal variations of the large dimensional data. In particular, electricity price curves are obtained by smoothing over the $24$ discrete hourly prices on each day. In the functional domain, seasonal phase variations are separated from level amplitude changes in a warping process with the Fisher–Rao distance metric, and the aligned (season-adjusted) electricity price curves are modeled in the functional autoregression framework. In a real application, the WFAR model provides superior out-of-sample forecast accuracy in both a normal functioning market, Nord Pool, and an extreme situation, the California market. The forecast performance as well as the relative accuracy improvement are stable for different markets and different time periods.




model

Network modelling of topological domains using Hi-C data

Y. X. Rachel Wang, Purnamrita Sarkar, Oana Ursu, Anshul Kundaje, Peter J. Bickel.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1511--1536.

Abstract:
Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. In addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. Using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.




model

A hidden Markov model approach to characterizing the photo-switching behavior of fluorophores

Lekha Patel, Nils Gustafsson, Yu Lin, Raimund Ober, Ricardo Henriques, Edward Cohen.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1397--1429.

Abstract:
Fluorescing molecules (fluorophores) that stochastically switch between photon-emitting and dark states underpin some of the most celebrated advancements in super-resolution microscopy. While this stochastic behavior has been heavily exploited, full characterization of the underlying models can potentially drive forward further imaging methodologies. Under the assumption that fluorophores move between fluorescing and dark states as continuous time Markov processes, the goal is to use a sequence of images to select a model and estimate the transition rates. We use a hidden Markov model to relate the observed discrete time signal to the hidden continuous time process. With imaging involving several repeat exposures of the fluorophore, we show the observed signal depends on both the current and past states of the hidden process, producing emission probabilities that depend on the transition rate parameters to be estimated. To tackle this unusual coupling of the transition and emission probabilities, we conceive transmission (transition-emission) matrices that capture all dependencies of the model. We provide a scheme of computing these matrices and adapt the forward-backward algorithm to compute a likelihood which is readily optimized to provide rate estimates. When confronted with several model proposals, combining this procedure with the Bayesian Information Criterion provides accurate model selection.




model

Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines

Lin Liu, Yuqi Qiu, Loki Natarajan, Karen Messer.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1370--1396.

Abstract:
It is common to encounter missing data among the potential predictor variables in the setting of model selection. For example, in a recent study we attempted to improve the US guidelines for risk stratification after screening colonoscopy ( Cancer Causes Control 27 (2016) 1175–1185), with the aim to help reduce both overuse and underuse of follow-on surveillance colonoscopy. The goal was to incorporate selected additional informative variables into a neoplasia risk-prediction model, going beyond the three currently established risk factors, using a large dataset pooled from seven different prospective studies in North America. Unfortunately, not all candidate variables were collected in all studies, so that one or more important potential predictors were missing on over half of the subjects. Thus, while variable selection was a main focus of the study, it was necessary to address the substantial amount of missing data. Multiple imputation can effectively address missing data, and there are also good approaches to incorporate the variable selection process into model-based confidence intervals. However, there is not consensus on appropriate methods of inference which address both issues simultaneously. Our goal here is to study the properties of model-based confidence intervals in the setting of imputation for missing data followed by variable selection. We use both simulation and theory to compare three approaches to such post-imputation-selection inference: a multiple-imputation approach based on Rubin’s Rules for variance estimation ( Comput. Statist. Data Anal. 71 (2014) 758–770); a single imputation-selection followed by bootstrap percentile confidence intervals; and a new bootstrap model-averaging approach presented here, following Efron ( J. Amer. Statist. Assoc. 109 (2014) 991–1007). We investigate relative strengths and weaknesses of each method. The “Rubin’s Rules” multiple imputation estimator can have severe undercoverage, and is not recommended. The imputation-selection estimator with bootstrap percentile confidence intervals works well. The bootstrap-model-averaged estimator, with the “Efron’s Rules” estimated variance, may be preferred if the true effect sizes are moderate. We apply these results to the colorectal neoplasia risk-prediction problem which motivated the present work.




model

Introduction to papers on the modeling and analysis of network data—II

Stephen E. Fienberg

Source: Ann. Appl. Stat., Volume 4, Number 2, 533--534.




model

Local law and Tracy–Widom limit for sparse stochastic block models

Jong Yun Hwang, Ji Oon Lee, Wooseok Yang.

Source: Bernoulli, Volume 26, Number 3, 2400--2435.

Abstract:
We consider the spectral properties of sparse stochastic block models, where $N$ vertices are partitioned into $K$ balanced communities. Under an assumption that the intra-community probability and inter-community probability are of similar order, we prove a local semicircle law up to the spectral edges, with an explicit formula on the deterministic shift of the spectral edge. We also prove that the fluctuation of the extremal eigenvalues is given by the GOE Tracy–Widom law after rescaling and centering the entries of sparse stochastic block models. Applying the result to sparse stochastic block models, we rigorously prove that there is a large gap between the outliers and the spectral edge without centering.




model

A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

Xin Bing, Florentina Bunea, Marten Wegkamp.

Source: Bernoulli, Volume 26, Number 3, 1765--1796.

Abstract:
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations, with parameters $N_{i}inmathbb{N}$ and $Pi_{i}in[0,1]^{p}$ for $i=1,ldots,n$. The model links all cell probabilities, collected in a $p imes n$ matrix $Pi$, via the assumption that $Pi$ can be factorized as the product of two nonnegative matrices $Ain[0,1]^{p imes K}$ and $Win[0,1]^{K imes n}$. Topic models have been originally developed in text mining, when one browses through $n$ documents, based on a dictionary of $p$ words, and covering $K$ topics. In this terminology, the matrix $A$ is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in detail in this work. Notably, the unique $A$ is required to satisfy what is commonly known as the anchor word assumption, under which $A$ has an unknown number of rows respectively proportional to the canonical basis vectors in $mathbb{R}^{K}$. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the $K$ vertices of a simplex. This crucial step in the estimation of $A$ requires $K$ to be known, and cannot be easily extended to the more realistic set-up when $K$ is unknown. This work takes a different view on anchor word estimation, and on the estimation of $A$. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates $K$ from the observed data. We derive new finite sample minimax lower bounds for the estimation of $A$, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any $n,N_{i},p$ and $K$, and both $p$ and $K$ are allowed to increase with $n$, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics $K$, while we provide the competing methods with the correct value in our simulations.




model

Efficient estimation in single index models through smoothing splines

Arun K. Kuchibhotla, Rohit K. Patra.

Source: Bernoulli, Volume 26, Number 2, 1587--1618.

Abstract:
We consider estimation and inference in a single index regression model with an unknown but smooth link function. In contrast to the standard approach of using kernels or regression splines, we use smoothing splines to estimate the smooth link function. We develop a method to compute the penalized least squares estimators (PLSEs) of the parametric and the nonparametric components given independent and identically distributed (i.i.d.) data. We prove the consistency and find the rates of convergence of the estimators. We establish asymptotic normality under mild assumption and prove asymptotic efficiency of the parametric component under homoscedastic errors. A finite sample simulation corroborates our asymptotic theory. We also analyze a car mileage data set and a Ozone concentration data set. The identifiability and existence of the PLSEs are also investigated.




model

Reliable clustering of Bernoulli mixture models

Amir Najafi, Seyed Abolfazl Motahari, Hamid R. Rabiee.

Source: Bernoulli, Volume 26, Number 2, 1535--1559.

Abstract:
A Bernoulli Mixture Model (BMM) is a finite mixture of random binary vectors with independent dimensions. The problem of clustering BMM data arises in a variety of real-world applications, ranging from population genetics to activity analysis in social networks. In this paper, we analyze the clusterability of BMMs from a theoretical perspective, when the number of clusters is unknown. In particular, we stipulate a set of conditions on the sample complexity and dimension of the model in order to guarantee the Probably Approximately Correct (PAC)-clusterability of a dataset. To the best of our knowledge, these findings are the first non-asymptotic bounds on the sample complexity of learning or clustering BMMs.




model

A new McKean–Vlasov stochastic interpretation of the parabolic–parabolic Keller–Segel model: The one-dimensional case

Denis Talay, Milica Tomašević.

Source: Bernoulli, Volume 26, Number 2, 1323--1353.

Abstract:
In this paper, we analyze a stochastic interpretation of the one-dimensional parabolic–parabolic Keller–Segel system without cut-off. It involves an original type of McKean–Vlasov interaction kernel. At the particle level, each particle interacts with all the past of each other particle by means of a time integrated functional involving a singular kernel. At the mean-field level studied here, the McKean–Vlasov limit process interacts with all the past time marginals of its probability distribution in a similarly singular way. We prove that the parabolic–parabolic Keller–Segel system in the whole Euclidean space and the corresponding McKean–Vlasov stochastic differential equation are well-posed for any values of the parameters of the model.