en The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics By projecteuclid.org Published On :: Fri, 02 Aug 2019 22:04 EDT Joshua Cape, Minh Tang, Carey E. Priebe. Source: The Annals of Statistics, Volume 47, Number 5, 2405--2439.Abstract: The singular value matrix decomposition plays a ubiquitous role throughout statistics and related fields. Myriad applications including clustering, classification, and dimensionality reduction involve studying and exploiting the geometric structure of singular values and singular vectors. This paper provides a novel collection of technical and theoretical tools for studying the geometry of singular subspaces using the two-to-infinity norm. Motivated by preliminary deterministic Procrustes analysis, we consider a general matrix perturbation setting in which we derive a new Procrustean matrix decomposition. Together with flexible machinery developed for the two-to-infinity norm, this allows us to conduct a refined analysis of the induced perturbation geometry with respect to the underlying singular vectors even in the presence of singular value multiplicity. Our analysis yields singular vector entrywise perturbation bounds for a range of popular matrix noise models, each of which has a meaningful associated statistical inference task. In addition, we demonstrate how the two-to-infinity norm is the preferred norm in certain statistical settings. Specific applications discussed in this paper include covariance estimation, singular subspace recovery, and multiple graph inference. Both our Procrustean matrix decomposition and the technical machinery developed for the two-to-infinity norm may be of independent interest. Full Article
en On testing conditional qualitative treatment effects By projecteuclid.org Published On :: Tue, 21 May 2019 04:00 EDT Chengchun Shi, Rui Song, Wenbin Lu. Source: The Annals of Statistics, Volume 47, Number 4, 2348--2377.Abstract: Precision medicine is an emerging medical paradigm that focuses on finding the most effective treatment strategy tailored for individual patients. In the literature, most of the existing works focused on estimating the optimal treatment regime. However, there has been less attention devoted to hypothesis testing regarding the optimal treatment regime. In this paper, we first introduce the notion of conditional qualitative treatment effects (CQTE) of a set of variables given another set of variables and provide a class of equivalent representations for the null hypothesis of no CQTE. The proposed definition of CQTE does not assume any parametric form for the optimal treatment rule and plays an important role for assessing the incremental value of a set of new variables in optimal treatment decision making conditional on an existing set of prescriptive variables. We then propose novel testing procedures for no CQTE based on kernel estimation of the conditional contrast functions. We show that our test statistics have asymptotically correct size and nonnegligible power against some nonstandard local alternatives. The empirical performance of the proposed tests are evaluated by simulations and an application to an AIDS data set. Full Article
en Convergence complexity analysis of Albert and Chib’s algorithm for Bayesian probit regression By projecteuclid.org Published On :: Tue, 21 May 2019 04:00 EDT Qian Qin, James P. Hobert. Source: The Annals of Statistics, Volume 47, Number 4, 2320--2347.Abstract: The use of MCMC algorithms in high dimensional Bayesian problems has become routine. This has spurred so-called convergence complexity analysis, the goal of which is to ascertain how the convergence rate of a Monte Carlo Markov chain scales with sample size, $n$, and/or number of covariates, $p$. This article provides a thorough convergence complexity analysis of Albert and Chib’s [ J. Amer. Statist. Assoc. 88 (1993) 669–679] data augmentation algorithm for the Bayesian probit regression model. The main tools used in this analysis are drift and minorization conditions. The usual pitfalls associated with this type of analysis are avoided by utilizing centered drift functions, which are minimized in high posterior probability regions, and by using a new technique to suppress high-dimensionality in the construction of minorization conditions. The main result is that the geometric convergence rate of the underlying Markov chain is bounded below 1 both as $n ightarrowinfty$ (with $p$ fixed), and as $p ightarrowinfty$ (with $n$ fixed). Furthermore, the first computable bounds on the total variation distance to stationarity are byproducts of the asymptotic analysis. Full Article
en Convergence rates of least squares regression estimators with heavy-tailed errors By projecteuclid.org Published On :: Tue, 21 May 2019 04:00 EDT Qiyang Han, Jon A. Wellner. Source: The Annals of Statistics, Volume 47, Number 4, 2286--2319.Abstract: We study the performance of the least squares estimator (LSE) in a general nonparametric regression model, when the errors are independent of the covariates but may only have a $p$th moment ($pgeq1$). In such a heavy-tailed regression setting, we show that if the model satisfies a standard “entropy condition” with exponent $alphain(0,2)$, then the $L_{2}$ loss of the LSE converges at a rate [mathcal{O}_{mathbf{P}}igl(n^{-frac{1}{2+alpha}}vee n^{-frac{1}{2}+frac{1}{2p}}igr).] Such a rate cannot be improved under the entropy condition alone. This rate quantifies both some positive and negative aspects of the LSE in a heavy-tailed regression setting. On the positive side, as long as the errors have $pgeq1+2/alpha$ moments, the $L_{2}$ loss of the LSE converges at the same rate as if the errors are Gaussian. On the negative side, if $p<1+2/alpha$, there are (many) hard models at any entropy level $alpha$ for which the $L_{2}$ loss of the LSE converges at a strictly slower rate than other robust estimators. The validity of the above rate relies crucially on the independence of the covariates and the errors. In fact, the $L_{2}$ loss of the LSE can converge arbitrarily slowly when the independence fails. The key technical ingredient is a new multiplier inequality that gives sharp bounds for the “multiplier empirical process” associated with the LSE. We further give an application to the sparse linear regression model with heavy-tailed covariates and errors to demonstrate the scope of this new inequality. Full Article
en On deep learning as a remedy for the curse of dimensionality in nonparametric regression By projecteuclid.org Published On :: Tue, 21 May 2019 04:00 EDT Benedikt Bauer, Michael Kohler. Source: The Annals of Statistics, Volume 47, Number 4, 2261--2285.Abstract: Assuming that a smoothness condition and a suitable restriction on the structure of the regression function hold, it is shown that least squares estimates based on multilayer feedforward neural networks are able to circumvent the curse of dimensionality in nonparametric regression. The proof is based on new approximation results concerning multilayer feedforward neural networks with bounded weights and a bounded number of hidden neurons. The estimates are compared with various other approaches by using simulated data. Full Article
en Negative association, ordering and convergence of resampling methods By projecteuclid.org Published On :: Tue, 21 May 2019 04:00 EDT Mathieu Gerber, Nicolas Chopin, Nick Whiteley. Source: The Annals of Statistics, Volume 47, Number 4, 2236--2260.Abstract: We study convergence and convergence rates for resampling schemes. Our first main result is a general consistency theorem based on the notion of negative association, which is applied to establish the almost sure weak convergence of measures output from Kitagawa’s [ J. Comput. Graph. Statist. 5 (1996) 1–25] stratified resampling method. Carpenter, Ckiffird and Fearnhead’s [ IEE Proc. Radar Sonar Navig. 146 (1999) 2–7] systematic resampling method is similar in structure but can fail to converge depending on the order of the input samples. We introduce a new resampling algorithm based on a stochastic rounding technique of [In 42nd IEEE Symposium on Foundations of Computer Science ( Las Vegas , NV , 2001) (2001) 588–597 IEEE Computer Soc.], which shares some attractive properties of systematic resampling, but which exhibits negative association and, therefore, converges irrespective of the order of the input samples. We confirm a conjecture made by [ J. Comput. Graph. Statist. 5 (1996) 1–25] that ordering input samples by their states in $mathbb{R}$ yields a faster rate of convergence; we establish that when particles are ordered using the Hilbert curve in $mathbb{R}^{d}$, the variance of the resampling error is ${scriptstylemathcal{O}}(N^{-(1+1/d)})$ under mild conditions, where $N$ is the number of particles. We use these results to establish asymptotic properties of particle algorithms based on resampling schemes that differ from multinomial resampling. Full Article
en Generalized cluster trees and singular measures By projecteuclid.org Published On :: Tue, 21 May 2019 04:00 EDT Yen-Chi Chen. Source: The Annals of Statistics, Volume 47, Number 4, 2174--2203.Abstract: In this paper we study the $alpha $-cluster tree ($alpha $-tree) under both singular and nonsingular measures. The $alpha $-tree uses probability contents within a set created by the ordering of points to construct a cluster tree so that it is well defined even for singular measures. We first derive the convergence rate for a density level set around critical points, which leads to the convergence rate for estimating an $alpha $-tree under nonsingular measures. For singular measures, we study how the kernel density estimator (KDE) behaves and prove that the KDE is not uniformly consistent but pointwise consistent after rescaling. We further prove that the estimated $alpha $-tree fails to converge in the $L_{infty }$ metric but is still consistent under the integrated distance. We also observe a new type of critical points—the dimensional critical points (DCPs)—of a singular measure. DCPs are points that contribute to cluster tree topology but cannot be defined using density gradient. Building on the analysis of the KDE and DCPs, we prove the topological consistency of an estimated $alpha $-tree. Full Article
en Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem By projecteuclid.org Published On :: Thu, 05 Aug 2010 15:41 EDT James G. Scott, James O. BergerSource: Ann. Statist., Volume 38, Number 5, 2587--2619.Abstract: This paper studies the multiplicity-correction effect of standard Bayesian variable-selection priors in linear regression. Our first goal is to clarify when, and how, multiplicity correction happens automatically in Bayesian analysis, and to distinguish this correction from the Bayesian Ockham’s-razor effect. Our second goal is to contrast empirical-Bayes and fully Bayesian approaches to variable selection through examples, theoretical results and simulations. Considerable differences between the two approaches are found. In particular, we prove a theorem that characterizes a surprising aymptotic discrepancy between fully Bayes and empirical Bayes. This discrepancy arises from a different source than the failure to account for hyperparameter uncertainty in the empirical-Bayes estimate. Indeed, even at the extreme, when the empirical-Bayes estimate converges asymptotically to the true variable-inclusion probability, the potential for a serious difference remains. Full Article
en componentization By looselycoupled.com Published On :: 2004-09-28T15:00:00-00:00 Breaking down into interchangeable pieces. For many years, software innovators have been trying to make software more like computer hardware, which is assembled from cheap, mass-produced components that connect together using standard interfaces. Component-based development (CBD) uses this approach to assemble software from reusable components within frameworks such as CORBA, Sun's Enterprise Java Beans (EJBs) and Microsoft COM. Today's service oriented architectures, based on web services, go a step further by encapsulating components in a standards-based service interface, which allows components to be reused outside their native framework. Componentization is not limited to software; through the use of subcontracting and outsourcing, it can also apply to business organizations and processes. Full Article
en endpoint By looselycoupled.com Published On :: 2004-11-01T19:00:00-00:00 Where a service connects to the network. In a service oriented architecture, any single network interaction involves two endpoints: one to provide a service, and the other to consume it. In web services, an endpoint is specified by a URI. Full Article
en object-oriented By looselycoupled.com Published On :: 2005-05-17T14:00:00-00:00 (OO) Structured around functional units. Object-oriented programming languages such as C++, SmallTalk and Java are designed to build software made up of objects: discrete bundles of functionality that can act on data only in certain pre-defined ways. This modular building-block approach makes complex software development tasks more flexible and easier to manage within a given programming environment. The emergence of object-oriented programming was a stepping stone to the development of componentization and subsequently of service-oriented architectures. Full Article
en Correction: Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Trang Quynh Nguyen, Elizabeth A. Stuart. Source: The Annals of Applied Statistics, Volume 14, Number 1, 518--520. Full Article
en A hierarchical dependent Dirichlet process prior for modelling bird migration patterns in the UK By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Alex Diana, Eleni Matechou, Jim Griffin, Alison Johnston. Source: The Annals of Applied Statistics, Volume 14, Number 1, 473--493.Abstract: Environmental changes in recent years have been linked to phenological shifts which in turn are linked to the survival of species. The work in this paper is motivated by capture-recapture data on blackcaps collected by the British Trust for Ornithology as part of the Constant Effort Sites monitoring scheme. Blackcaps overwinter abroad and migrate to the UK annually for breeding purposes. We propose a novel Bayesian nonparametric approach for expressing the bivariate density of individual arrival and departure times at different sites across a number of years as a mixture model. The new model combines the ideas of the hierarchical and the dependent Dirichlet process, allowing the estimation of site-specific weights and year-specific mixture locations, which are modelled as functions of environmental covariates using a multivariate extension of the Gaussian process. The proposed modelling framework is extremely general and can be used in any context where multivariate density estimation is performed jointly across different groups and in the presence of a continuous covariate. Full Article
en A comparison of principal component methods between multiple phenotype regression and multiple SNP regression in genetic association studies By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Zhonghua Liu, Ian Barnett, Xihong Lin. Source: The Annals of Applied Statistics, Volume 14, Number 1, 433--451.Abstract: Principal component analysis (PCA) is a popular method for dimension reduction in unsupervised multivariate analysis. However, existing ad hoc uses of PCA in both multivariate regression (multiple outcomes) and multiple regression (multiple predictors) lack theoretical justification. The differences in the statistical properties of PCAs in these two regression settings are not well understood. In this paper we provide theoretical results on the power of PCA in genetic association testings in both multiple phenotype and SNP-set settings. The multiple phenotype setting refers to the case when one is interested in studying the association between a single SNP and multiple phenotypes as outcomes. The SNP-set setting refers to the case when one is interested in studying the association between multiple SNPs in a SNP set and a single phenotype as the outcome. We demonstrate analytically that the properties of the PC-based analysis in these two regression settings are substantially different. We show that the lower order PCs, that is, PCs with large eigenvalues, are generally preferred and lead to a higher power in the SNP-set setting, while the higher-order PCs, that is, PCs with small eigenvalues, are generally preferred in the multiple phenotype setting. We also investigate the power of three other popular statistical methods, the Wald test, the variance component test and the minimum $p$-value test, in both multiple phenotype and SNP-set settings. We use theoretical power, simulation studies, and two real data analyses to validate our findings. Full Article
en Measuring human activity spaces from GPS data with density ranking and summary curves By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Yen-Chi Chen, Adrian Dobra. Source: The Annals of Applied Statistics, Volume 14, Number 1, 409--432.Abstract: Activity spaces are fundamental to the assessment of individuals’ dynamic exposure to social and environmental risk factors associated with multiple spatial contexts that are visited during activities of daily living. In this paper we survey existing approaches for measuring the geometry, size and structure of activity spaces, based on GPS data, and explain their limitations. We propose addressing these shortcomings through a nonparametric approach called density ranking and also through three summary curves: the mass-volume curve, the Betti number curve and the persistence curve. We introduce a novel mixture model for human activity spaces and study its asymptotic properties. We prove that the kernel density estimator, which at the present time, is one of the most widespread methods for measuring activity spaces, is not a stable estimator of their structure. We illustrate the practical value of our methods with a simulation study and with a recently collected GPS dataset that comprises the locations visited by 10 individuals over a six months period. Full Article
en Estimating and forecasting the smoking-attributable mortality fraction for both genders jointly in over 60 countries By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Yicheng Li, Adrian E. Raftery. Source: The Annals of Applied Statistics, Volume 14, Number 1, 381--408.Abstract: Smoking is one of the leading preventable threats to human health and a major risk factor for lung cancer, upper aerodigestive cancer and chronic obstructive pulmonary disease. Estimating and forecasting the smoking attributable fraction (SAF) of mortality can yield insights into smoking epidemics and also provide a basis for more accurate mortality and life expectancy projection. Peto et al. ( Lancet 339 (1992) 1268–1278) proposed a method to estimate the SAF using the lung cancer mortality rate as an indicator of exposure to smoking in the population of interest. Here, we use the same method to estimate the all-age SAF (ASAF) for both genders for over 60 countries. We document a strong and cross-nationally consistent pattern of the evolution of the SAF over time. We use this as the basis for a new Bayesian hierarchical model to project future male and female ASAF from over 60 countries simultaneously. This gives forecasts as well as predictive distributions that can be used to find uncertainty intervals for any quantity of interest. We assess the model using out-of-sample predictive validation and find that it provides good forecasts and well-calibrated forecast intervals, comparing favorably with other methods. Full Article
en Feature selection for generalized varying coefficient mixed-effect models with application to obesity GWAS By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Wanghuan Chu, Runze Li, Jingyuan Liu, Matthew Reimherr. Source: The Annals of Applied Statistics, Volume 14, Number 1, 276--298.Abstract: Motivated by an empirical analysis of data from a genome-wide association study on obesity, measured by the body mass index (BMI), we propose a two-step gene-detection procedure for generalized varying coefficient mixed-effects models with ultrahigh dimensional covariates. The proposed procedure selects significant single nucleotide polymorphisms (SNPs) impacting the mean BMI trend, some of which have already been biologically proven to be “fat genes.” The method also discovers SNPs that significantly influence the age-dependent variability of BMI. The proposed procedure takes into account individual variations of genetic effects and can also be directly applied to longitudinal data with continuous, binary or count responses. We employ Monte Carlo simulation studies to assess the performance of the proposed method and further carry out causal inference for the selected SNPs. Full Article
en Estimating the health effects of environmental mixtures using Bayesian semiparametric regression and sparsity inducing priors By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Joseph Antonelli, Maitreyi Mazumdar, David Bellinger, David Christiani, Robert Wright, Brent Coull. Source: The Annals of Applied Statistics, Volume 14, Number 1, 257--275.Abstract: Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures, such as some metals, can exhibit complex, nonlinear forms, since some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures, we propose a flexible Bayesian approach that allows exposures to interact with each other and have nonlinear relationships with the outcome. We induce sparsity using multivariate spike and slab priors to determine which exposures are associated with the outcome and which exposures interact with each other. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion into the model to identify pollutants that interact with each other. We utilize our approach to study the impact of exposure to metals on child neurodevelopment in Bangladesh and find a nonlinear, interactive relationship between arsenic and manganese. Full Article
en Bayesian factor models for probabilistic cause of death assessment with verbal autopsies By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Tsuyoshi Kunihama, Zehang Richard Li, Samuel J. Clark, Tyler H. McCormick. Source: The Annals of Applied Statistics, Volume 14, Number 1, 241--256.Abstract: The distribution of deaths by cause provides crucial information for public health planning, response and evaluation. About 60% of deaths globally are not registered or given a cause, limiting our ability to understand disease epidemiology. Verbal autopsy (VA) surveys are increasingly used in such settings to collect information on the signs, symptoms and medical history of people who have recently died. This article develops a novel Bayesian method for estimation of population distributions of deaths by cause using verbal autopsy data. The proposed approach is based on a multivariate probit model where associations among items in questionnaires are flexibly induced by latent factors. Using the Population Health Metrics Research Consortium labeled data that include both VA and medically certified causes of death, we assess performance of the proposed method. Further, we estimate important questionnaire items that are highly associated with causes of death. This framework provides insights that will simplify future data Full Article
en Modifying the Chi-square and the CMH test for population genetic inference: Adapting to overdispersion By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Kerstin Spitzer, Marta Pelizzola, Andreas Futschik. Source: The Annals of Applied Statistics, Volume 14, Number 1, 202--220.Abstract: Evolve and resequence studies provide a popular approach to simulate evolution in the lab and explore its genetic basis. In this context, Pearson’s chi-square test, Fisher’s exact test as well as the Cochran–Mantel–Haenszel test are commonly used to infer genomic positions affected by selection from temporal changes in allele frequency. However, the null model associated with these tests does not match the null hypothesis of actual interest. Indeed, due to genetic drift and possibly other additional noise components such as pool sequencing, the null variance in the data can be substantially larger than accounted for by these common test statistics. This leads to $p$-values that are systematically too small and, therefore, a huge number of false positive results. Even, if the ranking rather than the actual $p$-values is of interest, a naive application of the mentioned tests will give misleading results, as the amount of overdispersion varies from locus to locus. We therefore propose adjusted statistics that take the overdispersion into account while keeping the formulas simple. This is particularly useful in genome-wide applications, where millions of SNPs can be handled with little computational effort. We then apply the adapted test statistics to real data from Drosophila and investigate how information from intermediate generations can be included when available. We also discuss further applications such as genome-wide association studies based on pool sequencing data and tests for local adaptation. Full Article
en Surface temperature monitoring in liver procurement via functional variance change-point analysis By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Zhenguo Gao, Pang Du, Ran Jin, John L. Robertson. Source: The Annals of Applied Statistics, Volume 14, Number 1, 143--159.Abstract: Liver procurement experiments with surface-temperature monitoring motivated Gao et al. ( J. Amer. Statist. Assoc. 114 (2019) 773–781) to develop a variance change-point detection method under a smoothly-changing mean trend. However, the spotwise change points yielded from their method do not offer immediate information to surgeons since an organ is often transplanted as a whole or in part. We develop a new practical method that can analyze a defined portion of the organ surface at a time. It also provides a novel addition to the developing field of functional data monitoring. Furthermore, numerical challenge emerges for simultaneously modeling the variance functions of 2D locations and the mean function of location and time. The respective sample sizes in the scales of 10,000 and 1,000,000 for modeling these functions make standard spline estimation too costly to be useful. We introduce a multistage subsampling strategy with steps educated by quickly-computable preliminary statistical measures. Extensive simulations show that the new method can efficiently reduce the computational cost and provide reasonable parameter estimates. Application of the new method to our liver surface temperature monitoring data shows its effectiveness in providing accurate status change information for a selected portion of the organ in the experiment. Full Article
en Efficient real-time monitoring of an emerging influenza pandemic: How feasible? By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Paul J. Birrell, Lorenz Wernisch, Brian D. M. Tom, Leonhard Held, Gareth O. Roberts, Richard G. Pebody, Daniela De Angelis. Source: The Annals of Applied Statistics, Volume 14, Number 1, 74--93.Abstract: A prompt public health response to a new epidemic relies on the ability to monitor and predict its evolution in real time as data accumulate. The 2009 A/H1N1 outbreak in the UK revealed pandemic data as noisy, contaminated, potentially biased and originating from multiple sources. This seriously challenges the capacity for real-time monitoring. Here, we assess the feasibility of real-time inference based on such data by constructing an analytic tool combining an age-stratified SEIR transmission model with various observation models describing the data generation mechanisms. As batches of data become available, a sequential Monte Carlo (SMC) algorithm is developed to synthesise multiple imperfect data streams, iterate epidemic inferences and assess model adequacy amidst a rapidly evolving epidemic environment, substantially reducing computation time in comparison to standard MCMC, to ensure timely delivery of real-time epidemic assessments. In application to simulated data designed to mimic the 2009 A/H1N1 epidemic, SMC is shown to have additional benefits in terms of assessing predictive performance and coping with parameter nonidentifiability. Full Article
en Integrative survival analysis with uncertain event times in application to a suicide risk study By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Wenjie Wang, Robert Aseltine, Kun Chen, Jun Yan. Source: The Annals of Applied Statistics, Volume 14, Number 1, 51--73.Abstract: The concept of integrating data from disparate sources to accelerate scientific discovery has generated tremendous excitement in many fields. The potential benefits from data integration, however, may be compromised by the uncertainty due to incomplete/imperfect record linkage. Motivated by a suicide risk study, we propose an approach for analyzing survival data with uncertain event times arising from data integration. Specifically, in our problem deaths identified from the hospital discharge records together with reported suicidal deaths determined by the Office of Medical Examiner may still not include all the death events of patients, and the missing deaths can be recovered from a complete database of death records. Since the hospital discharge data can only be linked to the death record data by matching basic patient characteristics, a patient with a censored death time from the first dataset could be linked to multiple potential event records in the second dataset. We develop an integrative Cox proportional hazards regression in which the uncertainty in the matched event times is modeled probabilistically. The estimation procedure combines the ideas of profile likelihood and the expectation conditional maximization algorithm (ECM). Simulation studies demonstrate that under realistic settings of imperfect data linkage the proposed method outperforms several competing approaches including multiple imputation. A marginal screening analysis using the proposed integrative Cox model is performed to identify risk factors associated with death following suicide-related hospitalization in Connecticut. The identified diagnostics codes are consistent with existing literature and provide several new insights on suicide risk, prediction and prevention. Full Article
en BART with targeted smoothing: An analysis of patient-specific stillbirth risk By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Jennifer E. Starling, Jared S. Murray, Carlos M. Carvalho, Radek K. Bukowski, James G. Scott. Source: The Annals of Applied Statistics, Volume 14, Number 1, 28--50.Abstract: This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothness over a single target covariate $t$ while not necessarily requiring smoothness over other covariates $x$. tsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensemble of regression trees. tsBART extends BART by parameterizing each tree’s terminal nodes with smooth functions of $t$ rather than independent scalars. Like BART, tsBART captures complex nonlinear relationships and interactions among the predictors. But unlike BART, tsBART guarantees that the response surface will be smooth in the target covariate. This improves interpretability and helps to regularize the estimate. After introducing and benchmarking the tsBART model, we apply it to our motivating example—pregnancy outcomes data from the National Center for Health Statistics. Our aim is to provide patient-specific estimates of stillbirth risk across gestational age $(t)$ and based on maternal and fetal risk factors $(x)$. Obstetricians expect stillbirth risk to vary smoothly over gestational age but not necessarily over other covariates, and tsBART has been designed precisely to reflect this structural knowledge. The results of our analysis show the clear superiority of the tsBART model for quantifying stillbirth risk, thereby providing patients and doctors with better information for managing the risk of fetal mortality. All methods described here are implemented in the R package tsbart . Full Article
en SHOPPER: A probabilistic model of consumer choice with substitutes and complements By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Francisco J. R. Ruiz, Susan Athey, David M. Blei. Source: The Annals of Applied Statistics, Volume 14, Number 1, 1--27.Abstract: We develop SHOPPER, a sequential probabilistic model of shopping data. SHOPPER uses interpretable components to model the forces that drive how a customer chooses products; in particular, we designed SHOPPER to capture how items interact with other items. We develop an efficient posterior inference algorithm to estimate these forces from large-scale data, and we analyze a large dataset from a major chain grocery store. We are interested in answering counterfactual queries about changes in prices. We found that SHOPPER provides accurate predictions even under price interventions, and that it helps identify complementary and substitutable pairs of products. Full Article
en A general theory for preferential sampling in environmental networks By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Joe Watson, James V. Zidek, Gavin Shaddick. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2662--2700.Abstract: This paper presents a general model framework for detecting the preferential sampling of environmental monitors recording an environmental process across space and/or time. This is achieved by considering the joint distribution of an environmental process with a site-selection process that considers where and when sites are placed to measure the process. The environmental process may be spatial, temporal or spatio-temporal in nature. By sharing random effects between the two processes, the joint model is able to establish whether site placement was stochastically dependent of the environmental process under study. Furthermore, if stochastic dependence is identified between the two processes, then inferences about the probability distribution of the spatio-temporal process will change, as will predictions made of the process across space and time. The embedding into a spatio-temporal framework also allows for the modelling of the dynamic site-selection process itself. Real-world factors affecting both the size and location of the network can be easily modelled and quantified. Depending upon the choice of the population of locations considered for selection across space and time under the site-selection process, different insights about the precise nature of preferential sampling can be obtained. The general framework developed in the paper is designed to be easily and quickly fit using the R-INLA package. We apply this framework to a case study involving particulate air pollution over the UK where a major reduction in the size of a monitoring network through time occurred. It is demonstrated that a significant response-biased reduction in the air quality monitoring network occurred, namely the relocation of monitoring sites to locations with the highest pollution levels, and the routine removal of sites at locations with the lowest. We also show that the network was consistently unrepresenting levels of particulate matter seen across much of GB throughout the operating life of the network. Finally we show that this may have led to a severe overreporting of the population-average exposure levels experienced across GB. This could have great impacts on estimates of the health effects of black smoke levels. Full Article
en Hierarchical infinite factor models for improving the prediction of surgical complications for geriatric patients By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Elizabeth Lorenzi, Ricardo Henao, Katherine Heller. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2637--2661.Abstract: Nearly a third of all surgeries performed in the United States occur for patients over the age of 65; these older adults experience a higher rate of postoperative morbidity and mortality. To improve the care for these patients, we aim to identify and characterize high risk geriatric patients to send to a specialized perioperative clinic while leveraging the overall surgical population to improve learning. To this end, we develop a hierarchical infinite latent factor model (HIFM) to appropriately account for the covariance structure across subpopulations in data. We propose a novel Hierarchical Dirichlet Process shrinkage prior on the loadings matrix that flexibly captures the underlying structure of our data while sharing information across subpopulations to improve inference and prediction. The stick-breaking construction of the prior assumes an infinite number of factors and allows for each subpopulation to utilize different subsets of the factor space and select the number of factors needed to best explain the variation. We develop the model into a latent factor regression method that excels at prediction and inference of regression coefficients. Simulations validate this strong performance compared to baseline methods. We apply this work to the problem of predicting surgical complications using electronic health record data for geriatric patients and all surgical patients at Duke University Health System (DUHS). The motivating application demonstrates the improved predictive performance when using HIFM in both area under the ROC curve and area under the PR Curve while providing interpretable coefficients that may lead to actionable interventions. Full Article
en Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: A winning solution to the NIJ “Real-Time Crime Forecasting Challenge” By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Seth Flaxman, Michael Chirico, Pau Pereira, Charles Loeffler. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2564--2585.Abstract: We propose a generic spatiotemporal event forecasting method which we developed for the National Institute of Justice’s (NIJ) Real-Time Crime Forecasting Challenge (National Institute of Justice (2017)). Our method is a spatiotemporal forecasting model combining scalable randomized Reproducing Kernel Hilbert Space (RKHS) methods for approximating Gaussian processes with autoregressive smoothing kernels in a regularized supervised learning framework. While the smoothing kernels capture the two main approaches in current use in the field of crime forecasting, kernel density estimation (KDE) and self-exciting point process (SEPP) models, the RKHS component of the model can be understood as an approximation to the popular log-Gaussian Cox Process model. For inference, we discretize the spatiotemporal point pattern and learn a log-intensity function using the Poisson likelihood and highly efficient gradient-based optimization methods. Model hyperparameters including quality of RKHS approximation, spatial and temporal kernel lengthscales, number of autoregressive lags and bandwidths for smoothing kernels as well as cell shape, size and rotation, were learned using cross validation. Resulting predictions significantly exceeded baseline KDE estimates and SEPP models for sparse events. Full Article
en A simple, consistent estimator of SNP heritability from genome-wide association studies By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Armin Schwartzman, Andrew J. Schork, Rong Zablocki, Wesley K. Thompson. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2509--2538.Abstract: Analysis of genome-wide association studies (GWAS) is characterized by a large number of univariate regressions where a quantitative trait is regressed on hundreds of thousands to millions of single-nucleotide polymorphism (SNP) allele counts, one at a time. This article proposes an estimator of the SNP heritability of the trait, defined here as the fraction of the variance of the trait explained by the SNPs in the study. The proposed GWAS heritability (GWASH) estimator is easy to compute, highly interpretable and is consistent as the number of SNPs and the sample size increase. More importantly, it can be computed from summary statistics typically reported in GWAS, not requiring access to the original data. The estimator takes full account of the linkage disequilibrium (LD) or correlation between the SNPs in the study through moments of the LD matrix, estimable from auxiliary datasets. Unlike other proposed estimators in the literature, we establish the theoretical properties of the GWASH estimator and obtain analytical estimates of the precision, allowing for power and sample size calculations for SNP heritability estimates and forming a firm foundation for future methodological development. Full Article
en Empirical Bayes analysis of RNA sequencing experiments with auxiliary information By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Kun Liang. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2452--2482.Abstract: Finding differentially expressed genes is a common task in high-throughput transcriptome studies. While traditional statistical methods rank the genes by their test statistics alone, we analyze an RNA sequencing dataset using the auxiliary information of gene length and the test statistics from a related microarray study. Given the auxiliary information, we propose a novel nonparametric empirical Bayes procedure to estimate the posterior probability of differential expression for each gene. We demonstrate the advantage of our procedure in extensive simulation studies and a psoriasis RNA sequencing study. The companion R package calm is available at Bioconductor. Full Article
en Propensity score weighting for causal inference with multiple treatments By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Fan Li, Fan Li. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2389--2415.Abstract: Causal or unconfounded descriptive comparisons between multiple groups are common in observational studies. Motivated from a racial disparity study in health services research, we propose a unified propensity score weighting framework, the balancing weights, for estimating causal effects with multiple treatments. These weights incorporate the generalized propensity scores to balance the weighted covariate distribution of each treatment group, all weighted toward a common prespecified target population. The class of balancing weights include several existing approaches such as the inverse probability weights and trimming weights as special cases. Within this framework, we propose a set of target estimands based on linear contrasts. We further develop the generalized overlap weights, constructed as the product of the inverse probability weights and the harmonic mean of the generalized propensity scores. The generalized overlap weighting scheme corresponds to the target population with the most overlap in covariates across the multiple treatments. These weights are bounded and thus bypass the problem of extreme propensities. We show that the generalized overlap weights minimize the total asymptotic variance of the moment weighting estimators for the pairwise contrasts within the class of balancing weights. We consider two balance check criteria and propose a new sandwich variance estimator for estimating the causal effects with generalized overlap weights. We apply these methods to study the racial disparities in medical expenditure between several racial groups using the 2009 Medical Expenditure Panel Survey (MEPS) data. Simulations were carried out to compare with existing methods. Full Article
en A nonparametric spatial test to identify factors that shape a microbiome By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Susheela P. Singh, Ana-Maria Staicu, Robert R. Dunn, Noah Fierer, Brian J. Reich. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2341--2362.Abstract: The advent of high-throughput sequencing technologies has made data from DNA material readily available, leading to a surge of microbiome-related research establishing links between markers of microbiome health and specific outcomes. However, to harness the power of microbial communities we must understand not only how they affect us, but also how they can be influenced to improve outcomes. This area has been dominated by methods that reduce community composition to summary metrics, which can fail to fully exploit the complexity of community data. Recently, methods have been developed to model the abundance of taxa in a community, but they can be computationally intensive and do not account for spatial effects underlying microbial settlement. These spatial effects are particularly relevant in the microbiome setting because we expect communities that are close together to be more similar than those that are far apart. In this paper, we propose a flexible Bayesian spike-and-slab variable selection model for presence-absence indicators that accounts for spatial dependence and cross-dependence between taxa while reducing dimensionality in both directions. We show by simulation that in the presence of spatial dependence, popular distance-based hypothesis testing methods fail to preserve their advertised size, and the proposed method improves variable selection. Finally, we present an application of our method to an indoor fungal community found within homes across the contiguous United States. Full Article
en A latent discrete Markov random field approach to identifying and classifying historical forest communities based on spatial multivariate tree species counts By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Stephen Berg, Jun Zhu, Murray K. Clayton, Monika E. Shea, David J. Mladenoff. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2312--2340.Abstract: The Wisconsin Public Land Survey database describes historical forest composition at high spatial resolution and is of interest in ecological studies of forest composition in Wisconsin just prior to significant Euro-American settlement. For such studies it is useful to identify recurring subpopulations of tree species known as communities, but standard clustering approaches for subpopulation identification do not account for dependence between spatially nearby observations. Here, we develop and fit a latent discrete Markov random field model for the purpose of identifying and classifying historical forest communities based on spatially referenced multivariate tree species counts across Wisconsin. We show empirically for the actual dataset and through simulation that our latent Markov random field modeling approach improves prediction and parameter estimation performance. For model fitting we introduce a new stochastic approximation algorithm which enables computationally efficient estimation and classification of large amounts of spatial multivariate count data. Full Article
en Objective Bayes model selection of Gaussian interventional essential graphs for the identification of signaling pathways By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Federico Castelletti, Guido Consonni. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2289--2311.Abstract: A signalling pathway is a sequence of chemical reactions initiated by a stimulus which in turn affects a receptor, and then through some intermediate steps cascades down to the final cell response. Based on the technique of flow cytometry, samples of cell-by-cell measurements are collected under each experimental condition, resulting in a collection of interventional data (assuming no latent variables are involved). Usually several external interventions are applied at different points of the pathway, the ultimate aim being the structural recovery of the underlying signalling network which we model as a causal Directed Acyclic Graph (DAG) using intervention calculus. The advantage of using interventional data, rather than purely observational one, is that identifiability of the true data generating DAG is enhanced. More technically a Markov equivalence class of DAGs, whose members are statistically indistinguishable based on observational data alone, can be further decomposed, using additional interventional data, into smaller distinct Interventional Markov equivalence classes. We present a Bayesian methodology for structural learning of Interventional Markov equivalence classes based on observational and interventional samples of multivariate Gaussian observations. Our approach is objective, meaning that it is based on default parameter priors requiring no personal elicitation; some flexibility is however allowed through a tuning parameter which regulates sparsity in the prior on model space. Based on an analytical expression for the marginal likelihood of a given Interventional Essential Graph, and a suitable MCMC scheme, our analysis produces an approximate posterior distribution on the space of Interventional Markov equivalence classes, which can be used to provide uncertainty quantification for features of substantive scientific interest, such as the posterior probability of inclusion of selected edges, or paths. Full Article
en Fitting a deeply nested hierarchical model to a large book review dataset using a moment-based estimator By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Ningshan Zhang, Kyle Schmaus, Patrick O. Perry. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2260--2288.Abstract: We consider a particular instance of a common problem in recommender systems, using a database of book reviews to inform user-targeted recommendations. In our dataset, books are categorized into genres and subgenres. To exploit this nested taxonomy, we use a hierarchical model that enables information pooling across across similar items at many levels within the genre hierarchy. The main challenge in deploying this model is computational. The data sizes are large and fitting the model at scale using off-the-shelf maximum likelihood procedures is prohibitive. To get around this computational bottleneck, we extend a moment-based fitting procedure proposed for fitting single-level hierarchical models to the general case of arbitrarily deep hierarchies. This extension is an order of magnitude faster than standard maximum likelihood procedures. The fitting method can be deployed beyond recommender systems to general contexts with deeply nested hierarchical generalized linear mixed models. Full Article
en Spatial modeling of trends in crime over time in Philadelphia By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Cecilia Balocchi, Shane T. Jensen. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2235--2259.Abstract: Understanding the relationship between change in crime over time and the geography of urban areas is an important problem for urban planning. Accurate estimation of changing crime rates throughout a city would aid law enforcement as well as enable studies of the association between crime and the built environment. Bayesian modeling is a promising direction since areal data require principled sharing of information to address spatial autocorrelation between proximal neighborhoods. We develop several Bayesian approaches to spatial sharing of information between neighborhoods while modeling trends in crime counts over time. We apply our methodology to estimate changes in crime throughout Philadelphia over the 2006-15 period while also incorporating spatially-varying economic and demographic predictors. We find that the local shrinkage imposed by a conditional autoregressive model has substantial benefits in terms of out-of-sample predictive accuracy of crime. We also explore the possibility of spatial discontinuities between neighborhoods that could represent natural barriers or aspects of the built environment. Full Article
en Microsimulation model calibration using incremental mixture approximate Bayesian computation By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Carolyn M. Rutter, Jonathan Ozik, Maria DeYoreo, Nicholson Collier. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2189--2212.Abstract: Microsimulation models (MSMs) are used to inform policy by predicting population-level outcomes under different scenarios. MSMs simulate individual-level event histories that mark the disease process (such as the development of cancer) and the effect of policy actions (such as screening) on these events. MSMs often have many unknown parameters; calibration is the process of searching the parameter space to select parameters that result in accurate MSM prediction of a wide range of targets. We develop Incremental Mixture Approximate Bayesian Computation (IMABC) for MSM calibration which results in a simulated sample from the posterior distribution of model parameters given calibration targets. IMABC begins with a rejection-based ABC step, drawing a sample of points from the prior distribution of model parameters and accepting points that result in simulated targets that are near observed targets. Next, the sample is iteratively updated by drawing additional points from a mixture of multivariate normal distributions and accepting points that result in accurate predictions. Posterior estimates are obtained by weighting the final set of accepted points to account for the adaptive sampling scheme. We demonstrate IMABC by calibrating CRC-SPIN 2.0, an updated version of a MSM for colorectal cancer (CRC) that has been used to inform national CRC screening guidelines. Full Article
en Prediction of small area quantiles for the conservation effects assessment project using a mixed effects quantile regression model By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Emily Berg, Danhyang Lee. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2158--2188.Abstract: Quantiles of the distributions of several measures of erosion are important parameters in the Conservation Effects Assessment Project, a survey intended to quantify soil and nutrient loss on crop fields. Because sample sizes for domains of interest are too small to support reliable direct estimators, model based methods are needed. Quantile regression is appealing for CEAP because finding a single family of parametric models that adequately describes the distributions of all variables is difficult and small area quantiles are parameters of interest. We construct empirical Bayes predictors and bootstrap mean squared error estimators based on the linearly interpolated generalized Pareto distribution (LIGPD). We apply the procedures to predict county-level quantiles for four types of erosion in Wisconsin and validate the procedures through simulation. Full Article
en Joint model of accelerated failure time and mechanistic nonlinear model for censored covariates, with application in HIV/AIDS By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Hongbin Zhang, Lang Wu. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2140--2157.Abstract: For a time-to-event outcome with censored time-varying covariates, a joint Cox model with a linear mixed effects model is the standard modeling approach. In some applications such as AIDS studies, mechanistic nonlinear models are available for some covariate process such as viral load during anti-HIV treatments, derived from the underlying data-generation mechanisms and disease progression. Such a mechanistic nonlinear covariate model may provide better-predicted values when the covariates are left censored or mismeasured. When the focus is on the impact of the time-varying covariate process on the survival outcome, an accelerated failure time (AFT) model provides an excellent alternative to the Cox proportional hazard model since an AFT model is formulated to allow the influence of the outcome by the entire covariate process. In this article, we consider a nonlinear mixed effects model for the censored covariates in an AFT model, implemented using a Monte Carlo EM algorithm, under the framework of a joint model for simultaneous inference. We apply the joint model to an HIV/AIDS data to gain insights for assessing the association between viral load and immunological restoration during antiretroviral therapy. Simulation is conducted to compare model performance when the covariate model and the survival model are misspecified. Full Article
en Fire seasonality identification with multimodality tests By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Jose Ameijeiras-Alonso, Akli Benali, Rosa M. Crujeiras, Alberto Rodríguez-Casal, José M. C. Pereira. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2120--2139.Abstract: Understanding the role of vegetation fires in the Earth system is an important environmental problem. Although fire occurrence is influenced by natural factors, human activity related to land use and management has altered the temporal patterns of fire in several regions of the world. Hence, for a better insight into fires regimes it is of special interest to analyze where human activity has altered fire seasonality. For doing so, multimodality tests are a useful tool for determining the number of annual fire peaks. The periodicity of fires and their complex distributional features motivate the use of nonparametric circular statistics. The unsatisfactory performance of previous circular nonparametric proposals for testing multimodality justifies the introduction of a new approach, considering an adapted version of the excess mass statistic, jointly with a bootstrap calibration algorithm. A systematic application of the test on the Russia–Kazakhstan area is presented in order to determine how many fire peaks can be identified in this region. A False Discovery Rate correction, accounting for the spatial dependence of the data, is also required. Full Article
en Statistical inference for partially observed branching processes with application to cell lineage tracking of in vivo hematopoiesis By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Jason Xu, Samson Koelle, Peter Guttorp, Chuanfeng Wu, Cynthia Dunbar, Janis L. Abkowitz, Vladimir N. Minin. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2091--2119.Abstract: Single-cell lineage tracking strategies enabled by recent experimental technologies have produced significant insights into cell fate decisions, but lack the quantitative framework necessary for rigorous statistical analysis of mechanistic models describing cell division and differentiation. In this paper, we develop such a framework with corresponding moment-based parameter estimation techniques for continuous-time, multi-type branching processes. Such processes provide a probabilistic model of how cells divide and differentiate, and we apply our method to study hematopoiesis , the mechanism of blood cell production. We derive closed-form expressions for higher moments in a general class of such models. These analytical results allow us to efficiently estimate parameters of much richer statistical models of hematopoiesis than those used in previous statistical studies. To our knowledge, the method provides the first rate inference procedure for fitting such models to time series data generated from cellular barcoding experiments. After validating the methodology in simulation studies, we apply our estimator to hematopoietic lineage tracking data from rhesus macaques. Our analysis provides a more complete understanding of cell fate decisions during hematopoiesis in nonhuman primates, which may be more relevant to human biology and clinical strategies than previous findings from murine studies. For example, in addition to previously estimated hematopoietic stem cell self-renewal rate, we are able to estimate fate decision probabilities and to compare structurally distinct models of hematopoiesis using cross validation. These estimates of fate decision probabilities and our model selection results should help biologists compare competing hypotheses about how progenitor cells differentiate. The methodology is transferrable to a large class of stochastic compartmental and multi-type branching models, commonly used in studies of cancer progression, epidemiology and many other fields. Full Article
en Robust elastic net estimators for variable selection and identification of proteomic biomarkers By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Gabriela V. Cohen Freue, David Kepplinger, Matías Salibián-Barrera, Ezequiel Smucler. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2065--2090.Abstract: In large-scale quantitative proteomic studies, scientists measure the abundance of thousands of proteins from the human proteome in search of novel biomarkers for a given disease. Penalized regression estimators can be used to identify potential biomarkers among a large set of molecular features measured. Yet, the performance and statistical properties of these estimators depend on the loss and penalty functions used to define them. Motivated by a real plasma proteomic biomarkers study, we propose a new class of penalized robust estimators based on the elastic net penalty, which can be tuned to keep groups of correlated variables together in the selected model and maintain robustness against possible outliers. We also propose an efficient algorithm to compute our robust penalized estimators and derive a data-driven method to select the penalty term. Our robust penalized estimators have very good robustness properties and are also consistent under certain regularity conditions. Numerical results show that our robust estimators compare favorably to other robust penalized estimators. Using our proposed methodology for the analysis of the proteomics data, we identify new potentially relevant biomarkers of cardiac allograft vasculopathy that are not found with nonrobust alternatives. The selected model is validated in a new set of 52 test samples and achieves an area under the receiver operating characteristic (AUC) of 0.85. Full Article
en Estimating the rate constant from biosensor data via an adaptive variational Bayesian approach By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Ye Zhang, Zhigang Yao, Patrik Forssén, Torgny Fornstedt. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2011--2042.Abstract: The means to obtain the rate constants of a chemical reaction is a fundamental open problem in both science and the industry. Traditional techniques for finding rate constants require either chemical modifications of the reactants or indirect measurements. The rate constant map method is a modern technique to study binding equilibrium and kinetics in chemical reactions. Finding a rate constant map from biosensor data is an ill-posed inverse problem that is usually solved by regularization. In this work, rather than finding a deterministic regularized rate constant map that does not provide uncertainty quantification of the solution, we develop an adaptive variational Bayesian approach to estimate the distribution of the rate constant map, from which some intrinsic properties of a chemical reaction can be explored, including information about rate constants. Our new approach is more realistic than the existing approaches used for biosensors and allows us to estimate the dynamics of the interactions, which are usually hidden in a deterministic approximate solution. We verify the performance of the new proposed method by numerical simulations, and compare it with the Markov chain Monte Carlo algorithm. The results illustrate that the variational method can reliably capture the posterior distribution in a computationally efficient way. Finally, the developed method is also tested on the real biosensor data (parathyroid hormone), where we provide two novel analysis tools—the thresholding contour map and the high order moment map—to estimate the number of interactions as well as their rate constants. Full Article
en A semiparametric modeling approach using Bayesian Additive Regression Trees with an application to evaluate heterogeneous treatment effects By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Bret Zeldow, Vincent Lo Re III, Jason Roy. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1989--2010.Abstract: Bayesian Additive Regression Trees (BART) is a flexible machine learning algorithm capable of capturing nonlinearities between an outcome and covariates and interactions among covariates. We extend BART to a semiparametric regression framework in which the conditional expectation of an outcome is a function of treatment, its effect modifiers, and confounders. The confounders are allowed to have unspecified functional form, while treatment and effect modifiers that are directly related to the research question are given a linear form. The result is a Bayesian semiparametric linear regression model where the posterior distribution of the parameters of the linear part can be interpreted as in parametric Bayesian regression. This is useful in situations where a subset of the variables are of substantive interest and the others are nuisance variables that we would like to control for. An example of this occurs in causal modeling with the structural mean model (SMM). Under certain causal assumptions, our method can be used as a Bayesian SMM. Our methods are demonstrated with simulation studies and an application to dataset involving adults with HIV/Hepatitis C coinfection who newly initiate antiretroviral therapy. The methods are available in an R package called semibart. Full Article
en Radio-iBAG: Radiomics-based integrative Bayesian analysis of multiplatform genomic data By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Youyi Zhang, Jeffrey S. Morris, Shivali Narang Aerry, Arvind U. K. Rao, Veerabhadran Baladandayuthapani. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1957--1988.Abstract: Technological innovations have produced large multi-modal datasets that include imaging and multi-platform genomics data. Integrative analyses of such data have the potential to reveal important biological and clinical insights into complex diseases like cancer. In this paper, we present Bayesian approaches for integrative analysis of radiological imaging and multi-platform genomic data, where-in our goals are to simultaneously identify genomic and radiomic, that is, radiology-based imaging markers, along with the latent associations between these two modalities, and to detect the overall prognostic relevance of the combined markers. For this task, we propose Radio-iBAG: Radiomics-based Integrative Bayesian Analysis of Multiplatform Genomic Data , a multi-scale Bayesian hierarchical model that involves several innovative strategies: it incorporates integrative analysis of multi-platform genomic data sets to capture fundamental biological relationships; explores the associations between radiomic markers accompanying genomic information with clinical outcomes; and detects genomic and radiomic markers associated with clinical prognosis. We also introduce the use of sparse Principal Component Analysis (sPCA) to extract a sparse set of approximately orthogonal meta-features each containing information from a set of related individual radiomic features, reducing dimensionality and combining like features. Our methods are motivated by and applied to The Cancer Genome Atlas glioblastoma multiforme data set, where-in we integrate magnetic resonance imaging-based biomarkers along with genomic, epigenomic and transcriptomic data. Our model identifies important magnetic resonance imaging features and the associated genomic platforms that are related with patient survival times. Full Article
en Approximate inference for constructing astronomical catalogs from images By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Jeffrey Regier, Andrew C. Miller, David Schlegel, Ryan P. Adams, Jon D. McAuliffe, Prabhat. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1884--1926.Abstract: We present a new, fully generative model for constructing astronomical catalogs from optical telescope image sets. Each pixel intensity is treated as a random variable with parameters that depend on the latent properties of stars and galaxies. These latent properties are themselves modeled as random. We compare two procedures for posterior inference. One procedure is based on Markov chain Monte Carlo (MCMC) while the other is based on variational inference (VI). The MCMC procedure excels at quantifying uncertainty, while the VI procedure is 1000 times faster. On a supercomputer, the VI procedure efficiently uses 665,000 CPU cores to construct an astronomical catalog from 50 terabytes of images in 14.6 minutes, demonstrating the scaling characteristics necessary to construct catalogs for upcoming astronomical surveys. Full Article
en Incorporating conditional dependence in latent class models for probabilistic record linkage: Does it matter? By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Huiping Xu, Xiaochun Li, Changyu Shen, Siu L. Hui, Shaun Grannis. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1753--1790.Abstract: The conditional independence assumption of the Felligi and Sunter (FS) model in probabilistic record linkage is often violated when matching real-world data. Ignoring conditional dependence has been shown to seriously bias parameter estimates. However, in record linkage, the ultimate goal is to inform the match status of record pairs and therefore, record linkage algorithms should be evaluated in terms of matching accuracy. In the literature, more flexible models have been proposed to relax the conditional independence assumption, but few studies have assessed whether such accommodations improve matching accuracy. In this paper, we show that incorporating the conditional dependence appropriately yields comparable or improved matching accuracy than the FS model using three real-world data linkage examples. Through a simulation study, we further investigate when conditional dependence models provide improved matching accuracy. Our study shows that the FS model is generally robust to the conditional independence assumption and provides comparable matching accuracy as the more complex conditional dependence models. However, when the match prevalence approaches 0% or 100% and conditional dependence exists in the dominating class, it is necessary to address conditional dependence as the FS model produces suboptimal matching accuracy. The need to address conditional dependence becomes less important when highly discriminating fields are used. Our simulation study also shows that conditional dependence models with misspecified dependence structure could produce less accurate record matching than the FS model and therefore we caution against the blind use of conditional dependence models. Full Article
en A hierarchical Bayesian model for single-cell clustering using RNA-sequencing data By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Yiyi Liu, Joshua L. Warren, Hongyu Zhao. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1733--1752.Abstract: Understanding the heterogeneity of cells is an important biological question. The development of single-cell RNA-sequencing (scRNA-seq) technology provides high resolution data for such inquiry. A key challenge in scRNA-seq analysis is the high variability of measured RNA expression levels and frequent dropouts (missing values) due to limited input RNA compared to bulk RNA-seq measurement. Existing clustering methods do not perform well for these noisy and zero-inflated scRNA-seq data. In this manuscript we propose a Bayesian hierarchical model, called BasClu, to appropriately characterize important features of scRNA-seq data in order to more accurately cluster cells. We demonstrate the effectiveness of our method with extensive simulation studies and applications to three real scRNA-seq datasets. Full Article
en Sequential decision model for inference and prediction on nonuniform hypergraphs with application to knot matching from computational forestry By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Seong-Hwan Jun, Samuel W. K. Wong, James V. Zidek, Alexandre Bouchard-Côté. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1678--1707.Abstract: In this paper, we consider the knot-matching problem arising in computational forestry. The knot-matching problem is an important problem that needs to be solved to advance the state of the art in automatic strength prediction of lumber. We show that this problem can be formulated as a quadripartite matching problem and develop a sequential decision model that admits efficient parameter estimation along with a sequential Monte Carlo sampler on graph matching that can be utilized for rapid sampling of graph matching. We demonstrate the effectiveness of our methods on 30 manually annotated boards and present findings from various simulation studies to provide further evidence supporting the efficacy of our methods. Full Article
en RCRnorm: An integrated system of random-coefficient hierarchical regression models for normalizing NanoString nCounter data By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Gaoxiang Jia, Xinlei Wang, Qiwei Li, Wei Lu, Ximing Tang, Ignacio Wistuba, Yang Xie. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1617--1647.Abstract: Formalin-fixed paraffin-embedded (FFPE) samples have great potential for biomarker discovery, retrospective studies and diagnosis or prognosis of diseases. Their application, however, is hindered by the unsatisfactory performance of traditional gene expression profiling techniques on damaged RNAs. NanoString nCounter platform is well suited for profiling of FFPE samples and measures gene expression with high sensitivity which may greatly facilitate realization of scientific and clinical values of FFPE samples. However, methodological development for normalization, a critical step when analyzing this type of data, is far behind. Existing methods designed for the platform use information from different types of internal controls separately and rely on an overly-simplified assumption that expression of housekeeping genes is constant across samples for global scaling. Thus, these methods are not optimized for the nCounter system, not mentioning that they were not developed for FFPE samples. We construct an integrated system of random-coefficient hierarchical regression models to capture main patterns and characteristics observed from NanoString data of FFPE samples and develop a Bayesian approach to estimate parameters and normalize gene expression across samples. Our method, labeled RCRnorm, incorporates information from all aspects of the experimental design and simultaneously removes biases from various sources. It eliminates the unrealistic assumption on housekeeping genes and offers great interpretability. Furthermore, it is applicable to freshly frozen or like samples that can be generally viewed as a reduced case of FFPE samples. Simulation and applications showed the superior performance of RCRnorm. Full Article