it

Consistent selection of the number of change-points via sample-splitting

Changliang Zou, Guanghui Wang, Runze Li.

Source: The Annals of Statistics, Volume 48, Number 1, 413--439.

Abstract:
In multiple change-point analysis, one of the major challenges is to estimate the number of change-points. Most existing approaches attempt to minimize a Schwarz information criterion which balances a term quantifying model fit with a penalization term accounting for model complexity that increases with the number of change-points and limits overfitting. However, different penalization terms are required to adapt to different contexts of multiple change-point problems and the optimal penalization magnitude usually varies from the model and error distribution. We propose a data-driven selection criterion that is applicable to most kinds of popular change-point detection methods, including binary segmentation and optimal partitioning algorithms. The key idea is to select the number of change-points that minimizes the squared prediction error, which measures the fit of a specified model for a new sample. We develop a cross-validation estimation scheme based on an order-preserved sample-splitting strategy, and establish its asymptotic selection consistency under some mild conditions. Effectiveness of the proposed selection criterion is demonstrated on a variety of numerical experiments and real-data examples.




it

The multi-armed bandit problem: An efficient nonparametric solution

Hock Peng Chan.

Source: The Annals of Statistics, Volume 48, Number 1, 346--373.

Abstract:
Lai and Robbins ( Adv. in Appl. Math. 6 (1985) 4–22) and Lai ( Ann. Statist. 15 (1987) 1091–1114) provided efficient parametric solutions to the multi-armed bandit problem, showing that arm allocation via upper confidence bounds (UCB) achieves minimum regret. These bounds are constructed from the Kullback–Leibler information of the reward distributions, estimated from specified parametric families. In recent years, there has been renewed interest in the multi-armed bandit problem due to new applications in machine learning algorithms and data analytics. Nonparametric arm allocation procedures like $epsilon $-greedy, Boltzmann exploration and BESA were studied, and modified versions of the UCB procedure were also analyzed under nonparametric settings. However, unlike UCB these nonparametric procedures are not efficient under general parametric settings. In this paper, we propose efficient nonparametric procedures.




it

Testing for principal component directions under weak identifiability

Davy Paindaveine, Julien Remy, Thomas Verdebout.

Source: The Annals of Statistics, Volume 48, Number 1, 324--345.

Abstract:
We consider the problem of testing, on the basis of a $p$-variate Gaussian random sample, the null hypothesis $mathcal{H}_{0}:oldsymbol{ heta}_{1}=oldsymbol{ heta}_{1}^{0}$ against the alternative $mathcal{H}_{1}:oldsymbol{ heta}_{1} eq oldsymbol{ heta}_{1}^{0}$, where $oldsymbol{ heta}_{1}$ is the “first” eigenvector of the underlying covariance matrix and $oldsymbol{ heta}_{1}^{0}$ is a fixed unit $p$-vector. In the classical setup where eigenvalues $lambda_{1}>lambda_{2}geq cdots geq lambda_{p}$ are fixed, the Anderson ( Ann. Math. Stat. 34 (1963) 122–148) likelihood ratio test (LRT) and the Hallin, Paindaveine and Verdebout ( Ann. Statist. 38 (2010) 3245–3299) Le Cam optimal test for this problem are asymptotically equivalent under the null hypothesis, hence also under sequences of contiguous alternatives. We show that this equivalence does not survive asymptotic scenarios where $lambda_{n1}/lambda_{n2}=1+O(r_{n})$ with $r_{n}=O(1/sqrt{n})$. For such scenarios, the Le Cam optimal test still asymptotically meets the nominal level constraint, whereas the LRT severely overrejects the null hypothesis. Consequently, the former test should be favored over the latter one whenever the two largest sample eigenvalues are close to each other. By relying on the Le Cam’s asymptotic theory of statistical experiments, we study the non-null and optimality properties of the Le Cam optimal test in the aforementioned asymptotic scenarios and show that the null robustness of this test is not obtained at the expense of power. Our asymptotic investigation is extensive in the sense that it allows $r_{n}$ to converge to zero at an arbitrary rate. While we restrict to single-spiked spectra of the form $lambda_{n1}>lambda_{n2}=cdots =lambda_{np}$ to make our results as striking as possible, we extend our results to the more general elliptical case. Finally, we present an illustrative real data example.




it

Sparse high-dimensional regression: Exact scalable algorithms and phase transitions

Dimitris Bertsimas, Bart Van Parys.

Source: The Annals of Statistics, Volume 48, Number 1, 300--323.

Abstract:
We present a novel binary convex reformulation of the sparse regression problem that constitutes a new duality perspective. We devise a new cutting plane method and provide evidence that it can solve to provable optimality the sparse regression problem for sample sizes $n$ and number of regressors $p$ in the 100,000s, that is, two orders of magnitude better than the current state of the art, in seconds. The ability to solve the problem for very high dimensions allows us to observe new phase transition phenomena. Contrary to traditional complexity theory which suggests that the difficulty of a problem increases with problem size, the sparse regression problem has the property that as the number of samples $n$ increases the problem becomes easier in that the solution recovers 100% of the true signal, and our approach solves the problem extremely fast (in fact faster than Lasso), while for small number of samples $n$, our approach takes a larger amount of time to solve the problem, but importantly the optimal solution provides a statistically more relevant regressor. We argue that our exact sparse regression approach presents a superior alternative over heuristic methods available at present.




it

Bootstrap confidence regions based on M-estimators under nonstandard conditions

Stephen M. S. Lee, Puyudi Yang.

Source: The Annals of Statistics, Volume 48, Number 1, 274--299.

Abstract:
Suppose that a confidence region is desired for a subvector $ heta $ of a multidimensional parameter $xi =( heta ,psi )$, based on an M-estimator $hat{xi }_{n}=(hat{ heta }_{n},hat{psi }_{n})$ calculated from a random sample of size $n$. Under nonstandard conditions $hat{xi }_{n}$ often converges at a nonregular rate $r_{n}$, in which case consistent estimation of the distribution of $r_{n}(hat{ heta }_{n}- heta )$, a pivot commonly chosen for confidence region construction, is most conveniently effected by the $m$ out of $n$ bootstrap. The above choice of pivot has three drawbacks: (i) the shape of the region is either subjectively prescribed or controlled by a computationally intensive depth function; (ii) the region is not transformation equivariant; (iii) $hat{xi }_{n}$ may not be uniquely defined. To resolve the above difficulties, we propose a one-dimensional pivot derived from the criterion function, and prove that its distribution can be consistently estimated by the $m$ out of $n$ bootstrap, or by a modified version of the perturbation bootstrap. This leads to a new method for constructing confidence regions which are transformation equivariant and have shapes driven solely by the criterion function. A subsampling procedure is proposed for selecting $m$ in practice. Empirical performance of the new method is illustrated with examples drawn from different nonstandard M-estimation settings. Extension of our theory to row-wise independent triangular arrays is also explored.




it

Spectral and matrix factorization methods for consistent community detection in multi-layer networks

Subhadeep Paul, Yuguo Chen.

Source: The Annals of Statistics, Volume 48, Number 1, 230--250.

Abstract:
We consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network using methods based on the spectral clustering or a low-rank matrix factorization. As a general theme, these “intermediate fusion” methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix for clustering. However, the theoretical properties of these methods remain largely unexplored. In the absence of statistical guarantees on the objective functions, it is difficult to determine if the algorithms optimizing the objectives will return good community structures. We investigate the consistency properties of the global optimizer of some of these objective functions under the multi-layer stochastic blockmodel. For this purpose, we derive several new asymptotic results showing consistency of the intermediate fusion techniques along with the spectral clustering of mean adjacency matrix under a high dimensional setup, where the number of nodes, the number of layers and the number of communities of the multi-layer graph grow. Our numerical study shows that the intermediate fusion techniques outperform late fusion methods, namely spectral clustering on aggregate spectral kernel and module allegiance matrix in sparse networks, while they outperform the spectral clustering of mean adjacency matrix in multi-layer networks that contain layers with both homophilic and heterophilic communities.




it

Optimal rates for community estimation in the weighted stochastic block model

Min Xu, Varun Jog, Po-Ling Loh.

Source: The Annals of Statistics, Volume 48, Number 1, 183--204.

Abstract:
Community identification in a network is an important problem in fields such as social science, neuroscience and genetics. Over the past decade, stochastic block models (SBMs) have emerged as a popular statistical framework for this problem. However, SBMs have an important limitation in that they are suited only for networks with unweighted edges; in various scientific applications, disregarding the edge weights may result in a loss of valuable information. We study a weighted generalization of the SBM, in which observations are collected in the form of a weighted adjacency matrix and the weight of each edge is generated independently from an unknown probability density determined by the community membership of its endpoints. We characterize the optimal rate of misclustering error of the weighted SBM in terms of the Renyi divergence of order 1/2 between the weight distributions of within-community and between-community edges, substantially generalizing existing results for unweighted SBMs. Furthermore, we present a computationally tractable algorithm based on discretization that achieves the optimal error rate. Our method is adaptive in the sense that the algorithm, without assuming knowledge of the weight densities, performs as well as the best algorithm that knows the weight densities.




it

Model assisted variable clustering: Minimax-optimal recovery and algorithms

Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen.

Source: The Annals of Statistics, Volume 48, Number 1, 111--137.

Abstract:
The problem of variable clustering is that of estimating groups of similar components of a $p$-dimensional vector $X=(X_{1},ldots ,X_{p})$ from $n$ independent copies of $X$. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of $G$-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a $G$-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to $G$-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular $K$-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.




it

The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression

Emmanuel J. Candès, Pragya Sur.

Source: The Annals of Statistics, Volume 48, Number 1, 27--42.

Abstract:
This paper rigorously establishes that the existence of the maximum likelihood estimate (MLE) in high-dimensional logistic regression models with Gaussian covariates undergoes a sharp “phase transition.” We introduce an explicit boundary curve $h_{mathrm{MLE}}$, parameterized by two scalars measuring the overall magnitude of the unknown sequence of regression coefficients, with the following property: in the limit of large sample sizes $n$ and number of features $p$ proportioned in such a way that $p/n ightarrow kappa $, we show that if the problem is sufficiently high dimensional in the sense that $kappa >h_{mathrm{MLE}}$, then the MLE does not exist with probability one. Conversely, if $kappa <h_{mathrm{MLE}}$, the MLE asymptotically exists with probability one.




it

Tracy–Widom limit for Kendall’s tau

Zhigang Bao.

Source: The Annals of Statistics, Volume 47, Number 6, 3504--3532.

Abstract:
In this paper, we study a high-dimensional random matrix model from nonparametric statistics called the Kendall rank correlation matrix, which is a natural multivariate extension of the Kendall rank correlation coefficient. We establish the Tracy–Widom law for its largest eigenvalue. It is the first Tracy–Widom law for a nonparametric random matrix model, and also the first Tracy–Widom law for a high-dimensional U-statistic.




it

Joint convergence of sample autocovariance matrices when &#36;p/n o 0&#36; with application

Monika Bhattacharjee, Arup Bose.

Source: The Annals of Statistics, Volume 47, Number 6, 3470--3503.

Abstract:
Consider a high-dimensional linear time series model where the dimension $p$ and the sample size $n$ grow in such a way that $p/n o 0$. Let $hat{Gamma }_{u}$ be the $u$th order sample autocovariance matrix. We first show that the LSD of any symmetric polynomial in ${hat{Gamma }_{u},hat{Gamma }_{u}^{*},ugeq 0}$ exists under independence and moment assumptions on the driving sequence together with weak assumptions on the coefficient matrices. This LSD result, with some additional effort, implies the asymptotic normality of the trace of any polynomial in ${hat{Gamma }_{u},hat{Gamma }_{u}^{*},ugeq 0}$. We also study similar results for several independent MA processes. We show applications of the above results to statistical inference problems such as in estimation of the unknown order of a high-dimensional MA process and in graphical and significance tests for hypotheses on coefficient matrices of one or several such independent processes.




it

Bootstrapping and sample splitting for high-dimensional, assumption-lean inference

Alessandro Rinaldo, Larry Wasserman, Max G’Sell.

Source: The Annals of Statistics, Volume 47, Number 6, 3438--3469.

Abstract:
Several new methods have been recently proposed for performing valid inference after model selection. An older method is sample splitting: use part of the data for model selection and the rest for inference. In this paper, we revisit sample splitting combined with the bootstrap (or the Normal approximation). We show that this leads to a simple, assumption-lean approach to inference and we establish results on the accuracy of the method. In fact, we find new bounds on the accuracy of the bootstrap and the Normal approximation for general nonlinear parameters with increasing dimension which we then use to assess the accuracy of regression inference. We define new parameters that measure variable importance and that can be inferred with greater accuracy than the usual regression coefficients. Finally, we elucidate an inference-prediction trade-off: splitting increases the accuracy and robustness of inference but can decrease the accuracy of the predictions.




it

On testing for high-dimensional white noise

Zeng Li, Clifford Lam, Jianfeng Yao, Qiwei Yao.

Source: The Annals of Statistics, Volume 47, Number 6, 3382--3412.

Abstract:
Testing for white noise is a classical yet important problem in statistics, especially for diagnostic checks in time series modeling and linear regression. For high-dimensional time series in the sense that the dimension $p$ is large in relation to the sample size $T$, the popular omnibus tests including the multivariate Hosking and Li–McLeod tests are extremely conservative, leading to substantial power loss. To develop more relevant tests for high-dimensional cases, we propose a portmanteau-type test statistic which is the sum of squared singular values of the first $q$ lagged sample autocovariance matrices. It, therefore, encapsulates all the serial correlations (up to the time lag $q$) within and across all component series. Using the tools from random matrix theory and assuming both $p$ and $T$ diverge to infinity, we derive the asymptotic normality of the test statistic under both the null and a specific VMA(1) alternative hypothesis. As the actual implementation of the test requires the knowledge of three characteristic constants of the population cross-sectional covariance matrix and the value of the fourth moment of the standardized innovations, nontrivial estimations are proposed for these parameters and their integration leads to a practically usable test. Extensive simulation confirms the excellent finite-sample performance of the new test with accurate size and satisfactory power for a large range of finite $(p,T)$ combinations, therefore, ensuring wide applicability in practice. In particular, the new tests are consistently superior to the traditional Hosking and Li–McLeod tests.




it

A smeary central limit theorem for manifolds with application to high-dimensional spheres

Benjamin Eltzner, Stephan F. Huckemann.

Source: The Annals of Statistics, Volume 47, Number 6, 3360--3381.

Abstract:
The (CLT) central limit theorems for generalized Fréchet means (data descriptors assuming values in manifolds, such as intrinsic means, geodesics, etc.) on manifolds from the literature are only valid if a certain empirical process of Hessians of the Fréchet function converges suitably, as in the proof of the prototypical BP-CLT [ Ann. Statist. 33 (2005) 1225–1259]. This is not valid in many realistic scenarios and we provide for a new very general CLT. In particular, this includes scenarios where, in a suitable chart, the sample mean fluctuates asymptotically at a scale $n^{alpha }$ with exponents $alpha <1/2$ with a nonnormal distribution. As the BP-CLT yields only fluctuations that are, rescaled with $n^{1/2}$, asymptotically normal, just as the classical CLT for random vectors, these lower rates, somewhat loosely called smeariness, had to date been observed only on the circle. We make the concept of smeariness on manifolds precise, give an example for two-smeariness on spheres of arbitrary dimension, and show that smeariness, although “almost never” occurring, may have serious statistical implications on a continuum of sample scenarios nearby. In fact, this effect increases with dimension, striking in particular in high dimension low sample size scenarios.




it

Statistical inference for autoregressive models under heteroscedasticity of unknown form

Ke Zhu.

Source: The Annals of Statistics, Volume 47, Number 6, 3185--3215.

Abstract:
This paper provides an entire inference procedure for the autoregressive model under (conditional) heteroscedasticity of unknown form with a finite variance. We first establish the asymptotic normality of the weighted least absolute deviations estimator (LADE) for the model. Second, we develop the random weighting (RW) method to estimate its asymptotic covariance matrix, leading to the implementation of the Wald test. Third, we construct a portmanteau test for model checking, and use the RW method to obtain its critical values. As a special weighted LADE, the feasible adaptive LADE (ALADE) is proposed and proved to have the same efficiency as its infeasible counterpart. The importance of our entire methodology based on the feasible ALADE is illustrated by simulation results and the real data analysis on three U.S. economic data sets.




it

Additive models with trend filtering

Veeranjaneyulu Sadhanala, Ryan J. Tibshirani.

Source: The Annals of Statistics, Volume 47, Number 6, 3032--3068.

Abstract:
We study additive models built with trend filtering, that is, additive models whose components are each regularized by the (discrete) total variation of their $k$th (discrete) derivative, for a chosen integer $kgeq0$. This results in $k$th degree piecewise polynomial components, (e.g., $k=0$ gives piecewise constant components, $k=1$ gives piecewise linear, $k=2$ gives piecewise quadratic, etc.). Analogous to its advantages in the univariate case, additive trend filtering has favorable theoretical and computational properties, thanks in large part to the localized nature of the (discrete) total variation regularizer that it uses. On the theory side, we derive fast error rates for additive trend filtering estimates, and show these rates are minimax optimal when the underlying function is additive and has component functions whose derivatives are of bounded variation. We also show that these rates are unattainable by additive smoothing splines (and by additive models built from linear smoothers, in general). On the computational side, we use backfitting, to leverage fast univariate trend filtering solvers; we also describe a new backfitting algorithm whose iterations can be run in parallel, which (as far as we can tell) is the first of its kind. Lastly, we present a number of experiments to examine the empirical performance of trend filtering.




it

Inference for the mode of a log-concave density

Charles R. Doss, Jon A. Wellner.

Source: The Annals of Statistics, Volume 47, Number 5, 2950--2976.

Abstract:
We study a likelihood ratio test for the location of the mode of a log-concave density. Our test is based on comparison of the log-likelihoods corresponding to the unconstrained maximum likelihood estimator of a log-concave density and the constrained maximum likelihood estimator where the constraint is that the mode of the density is fixed, say at $m$. The constrained estimation problem is studied in detail in Doss and Wellner (2018). Here, the results of that paper are used to show that, under the null hypothesis (and strict curvature of $-log f$ at the mode), the likelihood ratio statistic is asymptotically pivotal: that is, it converges in distribution to a limiting distribution which is free of nuisance parameters, thus playing the role of the $chi_{1}^{2}$ distribution in classical parametric statistical problems. By inverting this family of tests, we obtain new (likelihood ratio based) confidence intervals for the mode of a log-concave density $f$. These new intervals do not depend on any smoothing parameters. We study the new confidence intervals via Monte Carlo methods and illustrate them with two real data sets. The new intervals seem to have several advantages over existing procedures. Software implementing the test and confidence intervals is available in the R package verb+logcondens.mode+.




it

A unified treatment of multiple testing with prior knowledge using the p-filter

Aaditya K. Ramdas, Rina F. Barber, Martin J. Wainwright, Michael I. Jordan.

Source: The Annals of Statistics, Volume 47, Number 5, 2790--2821.

Abstract:
There is a significant literature on methods for incorporating knowledge into multiple testing procedures so as to improve their power and precision. Some common forms of prior knowledge include (a) beliefs about which hypotheses are null, modeled by nonuniform prior weights; (b) differing importances of hypotheses, modeled by differing penalties for false discoveries; (c) multiple arbitrary partitions of the hypotheses into (possibly overlapping) groups and (d) knowledge of independence, positive or arbitrary dependence between hypotheses or groups, suggesting the use of more aggressive or conservative procedures. We present a unified algorithmic framework called p-filter for global null testing and false discovery rate (FDR) control that allows the scientist to incorporate all four types of prior knowledge (a)–(d) simultaneously, recovering a variety of known algorithms as special cases.




it

Phase transition in the spiked random tensor with Rademacher prior

Wei-Kuo Chen.

Source: The Annals of Statistics, Volume 47, Number 5, 2734--2756.

Abstract:
We consider the problem of detecting a deformation from a symmetric Gaussian random $p$-tensor $(pgeq3)$ with a rank-one spike sampled from the Rademacher prior. Recently, in Lesieur et al. (Barbier, Krzakala, Macris, Miolane and Zdeborová (2017)), it was proved that there exists a critical threshold $eta_{p}$ so that when the signal-to-noise ratio exceeds $eta_{p}$, one can distinguish the spiked and unspiked tensors and weakly recover the prior via the minimal mean-square-error method. On the other side, Perry, Wein and Bandeira (Perry, Wein and Bandeira (2017)) proved that there exists a $eta_{p}'<eta_{p}$ such that any statistical hypothesis test cannot distinguish these two tensors, in the sense that their total variation distance asymptotically vanishes, when the signa-to-noise ratio is less than $eta_{p}'$. In this work, we show that $eta_{p}$ is indeed the critical threshold that strictly separates the distinguishability and indistinguishability between the two tensors under the total variation distance. Our approach is based on a subtle analysis of the high temperature behavior of the pure $p$-spin model with Ising spin, arising initially from the field of spin glasses. In particular, we identify the signal-to-noise criticality $eta_{p}$ as the critical temperature, distinguishing the high and low temperature behavior, of the Ising pure $p$-spin mean-field spin glass model.




it

Semiparametrically point-optimal hybrid rank tests for unit roots

Bo Zhou, Ramon van den Akker, Bas J. M. Werker.

Source: The Annals of Statistics, Volume 47, Number 5, 2601--2638.

Abstract:
We propose a new class of unit root tests that exploits invariance properties in the Locally Asymptotically Brownian Functional limit experiment associated to the unit root model. The invariance structures naturally suggest tests that are based on the ranks of the increments of the observations, their average and an assumed reference density for the innovations. The tests are semiparametric in the sense that they are valid, that is, have the correct (asymptotic) size, irrespective of the true innovation density. For a correctly specified reference density, our test is point-optimal and nearly efficient. For arbitrary reference densities, we establish a Chernoff–Savage-type result, that is, our test performs as well as commonly used tests under Gaussian innovations but has improved power under other, for example, fat-tailed or skewed, innovation distributions. To avoid nonparametric estimation, we propose a simplified version of our test that exhibits the same asymptotic properties, except for the Chernoff–Savage result that we are only able to demonstrate by means of simulations.




it

Doubly penalized estimation in additive regression with high-dimensional data

Zhiqiang Tan, Cun-Hui Zhang.

Source: The Annals of Statistics, Volume 47, Number 5, 2567--2600.

Abstract:
Additive regression provides an extension of linear regression by modeling the signal of a response as a sum of functions of covariates of relatively low complexity. We study penalized estimation in high-dimensional nonparametric additive regression where functional semi-norms are used to induce smoothness of component functions and the empirical $L_{2}$ norm is used to induce sparsity. The functional semi-norms can be of Sobolev or bounded variation types and are allowed to be different amongst individual component functions. We establish oracle inequalities for the predictive performance of such methods under three simple technical conditions: a sub-Gaussian condition on the noise, a compatibility condition on the design and the functional classes under consideration and an entropy condition on the functional classes. For random designs, the sample compatibility condition can be replaced by its population version under an additional condition to ensure suitable convergence of empirical norms. In homogeneous settings where the complexities of the component functions are of the same order, our results provide a spectrum of minimax convergence rates, from the so-called slow rate without requiring the compatibility condition to the fast rate under the hard sparsity or certain $L_{q}$ sparsity to allow many small components in the true regression function. These results significantly broaden and sharpen existing ones in the literature.




it

The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics

Joshua Cape, Minh Tang, Carey E. Priebe.

Source: The Annals of Statistics, Volume 47, Number 5, 2405--2439.

Abstract:
The singular value matrix decomposition plays a ubiquitous role throughout statistics and related fields. Myriad applications including clustering, classification, and dimensionality reduction involve studying and exploiting the geometric structure of singular values and singular vectors. This paper provides a novel collection of technical and theoretical tools for studying the geometry of singular subspaces using the two-to-infinity norm. Motivated by preliminary deterministic Procrustes analysis, we consider a general matrix perturbation setting in which we derive a new Procrustean matrix decomposition. Together with flexible machinery developed for the two-to-infinity norm, this allows us to conduct a refined analysis of the induced perturbation geometry with respect to the underlying singular vectors even in the presence of singular value multiplicity. Our analysis yields singular vector entrywise perturbation bounds for a range of popular matrix noise models, each of which has a meaningful associated statistical inference task. In addition, we demonstrate how the two-to-infinity norm is the preferred norm in certain statistical settings. Specific applications discussed in this paper include covariance estimation, singular subspace recovery, and multiple graph inference. Both our Procrustean matrix decomposition and the technical machinery developed for the two-to-infinity norm may be of independent interest.




it

On testing conditional qualitative treatment effects

Chengchun Shi, Rui Song, Wenbin Lu.

Source: The Annals of Statistics, Volume 47, Number 4, 2348--2377.

Abstract:
Precision medicine is an emerging medical paradigm that focuses on finding the most effective treatment strategy tailored for individual patients. In the literature, most of the existing works focused on estimating the optimal treatment regime. However, there has been less attention devoted to hypothesis testing regarding the optimal treatment regime. In this paper, we first introduce the notion of conditional qualitative treatment effects (CQTE) of a set of variables given another set of variables and provide a class of equivalent representations for the null hypothesis of no CQTE. The proposed definition of CQTE does not assume any parametric form for the optimal treatment rule and plays an important role for assessing the incremental value of a set of new variables in optimal treatment decision making conditional on an existing set of prescriptive variables. We then propose novel testing procedures for no CQTE based on kernel estimation of the conditional contrast functions. We show that our test statistics have asymptotically correct size and nonnegligible power against some nonstandard local alternatives. The empirical performance of the proposed tests are evaluated by simulations and an application to an AIDS data set.




it

Convergence complexity analysis of Albert and Chib’s algorithm for Bayesian probit regression

Qian Qin, James P. Hobert.

Source: The Annals of Statistics, Volume 47, Number 4, 2320--2347.

Abstract:
The use of MCMC algorithms in high dimensional Bayesian problems has become routine. This has spurred so-called convergence complexity analysis, the goal of which is to ascertain how the convergence rate of a Monte Carlo Markov chain scales with sample size, $n$, and/or number of covariates, $p$. This article provides a thorough convergence complexity analysis of Albert and Chib’s [ J. Amer. Statist. Assoc. 88 (1993) 669–679] data augmentation algorithm for the Bayesian probit regression model. The main tools used in this analysis are drift and minorization conditions. The usual pitfalls associated with this type of analysis are avoided by utilizing centered drift functions, which are minimized in high posterior probability regions, and by using a new technique to suppress high-dimensionality in the construction of minorization conditions. The main result is that the geometric convergence rate of the underlying Markov chain is bounded below 1 both as $n ightarrowinfty$ (with $p$ fixed), and as $p ightarrowinfty$ (with $n$ fixed). Furthermore, the first computable bounds on the total variation distance to stationarity are byproducts of the asymptotic analysis.




it

Convergence rates of least squares regression estimators with heavy-tailed errors

Qiyang Han, Jon A. Wellner.

Source: The Annals of Statistics, Volume 47, Number 4, 2286--2319.

Abstract:
We study the performance of the least squares estimator (LSE) in a general nonparametric regression model, when the errors are independent of the covariates but may only have a $p$th moment ($pgeq1$). In such a heavy-tailed regression setting, we show that if the model satisfies a standard “entropy condition” with exponent $alphain(0,2)$, then the $L_{2}$ loss of the LSE converges at a rate [mathcal{O}_{mathbf{P}}igl(n^{-frac{1}{2+alpha}}vee n^{-frac{1}{2}+frac{1}{2p}}igr).] Such a rate cannot be improved under the entropy condition alone. This rate quantifies both some positive and negative aspects of the LSE in a heavy-tailed regression setting. On the positive side, as long as the errors have $pgeq1+2/alpha$ moments, the $L_{2}$ loss of the LSE converges at the same rate as if the errors are Gaussian. On the negative side, if $p<1+2/alpha$, there are (many) hard models at any entropy level $alpha$ for which the $L_{2}$ loss of the LSE converges at a strictly slower rate than other robust estimators. The validity of the above rate relies crucially on the independence of the covariates and the errors. In fact, the $L_{2}$ loss of the LSE can converge arbitrarily slowly when the independence fails. The key technical ingredient is a new multiplier inequality that gives sharp bounds for the “multiplier empirical process” associated with the LSE. We further give an application to the sparse linear regression model with heavy-tailed covariates and errors to demonstrate the scope of this new inequality.




it

On deep learning as a remedy for the curse of dimensionality in nonparametric regression

Benedikt Bauer, Michael Kohler.

Source: The Annals of Statistics, Volume 47, Number 4, 2261--2285.

Abstract:
Assuming that a smoothness condition and a suitable restriction on the structure of the regression function hold, it is shown that least squares estimates based on multilayer feedforward neural networks are able to circumvent the curse of dimensionality in nonparametric regression. The proof is based on new approximation results concerning multilayer feedforward neural networks with bounded weights and a bounded number of hidden neurons. The estimates are compared with various other approaches by using simulated data.




it

Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem

James G. Scott, James O. Berger

Source: Ann. Statist., Volume 38, Number 5, 2587--2619.

Abstract:
This paper studies the multiplicity-correction effect of standard Bayesian variable-selection priors in linear regression. Our first goal is to clarify when, and how, multiplicity correction happens automatically in Bayesian analysis, and to distinguish this correction from the Bayesian Ockham’s-razor effect. Our second goal is to contrast empirical-Bayes and fully Bayesian approaches to variable selection through examples, theoretical results and simulations. Considerable differences between the two approaches are found. In particular, we prove a theorem that characterizes a surprising aymptotic discrepancy between fully Bayes and empirical Bayes. This discrepancy arises from a different source than the failure to account for hyperparameter uncertainty in the empirical-Bayes estimate. Indeed, even at the extreme, when the empirical-Bayes estimate converges asymptotically to the true variable-inclusion probability, the potential for a serious difference remains.




it

interoperability

Ability to work with each other. In the loosely coupled environment of a service-oriented architecture, separate resources don't need to know the details of how they each work, but they need to have enough common ground to reliably exchange messages without error or misunderstanding. Standardized specifications go a long way towards creating this common ground, but differences in implementation may still lead to breakdowns in communication. Interoperability is when services can interact with each other without encountering such problems.




it

granularity

How small the pieces are. When a system is split into components, it's important to get the right degree of componentization. Small, fine-grained components give much greater flexibility in assembling precisely the right combination of functionality, but they are more difficult to co-ordinate. Much larger, coarse-grained components are easier to manage but may become too unwieldy. Performance and management considerations tend to favor the use of more coarsely grained messages in a service oriented architecture, whereas earlier generations of distributed computing have preferred a much finer level of granularity.




it

Correction: Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects

Trang Quynh Nguyen, Elizabeth A. Stuart.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 518--520.




it

Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis

Boyu Ren, Sergio Bacallado, Stefano Favaro, Tommi Vatanen, Curtis Huttenhower, Lorenzo Trippa.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 494--517.

Abstract:
Detecting associations between microbial compositions and sample characteristics is one of the most important tasks in microbiome studies. Most of the existing methods apply univariate models to single microbial species separately, with adjustments for multiple hypothesis testing. We propose a Bayesian analysis for a generalized mixed effects linear model tailored to this application. The marginal prior on each microbial composition is a Dirichlet process, and dependence across compositions is induced through a linear combination of individual covariates, such as disease biomarkers or the subject’s age, and latent factors. The latent factors capture residual variability and their dimensionality is learned from the data in a fully Bayesian procedure. The proposed model is tested in data analyses and simulation studies with zero-inflated compositions. In these settings and within each sample, a large proportion of counts per microbial species are equal to zero. In our Bayesian model a priori the probability of compositions with absent microbial species is strictly positive. We propose an efficient algorithm to sample from the posterior and visualizations of model parameters which reveal associations between covariates and microbial compositions. We evaluate the proposed method in simulation studies, and then analyze a microbiome dataset for infants with type 1 diabetes which contains a large proportion of zeros in the sample-specific microbial compositions.




it

Measuring human activity spaces from GPS data with density ranking and summary curves

Yen-Chi Chen, Adrian Dobra.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 409--432.

Abstract:
Activity spaces are fundamental to the assessment of individuals’ dynamic exposure to social and environmental risk factors associated with multiple spatial contexts that are visited during activities of daily living. In this paper we survey existing approaches for measuring the geometry, size and structure of activity spaces, based on GPS data, and explain their limitations. We propose addressing these shortcomings through a nonparametric approach called density ranking and also through three summary curves: the mass-volume curve, the Betti number curve and the persistence curve. We introduce a novel mixture model for human activity spaces and study its asymptotic properties. We prove that the kernel density estimator, which at the present time, is one of the most widespread methods for measuring activity spaces, is not a stable estimator of their structure. We illustrate the practical value of our methods with a simulation study and with a recently collected GPS dataset that comprises the locations visited by 10 individuals over a six months period.




it

Estimating and forecasting the smoking-attributable mortality fraction for both genders jointly in over 60 countries

Yicheng Li, Adrian E. Raftery.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 381--408.

Abstract:
Smoking is one of the leading preventable threats to human health and a major risk factor for lung cancer, upper aerodigestive cancer and chronic obstructive pulmonary disease. Estimating and forecasting the smoking attributable fraction (SAF) of mortality can yield insights into smoking epidemics and also provide a basis for more accurate mortality and life expectancy projection. Peto et al. ( Lancet 339 (1992) 1268–1278) proposed a method to estimate the SAF using the lung cancer mortality rate as an indicator of exposure to smoking in the population of interest. Here, we use the same method to estimate the all-age SAF (ASAF) for both genders for over 60 countries. We document a strong and cross-nationally consistent pattern of the evolution of the SAF over time. We use this as the basis for a new Bayesian hierarchical model to project future male and female ASAF from over 60 countries simultaneously. This gives forecasts as well as predictive distributions that can be used to find uncertainty intervals for any quantity of interest. We assess the model using out-of-sample predictive validation and find that it provides good forecasts and well-calibrated forecast intervals, comparing favorably with other methods.




it

Regression for copula-linked compound distributions with applications in modeling aggregate insurance claims

Peng Shi, Zifeng Zhao.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 357--380.

Abstract:
In actuarial research a task of particular interest and importance is to predict the loss cost for individual risks so that informative decisions are made in various insurance operations such as underwriting, ratemaking and capital management. The loss cost is typically viewed to follow a compound distribution where the summation of the severity variables is stopped by the frequency variable. A challenging issue in modeling such outcomes is to accommodate the potential dependence between the number of claims and the size of each individual claim. In this article we introduce a novel regression framework for compound distributions that uses a copula to accommodate the association between the frequency and the severity variables and, thus, allows for arbitrary dependence between the two components. We further show that the new model is very flexible and is easily modified to account for incomplete data due to censoring or truncation. The flexibility of the proposed model is illustrated using both simulated and real data sets. In the analysis of granular claims data from property insurance, we find substantive negative relationship between the number and the size of insurance claims. In addition, we demonstrate that ignoring the frequency-severity association could lead to biased decision-making in insurance operations.




it

Modeling wildfire ignition origins in southern California using linear network point processes

Medha Uppala, Mark S. Handcock.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 339--356.

Abstract:
This paper focuses on spatial and temporal modeling of point processes on linear networks. Point processes on linear networks can simply be defined as point events occurring on or near line segment network structures embedded in a certain space. A separable modeling framework is introduced that posits separate formation and dissolution models of point processes on linear networks over time. While the model was inspired by spider web building activity in brick mortar lines, the focus is on modeling wildfire ignition origins near road networks over a span of 14 years. As most wildfires in California have human-related origins, modeling the origin locations with respect to the road network provides insight into how human, vehicular and structural densities affect ignition occurrence. Model results show that roads that traverse different types of regions such as residential, interface and wildland regions have higher ignition intensities compared to roads that only exist in each of the mentioned region types.




it

Optimal asset allocation with multivariate Bayesian dynamic linear models

Jared D. Fisher, Davide Pettenuzzo, Carlos M. Carvalho.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 299--338.

Abstract:
We introduce a fast, closed-form, simulation-free method to model and forecast multiple asset returns and employ it to investigate the optimal ensemble of features to include when jointly predicting monthly stock and bond excess returns. Our approach builds on the Bayesian dynamic linear models of West and Harrison ( Bayesian Forecasting and Dynamic Models (1997) Springer), and it can objectively determine, through a fully automated procedure, both the optimal set of regressors to include in the predictive system and the degree to which the model coefficients, volatilities and covariances should vary over time. When applied to a portfolio of five stock and bond returns, we find that our method leads to large forecast gains, both in statistical and economic terms. In particular, we find that relative to a standard no-predictability benchmark, the optimal combination of predictors, stochastic volatility and time-varying covariances increases the annualized certainty equivalent returns of a leverage-constrained power utility investor by more than 500 basis points.




it

Feature selection for generalized varying coefficient mixed-effect models with application to obesity GWAS

Wanghuan Chu, Runze Li, Jingyuan Liu, Matthew Reimherr.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 276--298.

Abstract:
Motivated by an empirical analysis of data from a genome-wide association study on obesity, measured by the body mass index (BMI), we propose a two-step gene-detection procedure for generalized varying coefficient mixed-effects models with ultrahigh dimensional covariates. The proposed procedure selects significant single nucleotide polymorphisms (SNPs) impacting the mean BMI trend, some of which have already been biologically proven to be “fat genes.” The method also discovers SNPs that significantly influence the age-dependent variability of BMI. The proposed procedure takes into account individual variations of genetic effects and can also be directly applied to longitudinal data with continuous, binary or count responses. We employ Monte Carlo simulation studies to assess the performance of the proposed method and further carry out causal inference for the selected SNPs.




it

Estimating the health effects of environmental mixtures using Bayesian semiparametric regression and sparsity inducing priors

Joseph Antonelli, Maitreyi Mazumdar, David Bellinger, David Christiani, Robert Wright, Brent Coull.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 257--275.

Abstract:
Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures, such as some metals, can exhibit complex, nonlinear forms, since some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures, we propose a flexible Bayesian approach that allows exposures to interact with each other and have nonlinear relationships with the outcome. We induce sparsity using multivariate spike and slab priors to determine which exposures are associated with the outcome and which exposures interact with each other. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion into the model to identify pollutants that interact with each other. We utilize our approach to study the impact of exposure to metals on child neurodevelopment in Bangladesh and find a nonlinear, interactive relationship between arsenic and manganese.




it

Bayesian factor models for probabilistic cause of death assessment with verbal autopsies

Tsuyoshi Kunihama, Zehang Richard Li, Samuel J. Clark, Tyler H. McCormick.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 241--256.

Abstract:
The distribution of deaths by cause provides crucial information for public health planning, response and evaluation. About 60% of deaths globally are not registered or given a cause, limiting our ability to understand disease epidemiology. Verbal autopsy (VA) surveys are increasingly used in such settings to collect information on the signs, symptoms and medical history of people who have recently died. This article develops a novel Bayesian method for estimation of population distributions of deaths by cause using verbal autopsy data. The proposed approach is based on a multivariate probit model where associations among items in questionnaires are flexibly induced by latent factors. Using the Population Health Metrics Research Consortium labeled data that include both VA and medically certified causes of death, we assess performance of the proposed method. Further, we estimate important questionnaire items that are highly associated with causes of death. This framework provides insights that will simplify future data




it

Assessing wage status transition and stagnation using quantile transition regression

Chih-Yuan Hsu, Yi-Hau Chen, Ruoh-Rong Yu, Tsung-Wei Hung.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 160--177.

Abstract:
Workers in Taiwan overall have been suffering from long-lasting wage stagnation since the mid-1990s. In particular, there seems to be little mobility for the wages of Taiwanese workers to transit across wage quantile groups. It is of interest to see if certain groups of workers, such as female, lower educated and younger generation workers, suffer from the problem more seriously than the others. This work tries to apply a systematic statistical approach to study this issue, based on the longitudinal data from the Panel Study of Family Dynamics (PSFD) survey conducted in Taiwan since 1999. We propose the quantile transition regression model, generalizing recent methodology for quantile association, to assess the wage status transition with respect to the marginal wage quantiles over time as well as the effects of certain demographic and job factors on the wage status transition. Estimation of the model can be based on the composite likelihoods utilizing the binary, or ordinal-data information regarding the quantile transition, with the associated asymptotic theory established. A goodness-of-fit procedure for the proposed model is developed. The performances of the estimation and the goodness-of-fit procedures for the quantile transition model are illustrated through simulations. The application of the proposed methodology to the PSFD survey data suggests that female, private-sector workers with higher age and education below postgraduate level suffer from more severe wage status stagnation than the others.




it

Surface temperature monitoring in liver procurement via functional variance change-point analysis

Zhenguo Gao, Pang Du, Ran Jin, John L. Robertson.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 143--159.

Abstract:
Liver procurement experiments with surface-temperature monitoring motivated Gao et al. ( J. Amer. Statist. Assoc. 114 (2019) 773–781) to develop a variance change-point detection method under a smoothly-changing mean trend. However, the spotwise change points yielded from their method do not offer immediate information to surgeons since an organ is often transplanted as a whole or in part. We develop a new practical method that can analyze a defined portion of the organ surface at a time. It also provides a novel addition to the developing field of functional data monitoring. Furthermore, numerical challenge emerges for simultaneously modeling the variance functions of 2D locations and the mean function of location and time. The respective sample sizes in the scales of 10,000 and 1,000,000 for modeling these functions make standard spline estimation too costly to be useful. We introduce a multistage subsampling strategy with steps educated by quickly-computable preliminary statistical measures. Extensive simulations show that the new method can efficiently reduce the computational cost and provide reasonable parameter estimates. Application of the new method to our liver surface temperature monitoring data shows its effectiveness in providing accurate status change information for a selected portion of the organ in the experiment.




it

Modeling microbial abundances and dysbiosis with beta-binomial regression

Bryan D. Martin, Daniela Witten, Amy D. Willis.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 94--115.

Abstract:
Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon’s relative abundance . In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon’s relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon’s counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis , the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.




it

Efficient real-time monitoring of an emerging influenza pandemic: How feasible?

Paul J. Birrell, Lorenz Wernisch, Brian D. M. Tom, Leonhard Held, Gareth O. Roberts, Richard G. Pebody, Daniela De Angelis.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 74--93.

Abstract:
A prompt public health response to a new epidemic relies on the ability to monitor and predict its evolution in real time as data accumulate. The 2009 A/H1N1 outbreak in the UK revealed pandemic data as noisy, contaminated, potentially biased and originating from multiple sources. This seriously challenges the capacity for real-time monitoring. Here, we assess the feasibility of real-time inference based on such data by constructing an analytic tool combining an age-stratified SEIR transmission model with various observation models describing the data generation mechanisms. As batches of data become available, a sequential Monte Carlo (SMC) algorithm is developed to synthesise multiple imperfect data streams, iterate epidemic inferences and assess model adequacy amidst a rapidly evolving epidemic environment, substantially reducing computation time in comparison to standard MCMC, to ensure timely delivery of real-time epidemic assessments. In application to simulated data designed to mimic the 2009 A/H1N1 epidemic, SMC is shown to have additional benefits in terms of assessing predictive performance and coping with parameter nonidentifiability.




it

Integrative survival analysis with uncertain event times in application to a suicide risk study

Wenjie Wang, Robert Aseltine, Kun Chen, Jun Yan.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 51--73.

Abstract:
The concept of integrating data from disparate sources to accelerate scientific discovery has generated tremendous excitement in many fields. The potential benefits from data integration, however, may be compromised by the uncertainty due to incomplete/imperfect record linkage. Motivated by a suicide risk study, we propose an approach for analyzing survival data with uncertain event times arising from data integration. Specifically, in our problem deaths identified from the hospital discharge records together with reported suicidal deaths determined by the Office of Medical Examiner may still not include all the death events of patients, and the missing deaths can be recovered from a complete database of death records. Since the hospital discharge data can only be linked to the death record data by matching basic patient characteristics, a patient with a censored death time from the first dataset could be linked to multiple potential event records in the second dataset. We develop an integrative Cox proportional hazards regression in which the uncertainty in the matched event times is modeled probabilistically. The estimation procedure combines the ideas of profile likelihood and the expectation conditional maximization algorithm (ECM). Simulation studies demonstrate that under realistic settings of imperfect data linkage the proposed method outperforms several competing approaches including multiple imputation. A marginal screening analysis using the proposed integrative Cox model is performed to identify risk factors associated with death following suicide-related hospitalization in Connecticut. The identified diagnostics codes are consistent with existing literature and provide several new insights on suicide risk, prediction and prevention.




it

BART with targeted smoothing: An analysis of patient-specific stillbirth risk

Jennifer E. Starling, Jared S. Murray, Carlos M. Carvalho, Radek K. Bukowski, James G. Scott.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 28--50.

Abstract:
This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothness over a single target covariate $t$ while not necessarily requiring smoothness over other covariates $x$. tsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensemble of regression trees. tsBART extends BART by parameterizing each tree’s terminal nodes with smooth functions of $t$ rather than independent scalars. Like BART, tsBART captures complex nonlinear relationships and interactions among the predictors. But unlike BART, tsBART guarantees that the response surface will be smooth in the target covariate. This improves interpretability and helps to regularize the estimate. After introducing and benchmarking the tsBART model, we apply it to our motivating example—pregnancy outcomes data from the National Center for Health Statistics. Our aim is to provide patient-specific estimates of stillbirth risk across gestational age $(t)$ and based on maternal and fetal risk factors $(x)$. Obstetricians expect stillbirth risk to vary smoothly over gestational age but not necessarily over other covariates, and tsBART has been designed precisely to reflect this structural knowledge. The results of our analysis show the clear superiority of the tsBART model for quantifying stillbirth risk, thereby providing patients and doctors with better information for managing the risk of fetal mortality. All methods described here are implemented in the R package tsbart .




it

SHOPPER: A probabilistic model of consumer choice with substitutes and complements

Francisco J. R. Ruiz, Susan Athey, David M. Blei.

Source: The Annals of Applied Statistics, Volume 14, Number 1, 1--27.

Abstract:
We develop SHOPPER, a sequential probabilistic model of shopping data. SHOPPER uses interpretable components to model the forces that drive how a customer chooses products; in particular, we designed SHOPPER to capture how items interact with other items. We develop an efficient posterior inference algorithm to estimate these forces from large-scale data, and we analyze a large dataset from a major chain grocery store. We are interested in answering counterfactual queries about changes in prices. We found that SHOPPER provides accurate predictions even under price interventions, and that it helps identify complementary and substitutable pairs of products.




it

Hierarchical infinite factor models for improving the prediction of surgical complications for geriatric patients

Elizabeth Lorenzi, Ricardo Henao, Katherine Heller.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2637--2661.

Abstract:
Nearly a third of all surgeries performed in the United States occur for patients over the age of 65; these older adults experience a higher rate of postoperative morbidity and mortality. To improve the care for these patients, we aim to identify and characterize high risk geriatric patients to send to a specialized perioperative clinic while leveraging the overall surgical population to improve learning. To this end, we develop a hierarchical infinite latent factor model (HIFM) to appropriately account for the covariance structure across subpopulations in data. We propose a novel Hierarchical Dirichlet Process shrinkage prior on the loadings matrix that flexibly captures the underlying structure of our data while sharing information across subpopulations to improve inference and prediction. The stick-breaking construction of the prior assumes an infinite number of factors and allows for each subpopulation to utilize different subsets of the factor space and select the number of factors needed to best explain the variation. We develop the model into a latent factor regression method that excels at prediction and inference of regression coefficients. Simulations validate this strong performance compared to baseline methods. We apply this work to the problem of predicting surgical complications using electronic health record data for geriatric patients and all surgical patients at Duke University Health System (DUHS). The motivating application demonstrates the improved predictive performance when using HIFM in both area under the ROC curve and area under the PR Curve while providing interpretable coefficients that may lead to actionable interventions.




it

Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: A winning solution to the NIJ “Real-Time Crime Forecasting Challenge”

Seth Flaxman, Michael Chirico, Pau Pereira, Charles Loeffler.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2564--2585.

Abstract:
We propose a generic spatiotemporal event forecasting method which we developed for the National Institute of Justice’s (NIJ) Real-Time Crime Forecasting Challenge (National Institute of Justice (2017)). Our method is a spatiotemporal forecasting model combining scalable randomized Reproducing Kernel Hilbert Space (RKHS) methods for approximating Gaussian processes with autoregressive smoothing kernels in a regularized supervised learning framework. While the smoothing kernels capture the two main approaches in current use in the field of crime forecasting, kernel density estimation (KDE) and self-exciting point process (SEPP) models, the RKHS component of the model can be understood as an approximation to the popular log-Gaussian Cox Process model. For inference, we discretize the spatiotemporal point pattern and learn a log-intensity function using the Poisson likelihood and highly efficient gradient-based optimization methods. Model hyperparameters including quality of RKHS approximation, spatial and temporal kernel lengthscales, number of autoregressive lags and bandwidths for smoothing kernels as well as cell shape, size and rotation, were learned using cross validation. Resulting predictions significantly exceeded baseline KDE estimates and SEPP models for sparse events.




it

A simple, consistent estimator of SNP heritability from genome-wide association studies

Armin Schwartzman, Andrew J. Schork, Rong Zablocki, Wesley K. Thompson.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2509--2538.

Abstract:
Analysis of genome-wide association studies (GWAS) is characterized by a large number of univariate regressions where a quantitative trait is regressed on hundreds of thousands to millions of single-nucleotide polymorphism (SNP) allele counts, one at a time. This article proposes an estimator of the SNP heritability of the trait, defined here as the fraction of the variance of the trait explained by the SNPs in the study. The proposed GWAS heritability (GWASH) estimator is easy to compute, highly interpretable and is consistent as the number of SNPs and the sample size increase. More importantly, it can be computed from summary statistics typically reported in GWAS, not requiring access to the original data. The estimator takes full account of the linkage disequilibrium (LD) or correlation between the SNPs in the study through moments of the LD matrix, estimable from auxiliary datasets. Unlike other proposed estimators in the literature, we establish the theoretical properties of the GWASH estimator and obtain analytical estimates of the precision, allowing for power and sample size calculations for SNP heritability estimates and forming a firm foundation for future methodological development.




it

Empirical Bayes analysis of RNA sequencing experiments with auxiliary information

Kun Liang.

Source: The Annals of Applied Statistics, Volume 13, Number 4, 2452--2482.

Abstract:
Finding differentially expressed genes is a common task in high-throughput transcriptome studies. While traditional statistical methods rank the genes by their test statistics alone, we analyze an RNA sequencing dataset using the auxiliary information of gene length and the test statistics from a related microarray study. Given the auxiliary information, we propose a novel nonparametric empirical Bayes procedure to estimate the posterior probability of differential expression for each gene. We demonstrate the advantage of our procedure in extensive simulation studies and a psoriasis RNA sequencing study. The companion R package calm is available at Bioconductor.