ip High-Dimensional Interactions Detection with Sparse Principal Hessian Matrix By Published On :: 2020 In statistical learning framework with regressions, interactions are the contributions to the response variable from the products of the explanatory variables. In high-dimensional problems, detecting interactions is challenging due to combinatorial complexity and limited data information. We consider detecting interactions by exploring their connections with the principal Hessian matrix. Specifically, we propose a one-step synthetic approach for estimating the principal Hessian matrix by a penalized M-estimator. An alternating direction method of multipliers (ADMM) is proposed to efficiently solve the encountered regularized optimization problem. Based on the sparse estimator, we detect the interactions by identifying its nonzero components. Our method directly targets at the interactions, and it requires no structural assumption on the hierarchy of the interactions effects. We show that our estimator is theoretically valid, computationally efficient, and practically useful for detecting the interactions in a broad spectrum of scenarios. Full Article
ip Targeted Fused Ridge Estimation of Inverse Covariance Matrices from Multiple High-Dimensional Data Classes By Published On :: 2020 We consider the problem of jointly estimating multiple inverse covariance matrices from high-dimensional data consisting of distinct classes. An $ell_2$-penalized maximum likelihood approach is employed. The suggested approach is flexible and generic, incorporating several other $ell_2$-penalized estimators as special cases. In addition, the approach allows specification of target matrices through which prior knowledge may be incorporated and which can stabilize the estimation procedure in high-dimensional settings. The result is a targeted fused ridge estimator that is of use when the precision matrices of the constituent classes are believed to chiefly share the same structure while potentially differing in a number of locations of interest. It has many applications in (multi)factorial study designs. We focus on the graphical interpretation of precision matrices with the proposed estimator then serving as a basis for integrative or meta-analytic Gaussian graphical modeling. Situations are considered in which the classes are defined by data sets and subtypes of diseases. The performance of the proposed estimator in the graphical modeling setting is assessed through extensive simulation experiments. Its practical usability is illustrated by the differential network modeling of 12 large-scale gene expression data sets of diffuse large B-cell lymphoma subtypes. The estimator and its related procedures are incorporated into the R-package rags2ridges. Full Article
ip Causal Discovery Toolbox: Uncovering causal relationships in Python By Published On :: 2020 This paper presents a new open source Python framework for causal discovery from observational data and domain background knowledge, aimed at causal graph and causal mechanism modeling. The cdt package implements an end-to-end approach, recovering the direct dependencies (the skeleton of the causal graph) and the causal relationships between variables. It includes algorithms from the `Bnlearn' and `Pcalg' packages, together with algorithms for pairwise causal discovery such as ANM. Full Article
ip Optimal Bipartite Network Clustering By Published On :: 2020 We study bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. This is further formalized by deriving a minimax lower bound over a class of biclustering problems. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth, from sparse networks with average degrees growing arbitrarily slowly to fairly dense networks with average degrees of order $sqrt{n}$. As a special case, we recover the known exact recovery threshold in the $log n$ regime of sparsity. To obtain the consistency result, as part of the provable version of the algorithm, we introduce a sub-block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable algorithm is derived from a general class of pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations. Full Article
ip Skill Rating for Multiplayer Games. Introducing Hypernode Graphs and their Spectral Theory By Published On :: 2020 We consider the skill rating problem for multiplayer games, that is how to infer player skills from game outcomes in multiplayer games. We formulate the problem as a minimization problem $arg min_{s} s^T Delta s$ where $Delta$ is a positive semidefinite matrix and $s$ a real-valued function, of which some entries are the skill values to be inferred and other entries are constrained by the game outcomes. We leverage graph-based semi-supervised learning (SSL) algorithms for this problem. We apply our algorithms on several data sets of multiplayer games and obtain very promising results compared to Elo Duelling (see Elo, 1978) and TrueSkill (see Herbrich et al., 2006).. As we leverage graph-based SSL algorithms and because games can be seen as relations between sets of players, we then generalize the approach. For this aim, we introduce a new finite model, called hypernode graph, defined to be a set of weighted binary relations between sets of nodes. We define Laplacians of hypernode graphs. Then, we show that the skill rating problem for multiplayer games can be formulated as $arg min_{s} s^T Delta s$ where $Delta$ is the Laplacian of a hypernode graph constructed from a set of games. From a fundamental perspective, we show that hypernode graph Laplacians are symmetric positive semidefinite matrices with constant functions in their null space. We show that problems on hypernode graphs can not be solved with graph constructions and graph kernels. We relate hypernode graphs to signed graphs showing that positive relations between groups can lead to negative relations between individuals. Full Article
ip Exact Guarantees on the Absence of Spurious Local Minima for Non-negative Rank-1 Robust Principal Component Analysis By Published On :: 2020 This work is concerned with the non-negative rank-1 robust principal component analysis (RPCA), where the goal is to recover the dominant non-negative principal components of a data matrix precisely, where a number of measurements could be grossly corrupted with sparse and arbitrary large noise. Most of the known techniques for solving the RPCA rely on convex relaxation methods by lifting the problem to a higher dimension, which significantly increase the number of variables. As an alternative, the well-known Burer-Monteiro approach can be used to cast the RPCA as a non-convex and non-smooth $ell_1$ optimization problem with a significantly smaller number of variables. In this work, we show that the low-dimensional formulation of the symmetric and asymmetric positive rank-1 RPCA based on the Burer-Monteiro approach has benign landscape, i.e., 1) it does not have any spurious local solution, 2) has a unique global solution, and 3) its unique global solution coincides with the true components. An implication of this result is that simple local search algorithms are guaranteed to achieve a zero global optimality gap when directly applied to the low-dimensional formulation. Furthermore, we provide strong deterministic and probabilistic guarantees for the exact recovery of the true principal components. In particular, it is shown that a constant fraction of the measurements could be grossly corrupted and yet they would not create any spurious local solution. Full Article
ip Multiparameter Persistence Landscapes By Published On :: 2020 An important problem in the field of Topological Data Analysis is defining topological summaries which can be combined with traditional data analytic tools. In recent work Bubenik introduced the persistence landscape, a stable representation of persistence diagrams amenable to statistical analysis and machine learning tools. In this paper we generalise the persistence landscape to multiparameter persistence modules providing a stable representation of the rank invariant. We show that multiparameter landscapes are stable with respect to the interleaving distance and persistence weighted Wasserstein distance, and that the collection of multiparameter landscapes faithfully represents the rank invariant. Finally we provide example calculations and statistical tests to demonstrate a range of potential applications and how one can interpret the landscapes associated to a multiparameter module. Full Article
ip Researching the Pacific: The Pacific Manuscripts Bureau By feedproxy.google.com Published On :: Mon, 27 Apr 2020 05:25:40 +0000 The State Library holds a superb collection of original documents, illustrations, photographs and books about the Pacifi Full Article
ip Measuring symmetry and asymmetry of multiplicative distortion measurement errors data By projecteuclid.org Published On :: Mon, 04 May 2020 04:00 EDT Jun Zhang, Yujie Gai, Xia Cui, Gaorong Li. Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 370--393.Abstract: This paper studies the measure of symmetry or asymmetry of a continuous variable under the multiplicative distortion measurement errors setting. The unobservable variable is distorted in a multiplicative fashion by an observed confounding variable. First, two direct plug-in estimation procedures are proposed, and the empirical likelihood based confidence intervals are constructed to measure the symmetry or asymmetry of the unobserved variable. Next, we propose four test statistics for testing whether the unobserved variable is symmetric or not. The asymptotic properties of the proposed estimators and test statistics are examined. We conduct Monte Carlo simulation experiments to examine the performance of the proposed estimators and test statistics. These methods are applied to analyze a real dataset for an illustration. Full Article
ip Effects of gene–environment and gene–gene interactions in case-control studies: A novel Bayesian semiparametric approach By projecteuclid.org Published On :: Mon, 03 Feb 2020 04:00 EST Durba Bhattacharya, Sourabh Bhattacharya. Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 1, 71--89.Abstract: Present day bio-medical research is pointing towards the fact that cognizance of gene–environment interactions along with genetic interactions may help prevent or detain the onset of many complex diseases like cardiovascular disease, cancer, type2 diabetes, autism or asthma by adjustments to lifestyle. In this regard, we propose a Bayesian semiparametric model to detect not only the roles of genes and their interactions, but also the possible influence of environmental variables on the genes in case-control studies. Our model also accounts for the unknown number of genetic sub-populations via finite mixtures composed of Dirichlet processes. An effective parallel computing methodology, developed by us harnesses the power of parallel processing technology to increase the efficiencies of our conditionally independent Gibbs sampling and Transformation based MCMC (TMCMC) methods. Applications of our model and methods to simulation studies with biologically realistic genotype datasets and a real, case-control based genotype dataset on early onset of myocardial infarction (MI) have yielded quite interesting results beside providing some insights into the differential effect of gender on MI. Full Article
ip Fractional backward stochastic variational inequalities with non-Lipschitz coefficient By projecteuclid.org Published On :: Mon, 10 Jun 2019 04:04 EDT Katarzyna Jańczak-Borkowska. Source: Brazilian Journal of Probability and Statistics, Volume 33, Number 3, 480--497.Abstract: We prove the existence and uniqueness of the solution of backward stochastic variational inequalities with respect to fractional Brownian motion and with non-Lipschitz coefficient. We assume that $H>1/2$. Full Article
ip Public-private partnerships in Canada : law, policy and value for money By dal.novanet.ca Published On :: Fri, 1 May 2020 19:34:09 -0300 Author: Murphy, Timothy J. (Timothy John), author.Callnumber: KE 1465 M87 2019ISBN: 9780433457985 (Cloth) Full Article
ip Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules By projecteuclid.org Published On :: Thu, 05 Aug 2010 15:41 EDT Michael P. Fay, Michael A. ProschanSource: Statist. Surv., Volume 4, 1--39.Abstract: In a mathematical approach to hypothesis tests, we start with a clearly defined set of hypotheses and choose the test with the best properties for those hypotheses. In practice, we often start with less precise hypotheses. For example, often a researcher wants to know which of two groups generally has the larger responses, and either a t-test or a Wilcoxon-Mann-Whitney (WMW) test could be acceptable. Although both t-tests and WMW tests are usually associated with quite different hypotheses, the decision rule and p-value from either test could be associated with many different sets of assumptions, which we call perspectives. It is useful to have many of the different perspectives to which a decision rule may be applied collected in one place, since each perspective allows a different interpretation of the associated p-value. Here we collect many such perspectives for the two-sample t-test, the WMW test and other related tests. We discuss validity and consistency under each perspective and discuss recommendations between the tests in light of these many different perspectives. Finally, we briefly discuss a decision rule for testing genetic neutrality where knowledge of the many perspectives is vital to the proper interpretation of the decision rule. Full Article
ip Multi-scale analysis of lead-lag relationships in high-frequency financial markets. (arXiv:1708.03992v3 [stat.ME] UPDATED) By arxiv.org Published On :: We propose a novel estimation procedure for scale-by-scale lead-lag relationships of financial assets observed at high-frequency in a non-synchronous manner. The proposed estimation procedure does not require any interpolation processing of original datasets and is applicable to those with highest time resolution available. Consistency of the proposed estimators is shown under the continuous-time framework that has been developed in our previous work Hayashi and Koike (2018). An empirical application to a quote dataset of the NASDAQ-100 assets identifies two types of lead-lag relationships at different time scales. Full Article
ip Semiparametric Optimal Estimation With Nonignorable Nonresponse Data. (arXiv:1612.09207v3 [stat.ME] UPDATED) By arxiv.org Published On :: When the response mechanism is believed to be not missing at random (NMAR), a valid analysis requires stronger assumptions on the response mechanism than standard statistical methods would otherwise require. Semiparametric estimators have been developed under the model assumptions on the response mechanism. In this paper, a new statistical test is proposed to guarantee model identifiability without using any instrumental variable. Furthermore, we develop optimal semiparametric estimation for parameters such as the population mean. Specifically, we propose two semiparametric optimal estimators that do not require any model assumptions other than the response mechanism. Asymptotic properties of the proposed estimators are discussed. An extensive simulation study is presented to compare with some existing methods. We present an application of our method using Korean Labor and Income Panel Survey data. Full Article
ip On a computationally-scalable sparse formulation of the multidimensional and non-stationary maximum entropy principle. (arXiv:2005.03253v1 [stat.CO]) By arxiv.org Published On :: Data-driven modelling and computational predictions based on maximum entropy principle (MaxEnt-principle) aim at finding as-simple-as-possible - but not simpler then necessary - models that allow to avoid the data overfitting problem. We derive a multivariate non-parametric and non-stationary formulation of the MaxEnt-principle and show that its solution can be approximated through a numerical maximisation of the sparse constrained optimization problem with regularization. Application of the resulting algorithm to popular financial benchmarks reveals memoryless models allowing for simple and qualitative descriptions of the major stock market indexes data. We compare the obtained MaxEnt-models to the heteroschedastic models from the computational econometrics (GARCH, GARCH-GJR, MS-GARCH, GARCH-PML4) in terms of the model fit, complexity and prediction quality. We compare the resulting model log-likelihoods, the values of the Bayesian Information Criterion, posterior model probabilities, the quality of the data autocorrelation function fits as well as the Value-at-Risk prediction quality. We show that all of the considered seven major financial benchmark time series (DJI, SPX, FTSE, STOXX, SMI, HSI and N225) are better described by conditionally memoryless MaxEnt-models with nonstationary regime-switching than by the common econometric models with finite memory. This analysis also reveals a sparse network of statistically-significant temporal relations for the positive and negative latent variance changes among different markets. The code is provided for open access. Full Article
ip Active Learning with Multiple Kernels. (arXiv:2005.03188v1 [cs.LG]) By arxiv.org Published On :: Online multiple kernel learning (OMKL) has provided an attractive performance in nonlinear function learning tasks. Leveraging a random feature approximation, the major drawback of OMKL, known as the curse of dimensionality, has been recently alleviated. In this paper, we introduce a new research problem, termed (stream-based) active multiple kernel learning (AMKL), in which a learner is allowed to label selected data from an oracle according to a selection criterion. This is necessary in many real-world applications as acquiring true labels is costly or time-consuming. We prove that AMKL achieves an optimal sublinear regret, implying that the proposed selection criterion indeed avoids unuseful label-requests. Furthermore, we propose AMKL with an adaptive kernel selection (AMKL-AKS) in which irrelevant kernels can be excluded from a kernel dictionary 'on the fly'. This approach can improve the efficiency of active learning as well as the accuracy of a function approximation. Via numerical tests with various real datasets, it is demonstrated that AMKL-AKS yields a similar or better performance than the best-known OMKL, with a smaller number of labeled data. Full Article
ip Entries open for $40,000 award for female scriptwriters By feedproxy.google.com Published On :: Thu, 05 Mar 2020 23:11:18 +0000 Friday 6 March 2020 Nominations opened for the 2020 Mona Brand Award for Women Stage and Screen Writers. Full Article
ip Close encounters: a manuscripts workshop By blog.wellcomelibrary.org Published On :: Mon, 23 Apr 2018 15:18:54 +0000 A free manuscripts workshop for PhD students at Wellcome Collection, 01 June 2018 Engaging with an artefact from the past is often a powerful experience, eliciting emotional and sensory, as well as analytical, responses. Researchers in the library at Wellcome… Continue reading Full Article Early Medicine Events and Visits emotions manuscripts materiality senses study visits
ip Wyllie's treatment of epilepsy : principles and practice By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 149639769X Full Article
ip Wine science : principles and applications By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Author: Jackson, Ron S., author.Callnumber: OnlineISBN: 9780128161180 Full Article
ip Vertebrate and invertebrate respiratory proteins, lipoproteins and other body fluid proteins By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9783030417697 (electronic bk.) Full Article
ip Tissue engineering : principles, protocols, and practical exercises By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9783030396985 Full Article
ip The Washington manual internship survival guide By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9781975116859 Full Article
ip Nanoencapsulation of food ingredients by specialized equipment By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9780128156728 (electronic bk.) Full Article
ip Maxillofacial cone beam computed tomography : principles, techniques and clinical applications By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9783319620619 (electronic bk.) Full Article
ip Irwin and Rippe's intensive care medicine By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9781496306081 hardcover Full Article
ip Health consequences of microbial interactions with hydrocarbons, oils, and lipids By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9783319724737 (electronic bk.) Full Article
ip Handbook for principles and practice of gynecologic oncology By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9781975141066 (paperback) Full Article
ip Consequences of microbial interactions with hydrocarbons, oils, and lipids : biodegradation and bioremediation By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9783319445359 (electronic bk.) Full Article
ip Biscuit, cookie and cracker process and recipes By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Author: Sykes, Glyn, authorCallnumber: OnlineISBN: 9780128206133 (electronic bk.) Full Article
ip Anaerobic utilization of hydrocarbons, oils, and lipids By dal.novanet.ca Published On :: Fri, 1 May 2020 19:44:43 -0300 Callnumber: OnlineISBN: 9783319503912 (electronic bk.) Full Article
ip Wine Retailers Seek Alcohol Shipping Compromise with 18 States By www.prweb.com Published On :: National Association of Wine Retailers Release Letter Delivered to Attorneys General and Alcohol Regulatory Chiefs Concerning Unconstitutional and Unenforceable Wine Shipping Bans(PRWeb April 15, 2020)Read the full story at https://www.prweb.com/releases/wine_retailers_seek_alcohol_shipping_compromise_with_18_states/prweb17050617.htm Full Article
ip New Partnerships Emerge for COVID-19 Relief: Dade County Farm Bureau... By www.prweb.com Published On :: Harvested produce crops feed Florida Department of Corrections’ (FDC) more than 87,000 inmates; action saves food costs while reducing COVID-19 related supply chain impacts.(PRWeb April 20, 2020)Read the full story at https://www.prweb.com/releases/new_partnerships_emerge_for_covid_19_relief_dade_county_farm_bureau_teams_with_state_leaders_to_launch_farm_to_inmate_program/prweb17052045.htm Full Article
ip Efficient estimation of linear functionals of principal components By projecteuclid.org Published On :: Mon, 17 Feb 2020 04:02 EST Vladimir Koltchinskii, Matthias Löffler, Richard Nickl. Source: The Annals of Statistics, Volume 48, Number 1, 464--490.Abstract: We study principal component analysis (PCA) for mean zero i.i.d. Gaussian observations $X_{1},dots,X_{n}$ in a separable Hilbert space $mathbb{H}$ with unknown covariance operator $Sigma $. The complexity of the problem is characterized by its effective rank $mathbf{r}(Sigma):=frac{operatorname{tr}(Sigma)}{|Sigma |}$, where $mathrm{tr}(Sigma)$ denotes the trace of $Sigma $ and $|Sigma|$ denotes its operator norm. We develop a method of bias reduction in the problem of estimation of linear functionals of eigenvectors of $Sigma $. Under the assumption that $mathbf{r}(Sigma)=o(n)$, we establish the asymptotic normality and asymptotic properties of the risk of the resulting estimators and prove matching minimax lower bounds, showing their semiparametric optimality. Full Article
ip Testing for principal component directions under weak identifiability By projecteuclid.org Published On :: Mon, 17 Feb 2020 04:02 EST Davy Paindaveine, Julien Remy, Thomas Verdebout. Source: The Annals of Statistics, Volume 48, Number 1, 324--345.Abstract: We consider the problem of testing, on the basis of a $p$-variate Gaussian random sample, the null hypothesis $mathcal{H}_{0}:oldsymbol{ heta}_{1}=oldsymbol{ heta}_{1}^{0}$ against the alternative $mathcal{H}_{1}:oldsymbol{ heta}_{1} eq oldsymbol{ heta}_{1}^{0}$, where $oldsymbol{ heta}_{1}$ is the “first” eigenvector of the underlying covariance matrix and $oldsymbol{ heta}_{1}^{0}$ is a fixed unit $p$-vector. In the classical setup where eigenvalues $lambda_{1}>lambda_{2}geq cdots geq lambda_{p}$ are fixed, the Anderson ( Ann. Math. Stat. 34 (1963) 122–148) likelihood ratio test (LRT) and the Hallin, Paindaveine and Verdebout ( Ann. Statist. 38 (2010) 3245–3299) Le Cam optimal test for this problem are asymptotically equivalent under the null hypothesis, hence also under sequences of contiguous alternatives. We show that this equivalence does not survive asymptotic scenarios where $lambda_{n1}/lambda_{n2}=1+O(r_{n})$ with $r_{n}=O(1/sqrt{n})$. For such scenarios, the Le Cam optimal test still asymptotically meets the nominal level constraint, whereas the LRT severely overrejects the null hypothesis. Consequently, the former test should be favored over the latter one whenever the two largest sample eigenvalues are close to each other. By relying on the Le Cam’s asymptotic theory of statistical experiments, we study the non-null and optimality properties of the Le Cam optimal test in the aforementioned asymptotic scenarios and show that the null robustness of this test is not obtained at the expense of power. Our asymptotic investigation is extensive in the sense that it allows $r_{n}$ to converge to zero at an arbitrary rate. While we restrict to single-spiked spectra of the form $lambda_{n1}>lambda_{n2}=cdots =lambda_{np}$ to make our results as striking as possible, we extend our results to the more general elliptical case. Finally, we present an illustrative real data example. Full Article
ip New $G$-formula for the sequential causal effect and blip effect of treatment in sequential causal inference By projecteuclid.org Published On :: Mon, 17 Feb 2020 04:02 EST Xiaoqin Wang, Li Yin. Source: The Annals of Statistics, Volume 48, Number 1, 138--160.Abstract: In sequential causal inference, two types of causal effects are of practical interest, namely, the causal effect of the treatment regime (called the sequential causal effect) and the blip effect of treatment on the potential outcome after the last treatment. The well-known $G$-formula expresses these causal effects in terms of the standard parameters. In this article, we obtain a new $G$-formula that expresses these causal effects in terms of the point observable effects of treatments similar to treatment in the framework of single-point causal inference. Based on the new $G$-formula, we estimate these causal effects by maximum likelihood via point observable effects with methods extended from single-point causal inference. We are able to increase precision of the estimation without introducing biases by an unsaturated model imposing constraints on the point observable effects. We are also able to reduce the number of point observable effects in the estimation by treatment assignment conditions. Full Article
ip Two-step semiparametric empirical likelihood inference By projecteuclid.org Published On :: Mon, 17 Feb 2020 04:02 EST Francesco Bravo, Juan Carlos Escanciano, Ingrid Van Keilegom. Source: The Annals of Statistics, Volume 48, Number 1, 1--26.Abstract: In both parametric and certain nonparametric statistical models, the empirical likelihood ratio satisfies a nonparametric version of Wilks’ theorem. For many semiparametric models, however, the commonly used two-step (plug-in) empirical likelihood ratio is not asymptotically distribution-free, that is, its asymptotic distribution contains unknown quantities, and hence Wilks’ theorem breaks down. This article suggests a general approach to restore Wilks’ phenomenon in two-step semiparametric empirical likelihood inferences. The main insight consists in using as the moment function in the estimating equation the influence function of the plug-in sample moment. The proposed method is general; it leads to a chi-squared limiting distribution with known degrees of freedom; it is efficient; it does not require undersmoothing; and it is less sensitive to the first-step than alternative methods, which is particularly appealing for high-dimensional settings. Several examples and simulation studies illustrate the general applicability of the procedure and its excellent finite sample performance relative to competing methods. Full Article
ip Distributed estimation of principal eigenspaces By projecteuclid.org Published On :: Wed, 30 Oct 2019 22:03 EDT Jianqing Fan, Dong Wang, Kaizheng Wang, Ziwei Zhu. Source: The Annals of Statistics, Volume 47, Number 6, 3009--3031.Abstract: Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. This paper proposes and studies a distributed PCA algorithm: each node machine computes the top $K$ eigenvectors and transmits them to the central server; the central server then aggregates the information from all the node machines and conducts a PCA based on the aggregated information. We investigate the bias and variance for the resulting distributed estimator of the top $K$ eigenvectors. In particular, we show that for distributions with symmetric innovation, the empirical top eigenspaces are unbiased, and hence the distributed PCA is “unbiased.” We derive the rate of convergence for distributed PCA estimators, which depends explicitly on the effective rank of covariance, eigengap, and the number of machines. We show that when the number of machines is not unreasonably large, the distributed PCA performs as well as the whole sample PCA, even without full access of whole data. The theoretical results are verified by an extensive simulation study. We also extend our analysis to the heterogeneous case where the population covariance matrices are different across local machines but share similar top eigenstructures. Full Article
ip A unified treatment of multiple testing with prior knowledge using the p-filter By projecteuclid.org Published On :: Fri, 02 Aug 2019 22:04 EDT Aaditya K. Ramdas, Rina F. Barber, Martin J. Wainwright, Michael I. Jordan. Source: The Annals of Statistics, Volume 47, Number 5, 2790--2821.Abstract: There is a significant literature on methods for incorporating knowledge into multiple testing procedures so as to improve their power and precision. Some common forms of prior knowledge include (a) beliefs about which hypotheses are null, modeled by nonuniform prior weights; (b) differing importances of hypotheses, modeled by differing penalties for false discoveries; (c) multiple arbitrary partitions of the hypotheses into (possibly overlapping) groups and (d) knowledge of independence, positive or arbitrary dependence between hypotheses or groups, suggesting the use of more aggressive or conservative procedures. We present a unified algorithmic framework called p-filter for global null testing and false discovery rate (FDR) control that allows the scientist to incorporate all four types of prior knowledge (a)–(d) simultaneously, recovering a variety of known algorithms as special cases. Full Article
ip Semiparametrically point-optimal hybrid rank tests for unit roots By projecteuclid.org Published On :: Fri, 02 Aug 2019 22:04 EDT Bo Zhou, Ramon van den Akker, Bas J. M. Werker. Source: The Annals of Statistics, Volume 47, Number 5, 2601--2638.Abstract: We propose a new class of unit root tests that exploits invariance properties in the Locally Asymptotically Brownian Functional limit experiment associated to the unit root model. The invariance structures naturally suggest tests that are based on the ranks of the increments of the observations, their average and an assumed reference density for the innovations. The tests are semiparametric in the sense that they are valid, that is, have the correct (asymptotic) size, irrespective of the true innovation density. For a correctly specified reference density, our test is point-optimal and nearly efficient. For arbitrary reference densities, we establish a Chernoff–Savage-type result, that is, our test performs as well as commonly used tests under Gaussian innovations but has improved power under other, for example, fat-tailed or skewed, innovation distributions. To avoid nonparametric estimation, we propose a simplified version of our test that exhibits the same asymptotic properties, except for the Chernoff–Savage result that we are only able to demonstrate by means of simulations. Full Article
ip Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem By projecteuclid.org Published On :: Thu, 05 Aug 2010 15:41 EDT James G. Scott, James O. BergerSource: Ann. Statist., Volume 38, Number 5, 2587--2619.Abstract: This paper studies the multiplicity-correction effect of standard Bayesian variable-selection priors in linear regression. Our first goal is to clarify when, and how, multiplicity correction happens automatically in Bayesian analysis, and to distinguish this correction from the Bayesian Ockham’s-razor effect. Our second goal is to contrast empirical-Bayes and fully Bayesian approaches to variable selection through examples, theoretical results and simulations. Considerable differences between the two approaches are found. In particular, we prove a theorem that characterizes a surprising aymptotic discrepancy between fully Bayes and empirical Bayes. This discrepancy arises from a different source than the failure to account for hyperparameter uncertainty in the empirical-Bayes estimate. Indeed, even at the extreme, when the empirical-Bayes estimate converges asymptotically to the true variable-inclusion probability, the potential for a serious difference remains. Full Article
ip A comparison of principal component methods between multiple phenotype regression and multiple SNP regression in genetic association studies By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Zhonghua Liu, Ian Barnett, Xihong Lin. Source: The Annals of Applied Statistics, Volume 14, Number 1, 433--451.Abstract: Principal component analysis (PCA) is a popular method for dimension reduction in unsupervised multivariate analysis. However, existing ad hoc uses of PCA in both multivariate regression (multiple outcomes) and multiple regression (multiple predictors) lack theoretical justification. The differences in the statistical properties of PCAs in these two regression settings are not well understood. In this paper we provide theoretical results on the power of PCA in genetic association testings in both multiple phenotype and SNP-set settings. The multiple phenotype setting refers to the case when one is interested in studying the association between a single SNP and multiple phenotypes as outcomes. The SNP-set setting refers to the case when one is interested in studying the association between multiple SNPs in a SNP set and a single phenotype as the outcome. We demonstrate analytically that the properties of the PC-based analysis in these two regression settings are substantially different. We show that the lower order PCs, that is, PCs with large eigenvalues, are generally preferred and lead to a higher power in the SNP-set setting, while the higher-order PCs, that is, PCs with small eigenvalues, are generally preferred in the multiple phenotype setting. We also investigate the power of three other popular statistical methods, the Wald test, the variance component test and the minimum $p$-value test, in both multiple phenotype and SNP-set settings. We use theoretical power, simulation studies, and two real data analyses to validate our findings. Full Article
ip Estimating the health effects of environmental mixtures using Bayesian semiparametric regression and sparsity inducing priors By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Joseph Antonelli, Maitreyi Mazumdar, David Bellinger, David Christiani, Robert Wright, Brent Coull. Source: The Annals of Applied Statistics, Volume 14, Number 1, 257--275.Abstract: Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures, such as some metals, can exhibit complex, nonlinear forms, since some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures, we propose a flexible Bayesian approach that allows exposures to interact with each other and have nonlinear relationships with the outcome. We induce sparsity using multivariate spike and slab priors to determine which exposures are associated with the outcome and which exposures interact with each other. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion into the model to identify pollutants that interact with each other. We utilize our approach to study the impact of exposure to metals on child neurodevelopment in Bangladesh and find a nonlinear, interactive relationship between arsenic and manganese. Full Article
ip A hierarchical Bayesian model for predicting ecological interactions using scaled evolutionary relationships By projecteuclid.org Published On :: Wed, 15 Apr 2020 22:05 EDT Mohamad Elmasri, Maxwell J. Farrell, T. Jonathan Davies, David A. Stephens. Source: The Annals of Applied Statistics, Volume 14, Number 1, 221--240.Abstract: Identifying undocumented or potential future interactions among species is a challenge facing modern ecologists. Recent link prediction methods rely on trait data; however, large species interaction databases are typically sparse and covariates are limited to only a fraction of species. On the other hand, evolutionary relationships, encoded as phylogenetic trees, can act as proxies for underlying traits and historical patterns of parasite sharing among hosts. We show that, using a network-based conditional model, phylogenetic information provides strong predictive power in a recently published global database of host-parasite interactions. By scaling the phylogeny using an evolutionary model, our method allows for biological interpretation often missing from latent variable models. To further improve on the phylogeny-only model, we combine a hierarchical Bayesian latent score framework for bipartite graphs that accounts for the number of interactions per species with host dependence informed by phylogeny. Combining the two information sources yields significant improvement in predictive accuracy over each of the submodels alone. As many interaction networks are constructed from presence-only data, we extend the model by integrating a correction mechanism for missing interactions which proves valuable in reducing uncertainty in unobserved interactions. Full Article
ip Propensity score weighting for causal inference with multiple treatments By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Fan Li, Fan Li. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2389--2415.Abstract: Causal or unconfounded descriptive comparisons between multiple groups are common in observational studies. Motivated from a racial disparity study in health services research, we propose a unified propensity score weighting framework, the balancing weights, for estimating causal effects with multiple treatments. These weights incorporate the generalized propensity scores to balance the weighted covariate distribution of each treatment group, all weighted toward a common prespecified target population. The class of balancing weights include several existing approaches such as the inverse probability weights and trimming weights as special cases. Within this framework, we propose a set of target estimands based on linear contrasts. We further develop the generalized overlap weights, constructed as the product of the inverse probability weights and the harmonic mean of the generalized propensity scores. The generalized overlap weighting scheme corresponds to the target population with the most overlap in covariates across the multiple treatments. These weights are bounded and thus bypass the problem of extreme propensities. We show that the generalized overlap weights minimize the total asymptotic variance of the moment weighting estimators for the pairwise contrasts within the class of balancing weights. We consider two balance check criteria and propose a new sandwich variance estimator for estimating the causal effects with generalized overlap weights. We apply these methods to study the racial disparities in medical expenditure between several racial groups using the 2009 Medical Expenditure Panel Survey (MEPS) data. Simulations were carried out to compare with existing methods. Full Article
ip Principal nested shape space analysis of molecular dynamics data By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Ian L. Dryden, Kwang-Rae Kim, Charles A. Laughton, Huiling Le. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2213--2234.Abstract: Molecular dynamics simulations produce huge datasets of temporal sequences of molecules. It is of interest to summarize the shape evolution of the molecules in a succinct, low-dimensional representation. However, Euclidean techniques such as principal components analysis (PCA) can be problematic as the data may lie far from in a flat manifold. Principal nested spheres gives a fundamentally different decomposition of data from the usual Euclidean subspace based PCA [ Biometrika 99 (2012) 551–568]. Subspaces of successively lower dimension are fitted to the data in a backwards manner with the aim of retaining signal and dispensing with noise at each stage. We adapt the methodology to 3D subshape spaces and provide some practical fitting algorithms. The methodology is applied to cluster analysis of peptides, where different states of the molecules can be identified. Also, the temporal transitions between cluster states are explored. Full Article
ip Estimating abundance from multiple sampling capture-recapture data via a multi-state multi-period stopover model By projecteuclid.org Published On :: Wed, 27 Nov 2019 22:01 EST Hannah Worthington, Rachel McCrea, Ruth King, Richard Griffiths. Source: The Annals of Applied Statistics, Volume 13, Number 4, 2043--2064.Abstract: Capture-recapture studies often involve collecting data on numerous capture occasions over a relatively short period of time. For many study species this process is repeated, for example, annually, resulting in capture information spanning multiple sampling periods. To account for the different temporal scales, the robust design class of models have traditionally been applied providing a framework in which to analyse all of the available capture data in a single likelihood expression. However, these models typically require strong constraints, either the assumption of closure within a sampling period (the closed robust design) or conditioning on the number of individuals captured within a sampling period (the open robust design). For real datasets these assumptions may not be appropriate. We develop a general modelling structure that requires neither assumption by explicitly modelling the movement of individuals into the population both within and between the sampling periods, which in turn permits the estimation of abundance within a single consistent framework. The flexibility of the novel model structure is further demonstrated by including the computationally challenging case of multi-state data where there is individual time-varying discrete covariate information. We derive an efficient likelihood expression for the new multi-state multi-period stopover model using the hidden Markov model framework. We demonstrate the significant improvement in parameter estimation using our new modelling approach in terms of both the multi-period and multi-state components through both a simulation study and a real dataset relating to the protected species of great crested newts, Triturus cristatus . Full Article
ip A semiparametric modeling approach using Bayesian Additive Regression Trees with an application to evaluate heterogeneous treatment effects By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Bret Zeldow, Vincent Lo Re III, Jason Roy. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1989--2010.Abstract: Bayesian Additive Regression Trees (BART) is a flexible machine learning algorithm capable of capturing nonlinearities between an outcome and covariates and interactions among covariates. We extend BART to a semiparametric regression framework in which the conditional expectation of an outcome is a function of treatment, its effect modifiers, and confounders. The confounders are allowed to have unspecified functional form, while treatment and effect modifiers that are directly related to the research question are given a linear form. The result is a Bayesian semiparametric linear regression model where the posterior distribution of the parameters of the linear part can be interpreted as in parametric Bayesian regression. This is useful in situations where a subset of the variables are of substantive interest and the others are nuisance variables that we would like to control for. An example of this occurs in causal modeling with the structural mean model (SMM). Under certain causal assumptions, our method can be used as a Bayesian SMM. Our methods are demonstrated with simulation studies and an application to dataset involving adults with HIV/Hepatitis C coinfection who newly initiate antiretroviral therapy. The methods are available in an R package called semibart. Full Article
ip Radio-iBAG: Radiomics-based integrative Bayesian analysis of multiplatform genomic data By projecteuclid.org Published On :: Wed, 16 Oct 2019 22:03 EDT Youyi Zhang, Jeffrey S. Morris, Shivali Narang Aerry, Arvind U. K. Rao, Veerabhadran Baladandayuthapani. Source: The Annals of Applied Statistics, Volume 13, Number 3, 1957--1988.Abstract: Technological innovations have produced large multi-modal datasets that include imaging and multi-platform genomics data. Integrative analyses of such data have the potential to reveal important biological and clinical insights into complex diseases like cancer. In this paper, we present Bayesian approaches for integrative analysis of radiological imaging and multi-platform genomic data, where-in our goals are to simultaneously identify genomic and radiomic, that is, radiology-based imaging markers, along with the latent associations between these two modalities, and to detect the overall prognostic relevance of the combined markers. For this task, we propose Radio-iBAG: Radiomics-based Integrative Bayesian Analysis of Multiplatform Genomic Data , a multi-scale Bayesian hierarchical model that involves several innovative strategies: it incorporates integrative analysis of multi-platform genomic data sets to capture fundamental biological relationships; explores the associations between radiomic markers accompanying genomic information with clinical outcomes; and detects genomic and radiomic markers associated with clinical prognosis. We also introduce the use of sparse Principal Component Analysis (sPCA) to extract a sparse set of approximately orthogonal meta-features each containing information from a set of related individual radiomic features, reducing dimensionality and combining like features. Our methods are motivated by and applied to The Cancer Genome Atlas glioblastoma multiforme data set, where-in we integrate magnetic resonance imaging-based biomarkers along with genomic, epigenomic and transcriptomic data. Our model identifies important magnetic resonance imaging features and the associated genomic platforms that are related with patient survival times. Full Article