clustering

Clustering of Microtubule-based Motor Proteins: The Biological Roles and Mechanical Effects (November 14, 2024 3:00pm)

Event Begins: Thursday, November 14, 2024 3:00pm
Location: Medical Science Unit II
Organized By: Department of Molecular, Cellular, and Developmental Biology


Mentor: Kristen Verhey




clustering

A Clustering Approach for Collaborative Filtering Recommendation Using Social Network Analysis

Collaborative Filtering(CF) is a well-known technique in recommender systems. CF exploits relationships between users and recommends items to the active user according to the ratings of his/her neighbors. CF suffers from the data sparsity problem, where users only rate a small set of items. That makes the computation of similarity between users imprecise and consequently reduces the accuracy of CF algorithms. In this article, we propose a clustering approach based on the social information of users to derive the recommendations. We study the application of this approach in two application scenarios: academic venue recommendation based on collaboration information and trust-based recommendation. Using the data from DBLP digital library and Epinion, the evaluation shows that our clustering technique based CF performs better than traditional CF algorithms.




clustering

An English MOOC similar resource clustering method based on grey correlation

Due to the problems of low clustering accuracy and efficiency in traditional similar resource clustering methods, this paper studies an English MOOC similar resource clustering method based on grey correlation. Principal component analysis was used to extract similar resource features of English MOOC, and feature selection methods was used to pre-process similar resource features of English MOOC. On this basis, based on the grey correlation method, the pre-processed English MOOC similar resource features are standardised, and the correlation degree between different English MOOC similar resource features is calculated. The English MOOC similar resource correlation matrix is constructed to achieve English MOOC similar resource clustering. The experimental results show that the contour coefficient of the proposed method is closer to one, and the clustering accuracy of similar resources in English MOOC is as high as 94.2%, with a clustering time of only 22.3 ms.




clustering

Advanced Data Clustering Methods of Mining Web Documents




clustering

A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data

Aim/Purpose: This article proposes a methodology for selecting the initial sets for clustering categorical data. The main idea is to combine all the different values of every single criterion or attribute, to form the first proposal of the so-called multiclusters, obtaining in this way the maximum number of clusters for the whole dataset. The multiclusters thus obtained, are themselves clustered in a second step, according to the desired final number of clusters. Background: Popular cluster methods for categorical data, such as the well-known K-Modes, usually select the initial sets by means of some random process. This fact introduces some randomness in the final results of the algorithms. We explore a different application of the clustering methodology for categorical data that overcomes the instability problems and ultimately provides a greater clustering efficiency. Methodology: For assessing the performance of the proposed algorithm and its comparison with K-Modes, we apply both of them to categorical databases where the response variable is known but not used in the analysis. In our examples, that response variable can be identified to the real clusters or classes to which the observations belong. With every data set, we perform a two-step analysis. In the first step we perform the clustering analysis on data where the response variable (the real clusters) has been omitted, and in the second step we use that omitted information to check the efficiency of the clustering algorithm (by comparing the real clusters to those given by the algorithm). Contribution: Simplicity, efficiency and stability are the main advantages of the multicluster method. Findings: The experimental results attained with real databases show that the multicluster algorithm has greater precision and a better grouping effect than the classical K-modes algorithm. Recommendations for Practitioners: The method can be useful for those researchers working with small and medium size datasets, allowing them to detect the underlying structure of the data in an intuitive and reasonable way. Recommendation for Researchers: The proposed algorithm is slower than K-Modes, since it devotes a lot of time to the calculation of the initial combinations of attributes. The reduction of the computing time is therefore an important research topic. Future Research: We are concerned with the scalability of the algorithm to large and complex data sets, as well as the application to mixed data sets with both quantitative and qualitative attributes.




clustering

IDCUP Algorithm to Classifying Arbitrary Shapes and Densities for Center-based Clustering Performance Analysis

Aim/Purpose: The clustering techniques are normally considered to determine the significant and meaningful subclasses purposed in datasets. It is an unsupervised type of Machine Learning (ML) where the objective is to form groups from objects based on their similarity and used to determine the implicit relationships between the different features of the data. Cluster Analysis is considered a significant problem area in data exploration when dealing with arbitrary shape problems in different datasets. Clustering on large data sets has the following challenges: (1) clusters with arbitrary shapes; (2) less knowledge discovery process to decide the possible input features; (3) scalability for large data sizes. Density-based clustering has been known as a dominant method for determining the arbitrary-shape clusters. Background: Existing density-based clustering methods commonly cited in the literature have been examined in terms of their behavior with data sets that contain nested clusters of varying density. The existing methods are not enough or ideal for such data sets, because they typically partition the data into clusters that cannot be nested. Methodology: A density-based approach on traditional center-based clustering is introduced that assigns a weight to each cluster. The weights are then utilized in calculating the distances from data vectors to centroids by multiplying the distance by the centroid weight. Contribution: In this paper, we have examined different density-based clustering methods for data sets with nested clusters of varying density. Two such data sets were used to evaluate some of the commonly cited algorithms found in the literature. Nested clusters were found to be challenging for the existing algorithms. In utmost cases, the targeted algorithms either did not detect the largest clusters or simply divided large clusters into non-overlapping regions. But, it may be possible to detect all clusters by doing multiple runs of the algorithm with different inputs and then combining the results. This work considered three challenges of clustering methods. Findings: As a result, a center with a low weight will attract objects from further away than a centroid with higher weight. This allows dense clusters inside larger clusters to be recognized. The methods are tested experimentally using the K-means, DBSCAN, TURN*, and IDCUP algorithms. The experimental results with different data sets showed that IDCUP is more robust and produces better clusters than DBSCAN, TURN*, and K-means. Finally, we compare K-means, DBSCAN, TURN*, and to deal with arbitrary shapes problems at different datasets. IDCUP shows better scalability compared to TURN*. Future Research: As future recommendations of this research, we are concerned with the exploration of further available challenges of the knowledge discovery process in clustering along with complex data sets with more time. A hybrid approach based on density-based and model-based clustering algorithms needs to compare to achieve maximum performance accuracy and avoid the arbitrary shapes related problems including optimization. It is anticipated that the comparable kind of the future suggested process will attain improved performance with analogous precision in identification of clustering shapes.




clustering

Fast fuzzy C-means clustering and deep Q network for personalised web directories recommendation

This paper proposes an efficient solution for personalised web directories recommendation using fast FCM+DQN. At first, web directory usage file obtained from given dataset is fed into the accretion matrix computation module, where visitor chain matrix, visitor chain binary matrix, directory chain matrix and directory chain binary matrix are formulated. In this, directory grouping is accomplished based on fast FCM and matching among query and group is conducted based on Kumar Hassebrook and Kulczynski similarity. The user preferred directory is restored at this stage and at last, personalised web directories are recommended to the visitors by means of DQN. The proposed approach has received superior results with respect to maximum accuracy of 0.910, minimum mean squared error (MSE) of 0.0206 and root mean squared error (RMSE) of 0.144. Although the system offered magnificent outcomes, it failed to order web directories in the form of highly, medium and low interested directories.




clustering

Robust and automatic beamstop shadow outlier rejection: combining crystallographic statistics with modern clustering under a semi-supervised learning strategy

During the automatic processing of crystallographic diffraction experiments, beamstop shadows are often unaccounted for or only partially masked. As a result of this, outlier reflection intensities are integrated, which is a known issue. Traditional statistical diagnostics have only limited effectiveness in identifying these outliers, here termed Not-Excluded-unMasked-Outliers (NEMOs). The diagnostic tool AUSPEX allows visual inspection of NEMOs, where they form a typical pattern: clusters at the low-resolution end of the AUSPEX plots of intensities or amplitudes versus resolution. To automate NEMO detection, a new algorithm was developed by combining data statistics with a density-based clustering method. This approach demonstrates a promising performance in detecting NEMOs in merged data sets without disrupting existing data-reduction pipelines. Re-refinement results indicate that excluding the identified NEMOs can effectively enhance the quality of subsequent structure-determination steps. This method offers a prospective automated means to assess the efficacy of a beamstop mask, as well as highlighting the potential of modern pattern-recognition techniques for automating outlier exclusion during data processing, facilitating future adaptation to evolving experimental strategies.




clustering

Safety in numbers - The benefits of clustering for manufacturers

Emperor penguins huddle together to share warmth and protect each other during the intense winds of the harsh Antarctic storms. Fortunately, it’s not just penguins that can benefit from huddling together. Here, Jonathan Wilkins, marketing director of obsolete industrial parts supplier EU Automation, explains why manufacturers form clusters around the world.




clustering

On the robustness of graph-based clustering to random network alterations

R. Greg Stacey
Nov 4, 2020; 0:RA120.002275v1-mcp.RA120.002275
Research




clustering

On the robustness of graph-based clustering to random network alterations [Research]

Biological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multi-member protein complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially when inferred from high-throughput biochemical assays. Therefore, robustness to network-level noise is an important criterion for any clustering algorithm that aims to generate robust, reproducible clusters. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of noise, including algorithms common across domains and those specific to protein networks. Strikingly, we found that all of the clustering algorithms tested here markedly amplified noise within the underlying protein interaction network. Randomly rewiring only 1% of network edges yielded more than a 50% change in clustering results, indicating that clustering markedly amplified network-level noise. Moreover, we found the impact of network noise on individual clusters was not uniform: some clusters were consistently robust to injected noise while others were not. To assist in assessing this, we developed the clust.perturb R package and Shiny web application to measure the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that graph-based clustering amplifies noise in protein interaction networks, but quantifying the robustness of a cluster to network noise can separate stable protein complexes from spurious associations.




clustering

Multicolor circularly polarized luminescence: pendant primary amine/diphenylalanine chiral copolymers with clustering-triggered emission

Mater. Chem. Front., 2024, 8,3596-3607
DOI: 10.1039/D4QM00228H, Research Article
Ryo Yonenuma, Aoi Takenaka, Tamaki Nakano, Hideharu Mori
Clustering-triggered emission (CTE) materials without π-conjugate chromophores have attracted increasing attention. In this study, we designed CTE-based circularly polarized luminescence (CPL) block and random copolymers, showing multicolor CPL.
The content of this RSS Feed (c) The Royal Society of Chemistry




clustering

Clustering around Koppal

Aequs Aerospace to create space for large-scale manufacture of toys at Koppal




clustering

Geographic Clustering and Resource Reallocation Across Firms in Chinese Industries [electronic journal].




clustering

Clustering, Growth, and Inequality in China [electronic journal].




clustering

Presentation of Clean-Tech Clustering as an Engine for Local Development: The Negev Region, Israel

The Negev region has the potential to deliver real and tangible benefits for regional development, green growth and social inclusion. The report explains how the region can exploit its existing strenghts and competitive advantages, including a niche in research, demonstration and testing in renewable energies and water efficiency.




clustering

Presentation of Clean-Tech Clustering as an Engine for Local Development: The Negev Region, Israel

The Negev region has the potential to deliver real and tangible benefits for regional development, green growth and social inclusion. The report explains how the region can exploit its existing strenghts and competitive advantages, including a niche in research, demonstration and testing in renewable energies and water efficiency.




clustering

DBSCAN Clustering Algorithm in Machine Learning

An introduction to the DBSCAN algorithm and its Implementation in Python.




clustering

Getting Started with Spectral Clustering

This post will unravel a practical example to illustrate and motivate the intuition behind each step of the spectral clustering algorithm.




clustering

Robustly Clustering a Mixture of Gaussians. (arXiv:1911.11838v5 [cs.DS] UPDATED)

We give an efficient algorithm for robustly clustering of a mixture of two arbitrary Gaussians, a central open problem in the theory of computationally efficient robust estimation, assuming only that the the means of the component Gaussians are well-separated or their covariances are well-separated. Our algorithm and analysis extend naturally to robustly clustering mixtures of well-separated strongly logconcave distributions. The mean separation required is close to the smallest possible to guarantee that most of the measure of each component can be separated by some hyperplane (for covariances, it is the same condition in the second degree polynomial kernel). We also show that for Gaussian mixtures, separation in total variation distance suffices to achieve robust clustering. Our main tools are a new identifiability criterion based on isotropic position and the Fisher discriminant, and a corresponding Sum-of-Squares convex programming relaxation, of fixed degree.




clustering

SYSTEMS AND METHODS FOR ONLINE CLUSTERING OF CONTENT ITEMS

Systems, methods, and non-transitory computer-readable media can obtain a first batch of content items to be clustered. A set of clusters can be generated by clustering respective binary hash codes for each content item in the first batch, wherein content items included in a cluster are visually similar to one another. A next batch of content items to be clustered can be obtained. One or more respective binary hash codes for the content items in the next batch can be assigned to a cluster in the set of clusters.




clustering

Clustering




clustering

Semiclassical Standing Waves with Clustering Peaks for Nonlinear Schrodinger Equations

Jaeyoung Byeon, KAIST, and Kazunaga Tanaka, Waseda University - AMS, 2013, 89 pp., Softcover, ISBN-13: 978-0-8218-9163-6, List: US$71, All AMS Members: US$56.80, MEMO/229/1076

The authors study the following singularly perturbed problem: (-epsilon^2Delta u+V(x)u = f(u)) in (mathbf{R}^N). Their main result is the...




clustering

Model-based clustering with envelopes

Wenjing Wang, Xin Zhang, Qing Mai.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 82--109.

Abstract:
Clustering analysis is an important unsupervised learning technique in multivariate statistics and machine learning. In this paper, we propose a set of new mixture models called CLEMM (in short for Clustering with Envelope Mixture Models) that is based on the widely used Gaussian mixture model assumptions and the nascent research area of envelope methodology. Formulated mostly for regression models, envelope methodology aims for simultaneous dimension reduction and efficient parameter estimation, and includes a very recent formulation of envelope discriminant subspace for classification and discriminant analysis. Motivated by the envelope discriminant subspace pursuit in classification, we consider parsimonious probabilistic mixture models where the cluster analysis can be improved by projecting the data onto a latent lower-dimensional subspace. The proposed CLEMM framework and the associated envelope-EM algorithms thus provide foundations for envelope methods in unsupervised and semi-supervised learning problems. Numerical studies on simulated data and two benchmark data sets show significant improvement of our propose methods over the classical methods such as Gaussian mixture models, K-means and hierarchical clustering algorithms. An R package is available at https://github.com/kusakehan/CLEMM.




clustering

A Bayesian approach to disease clustering using restricted Chinese restaurant processes

Claudia Wehrhahn, Samuel Leonard, Abel Rodriguez, Tatiana Xifara.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 1449--1478.

Abstract:
Identifying disease clusters (areas with an unusually high incidence of a particular disease) is a common problem in epidemiology and public health. We describe a Bayesian nonparametric mixture model for disease clustering that constrains clusters to be made of adjacent areal units. This is achieved by modifying the exchangeable partition probability function associated with the Ewen’s sampling distribution. We call the resulting prior the Restricted Chinese Restaurant Process, as the associated full conditional distributions resemble those associated with the standard Chinese Restaurant Process. The model is illustrated using synthetic data sets and in an application to oral cancer mortality in Germany.




clustering

$k$-means clustering of extremes

Anja Janßen, Phyllis Wan.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 1211--1233.

Abstract:
The $k$-means clustering algorithm and its variant, the spherical $k$-means clustering, are among the most important and popular methods in unsupervised learning and pattern detection. In this paper, we explore how the spherical $k$-means algorithm can be applied in the analysis of only the extremal observations from a data set. By making use of multivariate extreme value analysis we show how it can be adopted to find “prototypes” of extremal dependence and derive a consistency result for our suggested estimator. In the special case of max-linear models we show furthermore that our procedure provides an alternative way of statistical inference for this class of models. Finally, we provide data examples which show that our method is able to find relevant patterns in extremal observations and allows us to classify extremal events.




clustering

Modal clustering asymptotics with applications to bandwidth selection

Alessandro Casa, José E. Chacón, Giovanna Menardi.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 835--856.

Abstract:
Density-based clustering relies on the idea of linking groups to some specific features of the probability distribution underlying the data. The reference to a true, yet unknown, population structure allows framing the clustering problem in a standard inferential setting, where the concept of ideal population clustering is defined as the partition induced by the true density function. The nonparametric formulation of this approach, known as modal clustering, draws a correspondence between the groups and the domains of attraction of the density modes. Operationally, a nonparametric density estimate is required and a proper selection of the amount of smoothing, governing the shape of the density and hence possibly the modal structure, is crucial to identify the final partition. In this work, we address the issue of density estimation for modal clustering from an asymptotic perspective. A natural and easy to interpret metric to measure the distance between density-based partitions is discussed, its asymptotic approximation explored, and employed to study the problem of bandwidth selection for nonparametric modal clustering.




clustering

Profile likelihood biclustering

Cheryl Flynn, Patrick Perry.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 731--768.

Abstract:
Biclustering, the process of simultaneously clustering the rows and columns of a data matrix, is a popular and effective tool for finding structure in a high-dimensional dataset. Many biclustering procedures appear to work well in practice, but most do not have associated consistency guarantees. To address this shortcoming, we propose a new biclustering procedure based on profile likelihood. The procedure applies to a broad range of data modalities, including binary, count, and continuous observations. We prove that the procedure recovers the true row and column classes when the dimensions of the data matrix tend to infinity, even if the functional form of the data distribution is misspecified. The procedure requires computing a combinatorial search, which can be expensive in practice. Rather than performing this search directly, we propose a new heuristic optimization procedure based on the Kernighan-Lin heuristic, which has nice computational properties and performs well in simulations. We demonstrate our procedure with applications to congressional voting records, and microarray analysis.




clustering

Path-Based Spectral Clustering: Guarantees, Robustness to Outliers, and Fast Algorithms

We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of high-dimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points.




clustering

Connecting Spectral Clustering to Maximum Margins and Level Sets

We study the connections between spectral clustering and the problems of maximum margin clustering, and estimation of the components of level sets of a density function. Specifically, we obtain bounds on the eigenvectors of graph Laplacian matrices in terms of the between cluster separation, and within cluster connectivity. These bounds ensure that the spectral clustering solution converges to the maximum margin clustering solution as the scaling parameter is reduced towards zero. The sensitivity of maximum margin clustering solutions to outlying points is well known, but can be mitigated by first removing such outliers, and applying maximum margin clustering to the remaining points. If outliers are identified using an estimate of the underlying probability density, then the remaining points may be seen as an estimate of a level set of this density function. We show that such an approach can be used to consistently estimate the components of the level sets of a density function under very mild assumptions.




clustering

Latent Simplex Position Model: High Dimensional Multi-view Clustering with Uncertainty Quantification

High dimensional data often contain multiple facets, and several clustering patterns can co-exist under different variable subspaces, also known as the views. While multi-view clustering algorithms were proposed, the uncertainty quantification remains difficult --- a particular challenge is in the high complexity of estimating the cluster assignment probability under each view, and sharing information among views. In this article, we propose an approximate Bayes approach --- treating the similarity matrices generated over the views as rough first-stage estimates for the co-assignment probabilities; in its Kullback-Leibler neighborhood, we obtain a refined low-rank matrix, formed by the pairwise product of simplex coordinates. Interestingly, each simplex coordinate directly encodes the cluster assignment uncertainty. For multi-view clustering, we let each view draw a parameterization from a few candidates, leading to dimension reduction. With high model flexibility, the estimation can be efficiently carried out as a continuous optimization problem, hence enjoys gradient-based computation. The theory establishes the connection of this model to a random partition distribution under multiple views. Compared to single-view clustering approaches, substantially more interpretable results are obtained when clustering brains from a human traumatic brain injury study, using high-dimensional gene expression data.




clustering

Optimal Bipartite Network Clustering

We study bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. This is further formalized by deriving a minimax lower bound over a class of biclustering problems. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth, from sparse networks with average degrees growing arbitrarily slowly to fairly dense networks with average degrees of order $sqrt{n}$. As a special case, we recover the known exact recovery threshold in the $log n$ regime of sparsity. To obtain the consistency result, as part of the provable version of the algorithm, we introduce a sub-block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable algorithm is derived from a general class of pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations.




clustering

Union of Low-Rank Tensor Spaces: Clustering and Completion

We consider the problem of clustering and completing a set of tensors with missing data that are drawn from a union of low-rank tensor spaces. In the clustering problem, given a partially sampled tensor data that is composed of a number of subtensors, each chosen from one of a certain number of unknown tensor spaces, we need to group the subtensors that belong to the same tensor space. We provide a geometrical analysis on the sampling pattern and subsequently derive the sampling rate that guarantees the correct clustering under some assumptions with high probability. Moreover, we investigate the fundamental conditions for finite/unique completability for the union of tensor spaces completion problem. Both deterministic and probabilistic conditions on the sampling pattern to ensure finite/unique completability are obtained. For both the clustering and completion problems, our tensor analysis provides significantly better bound than the bound given by the matrix analysis applied to any unfolding of the tensor data.




clustering

A Bayesian sparse finite mixture model for clustering data from a heterogeneous population

Erlandson F. Saraiva, Adriano K. Suzuki, Luís A. Milan.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 323--344.

Abstract:
In this paper, we introduce a Bayesian approach for clustering data using a sparse finite mixture model (SFMM). The SFMM is a finite mixture model with a large number of components $k$ previously fixed where many components can be empty. In this model, the number of components $k$ can be interpreted as the maximum number of distinct mixture components. Then, we explore the use of a prior distribution for the weights of the mixture model that take into account the possibility that the number of clusters $k_{mathbf{c}}$ (e.g., nonempty components) can be random and smaller than the number of components $k$ of the finite mixture model. In order to determine clusters we develop a MCMC algorithm denominated Split-Merge allocation sampler. In this algorithm, the split-merge strategy is data-driven and was inserted within the algorithm in order to increase the mixing of the Markov chain in relation to the number of clusters. The performance of the method is verified using simulated datasets and three real datasets. The first real data set is the benchmark galaxy data, while second and third are the publicly available data set on Enzyme and Acidity, respectively.




clustering

Variable selection methods for model-based clustering

Michael Fop, Thomas Brendan Murphy.

Source: Statistics Surveys, Volume 12, 18--65.

Abstract:
Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.




clustering

Finite mixture models and model-based clustering

Volodymyr Melnykov, Ranjan Maitra

Source: Statist. Surv., Volume 4, 80--116.

Abstract:
Finite mixture models have a long history in statistics, having been used to model population heterogeneity, generalize distributional assumptions, and lately, for providing a convenient yet formal framework for clustering and classification. This paper provides a detailed review into mixture models and model-based clustering. Recent trends as well as open problems in the area are also discussed.




clustering

Predictive Modeling of ICU Healthcare-Associated Infections from Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling Approach. (arXiv:2005.03582v1 [cs.LG])

Early detection of patients vulnerable to infections acquired in the hospital environment is a challenge in current health systems given the impact that such infections have on patient mortality and healthcare costs. This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units by means of machine-learning methods. The aim is to support decision making addressed at reducing the incidence rate of infections. In this field, it is necessary to deal with the problem of building reliable classifiers from imbalanced datasets. We propose a clustering-based undersampling strategy to be used in combination with ensemble classifiers. A comparative study with data from 4616 patients was conducted in order to validate our proposal. We applied several single and ensemble classifiers both to the original dataset and to data preprocessed by means of different resampling methods. The results were analyzed by means of classic and recent metrics specifically designed for imbalanced data classification. They revealed that the proposal is more efficient in comparison with other approaches.




clustering

Fair Algorithms for Hierarchical Agglomerative Clustering. (arXiv:2005.03197v1 [cs.LG])

Hierarchical Agglomerative Clustering (HAC) algorithms are extensively utilized in modern data science and machine learning, and seek to partition the dataset into clusters while generating a hierarchical relationship between the data samples themselves. HAC algorithms are employed in a number of applications, such as biology, natural language processing, and recommender systems. Thus, it is imperative to ensure that these algorithms are fair-- even if the dataset contains biases against certain protected groups, the cluster outputs generated should not be discriminatory against samples from any of these groups. However, recent work in clustering fairness has mostly focused on center-based clustering algorithms, such as k-median and k-means clustering. Therefore, in this paper, we propose fair algorithms for performing HAC that enforce fairness constraints 1) irrespective of the distance linkage criteria used, 2) generalize to any natural measures of clustering fairness for HAC, 3) work for multiple protected groups, and 4) have competitive running times to vanilla HAC. To the best of our knowledge, this is the first work that studies fairness for HAC algorithms. We also propose an algorithm with lower asymptotic time complexity than HAC algorithms that can rectify existing HAC outputs and make them subsequently fair as a result. Moreover, we carry out extensive experiments on multiple real-world UCI datasets to demonstrate the working of our algorithms.




clustering

Model assisted variable clustering: Minimax-optimal recovery and algorithms

Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen.

Source: The Annals of Statistics, Volume 48, Number 1, 111--137.

Abstract:
The problem of variable clustering is that of estimating groups of similar components of a $p$-dimensional vector $X=(X_{1},ldots ,X_{p})$ from $n$ independent copies of $X$. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of $G$-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a $G$-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to $G$-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular $K$-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.




clustering

A hierarchical Bayesian model for single-cell clustering using RNA-sequencing data

Yiyi Liu, Joshua L. Warren, Hongyu Zhao.

Source: The Annals of Applied Statistics, Volume 13, Number 3, 1733--1752.

Abstract:
Understanding the heterogeneity of cells is an important biological question. The development of single-cell RNA-sequencing (scRNA-seq) technology provides high resolution data for such inquiry. A key challenge in scRNA-seq analysis is the high variability of measured RNA expression levels and frequent dropouts (missing values) due to limited input RNA compared to bulk RNA-seq measurement. Existing clustering methods do not perform well for these noisy and zero-inflated scRNA-seq data. In this manuscript we propose a Bayesian hierarchical model, called BasClu, to appropriately characterize important features of scRNA-seq data in order to more accurately cluster cells. We demonstrate the effectiveness of our method with extensive simulation studies and applications to three real scRNA-seq datasets.




clustering

Reliable clustering of Bernoulli mixture models

Amir Najafi, Seyed Abolfazl Motahari, Hamid R. Rabiee.

Source: Bernoulli, Volume 26, Number 2, 1535--1559.

Abstract:
A Bernoulli Mixture Model (BMM) is a finite mixture of random binary vectors with independent dimensions. The problem of clustering BMM data arises in a variety of real-world applications, ranging from population genetics to activity analysis in social networks. In this paper, we analyze the clusterability of BMMs from a theoretical perspective, when the number of clusters is unknown. In particular, we stipulate a set of conditions on the sample complexity and dimension of the model in order to guarantee the Probably Approximately Correct (PAC)-clusterability of a dataset. To the best of our knowledge, these findings are the first non-asymptotic bounds on the sample complexity of learning or clustering BMMs.




clustering

Clustering of Risk Factors: A Simple Method of Detecting Cardiovascular Disease in Youth

Cardiovascular risk factors predict the development of premature atherosclerosis. As the number of risk factors increases, so does the extent of these lesions. Assessment of cardiovascular risk factors is an accepted practice in adults but is not used in pediatrics.

In this study, the authors discuss how the presence of ≥2 cardiovascular risk factors is associated with vascular changes in adolescents. The findings were compared with the Patholobiological Determinants of Atherosclerosis in Youth risk score to demonstrate that a simple method of clustering is a reliable tool to use in clinical practice. (Read the full article)




clustering

Strength Capacity and Cardiometabolic Risk Clustering in Adolescents

Resistance exercise is known to have a robust effect on glycemic control and cardiometabolic health among children and adolescents, even in the absence of weight loss.

Normalized strength capacity is associated with lower cardiometabolic risk clustering in boys and girls, even after adjustment for cardiorespiratory fitness, level of physical activity, and BMI. (Read the full article)




clustering

A standardized patient-centered characterization of the phenotypic spectrum of <i>PCDH19</i> girls clustering epilepsy




clustering

Satellite images reveal fleets of empty cruise ships clustering together in Caribbean & Philippines

At least three groups of cruise ships, with 15 in total, are clustered together off Coco Cay and Great Stirrup Cay in the Bahamas and about 12 are off the coast of Manila, in the Philippines.




clustering

Satellite images reveal fleets of empty cruise ships clustering together in Caribbean & Philippines

At least three groups of cruise ships, with 15 in total, are clustered together off Coco Cay and Great Stirrup Cay in the Bahamas and about 12 are off the coast of Manila, in the Philippines.




clustering

Customer Segmentation and Clustering Using SAS Enterprise Miner, Third Edition / Randall S. Collica

Online Resource




clustering

[ASAP] Clustering a Chemical Inventory for Safety Assessment of Fragrance Ingredients: Identifying Read-Across Analogs to Address Data Gaps

Chemical Research in Toxicology
DOI: 10.1021/acs.chemrestox.9b00518




clustering

Clustering methodology for symbolic data / Lynne Billard (University of Georgia), Edwin Diday (Universite de Paris IX--Dauphine)

Dewey Library - QA278.55.B55 2020




clustering

Medical products industries clustering in Tampa Bay