cluster

Method for Network Self-Healing in Cluster-Tree Structured Wireless Communication Networks

Provided is a network self-healing method in which, when a link between a parent device and a child device breaks down in a wireless communication network of a cluster-tree structure in which a main communication device (referred to an access point (AP)) manages network operation, routers that are devices capable of having their child devices, and end devices that are devices incapable of having their child devices are associated with each other in a parent-child device relationship, the link is restored. When a router becomes an orphan device, the router makes network re-association in a cluster unit while maintaining synchronized operation with its child devices, and thus time, energy and signaling burden for network self-healing is largely reduced.




cluster

Method for mixing short staple and down cluster by a dry processing

A method for mixing short staple and down cluster by a dry processing utilizes an air tool to blow the short staple over, so that the scattered short staple is mixed in the down cluster. Stirring blades are further applied for stirring. Chemical agents are needless, no pollution is generated, and processing time is preferably reduced since the mixture does not have to be soaked in the chemical agent. Both the processing time and the manufacturing cost are decreased. Preferably, a proportion of the short staple to the down cluster is adjustable for different needs and divergent warmth retaining effects.




cluster

SYSTEMS AND METHODS FOR ONLINE CLUSTERING OF CONTENT ITEMS

Systems, methods, and non-transitory computer-readable media can obtain a first batch of content items to be clustered. A set of clusters can be generated by clustering respective binary hash codes for each content item in the first batch, wherein content items included in a cluster are visually similar to one another. A next batch of content items to be clustered can be obtained. One or more respective binary hash codes for the content items in the next batch can be assigned to a cluster in the set of clusters.




cluster

SYSTEM AND METHOD FOR PROVIDING CONTENT RECOMMENDATIONS BASED ON PERSONALIZED MULTIMEDIA CONTENT ELEMENT CLUSTERS

A system and method for generating recommendations based on personalized multimedia content element clusters. The method includes obtaining a personalized multimedia content element cluster associated with a user, wherein the personalized multimedia content element cluster includes a plurality of multimedia content elements related to a common concept, wherein the common concept represents a user interest of the user; analyzing the obtained personalized multimedia content element cluster to determine at least one query; searching, using the determined at least one query, for at least one relevant multimedia content element that is relevant to the user interest of the user; and providing, to the user, at least one recommended multimedia content element of the at least one relevant multimedia content element found during the search.





cluster

Cluster Sound releases RX7 free drum pack for Ableton Live

Cluster Sound has announced the release of RX7, a free Ableton Live drum pack based on the vintage RX7 drum machine. The 12-bit drum machine from Yamaha comes equipped with 100 PCM samples. It was used by artists such as Future Sound of London, Massive Attack, Bjork and Nine Inch Nails. Released in 1988 by […]

The post Cluster Sound releases RX7 free drum pack for Ableton Live appeared first on rekkerd.org.




cluster

10 steps to set up a multi-data center Cassandra cluster on a Kubernetes platform

Learn how to deploy an Apache Cassandra NoSQL database on a Kubernetes cluster that spans multiple data centers across many regions. The benefits of such a setup are automatic live backups to protect the cluster from node- and site-level disasters, and location-aware access to Cassandra nodes for better performance.




cluster

How a Western Sydney nursing home became one of the country's biggest coronavirus clusters

It started with a "scratchy throat", but now almost every day at Newmarch House brings another death — here's what we know about the nursing home at the centre of a massive coronavirus outbreak.




cluster

Live: NSW Now: Probe into coronavirus cluster at Newmarch House sees worker stood down

MORNING BRIEFING: An aged care operator at the centre of Australia's second biggest coronavirus cluster has requested a worker be stood down after alleged breaches of infection control.




cluster

Tracking the coronavirus spread: The two clusters fuelling the new case tally

Outbreaks at two locations — one in Sydney and one in Melbourne — are behind many of the new COVID-19 cases identified in the past week.




cluster

Tasmanian coronavirus cluster could happen anywhere, doctors warn

As two hospitals close to clean up amid a coronavirus outbreak in Tasmania's north-west, doctors warn there's nothing unique about the region that means similar outbreaks can't happen anywhere else.




cluster

No new cases of coronavirus for Tasmania, as north-west cluster blamed on Ruby Princess

Australia's Chief Medical Officer says a coronavirus cluster in Tasmania's north-west was likely sparked by a passenger from the Ruby Princess cruise ship, as the state marks 24 hours without a new case being found.




cluster

Hospital cluster probe finds staff worked while sick, Ruby Princess source of outbreak

An investigation into a coronavirus cluster in north-west Tasmania finds some staff worked in local hospitals for several days while experiencing symptoms, but the Premier stresses no-one is to blame.




cluster

Victorian school likely had 'unsafe' levels of chemicals in soil, inquiry into possible cancer cluster told

A Senate inquiry into a possible cancer cluster on Victoria's Bellarine Peninsula hears evidence from high-profile lawyer Peter Gordon of a "disturbing number of cancer cases" connected to the early years of Bellarine Secondary College.



  • Health
  • Cancer
  • Diseases and Disorders
  • Government and Politics


cluster

Cluster of coronavirus cases discovered at Melbourne abattoir as paramedic tests positive

The number of coronavirus cases in Victoria continues to inch up as a paramedic tests positive to the virus and health authorities investigate a cluster at a meat processing plant.




cluster

Victoria sees biggest coronavirus tally jump in a fortnight as school closed and abattoir cluster grows

Health Minister Jenny Mikakos announces that Epping's Meadowglen Primary School will be closed for three days as the state confirms 13 new coronavirus cases.




cluster

Coronavirus cluster at Melbourne abattoir jumps to 34 cases, but 'not a risk' to public

Victoria's COVID-19 tally continues on its steepest climb in a fortnight, as Premier Daniel Andrews reveals 13,000 people were screened for the virus in the state's testing blitz on Sunday.




cluster

Coronavirus cluster at Melbourne meatworks grows again, showing COVID-19 battle 'far from over'

Victoria records 17 new coronavirus cases including 11 linked to a cluster at a Melbourne meat processing plant. It comes as Treasurer Tim Pallas announces $491 million in tax relief for Victorian businesses.




cluster

Coronavirus cluster at Melbourne meatworks grows as aged care homes in lockdown

A cluster of coronavirus cases at a Melbourne meatworks rises to 49, as two Victorian aged care homes go into lockdown after workers test positive to the virus.



  • COVID-19
  • Health
  • Government and Politics
  • Federal - State Issues
  • States and Territories


cluster

Clustering






cluster

Coronavirus case cluster tied to Pasadena party, spurring warning of Mother's Day gatherings

Pasadena is warning against Mother's Day gatherings after its public health department traced a cluster of coronavirus cases to a birthday party.




cluster

CBD News: Statement by the Executive Secretary at the first national meeting of the Satoyama Satoumi Sub-Global Assessment Inter-Cluster Meeting, Ishikawa, 16 September 2008.




cluster

Unistructurality of cluster algebras from unpunctured surfaces

Véronique Bazier-Matte and Pierre-Guy Plamondon
Proc. Amer. Math. Soc. 148 (2020), 2397-2409.
Abstract, references and article information




cluster

Lecture Notes on Cluster Algebras

Robert J. Marsh, University of Leeds - A publication of the European Mathematical Society, 2014, 122 pp., Softcover, ISBN-13: 978-3-03719-130-9, List: US$36, All AMS Members: US$28.80, EMSZLEC/19

Cluster algebras are combinatorially defined commutative algebras which were introduced by S. Fomin and A. Zelevinsky as a tool for studying the dual...




cluster

Semiclassical Standing Waves with Clustering Peaks for Nonlinear Schrodinger Equations

Jaeyoung Byeon, KAIST, and Kazunaga Tanaka, Waseda University - AMS, 2013, 89 pp., Softcover, ISBN-13: 978-0-8218-9163-6, List: US$71, All AMS Members: US$56.80, MEMO/229/1076

The authors study the following singularly perturbed problem: (-epsilon^2Delta u+V(x)u = f(u)) in (mathbf{R}^N). Their main result is the...




cluster

Research found a new way to make functional materials based on polymers of metal clusters

(University of Jyväskylä - Jyväskylän yliopisto) Researchers at the universities of Jyvaskyla and Xiamen discovered a novel way to make functional macroscopic crystalline materials out of nanometer-size 34-atom silver-gold intermetallic clusters. The cluster material has a highly anisotropic electrical conductivity, being a semiconductor in one direction and an electrical insulator in other directions. The research was published in Nature Communications on May 6, 2020.




cluster

Age of NGC 6652 globular cluster specified

(Kazan Federal University) Senior Research Associate Margarita Sharina (Special Astrophysical Observatory) and Associate Professor Vladislav Shimansky (Kazan Federal University) studied the globular cluster NGC 6652.4.05957 and found out that its age is close to 13.6 billion years, which makes it one of the oldest objects in the Milky Way.




cluster

South Korea sees new cluster of COVID-19 cases tied to nightclubs

Just days after South Korea loosened its social distancing guidelines, a new COVID-19 cluster of infections has sprung up in the capital city of Seoul tied to several nightclubs.




cluster

Model-based clustering with envelopes

Wenjing Wang, Xin Zhang, Qing Mai.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 82--109.

Abstract:
Clustering analysis is an important unsupervised learning technique in multivariate statistics and machine learning. In this paper, we propose a set of new mixture models called CLEMM (in short for Clustering with Envelope Mixture Models) that is based on the widely used Gaussian mixture model assumptions and the nascent research area of envelope methodology. Formulated mostly for regression models, envelope methodology aims for simultaneous dimension reduction and efficient parameter estimation, and includes a very recent formulation of envelope discriminant subspace for classification and discriminant analysis. Motivated by the envelope discriminant subspace pursuit in classification, we consider parsimonious probabilistic mixture models where the cluster analysis can be improved by projecting the data onto a latent lower-dimensional subspace. The proposed CLEMM framework and the associated envelope-EM algorithms thus provide foundations for envelope methods in unsupervised and semi-supervised learning problems. Numerical studies on simulated data and two benchmark data sets show significant improvement of our propose methods over the classical methods such as Gaussian mixture models, K-means and hierarchical clustering algorithms. An R package is available at https://github.com/kusakehan/CLEMM.




cluster

A Bayesian approach to disease clustering using restricted Chinese restaurant processes

Claudia Wehrhahn, Samuel Leonard, Abel Rodriguez, Tatiana Xifara.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 1449--1478.

Abstract:
Identifying disease clusters (areas with an unusually high incidence of a particular disease) is a common problem in epidemiology and public health. We describe a Bayesian nonparametric mixture model for disease clustering that constrains clusters to be made of adjacent areal units. This is achieved by modifying the exchangeable partition probability function associated with the Ewen’s sampling distribution. We call the resulting prior the Restricted Chinese Restaurant Process, as the associated full conditional distributions resemble those associated with the standard Chinese Restaurant Process. The model is illustrated using synthetic data sets and in an application to oral cancer mortality in Germany.




cluster

$k$-means clustering of extremes

Anja Janßen, Phyllis Wan.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 1211--1233.

Abstract:
The $k$-means clustering algorithm and its variant, the spherical $k$-means clustering, are among the most important and popular methods in unsupervised learning and pattern detection. In this paper, we explore how the spherical $k$-means algorithm can be applied in the analysis of only the extremal observations from a data set. By making use of multivariate extreme value analysis we show how it can be adopted to find “prototypes” of extremal dependence and derive a consistency result for our suggested estimator. In the special case of max-linear models we show furthermore that our procedure provides an alternative way of statistical inference for this class of models. Finally, we provide data examples which show that our method is able to find relevant patterns in extremal observations and allows us to classify extremal events.




cluster

Modal clustering asymptotics with applications to bandwidth selection

Alessandro Casa, José E. Chacón, Giovanna Menardi.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 835--856.

Abstract:
Density-based clustering relies on the idea of linking groups to some specific features of the probability distribution underlying the data. The reference to a true, yet unknown, population structure allows framing the clustering problem in a standard inferential setting, where the concept of ideal population clustering is defined as the partition induced by the true density function. The nonparametric formulation of this approach, known as modal clustering, draws a correspondence between the groups and the domains of attraction of the density modes. Operationally, a nonparametric density estimate is required and a proper selection of the amount of smoothing, governing the shape of the density and hence possibly the modal structure, is crucial to identify the final partition. In this work, we address the issue of density estimation for modal clustering from an asymptotic perspective. A natural and easy to interpret metric to measure the distance between density-based partitions is discussed, its asymptotic approximation explored, and employed to study the problem of bandwidth selection for nonparametric modal clustering.




cluster

Profile likelihood biclustering

Cheryl Flynn, Patrick Perry.

Source: Electronic Journal of Statistics, Volume 14, Number 1, 731--768.

Abstract:
Biclustering, the process of simultaneously clustering the rows and columns of a data matrix, is a popular and effective tool for finding structure in a high-dimensional dataset. Many biclustering procedures appear to work well in practice, but most do not have associated consistency guarantees. To address this shortcoming, we propose a new biclustering procedure based on profile likelihood. The procedure applies to a broad range of data modalities, including binary, count, and continuous observations. We prove that the procedure recovers the true row and column classes when the dimensions of the data matrix tend to infinity, even if the functional form of the data distribution is misspecified. The procedure requires computing a combinatorial search, which can be expensive in practice. Rather than performing this search directly, we propose a new heuristic optimization procedure based on the Kernighan-Lin heuristic, which has nice computational properties and performs well in simulations. We demonstrate our procedure with applications to congressional voting records, and microarray analysis.




cluster

Path-Based Spectral Clustering: Guarantees, Robustness to Outliers, and Fast Algorithms

We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of high-dimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points.




cluster

Connecting Spectral Clustering to Maximum Margins and Level Sets

We study the connections between spectral clustering and the problems of maximum margin clustering, and estimation of the components of level sets of a density function. Specifically, we obtain bounds on the eigenvectors of graph Laplacian matrices in terms of the between cluster separation, and within cluster connectivity. These bounds ensure that the spectral clustering solution converges to the maximum margin clustering solution as the scaling parameter is reduced towards zero. The sensitivity of maximum margin clustering solutions to outlying points is well known, but can be mitigated by first removing such outliers, and applying maximum margin clustering to the remaining points. If outliers are identified using an estimate of the underlying probability density, then the remaining points may be seen as an estimate of a level set of this density function. We show that such an approach can be used to consistently estimate the components of the level sets of a density function under very mild assumptions.




cluster

Latent Simplex Position Model: High Dimensional Multi-view Clustering with Uncertainty Quantification

High dimensional data often contain multiple facets, and several clustering patterns can co-exist under different variable subspaces, also known as the views. While multi-view clustering algorithms were proposed, the uncertainty quantification remains difficult --- a particular challenge is in the high complexity of estimating the cluster assignment probability under each view, and sharing information among views. In this article, we propose an approximate Bayes approach --- treating the similarity matrices generated over the views as rough first-stage estimates for the co-assignment probabilities; in its Kullback-Leibler neighborhood, we obtain a refined low-rank matrix, formed by the pairwise product of simplex coordinates. Interestingly, each simplex coordinate directly encodes the cluster assignment uncertainty. For multi-view clustering, we let each view draw a parameterization from a few candidates, leading to dimension reduction. With high model flexibility, the estimation can be efficiently carried out as a continuous optimization problem, hence enjoys gradient-based computation. The theory establishes the connection of this model to a random partition distribution under multiple views. Compared to single-view clustering approaches, substantially more interpretable results are obtained when clustering brains from a human traumatic brain injury study, using high-dimensional gene expression data.




cluster

Optimal Bipartite Network Clustering

We study bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. This is further formalized by deriving a minimax lower bound over a class of biclustering problems. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth, from sparse networks with average degrees growing arbitrarily slowly to fairly dense networks with average degrees of order $sqrt{n}$. As a special case, we recover the known exact recovery threshold in the $log n$ regime of sparsity. To obtain the consistency result, as part of the provable version of the algorithm, we introduce a sub-block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable algorithm is derived from a general class of pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations.




cluster

High-Dimensional Inference for Cluster-Based Graphical Models

Motivated by modern applications in which one constructs graphical models based on a very large number of features, this paper introduces a new class of cluster-based graphical models, in which variable clustering is applied as an initial step for reducing the dimension of the feature space. We employ model assisted clustering, in which the clusters contain features that are similar to the same unobserved latent variable. Two different cluster-based Gaussian graphical models are considered: the latent variable graph, corresponding to the graphical model associated with the unobserved latent variables, and the cluster-average graph, corresponding to the vector of features averaged over clusters. Our study reveals that likelihood based inference for the latent graph, not analyzed previously, is analytically intractable. Our main contribution is the development and analysis of alternative estimation and inference strategies, for the precision matrix of an unobservable latent vector Z. We replace the likelihood of the data by an appropriate class of empirical risk functions, that can be specialized to the latent graphical model and to the simpler, but under-analyzed, cluster-average graphical model. The estimators thus derived can be used for inference on the graph structure, for instance on edge strength or pattern recovery. Inference is based on the asymptotic limits of the entry-wise estimates of the precision matrices associated with the conditional independence graphs under consideration. While taking the uncertainty induced by the clustering step into account, we establish Berry-Esseen central limit theorems for the proposed estimators. It is noteworthy that, although the clusters are estimated adaptively from the data, the central limit theorems regarding the entries of the estimated graphs are proved under the same conditions one would use if the clusters were known in advance. As an illustration of the usage of these newly developed inferential tools, we show that they can be reliably used for recovery of the sparsity pattern of the graphs we study, under FDR control, which is verified via simulation studies and an fMRI data analysis. These experimental results confirm the theoretically established difference between the two graph structures. Furthermore, the data analysis suggests that the latent variable graph, corresponding to the unobserved cluster centers, can help provide more insight into the understanding of the brain connectivity networks relative to the simpler, average-based, graph.




cluster

Union of Low-Rank Tensor Spaces: Clustering and Completion

We consider the problem of clustering and completing a set of tensors with missing data that are drawn from a union of low-rank tensor spaces. In the clustering problem, given a partially sampled tensor data that is composed of a number of subtensors, each chosen from one of a certain number of unknown tensor spaces, we need to group the subtensors that belong to the same tensor space. We provide a geometrical analysis on the sampling pattern and subsequently derive the sampling rate that guarantees the correct clustering under some assumptions with high probability. Moreover, we investigate the fundamental conditions for finite/unique completability for the union of tensor spaces completion problem. Both deterministic and probabilistic conditions on the sampling pattern to ensure finite/unique completability are obtained. For both the clustering and completion problems, our tensor analysis provides significantly better bound than the bound given by the matrix analysis applied to any unfolding of the tensor data.




cluster

A Bayesian sparse finite mixture model for clustering data from a heterogeneous population

Erlandson F. Saraiva, Adriano K. Suzuki, Luís A. Milan.

Source: Brazilian Journal of Probability and Statistics, Volume 34, Number 2, 323--344.

Abstract:
In this paper, we introduce a Bayesian approach for clustering data using a sparse finite mixture model (SFMM). The SFMM is a finite mixture model with a large number of components $k$ previously fixed where many components can be empty. In this model, the number of components $k$ can be interpreted as the maximum number of distinct mixture components. Then, we explore the use of a prior distribution for the weights of the mixture model that take into account the possibility that the number of clusters $k_{mathbf{c}}$ (e.g., nonempty components) can be random and smaller than the number of components $k$ of the finite mixture model. In order to determine clusters we develop a MCMC algorithm denominated Split-Merge allocation sampler. In this algorithm, the split-merge strategy is data-driven and was inserted within the algorithm in order to increase the mixing of the Markov chain in relation to the number of clusters. The performance of the method is verified using simulated datasets and three real datasets. The first real data set is the benchmark galaxy data, while second and third are the publicly available data set on Enzyme and Acidity, respectively.




cluster

Variable selection methods for model-based clustering

Michael Fop, Thomas Brendan Murphy.

Source: Statistics Surveys, Volume 12, 18--65.

Abstract:
Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.




cluster

Finite mixture models and model-based clustering

Volodymyr Melnykov, Ranjan Maitra

Source: Statist. Surv., Volume 4, 80--116.

Abstract:
Finite mixture models have a long history in statistics, having been used to model population heterogeneity, generalize distributional assumptions, and lately, for providing a convenient yet formal framework for clustering and classification. This paper provides a detailed review into mixture models and model-based clustering. Recent trends as well as open problems in the area are also discussed.




cluster

Know Your Clients' behaviours: a cluster analysis of financial transactions. (arXiv:2005.03625v1 [econ.EM])

In Canada, financial advisors and dealers by provincial securities commissions, and those self-regulatory organizations charged with direct regulation over investment dealers and mutual fund dealers, respectively to collect and maintain Know Your Client (KYC) information, such as their age or risk tolerance, for investor accounts. With this information, investors, under their advisor's guidance, make decisions on their investments which are presumed to be beneficial to their investment goals. Our unique dataset is provided by a financial investment dealer with over 50,000 accounts for over 23,000 clients. We use a modified behavioural finance recency, frequency, monetary model for engineering features that quantify investor behaviours, and machine learning clustering algorithms to find groups of investors that behave similarly. We show that the KYC information collected does not explain client behaviours, whereas trade and transaction frequency and volume are most informative. We believe the results shown herein encourage financial regulators and advisors to use more advanced metrics to better understand and predict investor behaviours.




cluster

Predictive Modeling of ICU Healthcare-Associated Infections from Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling Approach. (arXiv:2005.03582v1 [cs.LG])

Early detection of patients vulnerable to infections acquired in the hospital environment is a challenge in current health systems given the impact that such infections have on patient mortality and healthcare costs. This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units by means of machine-learning methods. The aim is to support decision making addressed at reducing the incidence rate of infections. In this field, it is necessary to deal with the problem of building reliable classifiers from imbalanced datasets. We propose a clustering-based undersampling strategy to be used in combination with ensemble classifiers. A comparative study with data from 4616 patients was conducted in order to validate our proposal. We applied several single and ensemble classifiers both to the original dataset and to data preprocessed by means of different resampling methods. The results were analyzed by means of classic and recent metrics specifically designed for imbalanced data classification. They revealed that the proposal is more efficient in comparison with other approaches.




cluster

Fair Algorithms for Hierarchical Agglomerative Clustering. (arXiv:2005.03197v1 [cs.LG])

Hierarchical Agglomerative Clustering (HAC) algorithms are extensively utilized in modern data science and machine learning, and seek to partition the dataset into clusters while generating a hierarchical relationship between the data samples themselves. HAC algorithms are employed in a number of applications, such as biology, natural language processing, and recommender systems. Thus, it is imperative to ensure that these algorithms are fair-- even if the dataset contains biases against certain protected groups, the cluster outputs generated should not be discriminatory against samples from any of these groups. However, recent work in clustering fairness has mostly focused on center-based clustering algorithms, such as k-median and k-means clustering. Therefore, in this paper, we propose fair algorithms for performing HAC that enforce fairness constraints 1) irrespective of the distance linkage criteria used, 2) generalize to any natural measures of clustering fairness for HAC, 3) work for multiple protected groups, and 4) have competitive running times to vanilla HAC. To the best of our knowledge, this is the first work that studies fairness for HAC algorithms. We also propose an algorithm with lower asymptotic time complexity than HAC algorithms that can rectify existing HAC outputs and make them subsequently fair as a result. Moreover, we carry out extensive experiments on multiple real-world UCI datasets to demonstrate the working of our algorithms.




cluster

Model assisted variable clustering: Minimax-optimal recovery and algorithms

Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen.

Source: The Annals of Statistics, Volume 48, Number 1, 111--137.

Abstract:
The problem of variable clustering is that of estimating groups of similar components of a $p$-dimensional vector $X=(X_{1},ldots ,X_{p})$ from $n$ independent copies of $X$. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of $G$-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a $G$-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to $G$-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular $K$-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.