全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Comprehensive K-Means Clustering

DOI: 10.4236/jcc.2024.123009, PP. 146-159

Keywords: K-Means Clustering

Full-Text   Cite this paper   Add to My Lib

Abstract:

The k-means algorithm is a popular data clustering technique due to its speed and simplicity. However, it is susceptible to issues such as sensitivity to the chosen seeds, and inaccurate clusters due to poor initial seeds, particularly in complex datasets or datasets with non-spherical clusters. In this paper, a Comprehensive K-Means Clustering algorithm is presented, in which multiple trials of k-means are performed on a given dataset. The clustering results from each trial are transformed into a five-dimensional data point, containing the scope values of the x and y coordinates of the clusters along with the number of points within that cluster. A graph is then generated displaying the configuration of these points using Principal Component Analysis (PCA), from which we can observe and determine the common clustering patterns in the dataset. The robustness and strength of these patterns are then examined by observing the variance of the results of each trial, wherein a different subset of the data keeping a certain percentage of original data points is clustered. By aggregating information from multiple trials, we can distinguish clusters that consistently emerge across different runs from those that are more sensitive or unlikely, hence deriving more reliable conclusions about the underlying structure of complex datasets. Our experiments show that our algorithm is able to find the most common associations between different dimensions of data over multiple trials, often more accurately than other algorithms, as well as measure stability of these clusters, an ability that other k-means algorithms lack.

References

[1]  Eisen, M.B., et al. (1998) Cluster Analysis and Display of Genome-Wide Expression Patterns. Proceedings of the National Academy of Sciences, 95, 14863-14868.
https://doi.org/10.1073/pnas.95.25.14863
[2]  Provost, F. and Fawcett, T. (2013) Data Science and Its Relationship to Big Data and Data-Driven Decision Making. Mary Ann Liebert, Inc., Larchmont.
https://doi.org/10.1089/big.2013.1508
[3]  Lloyd, S. (1982) Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28, 129-137.
https://doi.org/10.1109/TIT.1982.1056489
[4]  Ikotun, A.M., et al. (2023) K-Means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data. Information Sciences, 622, 178-210.
https://doi.org/10.1016/j.ins.2022.11.139
[5]  Fränti, P. and Sieranoja, S. (2019) How Much Can k-Means Be Improved by Using Better Initialization and Repeats? Pattern Recognition, 93, 95-112.
https://doi.org/10.1016/j.patcog.2019.04.014
[6]  Arthur, D. and Vassilvitskii, S. (2007) K-Means++: The Advantages of Careful Seeding. Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, 7-9 January 2007, 1027-1035.
[7]  Deshpande, A., et al. (2020) Robust k-Means++. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), Toronto, 3-6 August 2020, 799-808.
[8]  Bennett, K.P., et al. (2018) Constrained K-Means Clustering. Microsoft Research.
https://www.microsoft.com/en-us/research/publication/constrained-k-means-clustering/
[9]  Feizollah, A., Anuar, N.B., Salleh, R. and Amalina, F. (2014) Comparative Study of k-Means and Mini Batch k-Means Clustering Algorithms in Android Malware Detection Using Network Traffic Analysis. 2014 International Symposium on Biometrics and Security Technologies (ISBAST), Kuala Lumpur, 26-27 August 2014, 193-197.
https://doi.org/10.1109/ISBAST.2014.7013120
[10]  Bejar, J. (2013) K-Means vs Mini Batch K-Means: A Comparison. UPCommons.
[11]  Steinley, D. (2008) Stability Analysis in K-Means Clustering. British Journal of Mathematical and Statistical Psychology, 61, 255-273.
https://doi.org/10.1348/000711007X184849
[12]  Dorabiala, O., et al. (2021) Robust Trimmed k-Means.
[13]  Zhang, X.L., et al. (2020) A Robust k-Means Clustering Algorithm Based on Observation Point Mechanism. Complexity, 2020, Article ID: 3650926.
https://www.hindawi.com/journals/complexity/2020/3650926/
https://doi.org/10.1155/2020/3650926
[14]  Li, H.-G., et al. (2011) K-Means Clustering with Bagging and MapReduce. 2011 44th Hawaii International Conference on System Sciences, Kauai, 4-7 January 2011, 1-8.
https://doi.org/10.1109/HICSS.2011.265
[15]  Rocca, J. (2021) Ensemble Methods: Bagging, Boosting and Stacking. Medium.
https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
[16]  Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51, 107-113.
https://doi.org/10.1145/1327452.1327492
[17]  Yang, B., et al. (2016) Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering.
[18]  Shindler, M., et al. (2011) Fast and Accurate k-Means for Large Datasets. In: Shawe-Taylor, J., et al., Eds., Advances in Neural Information Processing Systems, Curran Associates, New York.
http://proceedings.neurips.cc/paper_files/paper/2011/file/52c670999cdef4b09eb656850da777c4-Paper.pdf
[19]  Baccin, C., et al. (2019) CellRank: A Scalable and Unbiased Algorithm for Mapping and Characterizing the Early Lineage Bifurcation in Single-Cell RNA-seq Data.
[20]  Pramar, R. (2018) Wine Quality. Kaggle.
https:///www.kaggle.com/datasets/rajyellow46/wine-quality?rvi=1
[21]  Nugent, C. (2017) California Housing Prices. Kaggle.
https://www.kaggle.com/datasets/camnugent/california-housing-prices?resource=download

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413