全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

DOI: 10.1155/2013/863146

Full-Text   Cite this paper   Add to My Lib

Abstract:

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS. 1. Introduction Clustering problem concerns the discovery of homogeneous groups of data according to a certain similarity measure. The task of clustering has been studied in statistics [1], machine learning [2–4], bioinformatics [3, 5–7], and more recently in databases [8–10]. Clustering algorithms finds a partition of the points such that points within a cluster are more similar to each other than to points in different clusters [11]. In traditional clustering each dimension is equally weighted when computing the distance between points. Most of these algorithms perform well in clustering low-dimensional data sets [12–15]. However, in higher dimensional feature spaces, their performance and efficiency deteriorate to a greater extent due to the high dimensionality [16]. Another difficulty we have to face when dealing with clustering is the dimensionality of data. In the clustering task, the overwhelming problem of high dimensionality presents a dual aspect. First, the presence of irrelevant attributes eliminates any hope on clustering tendency, because such features cause the algorithm to search for clusters where there is no existence of clusters. This also happens with low-dimensional data, but the likelihood of presence of irrelevant features and their number grow with dimension. The second

References

[1]  L. C. Hamilton, Modern Data Analyis: A First Course in Applied Statistics, Brooks/Cole, 1990.
[2]  D. H. Fisher, “Iterative optimization and simplification of hierarchical clustering,” in Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD ’95), pp. 118–123, Montreal, Canada, 1995.
[3]  J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
[4]  N. Kumar and K. Kummamuru, “Semisupervised clustering with metric learning using relative comparisons,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 4, pp. 496–503, 2008.
[5]  M. H. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2003.
[6]  A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G. C. Tseng, “Evaluation and comparison of gene clustering methods in microarray analysis,” Bioinformatics, vol. 22, no. 19, pp. 2405–2412, 2006.
[7]  K. Wang, Z. Du, Y. Chen, and S. Li, “V3COCA: an effective clustering algorithm for complicated objects and its application in breast cancer research and diagnosis,” Simulation Modelling Practice and Theory, vol. 17, no. 2, pp. 454–470, 2009.
[8]  J. Sander, M. Ester, H. Kriegel, and X. Xu, “Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169–194, 1998.
[9]  B. R. S. A. Young, Efficient algorithms for data mining with federated databases [Ph.D. thesis], University of Cincinnati, 2007.
[10]  G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: a wavelet-based clustering approach for spatial data in very large databases,” Journal of Very Large Data Bases, vol. 8, no. 3-4, pp. 289–304, 2000.
[11]  J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, Calif, USA, 2011.
[12]  J. McQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the 5th Berkeley Symposium on Mathematics, Statistics, and Probabilistics, vol. 1, pp. 281–297, 1967.
[13]  E. M. Voorhees, “Implementing agglomerative hierarchic clustering algorithms for use in document retrieval,” Information Processing and Management, vol. 22, no. 6, pp. 465–576, 1986.
[14]  T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” in Proceedings of the ACM SIGMOD International Conference on Management of Data of ACM SIGMOD Record, vol. 25, no. 2, pp. 103–114, June 1996.
[15]  S. Guha and R. Rastogi, “CURE: an efficient clustering algorithm for large databases,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, vol. 27, no. 2, pp. 73–84, ACM Press, 1998.
[16]  F. Korn, B. Pagel, and C. Faloutsos, “On the “dimensionality curse” and the ‘self-similarity blessing–,” IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 1, pp. 96–111, 2001.
[17]  K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is nearest neighbors meaningful,” in Proceedings of the 7th International Conference on Database Theory (ICDT '99), pp. 217–235, London, UK, 1999.
[18]  B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall, London, UK, 1986.
[19]  M. Ester, H. P. Kriegel, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD '96), pp. 226–223, Portand, Ore, USA, 1996.
[20]  A. Hinneburg and H. Gabriel, “DENCLUE 2. 0: fast clustering based on kernel density estimation,” in Advances in Intelligent Data Analysis VII, pp. 70–80, Springer, Berlin, Germany, 2007.
[21]  A. H. Pilevar and M. Sukumar, “GCHL: a grid-clustering algorithm for high-dimensional very large spatial data bases,” Pattern Recognition Letters, vol. 26, no. 7, pp. 999–1010, 2005.
[22]  W. Wang, J. Yang, and R. Muntz, “STING: a statistical information grid approach to spatial data mining,” in Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97), vol. 97, pp. 186–195, 1997.
[23]  C. C. Aggarwal, C. Procopiuc, J. L. Wolf, and J. S. Park, “Fast algorithms for projected clustering,” in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, vol. 28, no. 2, pp. 61–72, 1999.
[24]  A. Patrikainen and M. Meila, “Comparing subspace clusterings,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 7, pp. 902–916, 2006.
[25]  R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” in Proceedings of the ACM-SIGMOD Conference on the Management of Data, vol. 27, no. 2, pp. 94–105, 1998.
[26]  K. Kailing, H. P. Kriegel, and P. Kroger, “Density-connected subspace clustering for high-dimensional data,” in Proceedings 4th SIAM International Conference on Data Mining (SDM '04), p. 4, 2004.
[27]  C. C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in high dimensional spaces,” in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, vol. 29, no. 2, pp. 70–81, 2000.
[28]  H. Kriegel, P. Kr?ger, M. Renz, and S. Wurst, “A generic framework for efficient subspace clustering of high-dimensional data,” in Proceedings of the 5th IEEE International Conference on Data Mining (ICDM '05), pp. 250–257, November 2005.
[29]  M. J. Zaki, M. Peters, I. Assent, and T. Seidl, “CLICKS: an effective algorithm for mining subspace clusters in categorical datasets,” Data and Knowledge Engineering, vol. 60, no. 1, pp. 51–70, 2007.
[30]  J. H. Friedman and J. J. Meulman, “Clustering objects on subsets of attributes,” Journal of the Royal Statistical Society B, vol. 66, no. 4, pp. 815–839, 2004.
[31]  K. Woo, J. Lee, M. Kim, and Y. Lee, “FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting,” Information and Software Technology, vol. 46, no. 4, pp. 255–271, 2004.
[32]  Y. Chu, J. Huang, K. Chuang, D. Yang, and M. Chen, “Density conscious subspace clustering for high-dimensional data,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 1, pp. 16–30, 2010.
[33]  C. H. Cheng, A. W. Fu, C. H. Cheng, A. W. Fu, and Y. Zhang, “Entropy-based subspace clustering for mining numerical data,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93, 1999.
[34]  S. Goil, H. S. Nagesh, and A. Choudhary, “MAFIA: efficient and scalable subspace clustering for very large data sets,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452, 1999.
[35]  H. S. Nagesh, S. Goil, and A. Choudhary, “Adaptive grids for clustering massive data sets,” in Proceedings of the 1st SIAM International Conference on Data Mining (SDM '01), pp. 1–17, 2001.
[36]  C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali, “A Monte Carlo algorithm for fast projective clustering,” in Proceedings of the 2002 ACM SIGMOD International Conference on Managment of Data, pp. 418–427, June 2002.
[37]  B. L. Milenova and M. M. Campos, “O-Cluster: scalable clustering of large high dimensional sata sets,” in Oracle Data Mining Technologies, pp. 290–297, Oracle Corporation, 2002.
[38]  L. Jing, M. K. Ng, and J. Z. Huang, “An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 8, pp. 1026–1041, 2007.
[39]  E. K. K. Ng, A. W. Fu, and R. C. Wong, “Projective clustering by histograms,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 3, pp. 369–383, 2005.
[40]  A. Elke, C. B?hm, H. P. Kriegel, P. Kr?ger, I. M. Gorman, and A. Zimek, “Detection and visualization of subspace cluster hierarchies,” in Advances in Databases: Concepts, Systems and Applications, pp. 152–163, springer, Berlin, Germany, 2007.
[41]  S. Robert and P. H. A. Sneath, Principles of Numerical Taxonomy, W. H. Freeman, 1963.
[42]  D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, 1989.
[43]  E. R. Hruschka, R. J. G. B. Campello, A. A. Freitas, and A. C. P. L. F. de Carvalho, “A survey of evolutionary algorithms for clustering,” IEEE Transactions on Systems, Man and Cybernetics C, vol. 39, no. 2, pp. 133–155, 2009.
[44]  J. Handl and J. Knowles, “An evolutionary approach to multiobjective clustering,” IEEE Transactions on Evolutionary Computation, vol. 11, no. 1, pp. 56–76, 2007.
[45]  T. Kohonen, Self-Organizing Maps, vol. 30, Springer, 2001.
[46]  D. E. Goldberg and K. Deb, “A comparative analysis of selection schemes used in genetic algorithms,” in Foundations of Genetic Algorithms, vol. 51, pp. 61801–62996, Urbana, 1991.
[47]  J. E. Baker, An analysis of the effects of selection in genetic algorithms [Ph.D. thesis], Graduate School of Vanderbilt University, Nashville, Tenn, USA, 1989.
[48]  J. Yeh and J. C. Fu, “A hierarchical genetic algorithm for segmentation of multi-spectral human-brain MRI,” Expert Systems with Applications, vol. 34, no. 2, pp. 1285–1295, 2008.
[49]  J. H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, Mich, USA, 1975.
[50]  D. Thierens and D. Goldberg, “Elitist recombination: an integrated selection recombination GA,” in Proceedings of the 1st IEEE Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence (ICEC '94), pp. 508–512, June 1994.
[51]  J. H. Friedman, L. B. Jon, and A. F. Raphael, “An algorithm for finding best matches in logarithmic expected time,” ACM Transactions on Mathematical Software, vol. 3, no. 3, pp. 209–226, 1977.

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413