We deal with the problem of protein superfamily classification in which the family membership of newly discovered amino acid sequence is predicted. Correct prediction is a matter of great concern for the researchers and drug analyst which helps them in discovery of new drugs. As this problem falls broadly under the category of pattern classification problem, we have made all efforts to optimize feature extraction in the first stage and classifier design in the second stage with an overall objective to maximize the performance accuracy of the classifier. In the feature extraction phase, Genetic Algorithm- (GA-) based wrapper approach is used to select few eigenvectors from the principal component analysis (PCA) space which are encoded as binary strings in the chromosome. On the basis of position of 1’s in the chromosome, the eigenvectors are selected to build the transformation matrix which then maps the original high-dimension feature space to lower dimension feature space. Using PCA-NSGA-II (non-dominated sorting GA), the nondominated solutions obtained from the Pareto front solve the trade-off problem by compromising between the number of eigenvectors selected and the accuracy obtained by the classifier. In the second stage, recursive orthogonal least square algorithm (ROLSA) is used for training radial basis function network (RBFN) to select optimal number of hidden centres as well as update the output layer weighting matrix. This approach can be applied to large data set with much lower requirements of computer memory. Thus, very small architectures having few number of hidden centres are obtained showing higher level of performance accuracy. 1. Introduction Protein superfamily classification deals with the prediction of family membership of newly discovered proteins. This helps the drug analyst for discovery of new drugs required in the treatment of new diseases. This classification helps in predicting the protein function and/or structure of the unknown sequence, thus avoiding the expensive biological (wet) experiments in the laboratory. Once a particular sequence S causing disease D is classified to a superfamily , the researchers can design some new drugs by trying some combination of existing drugs for family . Thus, this classification problem helps the researchers in treatment of diseases by discovering new drugs [1]. Among the major techniques developed in the past, the popular BLAST tool [19] represents the simplest nearest neighbour approach and exploits pairwise local alignments to measure sequence similarity. Another type of direct
References
[1]
K. Blekas, D. I. Fotiadis, and A. Likas, “Motif-based protein sequence classification using neural networks,” Journal of Computational Biology, vol. 12, no. 1, pp. 64–82, 2005.
[2]
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, “Markov chain and hidden Markov model,” in Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, chapter 3, pp. 47–65, Cambridge University Press, 1998.
[3]
C. Wu, M. Berry, S. Shivakumar, and J. McLarty, “Neural networks for full-scale protein sequence classification: sequence encoding with singular value decomposition,” Machine Learning, vol. 21, no. 1-2, pp. 177–193, 1995.
[4]
A. Edgardo, FerrAn, P. Ferrara, and B. Pflugfelder, “Protein classification using neural networks,” in Proceedings of the International Conference on Intelligent Systems for Molecular Biology, vol. 1, pp. 127–135, 1993.
[5]
J. T. L. Wang, Q. Ma, D. Shasha, and C. H. Wu, “New techniques for extracting features from protein sequences,” IBM Systems Journal, vol. 40, no. 2, pp. 426–441, 2001.
[6]
D. Wang, N. Kion Lee, and T. S. Dillon, “Extraction and optimization of fuzzy protein sequences classification rules using GRBF neural networks,” Neural Information Processing-Letters and Reviews, vol. 1, no. 1, pp. 53–59, 2003.
[7]
E. G. Mansoori, M. J. Zolghadri, S. D. Katebi, H. Mohabatkar, R. Boostani, and M. H. Sadreddini, “Generating fuzzy rules for protein classification,” Iranian Journal of Fuzzy Systems, vol. 5, no. 2, pp. 21–33, 2008.
[8]
L. French, A. Ngom, and L. Rueda, “Fast protein superfamily classification using principal component null space analysis,” in Advances in Artificial Intelligence, vol. 3501, pp. 158–169, 2005.
[9]
D. Wang and G. B. Huang, “Protein sequence classification using extreme learning machine,” in Proceedings of theInternational Joint Conference on Neural Networks (IJCNN '05), vol. 3, pp. 1406–1411, 2005.
[10]
X.-M. Zhao, J.-X. Du, and H.-Q. Wang, “A new technique for extracting features from protein sequences,” in Proceedings of the International conference on Intelligent Computing, pp. 1223–1232, August 2005.
[11]
E. G. Mansoori, M. J. Zolghadri, and S. D. Katebi, “Protein superfamily classification using fuzzy rule-based classifier,” IEEE Transactions on Nanobioscience, vol. 8, no. 1, pp. 92–99, 2009.
[12]
Z. Zainuddin and M. Kumar, “Radial basis function neural networks in protein sequence classification,” Malaysian Journal of Mathematical Sciences, vol. 2, no. 2, pp. 195–204, 2008.
[13]
S. Bandyopadhyay, “An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection,” Fuzzy Sets and Systems, vol. 152, no. 1, pp. 5–16, 2005.
[14]
X.-M. Zhao, D.-S. Huang, and Y.-M. Cheung, “A novel hybrid GA/RBFNN technique for protein sequences classification,” Protein and Peptide Letters, vol. 12, no. 4, pp. 383–386, 2005.
[15]
X.-M. Zhao, Y.-M. Cheung, and D.-S. Huang, “A novel approach to extracting features from motif content and protein composition for protein sequence classification,” Neural Networks, vol. 18, no. 8, pp. 1019–1028, 2005.
[16]
M. Dash and H. Liu, “Feature selection for classification,” Intelligent Data Analysis, vol. 1, pp. 131–156, 1997.
[17]
C. Wu, G. Whitson, J. McLarty, A. Ermongkonchai, and T.-C. Chang, “Protein classification artificial neural system,” Protein Science, vol. 1, no. 5, pp. 667–677, 1992.
[18]
Y. Bengio and Y. Grandvalet, “No unbiased estimator of the variance of K-fold cross-validation’,” Journal of Machine Learning Research, vol. 5, pp. 1089–1105, 2004.
[19]
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990.
[20]
Z. Xing, J. Pei, and E. Keogh, “A brief survey on sequence classification,” ACM SIGKDD Explorations Newsletter, vol. 12, no. 1, pp. 40–48, 2010.
[21]
S. Sharma, V. Kumar, T. S. Rani, S. D. Bhavani, and S. B. Raju, “Application of neural networks for protein sequence classification,” in Proceedings of International Conference on Intelligent Sensing and Information Processing (ICISIP '04), pp. 325–328, January 2004.
[22]
A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: a review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4–37, 2000.
[23]
A. Ghosh and S. Dehuri, “Evolutionary algorithms for multi-criterion optimization: a survey,” International Journal of Computing and Information Sciences, vol. 2, no. 1, pp. 38–57, 2004.
[24]
E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algorithms: empirical results,” Evolutionary Computation, vol. 8, no. 2, pp. 173–195, 2000.
[25]
K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
[26]
M. Farina, K. Deb, and P. Amato, “Dynamic multiobjective optimization problems: test cases, approximations, and applications,” IEEE Transactions on Evolutionary Computation, vol. 8, no. 5, pp. 425–442, 2004.
[27]
K. Balci and V. Atalay, “PCA for gender estimation: which eigenvectors contribute?” in Proceedings of the 16th International Conference on Pattern Recognition (ICPR’02), pp. 363–366, IEEE Computer Society, Washington, DC, USA, 2002.
[28]
Z. Sun, G. Bebis, and R. Miller, “Object detection using feature subset selection,” Pattern Recognition, vol. 37, no. 11, pp. 2165–2176, 2004.
[29]
R. Neruda and P. Kudová, “Learning methods for radial basis function networks,” Future Generation Computer Systems, vol. 21, no. 7, pp. 1131–1142, 2005.
[30]
A. Steve Billings and G. L. Zheng, “Radial basis function network Configuration using Genetic Algorithms,” Neural Networks, vol. 8, no. 6, pp. 877–890, 1995.
[31]
L. Guo, D.-S. Huang, and W. Zhao, “Combining genetic optimisation with hybrid learning algorithm for radial basis function neural networks,” Electronics Letters, vol. 39, no. 22, pp. 1600–1601, 2003.
[32]
J. González, I. Rojas, J. Ortega, H. Pomares, J. Fernández, and A. Fco, “Multiobjective evolutionary optimization of the size, shape, and position parameters of radial basis function networks for function approximation,” IEEE Transactions on Neural Networks, vol. 14, no. 6, pp. 1478–1495, 2003.
[33]
T. Hatanaka, N. Kondo, and K. Uosaki, “Multiobjective structure selection for radial basis function networks based on Genetic Algorithm,” in Proceedings of the Congress on Evolutionary Computation, vol. 2, pp. 1095–1100, 2003.
[34]
Z.-Q. Zhao and D.-S. Huang, “A mended hybrid learning algorithm for radial basis function neural networks to improve generalization capability,” Applied Mathematical Modelling, vol. 31, no. 7, pp. 1271–1281, 2007.
[35]
S. N. Qasem and S. M. Shamsuddin, “Generalization improvement of radial basis function network based on multi-objective particle swarm optimization,” Journal of Artificial Intelligence, vol. 3, no. 1, pp. 1–16, 2010.
[36]
M. Majid Zirkohi, M. Mihammad, Fateh, and Ali Akbarzade, “Design of Radial basis function network using adaptive particle swarm optimization and orthogonal least squares,” Journal of Software Engineering and Applications, vol. 3, pp. 704–708, 2010.
[37]
S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Transactions on Neural Networks, vol. 2, no. 2, pp. 302–309, 1991.
[38]
S. Chen, P. M. Grant, and C. F. N. Cowan, “Orthogonal least-squares algorithm for training multioutput radial basis function networks,” IEE Proceedings on Radar and Signal Processing, vol. 139, no. 6, pp. 378–384, 1992.
[39]
X. Hong and S. A. Billings, “Givens rotation based fast backward elimination algorithm for RBF neural network pruning,” IEE Proceedings on Control Theory and Applications, vol. 144, no. 5, pp. 381–384, 1997.
[40]
D. L. Yu, J. B. Gomm, and D. Williams, “A recursive orthogonal least squares algorithm for training RBF networks,” Neural Processing Letters, vol. 5, no. 3, pp. 167–176, 1997.
[41]
J. B. Gomm and D. L. Yu, “Selecting radial basis function network centers with recursive orthogonal least squares training,” IEEE Transactions on Neural Networks, vol. 11, no. 2, pp. 306–314, 2000.