|
基于梯度的三种优化方法及比较
|
Abstract:
近年来,高速发展的科学技术将人工智能带入大众的视野。机器学习和深度学习作为人工智能的核心技术,引起了众多学者的关注。在机器学习领域,由于目标模型损失函数的复杂性,使得无法快速有效地得到参数估计的表达式,因此,运用基于梯度的优化方法求解该类优化问题很受欢迎,到目前为止最主流的一个算法就是梯度下降法,但在实际应用中,随着数据规模越来越大,传统的梯度下降法训练的过程及其缓慢,已不能够快速有效的解决大规模机器学习问题。所以,在梯度下降方法的基础上进行了改进,提出了随机梯度下降算法。随机方法因其良好的标度特性在大规模应用问题中受到青睐,本文首先详细介绍了梯度下降法、随机梯度下降法及小批量随机梯度下降法三个方法基本思想及其求解最优化问题的具体过程,然后设计数值例子进行模拟实验,并比较三种方法的优劣性,最后通过实验结果得出结论:梯度下降法收敛性较好,但计算效率低,随机梯度下降法计算效率高,而小批量梯度下降法则是介于二者之间。因此在计算大规模的问题时,随机梯度下降法相较于另外两种方法更为有效。
In recent years, the rapid development of science and technology has brought artificial intelligence into the public eye. Machine learning and deep learning, as the core technologies of artificial intel-ligence, have attracted considerable attention from numerous scholars. In the field of machine learning, the complexity of the loss function for target models makes it challenging to obtain a rapid and effective expression for parameter estimation. Therefore, the application of gradient-based op-timization methods to solve such optimization problems has become popular. Up to now, the most mainstream algorithm is the gradient descent method. However, in practical applications, as data scales increase, the traditional gradient descent method and its slow training process become inef-ficient in addressing large-scale machine learning problems. Therefore, improvements have been made based on the gradient descent method, leading to the introduction of the stochastic gradient descent algorithm. Stochastic methods, due to their favorable scaling characteristics, have gained popularity in large-scale application problems. This paper first provides a detailed introduction to the basic ideas and specific processes of three methods: gradient descent, stochastic gradient de-scent, and mini-batch stochastic gradient descent, for solving optimization problems. Subsequently, numerical examples are designed for simulation experiments, comparing the strengths and weak-nesses of the three methods. The conclusions drawn from the experimental results are as follows: gradient descent exhibits good convergence but low computational efficiency, stochastic gradient descent has high computational efficiency, and mini-batch gradient descent falls between the two. Therefore, when dealing with large-scale computational problems, stochastic gradient descent is more effective compared to the other two methods.
[1] | Rumelhart, D.E., Hinton, G.E. and Williams, R.J. (1986) Learning Internal Representation by Back-Propagation Errors. Nature, 323, 533-536. https://doi.org/10.1038/323533a0 |
[2] | Robbins, H. and Monro, S. (1951) A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22, 400-407. https://doi.org/10.1214/aoms/1177729586 |
[3] | Amari, S. (1967) A Theory of Adaptive Pattern Classifiers. IEEE Transactions on Electronic Computers, 3, 299-307.
https://doi.org/10.1109/PGEC.1967.264666 |
[4] | Bottou, L. (1998) Online Algorithms and Stochastic Approximations. In: Saad, D., Ed., Online Learning and Neural Networks, Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511569920.003 |
[5] | Rosenblatt, F. (1958) The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65, 386-408. https://doi.org/10.1037/h0042519 |
[6] | Nemirovski, A.S. and Yudin, D.B. (1978) Cesari Convergence of the Gradient Method of Approximating Saddle Points of Convex-Concave Functions. Doklady Akademii Nauk SSSR, 239, 1056-1059. |
[7] | Shalev-Shwartz, S., Singer, Y. and Srebro, N. (2007) Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Proceedings of the 24th International Conference on Machine Learning, New York, 20-24 June 2007, 807-814.
https://doi.org/10.1145/1273496.1273598 |
[8] | Nemirovski, A.S., Juditsky, A., Lan, G., et al. (2009) Robust Stochastic Approximation Approach to Stochastic Programming. SIAM Journal on Optimization, 19, 1574-1609. https://doi.org/10.1137/070704277 |
[9] | Langford, J., Li, L. and Tong, Z. (2009) Sparse Online Learning via Truncated Gradient. Journal of Machine Learning Research, 10, 777-801. |
[10] | Hardt, M., Recht, B.H. and Singer, Y. (2016) Train Faster, Generalize Better: Stability of Stochastic Gradient Descent. Proceedings of the 33rd International Conference on Machine Learning, New York, 19-24 June 2016, 1868-1877. |
[11] | Kasiviswanathan, S.P. and Jin, H. (2016) Efficient Private Empirical Risk Minimization for High-dimensional Learning. Proceedings of the 33rd International Conference on Machine Learning, New York, 19-24 June 2016, 488-497. |
[12] | Park, J. (2022) Representation Learnt by SGD and Adaptive Learning Rules—Conditions That Vary Sparsity and Selectivity in Neural Network. arXiv:2201.11653. |
[13] | Krizhevsky, A., Sutskever, I. and Hinton, G. (2012) ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25, 1097-1105. |
[14] | Sutskever, I., Martens, J., Dahl, G., et al. (2013) On the Importance of Initialization and Momentum in Deep Learning. Proceedings of the 30th International Conference on Machine Learning, 28, 1139-1147. |
[15] | Shamir, O. (2016) Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity. Proceedings of the 33rd International Conference on Machine Learning, New York, 19-24 June 2016, 248-256. |
[16] | Shamir, O. (2016) Convergence of Stochastic Gradient Descent for PCA. Proceedings of the 33rd International Conference on Machine Learning, New York, 19-24 June 2016, 257-265. |
[17] | Garber, D., Hazan, E., Jin, C., Kakade, S.M., Musco, C., Netrapalli, P., et al. (2016) Faster Eigenvector Computation via Shift-and-Invert Preconditioning. Proceedings of the 33rd International Con-ference on Machine Learning, New York, 19-24 June 2016, 2626-2634. |
[18] | Allen-Zhu, Z. and Li, Y. (2017) Doubly Accel-erated Methods for Faster CCA and Generalized Eigendecomposition. Proceedings of the 34rd International Conference on Machine Learning, Sydney, 6-11 August 2017, 98-106. |
[19] | 李凌云. 基于神经网络随机梯度下降法的手写数字识别方法[J]. 信息与电脑, 2021, 33(17): 74-76. |
[20] | 史加荣, 王丹, 尚凡华, 张鹤于. 随机梯度下降算法研究进展[J]. 自动化学报, 2021, 47(9): 2103-2119. |
[21] | Chen, J., Jin, S. and Lyu, L. (2020) A Consensus-Based Global Optimization Method with Adaptive Momentum Estimation. arXiv: 2012.04827. |