%0 Journal Article %T 基于端到端的复杂场景中文文字识别方法研究
Research on End-to-End Chinese Text Recognition Method in Complex Scenes %A 帅梓涵 %A 胡金蓉 %A 郎子鑫 %A 罗月梅 %A 李桂钢 %J Hans Journal of Data Mining %P 154-164 %@ 2163-1468 %D 2023 %I Hans Publishing %R 10.12677/HJDM.2023.132015 %X 近年来,由于成功挖掘了场景文本检测和识别的内在协同作用,端到端场景文本识别引起了人们的极大关注。然而,最近最先进的方法通常仅通过共享主干来结合检测和识别,这些方法由于其尺度和纵横比的极端变化不能很好地处理场景文本。在本文中,我们提出了一种新的端到端场景文本识别框架,称为ES-Transformer。与以往以整体方式学习场景文本的方法不同,我们的方法基于几个代表性特征来执行场景文本识别,这避免了背景干扰并降低了计算成本。具体来说,使用基本特征金字塔网络进行特征提取,然后,我们采用Swin-Transformer来建模采样特征之间的关系,从而有效地将它们划分为合理的组。在提升识别精度的同时降低了计算复杂度,不再依赖于繁杂的后处理模块。对中文数据集的定性和定量实验表明,ES-Transformer优于现有方法。
In recent years, due to the successful exploration of the inherent synergistic effect of scene text detection and recognition, end-to-end scene text recognition has attracted great attention. However, the most recent state-of-the-art methods usually only combine detection and recognition by sharing backbones, and these methods cannot handle scene text well due to extreme variations in scale and aspect ratio. In this paper, we propose a new end-to-end scene text recognition framework called ES-Transformer. Unlike previous methods that learn scene text in a holistic way, our approach per-forms scene text recognition based on several representative features, which avoids background interference and reduces computational cost. Specifically, we use a basic feature pyramid network for feature extraction, and then we employ Swin-Transformer to model the relationships between the sampled features, effectively partitioning them into reasonable groups. By improving recognition accuracy and reducing computational complexity, ES-Transformer no longer relies on complex post-processing modules. Qualitative and quantitative experiments on Chinese datasets show that ES-Transformer outperforms existing methods. %K 端到端,文字识别,Transformer,深度学习,End-to-End %K Text Recognition %K Transformer %K Deep Learning %U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=64199