全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

一种节奏与内容解纠缠的语音克隆模型
A Voice Cloning Model for Rhythm and Content De-Entanglement

DOI: 10.12677/AIRR.2024.131018, PP. 166-176

Keywords: 语音克隆,零样本,扬声器表示,内容增强
Voice Cloning
, Zero-Shot, Speaker Representation, Content Enhance

Full-Text   Cite this paper   Add to My Lib

Abstract:

语音克隆是一种通过语音分析、说话人分类和语音编码等算法合成与参考语音非常相似的语音技术。为了增强说话人个人发音特征转移情况,提出了节奏与内容解纠缠的MRCD模型。通过节奏随机扰动模块的随机阈值重采样将语音信号所传递的节奏信息解纠缠,使语音节奏相互独立;利用梅尔内容增强模块获取说话人的相似发言特征内容,同时增加风格损失函数及循环一致性损失函数衡量生成的语音与源语音的谱图及说话人身份之间差异,最后用端到端的语音合成模型FastSpeech2进行语音克隆。为了进行实验评估,将该方法应用于公开的AISHELL3数据集进行语音转换任务。通过客观和主观评价指标对该模型进行评估,结果表明,转换后的语音在保持自然度得分的同时,在说话人相似度方面优于之前的方法。
Voice cloning is a technique for synthesizing speech that closely resembles a reference speech through algorithms such as speech analysis, speaker classification, and voice coding. To improve the transfer of individual speaker articulatory features, the MRCD model with rhythm and content de-entanglement is proposed. The rhythmic information carried by the speech signal is de-entangled by the random threshold resampling of the rhythmic random perturbation module, so that the speech rhythms are independent of each other; the content of the speaker’s similar speech features is obtained by using the Meier content enhancement module, and at the same time the stylistic and cyclic consistency loss functions are added to measure the differences between the generated speech and the spectrograms of the source speech and the speaker’s identity, and then finally the speaker is identified by an end-to-end speech synthesis model, FastSpeech2. Finally, an end-to-end speech synthesis model, FastSpeech2, is used for speech cloning. For experimental evaluation, the method was applied to the publicly available AISHELL3 dataset for the speech cloning task. The model is evaluated using objective and subjective evaluation metrics, and the results show that the converted speech outperforms the previous method in terms of speaker similarity while maintaining the naturalness score.

References

[1]  Sproat, R.W. and Olive, J.P. (1995) Text-to-Speech Synthesis. AT&T Technical Journal, 74, 35-44.
https://doi.org/10.1002/j.1538-7305.1995.tb00399.x
[2]  Olive, J.P. (1977) Rule Synthesis of Speech from Dyadic Units. IEEE International Conference on Acoustics, Speech, and Signal Processing, Hartford, 9-11 May 1977, 568-570.
https://doi.org/10.1109/ICASSP.1977.1170350
[3]  Zen, H., Tokuda, K. and Black, A.W. (2009) Statistical Parametric Speech Synthesis. Speech Communication, 51, 1039-1064.
https://doi.org/10.1016/j.specom.2009.04.004
[4]  Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z. and Wu, Y. (2018) Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 4779-4783.
https://doi.org/10.1109/ICASSP.2018.8461368
[5]  Wu, Y.C., Hayashi, T., Tobing, P.L., Kobayashi, K. and Toda, T. (2021) Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-Dependent Dilated Convolution Neural Network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1134-1148.
https://doi.org/10.1109/TASLP.2021.3061245
[6]  Prenger, R., Valle, R. and Catanzaro, B. (2019) Waveglow: A Flow-Based Generative Network for Speech Synthesis. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 3617-3621.
https://doi.org/10.1109/ICASSP.2019.8683143
[7]  Kong, J., Kim, J. and Bae, J. (2020) Hifi-gan: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, Vol. 33, 17022-17033.
[8]  Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z. and Liu, T.Y. (2020) Fastspeech 2: Fast and High-Quality End-to-End Text to Speech.
[9]  Choi, S., Han, S., Kim, D. and Ha, S. (2020) Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding. Proceedings Interspeech 2020, Shanghai, 25-29 October 2020, 2007-2011.
https://doi.org/10.21437/Interspeech.2020-2096
[10]  An, X., Soong, F.K. and Xie, L. (2022) Disentangling Style and Speaker Attributes for TTS Style Transfer. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 646-658.
https://doi.org/10.1109/TASLP.2022.3145297
[11]  Zhou, Y., Song, C., Li, X., Zhang, L., Wu, Z., Bian, Y. and Meng, H. (2022) Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis. Proceedings Interspeech 2022, Incheon, 8-22 September 2022, 2573-2577.
https://doi.org/10.21437/Interspeech.2022-10054
[12]  Miao, Y. and Metze, F. (2015) On Speaker Adaptation of Long Short-Term Memory Recurrent Neural Networks. In 16th Annual Conference of the International Speech Communication Association, Dresden, 6-10 September 2015, 1101-1105.
https://doi.org/10.21437/Interspeech.2015-290
[13]  Cooper, E., Lai, C.I., Yasuda, Y., Fang, F., Wang, X., Chen, N. and Yamagishi, J. (2020) Zero-Shot Multi-Speaker Text-to-Speech with State-of-the-Art Neural Speaker Embeddings. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 6184-6188.
https://doi.org/10.1109/ICASSP40776.2020.9054535
[14]  Li, X., Song, C., Li, J., Wu, Z., Jia, J. and Meng, H. (2021) Towards Multi-Scale Style Control for Expressive Speech Synthesis. Proceedings Interspeech 2021, Brno, 30 August-3 September 2021, 4673-4677.
https://doi.org/10.21437/Interspeech.2021-947
[15]  Hsu, W.N., Zhang, Y., Weiss, R.J., et al. (2019) Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 5901-5905.
https://doi.org/10.1109/ICASSP.2019.8683561
[16]  Fang, W., Chung, Y.A. and Glass, J. (2019) Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.
[17]  Qian, K., Zhang, Y., Chang, S., Xiong, J., Gan, C., Cox, D. and Hasegawa-Johnson, M. (2021) Global Prosody Style Transfer without Text Transcriptions. Proceedings of Machine Learning Research, 139, 8650-8660.
[18]  Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W. and Zhou, Y. (2017) Deep Voice 2: Multi-Speaker Neural Text-to-Speech. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 3-8 December 2018, 2966-2974.
[19]  Xue, L., Pan, S., He, L., Xie, L. and Soong, F.K. (2021) Cycle Consistent Network for End-to-End Style Transfer TTS Training. Neural Networks, 140, 223-236.
https://doi.org/10.1016/j.neunet.2021.03.005
[20]  Shi, Y., Bu, H., Xu, X., Zhang, S. and Li, M. (2020) Aishell-3: A Multi-Speaker Mandarin TTS Corpus and the Baselines. Proceedings Interspeech 2021, Brno, 30 August-3 September 2021, 2756-2760.
https://doi.org/10.21437/Interspeech.2021-755
[21]  Pypinyin.
https://pypi.org/project/pypinyin
[22]  Wan, L., Wang, Q., Papir, A. and Moreno, I.L. (2018) Generalized End-to-End Loss for Speaker Verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 4879-4883.
https://doi.org/10.1109/ICASSP.2018.8462665

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413