全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Open-Source Boundary-Annotated Qur’an Corpus for Arabic and Phrase Breaks Prediction in Classical and Modern Standard Arabic Text

Keywords: phrase break prediction , prosodic annotation , Tajwid recitation , N-gram and HMM taggers , boundary-annotated and PoS-tagged Qur’an

Full-Text   Cite this paper   Add to My Lib

Abstract:

A phrase break classifier is needed to predict natural prosodic pauses in text to be read out loud byhumans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-ofspeechtagged corpus. Boundary annotations in English speech corpora are descriptive, delimitingintonation units perceived by the listener; manual annotation must be done by an expert linguist. ForArabic, there are no existing suitable resources. We take a novel approach to phrase break prediction forArabic, deriving our prosodic annotation scheme from Tajwid (recitation) mark-up in the Qur’an whichwe then interpret as additional text-based data for computational analysis. This mark-up is prescriptive,and signifies a widely-used recitation style, and one of seven original styles of transmission. Here wereport on version 1.0 of our Boundary-Annotated Qur’an dataset of 77430 words and 8230 sentences,where each word is tagged with prosodic and syntactic information at two coarse-grained levels. We thenuse this dataset to train, test, and compare two probabilistic taggers (trigram and HMM) for Arabicphrase break prediction, where the task is to predict boundary locations in an unseen test set stripped ofboundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks inthe training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gainsin accuracy with a trigram tagger, and significant gains in performance recognition of minority classinstances with both taggers via the Balanced Classification Rate metric. This is initial work on a longtermresearch project to produce annotation schemes, language resources, algorithms, and applicationsfor Classical and Modern Standard Arabic.

Full-Text

Contact Us

[email protected]

QQ:3279437679

WhatsApp +8615387084133