|
PARADOCS: A Language Independant Go-Between for Mating Parallel Documents PARADOCS : l'entremetteur de documents parallèles indépendant de la langueKeywords: Parallel corpora , Information Retrieval , Machine Translation Abstract: Parallel corpora are the bread and butter of a number of machine translation tech- nologies. Therefore, important efforts are regularly spent in acquiring new ones. This task often involves a rather cumbersome manual inspection and it is rather difficult to set up a strategy that fits all the needs. We thus developed PARADOCS, a system aiming at doing this automatically. Our solution exploits numerical entities in documents in order to pair them. A classifier trained to recognize parallel text coupled to an information retrieval engine controlling the search space of candidate pairs are the main components of our approach. We tested PARADOCS on a number of tasks involving numerous pairs of languages and report good results.
|