Blijde Inkomststraat 13 (room 01.05)
Machine translation, alignment, parallel corpora, term candidate extraction
My ongoing PhD research concerns the alignment of divergent structures in translation equivalents, using parse trees enriched with semantic information.
I coordinated the project TermTreffer of the Nederlandse Taalunie for the development of a Dutch term candidate extractor, which will be publicly available in the near future (see INL).
I updated the language pairs of Systran involving Dutch (English-Dutch and vice versa, French-Dutch and vice versa), as part of a collaboration between Systran and the Centre for Computational Linguistics.
Belgisch Staatsblad corpus
A parallel corpus with 5 million French-Dutch sentence pairs in the legislative domain, produced by downloading official documents from the online version of the Belgisch Staatsblad/Moniteur belge (publication of the Belgian authorities), and automatically aligning the sentences in documents available in two languages. The corpus contains 2.4 million unique sentence pairs. It covers documents which appeared between 1997 and 2006. See publication: Vanallemeersch (2010).
The corpus is available for research purposes, in the following formats:
- Original documents in HTML
- Pairs of files with French and Dutch sentences (equivalent sentences have the same line number)
- Sentence pairs in TMX format (Translation Memory eXchange)
Here is a sample of the corpus (a number of documents which were published in 1997):
original documents (French, Dutch, parallel example), pairs of aligned files (French, Dutch, parallel example), TMX file
If you would like to download the whole corpus, send a mail to email@example.com.
A tool for aligning words, word fragments and word groups between parallel texts using a bilingual lexicon. The tool is not really fit for distribution: it requires TAWK, a commercial implementation of AWK. See publications: Kockaert et al. (2008), Vanallemeersch and Wermuth (2008).
A tool for statistically extracting bilingual term candidates from a parallel corpus / Translation Memory, originally designed for a translation agency (see general description). It is based on suffix array comparison, making it very fast, and is able to deal with high-volume parallel text (volume is only limited by the amount of internal memory available). If you would like to try it, send a mail to firstname.lastname@example.org. See publication: Vanallemeersch and Kockaert (2010).
Vanallemeersch, Tom (2012). 'Parser-independent Semantic Tree Alignment', Proceedings of META-RESEARCH Workshop on Advanced Treebanking, in conjunction with LREC-2012, Istanbul, Turkey. (pdf)
Vanallemeersch, Tom and Hendrik Kockaert (2010) 'Automated detection of inconsistent phraseology translation', Southern African Linguistics and Applied Language Studies 28/3, pp. 283-290. (contact me for reprint)
Vanallemeersch, Tom (2010) 'Tree Alignment through Semantic Role Annotation Projection', Proceedings of Workshop on Annotation and Exploitation of Parallel Corpora (AEPC), Tartu, Estonia, pp. 73-82. (pdf)
Vanallemeersch, Tom (2010) 'Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents', Proceedings of LREC 7, Malta, pp. 3413-3416. (pdf)
Kockaert, Hendrik J., Tom Vanallemeersch and Frieda Steurs (2008) 'Term-based context extraction in legal terminology: a case study in Belgium', Proceedings of Current Trends in Terminology, International Conference on Terminology, Szombathely, Hungary. (pdf)
Vanallemeersch, Tom and Cornelia Wermuth (2008) 'Linguistics-based word alignment for medical translators', Journal of Specialized Translation (Jostrans), nr. 9. (pdf)