Self-Supervised Neural Machine Translation

Dana Ruiter, Cristina España-Bonet & Josef van Genabith

We present a simple new method where an emergent NMT system is used for simultaneously selecting training data and learning internal NMT representations. This is done in a self-supervised way without parallel data, in such a way that both tasks enhance each other during training. The method is language independent, introduces no additional hyper-parameters, and achieves BLEU scores of 29.21 (en2fr) and 27.36 (fr2en) on newstest2014 using English and French Wikipedia data for training.

Query Translation for Cross-lingual Search in the Academic Search Engine PubPsych
Preprint, Präsentation

Cristina España-Bonet, Juliane Stiller, Roland Ramthun, Josef van Genabith & Vivien Petras

We describe a lexical resource-based process for query translation of a domain-specific and multilingual academic search engine in psychology, PubPsych. PubPsych queries are diverse in language with a high amount of informational queries and technical terminology. We present an approach for translating queries into English, German, French, and Spanish. We build a quadrilingual lexicon with aligned terms in the four languages using MeSH, Wikipedia and Apertium as our main resources. Our results show that using the quadlexicon together with some simple translation rules, we can automatically translate 85% of translatable tokens in PubPsych queries with mean adequacy over all the translatable text of 1.4 when measured on a 3-point scale [0,1,2]

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

Cristina España-Bonet, Ádám Csaba Varga, Alberto Barrón-Cedeño & Josef van Genabith

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with a large amount of parallel data available. Beside this palpable improvement, neural networks embrace several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the context vectors, i.e. output of the encoder, and their prowess as an interlingua representation of a sentence. Their quality and effectiveness are assessed by similarity measures across translations, semantically related, and semantically unrelated sentence pairs. Second, and as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only context vectors. F1 reaches 98.9% when complementary similarity measures are used.

We discuss the motivation for creating the system, design decisions, connected problems and our solutions, especially concerning multilingual information.


Neural Machine Translation (NMT) depends on large amounts of parallel data, which is scarce for low-resource language pairs and domains. Extracting parallel sentences from a non-parallel source using similarity measures over interlingual representations is our proposed method towards low-resource Machine Translation (MT). In an encoder-decoder NMT system, such representations can be observed for instance in the encoder outputs or the word embeddings when the model is trained on multilingual data. To exploit the fact that these representations improve in quality over time, this thesis project envisions to develop a parallel data extraction framework that extracts parallel data online, i.e., at the same time as the MT model is being trained. In this report, the literature review is presented, followed by an outline of the proposed system and the experiments performed.

Neural machine translation (NMT) is currently considered the state-of-the-art for language pairs with vast amounts of parallel data. In this thesis project, we utilize such systems to provide translations between four languages in the psychology domain, where the biggest challenge is posed by in-domain data scarcity. Therefore, the emphasis of the research is laid on exploring domain adaptation methods in this scenario. We first propose a system for automatically building in-domain adaptation corpora by extracting parallel sentence pairs from comparable articles of Wikipedia. To this end, we use supervised classification and regression methods trained on NMT context vector similarities and complementary textual similarity features. We find that the best method for our purposes is a regression model trained on continuous similarity labels. We rerank the extracted candidates by their similarity feature averages and use the top-N partitions as adaptation corpora. In the second part of the thesis we thoroughly examine multilingual domain adaptation by transfer learning with respect to the adaptation data quality, size, and domain. With clean parallel in-domain adaptation data we achieve significant improvements for most translation directions, including ones with no adaptation data, while the automatically extracted corpora prove beneficial mostly for language pairs with no clean in-domain adaptation set. Particularly in these latter cases, the combination of the two adaptation corpora yields further improvements. We also explore the possibilities of reranking N -best translation lists with in-domain language models and similarity features. We conclude that adapted systems produce candidates that can result in a higher improvement in translation performance than the ones of unadapted models, and that remarkable improvements can be achieved by similarity-based reranking methods.

Query Log Analysis of Information Search Behaviour in the Psychological Domain: a Case Study of PubPsych

Jie Yin

This study focused on analysing user search behaviours in the psychological domain, as well as comparing with web search engines, other specialised search engines and former study in the psychological domain. The specific research questions are related to session length and query length, the language of sessions and queries, language switch, query reformulation patterns and query categories, term frequency and term distribution and search failure.


This paper describes the UdS-DFKI participation to the multilingual task of the IWSLT Evaluation 2017. Our approach is based on factored multilingual neural translation systems following the small data and zero-shot training conditions. Our systems are designed to fully exploit multilinguality by including factors that increase the number of common elements among languages such as phonetic coarse encodings and synsets, besides shallow part-of-speech tags, stems and lemmas. Document level information is also considered by including the topic of every document. This approach improves a baseline without any additional factor for all the language pairs and even allows beyond-zero-shot translation. That is, the translation from unseen languages is possible thanks to the common elements —especially synsets in our models— among languages.

Lump at SemEval-2017 Task 1: Towards an Interlingua Semantic Similarity

Cristina España-Bonet & Alberto Barrón-Cedeño

This is the Lump team participation at SemEval 2017 Task 1 on Semantic Textual Similarity. Our supervised model relies on features which are multilingual or interlingual in nature. We include lexical similarities, cross-language explicit semantic analysis, internal representations of multilingual neural networks and interlingual word embeddings. Our representations allow to use large datasets in language pairs with many instances to better classify instances in smaller language pairs avoiding the necessity of translating into a single language. Hence we can deal with all the languages in the task: Arabic, English, Spanish, and Turkish.