Newsletter 5/2016

Neural Machine Translation

In the last two years, we have witnessed the birth of a new paradigm in machine translation. Having demonstrated significant improvements over the state of the art in tasks such as image and speech recognition, researchers in “deep learning” turned their attention to machine translation. It was not long before machine translation systems based on deep learning (referred to as neural machine translation systems) equaled or surpassed the performance of the previous generation of systems on standard benchmark tasks.

Like earlier statistical models of MT, neural MT systems are trained on large quantities of parallel (sentence-aligned) data, but unlike earlier models NMT models do not rely on learning large tables of rules which each translate small segments of text. Instead, an NMT system essentially views translation as a complex mathematical function on matrices, the parameters of which are learnt from the training data. This function is made of various building blocks, and the most popular version can be viewed as an encoder-decoder with attention. This means that the source sentence is first encoded as a sequence of vectors, and then the target sentence (output) is produced one word at a time by the decoder. The decoder uses the encoded version of the source, and the words previously generated in the target, to select its output, and can selectively attend to portions of the source representation.

The first NMT systems were created separately by researchers at Montreal University, and at Google, but recent improvements by researchers in the University of Edinburgh have enabled these systems to surpass the state of the art in many language pairs. This was evidenced by the performance of the Edinburgh systems in the shared task on news translation at the WMT16 Conference in Machine Translation (

So why has NMT been so successful, and will it completely replace the currently state-of-the-art statistical MT systems? These questions are hard to answer at the moment, especially as the inner workings of NMT systems are very opaque. However there are some hints suggested by the detailed evaluation results in WMT16. NMT systems appear to do very well on fluency judgements, and in particular on languages such as German and Czech where earlier approaches were not able to represent the long distance dependencies required by these languages. The complex, many layered functions used in NMT are able to learn these dependencies, producing much more fluent output. However there is a downside - current NMT models are not as well anchored in the source sentence as earlier statistical models and occasionally depart from it altogether, producing repetitive nonsense or (perhaps worse) completely fluent but wrong translations. Despite these problems, NMT offers many advantages, such as the ease in which additional information (e.g. context, images, non-linguistic constraints) can be incorporated, and the elegance of its training pipeline. In HimL we see a bright future for NMT, and all the research partners in the project are actively investigating how it could be used in the project.


The three HimL academic partners were all in Berlin in August for the annual meeting of the Association of Computational Linguistics (the top ranked conference in this field) and the Conference on Machine Translation (WMT). HimL was well represented in the papers presented, with 2 papers at ACL, and a total of 10 at WMT, funded wholly or partly by the project. These included several collaborations between HimL partners.

In particular, we presented work on improved models for rule selection in MT, models for determining the scope of negation, improved handling of verbs and complements in translation into German, and showed how to improve neural MT using linguistic source annotation. In addition HimL partners participated in several of the shared tasks in WMT, where they built strong systems for head-to-head comparisons with other research groups.

In the last couple of years, there has been great excitement in the computational linguistics community over the use of “deep learning” or “neural network” methods. These methods have shown significant improvements in performance on many tasks, and lead to a lively debate about what makes such methods work, and whether the hype is really justified. In this year’s conference, machine translation was strongly affected by deep learning methods, with state-of-the-art performance demonstrate on well-known benchmark tasks.