himl-test-2017Himl Test sets

Description

The HimL test sets consist of about 3000 sentences of English health information text, translated into Czech, German, Polish and Romanian.

The source sentences were extracted from NHS 24 and Cochrane online content, 50% from each site. The translations were performed by post-editing, with the initial automatic translation created by a Moses phrase-based MT system. During the post-editing we collected detailed information on the edits, which we also intend to distribute.

We have divided the test sets into test and tuning, as well as dividing them according to the origin of the source text (NHS 24 or Cochrane).

Licence

The test sets may be freely used for non-commercial, research purposes. In detail:

 
  • The Cochrane English source texts are covered by the Cochrane licence, which allows research use only.
  • The NHS 24 English source texts are covered by the Creative Commons BY-NC-SA licence with the following additional clause:

 

Copyright © 2010-2016 by NHS 24, all material in this data set is protected by UK and other copyright laws. All rights reserved. Material may be used freely for personal, research, and/or scientific purposes only. No part of this material may be copied, downloaded, stored in a retrieval system, or redistributed for any other purpose without identifying where this information has come from. You may modify the material and create derivative works provided that you identify the original source and state what you have changed. You may use and distribute the modified material and derivative works for use in the training of language technologies, for example machine translation. You may distribute the trained language technologies, provided that the original text in the material cannot be recovered from these.

 

  • Other data is covered by the Creative Commons BY-NC-SA licence. 

Download

The test sets are released in sgm format, as used in the WMT translation tasks. Note that v1 was not released externally.