|Title||Domain Adaptation in Statistical Machine Translation|
|Publication Type||Master Theses|
|Year of Publication||2007|
|Number of Pages||52|
|University||The University of Edinburgh|
Human beings are capable of categorizing a document based on its topic. Computers are already able to perform very well on that task. However, when translating from one language to another, the human translator will use this knowledge to adapt the writing style and vocabulary for the translation to sound as natural as possible. Statistical Machine Translation (SMT) uses Probabilistic Machine Learning methods to perform translations. However, such systems do not perform well in domains different from the ones used to train them. How can the ability to recognize the topic of a document be captured by an SMT system to perform better? Methodologies for adapting a Statistical Machine Translation System to a specific domain are explored. Two methods are examined. The one mixes translation and language models, weighting them appropriately to improve translation quality. The other uses unsupervised methods to cluster a corpus into sub-corpora, train them individually and decode on a specific trained cluster according to the genre or “domain” of the new sentence to be translated. Experimentation showed improvement in translation quality using both methods. Training on a small domain-specific corpus and a large general one, can improve the performance on translating documents in the small corpus’ domain.