在《如何学习自然语言处理》中,关于阅读文献,nlpers的作者曾计划在之后的博客中列出一些领域的“必读(must reads”)文献,这的确是一个不错的对于初学读者有帮助的方法,不过nlpers的这个系列没能写出很多领域。
其实网络上关于自然语言处理的很多领域或者方向都有一些热心人整理了很多有用的资料,今天就从自己比较熟悉的统计机器翻译说起,这里曾经写过一篇《统计机器翻译中的几篇经典文献》,但是即使是统计机器翻译,自己所学也很有限,对于整个领域的经典文献,还没有发言权。不过最近刚好发现了David Kauchak这篇成于2005年冬天的“Statistical Machine Translation Tutorial Reading”,很好的总结了自90年代初统计机器翻译诞生至2005年的一些经典文献,这里就将其转载并部分附上一点自己的心得了。
Statistical Machine Translation Tutorial Reading
The following is a list of papers that I think are worth reading for our discussion of machine translation. I've tried to give a short blurb about each of the papers to put them in context. I've included a number of papers that I marked "OPTIONAL" that I think are interesting, but are either supplementary or the material is more or less covered in the other papers.
If anyone would like more information on a particular topic or would like to discuss any of these papers, feel free to e-mail me dkauchakcs ucsd edu
Part 1 (Jan. 19)
A Statistical MT Tutorial Workbook. Kevin Knight. 1999.
Very good introduction to word-based statistical machine translation.Written in an informal, understandable, tutorial oriented style.
这是统计机器翻译界教父级人物Kevin Knight在99年约翰霍普金斯大学夏季机器翻译研讨班上的讲座,毫不夸张的说,这是我读过最让人愉快的统计机器翻译文献,也许不是正式文献的缘故,Kevin Knight甚至在文中开起了玩笑!但是他也没有摆什么架子,在文中开头便说“I know this looks like a thick workbook, but if you take a day to work through it, you will know almost as much about statistical machine translation as anybody!”。事实上,即使是对于统计机器翻译没有任何概念的入门读者,如果一开始便仔细研究这份文档,就会少走很多弯路。关于这份文档的来源,可以看看Kevin Knight的这段话,很有意思:
“At the time, I was trying to align English sound sequences with Japanese sound sequences, and I knew that EM could do it for me. It was hard, though. I spent two years reading Brown et al 93. When I finally got it to work, I was pretty fired up, and I told David Yarowsky. He said, “EM is the answer to all the world’s problems.” Wow! I figured everybody should know about it, so I wrote “A Statistical MT Tutorial Workbook”.”
这段话至少告诉了我一点:Brown93很难!我读得头都大了,还是感觉没有抓住本质,不过既然Kevin Knight都花了两年时间弄懂它,我为什么又不能多花点时间呢?不过幸好他老人家写了这篇 Tutorial Workbook,让我明白了EM不但是Brown93的灵魂,也是the answer to all the world’s problems。
Automating Knowledge Acquisition for Machine Translation.Kevin Knight. 1997.
(OPTIONAL) Another tutorial oriented paper that steps through how one can learn from bilingual data. Also introduces a number of important concepts for MT.
(可选)我也没仔细读过,这里就不予评论了。
Foundations of Statistical NLP,chapter 13. Manning and Schutze. 1999.
(OPTIONAL) Must be accessed from UCSD. Overview of statistical MT. Spends a lot of time on sentence and word alignment of bilingual data.
(可选)《统计自然语言处理基础》第13章,主要概述了统计机器翻译和重点讲了统计对齐(句对齐和词对齐)。
Foundations of Statistical NLP, chapter 6. Manning and Schutze. 1999.
(OPTIONAL) Must be accessed from UCSD. Discusses n-gram language modeling. Language modeling is crucial for SMT and many other natural language applications. I won't spend much time discussing language modeling, but for those that are interested this is a good introduction.
(可选)《统计自然语言处理基础》第6章,主要讲了n-gram语言模型,讲得很好,读者有条件的话应该是必读了!
Part 2 (Jan. 26)
Word models:
基于词的统计翻译模型,虽已不是主流,但是却能了解到统计机器翻译的本质,是根基。这方面,奠基之作当属Brown90、Brown93了,这里未提及Brown90,建议读者也能读一下。
The Mathematics of Statistical Machine Translation: Parameter Estimation. P. F. Brown, S. A. Della Pietra, V. J. Della Pietra and R.L. Mercer. 1993.
(OPTIONAL) All you ever wanted to know about word level models. Describes IBM models 1-5 and parameter estimation for these models. It's about 50 pages and contains a lot of material for the interested reader.
(可选)经典的Brown93,完整的描述了IBM词模型1-5及如何对这些模型进行参数估计。不知道作者为什么将其列为可选文献,也许是觉得对于多数读者来说都有点难度,的确里面到处都是数学。如果读者觉得难得话,可以先好好看看"A Statistical MT Tutorial Workbook”。
Word model decoding:
关于词模型解码的下面这两篇文献,我没有读过,不予评论了。
Decoding Algorithm in Statistical Machine Translation.Ye-Yi Wand and Alex Waibel. 1997.
Early paper discussing decoding of IBM model 2. The paper
provides a fairly good introduction to word-level decoding
including multi-stack search (i.e. multiple beams) and rest
cost estimation (heuristic functions).
An Efficient A* Search Algorithm for Statistical Machine Translation.Franz Josef Och, Nicola Ueffing, Hermann Ney. 2001.
(OPTIONAL) One of many papers on decoding with word-based SMT. They discuss the basic idea of viewing decoding as state space search and provide one method for doing this. They describe decoding for Model 3 and suggest a few different heuristics that are admissible, leading to few search errors.
Phrase based statistical MT:
基于短语的统计机器翻译,多年来一直占据着主流的位置,虽然现在有点岌岌可危了。
Statistical Phrase-Based Translation.Philipp Koehn, Franz Jasof Och and Daniel Marcu. 2003.
Good, short overview of phrased based systems. If you want more details, see the paper below.
两位统计机器翻译的大牛Philipp Koehn与Franz Jasof Och当年在南加州大学信息科学研究所(ISI/USC)的合著文章,对于当年刚刚开始的基于短语的统计机器翻译系统进行了描述。如果想了解更多的细节,可以参考下面这篇文章。
The Alignment Template Approach to Statistical Machine Translation.Franz Josef Och and Hermann Ney. 2004.
(OPTIONAL) This is a journal paper discussing one phrase based statistical system including decoding. This is more or less the system used at ISI and is probably the best current system (though syntax based systems my beat these in the next few years). Requires acrobat 5 and to be at UCSD.
(可选)这是当年04年发表在"Computational Linguistics”的杂志文章,当然份量十足并且详细很多了。
Part 3 (Feb. 2)
Phrase-based decoding:
关于基于短语的统计机器翻译解码问题,See the previous paper.
Syntax based translation:
基于句法的统计机器翻译,请允许我保持沉默!
What's in a Translation Rule? Galley, Hopkins, Knight and Marcu. 2004.
This is the current system being investigated at ISI and the hope is that these syntax based systems will perform better than phrase based systems.The paper is a bit tough to read since it's a conference paper.
A Syntax-Based Statistical Translation Model. Yamada and Knight. 2001.
(OPTIONAL) Predecessor model to Galley et al., but similar.
Syntax based decoding:
Foundations of Statistical NLP, chapter 12. Manning and Schutze. 1999.
Must be on campus. This is a chapter on parsing (not actually decoding) However, since the above rules are very similar to PCFGs, then decoding is very similar to parsing... just with more complications.
A Decoder for Syntax-Based Statistical MT. Kenji Yamada and Kevin Knight. 2001.
(OPTIONAL) Decoder for the above Yamada and Knight model.
Part 4 (Feb. 9)
Discriminative Training:
判别模型方面,一篇是将最大熵模型引入到统计机器翻译的经典之作,另一篇用于Reranking。
Discriminative Training and Maximum Entropy Models for Statistical Machine Translation.Och and Ney. 2002.
Learning how the best models for combining the different models (traslation model, language model, etc.) using maximum entropy parameter estimation.This line of research is still very important and my be interesting to many of you since it's very machine learningy.
Discriminative Reranking for Machine Translation. Shen, Sarkar and Och. 2004.
(OPTIONAL) Given a ranked output of possible translations from the translation system, this paper uses the perceptron algorithm to learn a reranking of the sentences to improves the top translation.
MT Evaluation:
机器翻译自动评测,BLEU几乎成了标准!
BLEU: A Method for Automatic Evaluation of Machine Translation. Papineni, Roukos, Ward and Zhu. 2001.
Foundational method for evaluating MT methods and still used currently.
以上除了《统计自然语言处理基础》外的文献都给出了下载链接,我试了一下,都可以直接下载。另外关于2005至今的统计机器翻译文献,蒋伟(David Chiang)博士的“Hierarchical Phrase-Based Translation”绝对算得上是经典,至于其他的值得阅读的文献,欢迎读者补充!
注:转载请注明出处“我爱自然语言处理”:www.52nlp.cn
本文链接地址:https://www.52nlp.cn/statistical-machine-translation-tutorial-reading
呵呵,我正好参加一个MT的reading group,下个星期开始syntax-based MT,就是从What’s in a Translation Rule? 开始。
[回复]
欢迎有空在这里分享一下心得,呵呵!
[回复]
最近三周读的论文是:
What's in a Translation Rule?
后来称为GHKM模型(作者名字首字母),好像是第一篇谈到抽取tree->string的规则的论文。有一点搞的是,这类系统的命名有点混乱。其实source language是string,target language是tree,但是因为训练的时候,规则是从tree->string的,然后在decoding的时候再“反”过来。所以以后提到tree->string,还是string->tree(tree->tree没问题)的时候一定要注意到底讲的是从source->target还是指规则。
Scalable Inference and Training of Context-Rich Syntactic Translation Modesl. Galley et al., 2006. ACL-06.
扩展了之前的工作,在阿拉伯语到英文和中文到英文上跟GHKM模型比提高了3.63个BLEU score.但是跟基于短语的模型还是相差6个BLEU score。好像后续的论文他们有改进language model,从而又有所提升。
Phrasal cohesion and statistical machine translation. Fox, 2002.
下个星期读。
[回复]
非常感谢!
[回复]
你好,有个问题想请教:
我目前想做一个QA系统,参考了一些论文,决定用Translation Language Model来查找相似问题,目前遇到的是选择工具的问题。查看了您的博客,发现介绍的基本上就是Moses了,但是它是用C/C++和Perl做的,我相对不熟(Perl没用过),如果能转移到Java和Python组合就最好了。
不知道52nlp对这个课题和工具选择方面有何建议?
打搅了,由于接触这个领域不久,问题可能有些基础,期待您的回复,十分感谢!
[回复]
52nlp 回复:
21 4 月, 2010 at 22:36
抱歉,今天比较忙,这么晚才给你答复。
可以考虑这个Moses+python绑定的工具:
http://veer66.wordpress.com/2008/12/27/simple-dirty-python-binding-for-moses-smt-decoder/
我没有试过,不过你可以仔细研究一下他是如何利用Python调用Moses的。
其次,可以考虑纯Java写的层次短语模型解码器Joshua,由约翰霍普金斯大学开发:http://joshua.sourceforge.net/Joshua/Welcome.html
这个系统目前已被ACL 2010的统计机器翻译workshop列为与Moses并列的baseline系统2,很有前景:
http://www.statmt.org/wmt10/
[回复]
ppsly 回复:
22 4 月, 2010 at 00:05
感谢您的回复
发现我需要的只是P(f|e),f是目标语言的单词(如英文单词),e是源语言的单词(如中文),那么我是不是只要做到翻译模型这一步就好了?看了下貌似“CMU-Cam Language Model Toolkit” + GIZA++ + mkcls + 分词工具就可以做到了吧?所以目前估计也不用Moses这样一个完整的系统;
52nlp怎么看的? 麻烦了
[回复]
52nlp 回复:
22 4 月, 2010 at 07:34
如果仅仅是词级别的话Giza++ + mkcls就够了。