自然语言处理:概率语言模型
Natural Language Processing: Probabilistic Language Modeling
作者:Regina Barzilay(MIT,EECS Department, November 15, 2004)
译者:我爱自然语言处理(www.52nlp.cn ,2009年1月18日)
三、 语言模型的评估
a) 评估一个语言模型(Evaluating a Language Model)
i. 我们有n个测试单词串(We have n test string):
ii. 考虑在我们模型之下这段单词串的概率(Consider the probability under our model):
或对数概率(or log probability):
iii. 困惑度(Perplexity):
这里
W是测试数据里总的单词数(W is the total number of words in the test data.)
iv. 困惑度是一种有效的“分支因子”评测方法(Perplexity is a measure of effective “branching factor”)
1. 我们有一个规模为N的词汇集v,模型预测(We have a vocabulary v of size N, and model predicts):
P(w) = 1/N 对于v中所有的单词(for all the words in v.)
v. 困惑度是什么(What about Perplexity)?
这里
于是 Perplexity = N
vi. 人类行为的评估(estimate of human performance (Shannon, 1951)
1. 香农游戏(Shannon game)— 人们在一段文本中猜测下一个字母(humans guess next letter in text)
2. PP=142(1.3 bits/letter), uncased, open vocabulary
vii. 三元语言模型的评估(estimate of trigram language model (Brown et al. 1992))
PP=790(1.75 bits/letter), cased, open vocabulary
未完待续:第四部分
附:课程及课件pdf下载MIT英文网页地址:
http://people.csail.mit.edu/regina/6881/
注:本文遵照麻省理工学院开放式课程创作共享规范翻译发布,转载请注明出处“我爱自然语言处理”:www.52nlp.cn
本文链接地址:
https://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-third-part/
支持
[回复]
admin 回复:
5 2 月, 2009 at 13:40
谢谢!其实翻译是个累活!呵呵!
[回复]
最近在看MIT的mallet项目……看得不是很明白,这里大牛众多,希望接触过的给我简
单介绍一下这个项目吧。
另外,我需要得到P(w|t),单词出现在topic中的概率,mallet中只提供了计算topic出
现在document中的概率的方法。从网上查到一个方法:
> On Tue, May 3, 2011 at 10:48 AM, Steven Bethard wrote:
>> TopicInferencer.getSampledDistribution gives you a double[] representing
the topic distribution
for the entire instance (document). Is there a way to get the per-word topic
distributions?
On May 9, 2011, at 6:26 PM, David Mimno wrote:
> It doesn't look like there's an easy way without digging into the
> sampling code. You'd need to add an additional data structure to store
> token-topic distributions, and update it from the "topics" array after
> every sampling round. Once you're done, you'll need a way to pass it
> back -- keeping the token-topic distributions as a state variable and
> adding a callback function to pick up the distribution after every
> document might be the best option.
Thanks for the response. I ended up using the Stanford Topic Modeling
Toolbox instead, which supports
per-word topic distributions out of the box, but the approach above sounds
plausible if I ever end up going
back to the Mallet code.
url是:
http://article.gmane.org/gmane.comp.ai.mallet.devel/1482/match=getting+topic
+distribution
希望做过相关改进的师兄分享一下经验,小弟不胜感激。
[回复]
52nlp 回复:
8 5 月, 2012 at 14:07
这个不太清楚,期待其他同学来回答。
[回复]