自然语言处理:概率语言模型
Natural Language Processing: Probabilistic Language Modeling
作者:Regina Barzilay(MIT,EECS Department, November 15, 2004)
译者:我爱自然语言处理(www.52nlp.cn ,2009年1月18日)

三、 语言模型的评估
a) 评估一个语言模型(Evaluating a Language Model)
 i. 我们有n个测试单词串(We have n test string):
     S_{1},S_{2},...,S_{n}
 ii. 考虑在我们模型之下这段单词串的概率(Consider the probability under our model):
     prod{i=1}{n}{P(S_{i})}
或对数概率(or log probability):
  log{prod{i=1}{n}{P(S_{i})}}=sum{i=1}{n}{logP(S_{i})}
 iii. 困惑度(Perplexity):
     Perplexity = 2^{-x}
  这里x = {1/W}sum{i=1}{n}{logP(S_{i})}
  W是测试数据里总的单词数(W is the total number of words in the test data.)
 iv. 困惑度是一种有效的“分支因子”评测方法(Perplexity is a measure of effective “branching factor”)
  1. 我们有一个规模为N的词汇集v,模型预测(We have a vocabulary v of size N, and model predicts):
  P(w) = 1/N 对于v中所有的单词(for all the words in v.)
 v. 困惑度是什么(What about Perplexity)?
      Perplexity = 2^{-x}
   这里 x = log{1/N}
   于是 Perplexity = N
 vi. 人类行为的评估(estimate of human performance (Shannon, 1951)
  1. 香农游戏(Shannon game)— 人们在一段文本中猜测下一个字母(humans guess next letter in text)
  2. PP=142(1.3 bits/letter), uncased, open vocabulary
 vii. 三元语言模型的评估(estimate of trigram language model (Brown et al. 1992))
  PP=790(1.75 bits/letter), cased, open vocabulary

未完待续:第四部分

附:课程及课件pdf下载MIT英文网页地址:
   http://people.csail.mit.edu/regina/6881/

注:本文遵照麻省理工学院开放式课程创作共享规范翻译发布,转载请注明出处“我爱自然语言处理”:www.52nlp.cn

本文链接地址:
https://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-third-part/

作者 52nlp

《MIT自然语言处理第三讲:概率语言模型(第三部分)》有4条评论
  1. 最近在看MIT的mallet项目……看得不是很明白,这里大牛众多,希望接触过的给我简
    单介绍一下这个项目吧。
    另外,我需要得到P(w|t),单词出现在topic中的概率,mallet中只提供了计算topic出
    现在document中的概率的方法。从网上查到一个方法:
    > On Tue, May 3, 2011 at 10:48 AM, Steven Bethard wrote:
    >> TopicInferencer.getSampledDistribution gives you a double[] representing
    the topic distribution
    for the entire instance (document). Is there a way to get the per-word topic
    distributions?

    On May 9, 2011, at 6:26 PM, David Mimno wrote:
    > It doesn't look like there's an easy way without digging into the
    > sampling code. You'd need to add an additional data structure to store
    > token-topic distributions, and update it from the "topics" array after
    > every sampling round. Once you're done, you'll need a way to pass it
    > back -- keeping the token-topic distributions as a state variable and
    > adding a callback function to pick up the distribution after every
    > document might be the best option.

    Thanks for the response. I ended up using the Stanford Topic Modeling
    Toolbox instead, which supports
    per-word topic distributions out of the box, but the approach above sounds
    plausible if I ever end up going
    back to the Mallet code.
    url是:
    http://article.gmane.org/gmane.comp.ai.mallet.devel/1482/match=getting+topic
    +distribution
    希望做过相关改进的师兄分享一下经验,小弟不胜感激。

    [回复]

    52nlp 回复:

    这个不太清楚,期待其他同学来回答。

    [回复]

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注