专业老友痛批立委《迷思》系列搅乱NLP秩序，立委固执己见

G 是资深同行专业老友很多年了，常与立委有专业内外的交流。都是过来人，激烈交锋、碰撞出火是常有的事儿。

昨天给他邮去《迷思》系列三则，他即打电话说：“好家伙，你这是惟恐天下不乱啊。看了《迷思》，我就气不打一处来。你这是对中文NLP全盘否定啊，危言耸听，狂放颠覆性言论。偏激，严重偏激，而且误导。虽然我知道你在说什么，你想说什么，对于刚入门的新人，你的《迷思》有误导。”

听到他气不打一处来，我特别兴奋：“你尽管批判，砸砖。我为我说的话负责，每一个论点都是多年琢磨和经验以后的自然流露，绝对可以站住。对于年轻人，他们被各种’迷思‘误导很多了，我最多是矫枉过正，是对迷思的反弹，绝对不是误导。”

现剪辑摘录批判与回应，为历史留下足迹。内行看门道，外行看热闹，欢迎围观。

2011/12/28 G

The third one is more to the point - 严格说起来，这不能算是迷思，而应该算是放之四海而皆准的“多余的话”

Frankly, the first two are 标题党 to me. Most "supporting evidence" is wrong.

Well, I think I know what you were trying to say. But to most people I believe you are misleading.

No, I was not misleading, this is 矫枉过正 on purpose.

At least I think you should explain a bit more, and carefully pick up your examples.

Take one example. Tokenizing Peoples Republic of China is routinely done by regular expression (rule based) based on capitalization, apostrophe and proposition (symbolic evidences), but NOT using dictionary.

that is not the point. yes, maybe I should have chosen a non-Name example ("interest rate" 利率 is a better example for both Chinese and English), but the point is that closed compounding can (and should) be looked up by lexicons rather than using rules.

What you are referring to I guess is named entity recognition. Even that chinese and English could be significantly different.

No I was not talking about NE, that is a special topic by itself. I consider that to be a low-level, solved problem, and do not plan to re-invent the wheel. I will just pick an off-shelf API to use for NE, tolerating its imperfection.

I wouldn't be surprised if you don't do tokenization, as you can well combine that in overall parsing. But to applications like Baidu search, tokenization is the end of text processing and is a must-have.

Chunking of words into phrases (syntax) are by nature no different from chunking of morphemes (characters) into words (morphology). Parsing with no "word segmentation" is thus possible.

In existing apps like search engines, no big players are using parsing and deep NLP, yet (they will: only a time issue), so lexical features from large lexicons may not be necessary. As a result, they may prefer to adopt a light-weight tokenization without lexicons. That is a different case from what I am addressing here. NLP discussed in my post series assumes the need for developing a parser as its core.

Your attack to tagging is also misleading. You basically say if a word has two categories, just tag it both without further processing. That is tagging already.

That is not (POS) tagging in the traditional sense: the traditional sense of tagging is deterministic and relies on context. Lexical feature assignment from lexical lookup is not tagging in the traditional sense. If you want to change the definition, then that is off the topic.

What others do is merely one step forward, saying tag-a has 90% correct while tag-b 10% chance. I did rule based parser before and I find that is really helpful (at least in terms of speed). I try the high chance first. If it making sense, I just take it. If not, I come back trying the other. Let me know if you don't do something like that.

Parsing can go a long way without context-based POS tagging. But note that at the end I proposed 一步半 approach, i.e. I can do limited, simple context-based tagging for convenience' sake. The later development is adaptive and in principle does not rely on tagging.

Note here I am not talking about 兼语词 which is essentially another unique tag with its own properties. I know this is not 100% accurate but I see it in chinese something like 动名词 in English.

In fact, I do not see that as 兼语词, but for the sake of explanation of the phenomena, I used that term (logically equivalent, but to elaborate on that clearly requires too much space). In my actual system, 学习 is a verb, only a verb (or logical verb).

Then this touches grammar theory. While we may not really need a new theory, we do need to have a working theory with consistency. You may have a good one in mind. But to most people it is not the case. For example, I see you are deeply influenced by 中心词 and dependency. But not everyone even aware of that, not to mention if they agree with. Till now there is no serious competition, as really no large scale success story yet. We need to wait and see which 学派 eventually casts a bigger shadow.

Good to be criticized. But I had a point to make there.

【相关博文】

中文处理的迷思之一：切词特有论 2011-12-28

中文处理的迷思之二：词类标注是句法分析的前提 2011-12-28

中文NLP迷思之三：中文处理的长足进步有待于汉语语法的理论突破 2011-12-29

本文引用地址：http://blog.sciencenet.cn/home.php?mod=space&uid=362400&do=blog&id=523458

专业老友痛批立委《迷思》系列搅乱NLP秩序，立委固执己见

作者liwei999

作者 liwei999

相关文章

DeepSeek-V3解析及技术报告英中报告对照版

如何构建和优化推理型大型语言模型？DeepSeek R1的启示

新浪张俊林：大语言模型的涌现能力——现象与解释

发表回复

You missed

Qwen2.5-Omni：迈向通用多模态AI的里程碑——解读首个支持实时多模态输入与输出的统一模型

Google DeepMind 发布多模态轻量级开源模型 Gemma 3：性能与功能全面升级