如何计算两个文档的相似度（三）

作者52nlp

6 月 7, 2013 #Deep Learning, #Deep Learning公开课, #gensim, #nltk, #NLTK中文信息处理, #NLTK应用, #python, #Python自然语言处理, #主题模型, #布朗语料库, #文本相似度, #文档相似度, #机器学习, #机器学习公开课, #概率图模型, #概率图模型公开课, #神经网络公开课, #自然语言处理, #课程图谱

上一节我们用了一个简单的例子过了一遍gensim的用法，这一节我们将用课程图谱的实际数据来做一些验证和改进，同时会用到NLTK来对课程的英文数据做预处理。

三、课程图谱相关实验

1、数据准备
为了方便大家一起来做验证，这里准备了一份Coursera的课程数据，可以在这里下载：coursera_corpus，（百度网盘链接: http://t.cn/RhjgPkv，密码: oppc）总共379个课程，每行包括3部分内容：课程名\t课程简介\t课程详情, 已经清除了其中的html tag, 下面所示的例子仅仅是其中的课程名：

Writing II: Rhetorical Composing
Genetics and Society: A Course for Educators
General Game Playing
Genes and the Human Condition (From Behavior to Biotechnology)
A Brief History of Humankind
New Models of Business in Society
Analyse Numérique pour Ingénieurs
Evolution: A Course for Educators
Coding the Matrix: Linear Algebra through Computer Science Applications
The Dynamic Earth: A Course for Educators
...

好了，首先让我们打开Python, 加载这份数据：

>>> courses = [line.strip() for line in file('coursera_corpus')]
>>> courses_name = [course.split('\t')[0] for course in courses]
>>> print courses_name[0:10]
['Writing II: Rhetorical Composing', 'Genetics and Society: A Course for Educators', 'General Game Playing', 'Genes and the Human Condition (From Behavior to Biotechnology)', 'A Brief History of Humankind', 'New Models of Business in Society', 'Analyse Num\xc3\xa9rique pour Ing\xc3\xa9nieurs', 'Evolution: A Course for Educators', 'Coding the Matrix: Linear Algebra through Computer Science Applications', 'The Dynamic Earth: A Course for Educators']

2、引入NLTK
NTLK是著名的Python自然语言处理工具包，但是主要针对的是英文处理，不过课程图谱目前处理的课程数据主要是英文，因此也足够了。NLTK配套有文档，有语料库，有书籍，甚至国内有同学无私的翻译了这本书: 用Python进行自然语言处理，有时候不得不感慨：做英文自然语言处理的同学真幸福。

首先仍然是安装NLTK，在NLTK的主页详细介绍了如何在Mac, Linux和Windows下安装NLTK：http://nltk.org/install.html ，最主要的还是要先装好依赖NumPy和PyYAML，其他没什么问题。安装NLTK完毕，可以import nltk测试一下，如果没有问题，还有一件非常重要的工作要做，下载NLTK官方提供的相关语料：

>>> import nltk
>>> nltk.download()

这个时候会弹出一个图形界面，会显示两份数据供你下载，分别是all-corpora和book，最好都选定下载了，这个过程需要一段时间，语料下载完毕后，NLTK在你的电脑上才真正达到可用的状态，可以测试一下布朗语料库：

>>> from nltk.corpus import brown
>>> brown.readme()
'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
>>> brown.tagged_words()[0:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
>>> len(brown.words())
1161192

现在我们就来处理刚才的课程数据，如果按此前的方法仅仅对文档的单词小写化的话，我们将得到如下的结果：

>>> texts_lower = [[word for word in document.lower().split()] for document in courses]
>>> print texts_lower[0]
['writing', 'ii:', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading,', 'research,', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic,', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words,', 'ideas,', 'talents,', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

注意其中很多标点符号和单词是没有分离的，所以我们引入nltk的word_tokenize函数，并处理相应的数据：

>>> from nltk.tokenize import word_tokenize
>>> texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]
>>> print texts_tokenized[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading', ',', 'research', ',', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic', ',', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers', '...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', ...

对课程的英文数据进行tokenize之后，我们需要去停用词，幸好NLTK提供了一份英文停用词数据：

>>> from nltk.corpus import stopwords
>>> english_stopwords = stopwords.words('english')
>>> print english_stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
>>> len(english_stopwords)
127

总计127个停用词，我们首先过滤课程语料中的停用词：
>>> texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
>>> print texts_filtered_stopwords[0]
['writing', 'ii', ':', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', ',', 'research', ',', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', ',', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', ',', 'ideas', ',', 'talents', ',', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', ',', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', ',', 'visual', ',', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', ',', 'demonstrations', ',', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', ',', 'rhetoric', 'course', 'design', ',', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', ',', 'students', ',', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', ',', 'writers', 'exchange', ',', 'place', 'exchange', 'work', 'feedback']

停用词被过滤了，不过发现标点符号还在，这个好办，我们首先定义一个标点符号list:
>>> english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']

然后过滤这些标点符号：
>>> texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
>>> print texts_filtered[0]
['writing', 'ii', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'series', 'interactive', 'reading', 'research', 'composing', 'activities', 'along', 'assignments', 'designed', 'help', 'become', 'effective', 'consumers', 'producers', 'alphabetic', 'visual', 'multimodal', 'texts.', 'join', 'us', 'become', 'effective', 'writers', '...', 'better', 'citizens.', 'rhetorical', 'composing', 'course', 'writers', 'exchange', 'words', 'ideas', 'talents', 'support.', 'introduced', 'variety', 'rhetorical', 'concepts\xe2\x80\x94that', 'ideas', 'techniques', 'inform', 'persuade', 'audiences\xe2\x80\x94that', 'help', 'become', 'effective', 'consumer', 'producer', 'written', 'visual', 'multimodal', 'texts.', 'class', 'includes', 'short', 'videos', 'demonstrations', 'activities.', 'envision', 'rhetorical', 'composing', 'learning', 'community', 'includes', 'enrolled', 'course', 'instructors.', 'bring', 'expertise', 'writing', 'rhetoric', 'course', 'design', 'designed', 'assignments', 'course', 'infrastructure', 'help', 'share', 'experiences', 'writers', 'students', 'professionals', 'us.', 'collaborations', 'facilitated', 'wex', 'writers', 'exchange', 'place', 'exchange', 'work', 'feedback']

更进一步，我们对这些英文单词词干化（Stemming)，NLTK提供了好几个相关工具接口可供选择，具体参考这个页面: http://nltk.org/api/nltk.stem.html , 可选的工具包括Lancaster Stemmer, Porter Stemmer等知名的英文Stemmer。这里我们使用LancasterStemmer:

>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('stemmed')
'stem'
>>> st.stem('stemming')
'stem'
>>> st.stem('stemmer')
'stem'
>>> st.stem('running')
'run'
>>> st.stem('maximum')
'maxim'
>>> st.stem('presumably')
'presum'

让我们调用这个接口来处理上面的课程数据:
>>> texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
>>> print texts_stemmed[0]
['writ', 'ii', 'rhet', 'compos', 'rhet', 'compos', 'eng', 'sery', 'interact', 'read', 'research', 'compos', 'act', 'along', 'assign', 'design', 'help', 'becom', 'effect', 'consum', 'produc', 'alphabet', 'vis', 'multimod', 'texts.', 'join', 'us', 'becom', 'effect', 'writ', '...', 'bet', 'citizens.', 'rhet', 'compos', 'cours', 'writ', 'exchang', 'word', 'idea', 'tal', 'support.', 'introduc', 'vary', 'rhet', 'concepts\xe2\x80\x94that', 'idea', 'techn', 'inform', 'persuad', 'audiences\xe2\x80\x94that', 'help', 'becom', 'effect', 'consum', 'produc', 'writ', 'vis', 'multimod', 'texts.', 'class', 'includ', 'short', 'video', 'demonst', 'activities.', 'envid', 'rhet', 'compos', 'learn', 'commun', 'includ', 'enrol', 'cours', 'instructors.', 'bring', 'expert', 'writ', 'rhet', 'cours', 'design', 'design', 'assign', 'cours', 'infrastruct', 'help', 'shar', 'expery', 'writ', 'stud', 'profess', 'us.', 'collab', 'facilit', 'wex', 'writ', 'exchang', 'plac', 'exchang', 'work', 'feedback']

在我们引入gensim之前，还有一件事要做，去掉在整个语料库中出现次数为1的低频词，测试了一下，不去掉的话对效果有些影响：

>>> all_stems = sum(texts_stemmed, [])
>>> stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]

3、引入gensim
有了上述的预处理，我们就可以引入gensim，并快速的做课程相似度的实验了。以下会快速的过一遍流程，具体的可以参考上一节的详细描述。

>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> dictionary = corpora.Dictionary(texts)
2013-06-07 21:37:07,120 : INFO : adding document #0 to Dictionary(0 unique tokens)
2013-06-07 21:37:07,263 : INFO : built Dictionary(3341 unique tokens) from 379 documents (total 46417 corpus positions)

>>> corpus = [dictionary.doc2bow(text) for text in texts]

>>> tfidf = models.TfidfModel(corpus)
2013-06-07 21:58:30,490 : INFO : collecting document frequencies
2013-06-07 21:58:30,490 : INFO : PROGRESS: processing document #0
2013-06-07 21:58:30,504 : INFO : calculating IDF weights for 379 documents and 3341 features (29166 matrix non-zeros)

>>> corpus_tfidf = tfidf[corpus]

这里我们拍脑门决定训练topic数量为10的LSI模型：
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

>>> index = similarities.MatrixSimilarity(lsi[corpus])
2013-06-07 22:04:55,443 : INFO : scanning corpus to determine the number of features
2013-06-07 22:04:55,510 : INFO : creating matrix for 379 documents and 10 features

基于LSI模型的课程索引建立完毕，我们以Andrew Ng教授的机器学习公开课为例，这门课程在我们的coursera_corpus文件的第211行，也就是：

>>> print courses_name[210]
Machine Learning

现在我们就可以通过lsi模型将这门课程映射到10个topic主题模型空间上，然后和其他课程计算相似度：
>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi
[(0, 8.3270084238788673), (1, 0.91295652151975082), (2, -0.28296075112669405), (3, 0.0011599008827843801), (4, -4.1820134980024255), (5, -0.37889856481054851), (6, 2.0446999575052125), (7, 2.3297944485200031), (8, -0.32875594265388536), (9, -0.30389668455507612)]
>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])

取按相似度排序的前10门课程：
>>> print sort_sims[0:10]
[(210, 1.0), (174, 0.97812241), (238, 0.96428639), (203, 0.96283489), (63, 0.9605484), (189, 0.95390636), (141, 0.94975704), (184, 0.94269753), (111, 0.93654782), (236, 0.93601125)]

第一门课程是它自己:
>>> print courses_name[210]
Machine Learning

第二门课是Coursera上另一位大牛Pedro Domingos机器学习公开课
>>> print courses_name[174]
Machine Learning

第三门课是Coursera的另一位创始人，同样是大牛的Daphne Koller教授的概率图模型公开课：
>>> print courses_name[238]
Probabilistic Graphical Models

第四门课是另一位超级大牛Geoffrey Hinton的神经网络公开课，有同学评价是Deep Learning的必修课。
>>> print courses_name[203]
Neural Networks for Machine Learning

感觉效果还不错，如果觉得有趣的话，也可以动手试试。

好了，这个系列就到此为止了，原计划写一下在英文维基百科全量数据上的实验，因为课程图谱目前暂时不需要，所以就到此为止，感兴趣的同学可以直接阅读gensim上的相关文档，非常详细。之后我可能更关注将NLTK应用到中文信息处理上，欢迎关注。

注：原创文章，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

本文链接地址：https://www.52nlp.cn/如何计算两个文档的相似度三

作者 52nlp

LLm 自然语言处理

《如何计算两个文档的相似度（三）》有113条评论

52nlp说道：

2017年03月24号 15:43

其实我不太明白你这个问题，文档本身是query指的是文档本身已经映射到query向量了吗？

[回复]
道如那回复:
24 3 月, 2017 at 20:56
计算语料中文档（doc1,doc2,doc3，...）之间的相似度，而不是语料中文档跟外部query之间的相似度。我再看了一下老师您的文章，是不是这个部分已经计算出了语料中所有文档之间的相似度？
index = similarities.MatrixSimilarity(lsi[corpus])

[回复]
52nlp 回复:
27 3 月, 2017 at 16:21
其实你之前给得那个API接口就可以的，会依次输入每个文档和其他文档计算的相似度结果，我的一个测试结果：

In [116]: for similarities in index:
.....: print similarities
.....:
[ 0.99999994 0.99349093 0.99999994 0.99998897 0.99890161 -0.09886808
-0.06243003 -0.03453361 0.1831941 ]
[ 0.99349093 0.99999994 0.99351043 0.99294722 0.99773723 0.01512812
0.0516649 0.07953398 0.29398471]
[ 0.99999994 0.99351043 1. 0.99998826 0.99890959 -0.09869774
-0.06225919 -0.03436255 0.18336238]
[ 0.99998897 0.99294722 0.99998826 1. 0.99867147 -0.10352203
-0.06709816 -0.0392084 0.17859331]
[ 0.99890161 0.99773723 0.99890959 0.99867147 1. -0.05213219
-0.01559597 0.01233324 0.22905678]
[-0.09886808 0.01512812 -0.09869774 -0.10352203 -0.05213219 0.99999994
0.99933177 0.99792129 0.96014822]
[-0.06243003 0.0516649 -0.06225919 -0.06709816 -0.01559597 0.99933177
1. 0.99960995 0.96972233]
[-0.03453361 0.07953398 -0.03436255 -0.0392084 0.01233324 0.99792129
0.99960995 1. 0.97616404]
[ 0.1831941 0.29398471 0.18336238 0.17859331 0.22905678 0.96014822
0.96972233 0.97616404 1. ]

[回复]
道如那回复:
29 3 月, 2017 at 20:49
明白了谢谢您的解答！
Yugeng Liu说道：

2017年07月28号 11:38

为什么每次跑的结果不一样？

[回复]
52nlp 回复:
28 7 月, 2017 at 14:26
不清楚，是差异很大还是微小差异？

[回复]
lijing说道：

2017年08月24号 14:25

老师，您好，我按照您的方法训练了40万篇语料，每次查询新的文章来计算相似度，第一次计算要319.69ms，之后每次查询需要 79.52ms ; 也就是说，我平均一分钟只能查询13次，请问如何实现批量查询？非常感谢

[回复]
52nlp 回复:
24 8 月, 2017 at 14:36
如果到一定规模，可以试一下原作者写得simserver: https://radimrehurek.com/gensim/simserver.html

[回复]
单世超说道：

2018年01月6号 13:24

老师好：
1.ml_lsi = lsi[ml_bow] 预测这里不需要把文本向量转换为 tfidf 吗？
2. 不太理解这里 lsi 和 index 的数据类型。分别可以通过 [ ] 方式得到文档的主题向量，和其他文档与查询文档的相似度？

[回复]
孙伟说道：

2018年03月12号 16:26

标点符号其实可以用python标准库string.punctuation

[回复]
52nlp 回复:
12 3 月, 2018 at 22:49
是的，谢谢

[回复]
pengyulong说道：

2018年08月3号 14:58

您好，这篇文章写得很详细，也披露了很多处理语料的细节，谢谢分享。但是我这里还有一个问题：
你这里计算的文本的相似度都是只query与语料库corpus中的某一条语料之间的相似度，请问如何求两段文本(sen1,sen2)之间的相似度呢？这里的sen1，sen2均不在语料corpus中。如果能得到解答，万分感谢。

[回复]
52nlp 回复:
4 8 月, 2018 at 12:28
将两个文本映射到向量空间中，然后用训练好的模型计算就可以了

[回复]

如何计算两个文档的相似度（三）

作者52nlp

作者 52nlp

相关文章

DeepSeek-V3解析及技术报告英中报告对照版

如何构建和优化推理型大型语言模型？DeepSeek R1的启示

新浪张俊林：大语言模型的涌现能力——现象与解释

《如何计算两个文档的相似度（三）》有113条评论

发表回复

You missed

Qwen2.5-Omni：迈向通用多模态AI的里程碑——解读首个支持实时多模态输入与输出的统一模型

Google DeepMind 发布多模态轻量级开源模型 Gemma 3：性能与功能全面升级

DeepSeek-V3解析及技术报告英中报告对照版

Qwen2.5-VL：阿里巴巴新一代多模态大模型的技术突破与应用前景

作者52nlp

相关文章：

作者 52nlp

相关文章

《如何计算两个文档的相似度（三）》有113条评论

发表回复

You missed