
注:目前可以直接在AINLP公众号上体验腾讯词向量,公众号对话直接输入:相似词 词条
最近试了一下Word2Vec, GloVe 以及对应的python版本 gensim word2vec 和 python-glove,就有心在一个更大规模的语料上测试一下,自然而然维基百科的语料进入了视线。维基百科官方提供了一个很好的维基百科数据源:https://dumps.wikimedia.org,可以方便的下载多种语言多种格式的维基百科数据。此前通过gensim的玩过英文的维基百科语料并训练LSI,LDA模型来计算两个文档的相似度,所以想看看gensim有没有提供一种简便的方式来处理维基百科数据,训练word2vec模型,用于计算词语之间的语义相似度。感谢Google,在gensim的google group下,找到了一个很长的讨论帖:training word2vec on full Wikipedia ,这个帖子基本上把如何使用gensim在维基百科语料上训练word2vec模型的问题说清楚了,甚至参与讨论的gensim的作者Radim Řehůřek博士还在新的gensim版本里加了一点修正,而对于我来说,所做的工作就是做一下验证而已。虽然github上有一个wiki2vec的项目也是做得这个事,不过我更喜欢用python gensim的方式解决问题。
关于word2vec,这方面无论中英文的参考资料相当的多,英文方面既可以看官方推荐的论文,也可以看gensim作者Radim Řehůřek博士写得一些文章。而中文方面,推荐 @licstar的《Deep Learning in NLP (一)词向量和语言模型》,有道技术沙龙的《Deep Learning实战之word2vec》,@飞林沙 的《word2vec的学习思路》, falao_beiliu 的《深度学习word2vec笔记之基础篇》和《深度学习word2vec笔记之算法篇》等。
一、英文维基百科的Word2Vec测试
首先测试了英文维基百科的数据,下载的是xml压缩后的最新数据(下载日期是2015年3月1号),大概11G,下载地址:
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
处理包括两个阶段,首先将xml的wiki数据转换为text格式,通过下面这个脚本(process_wiki.py)实现:
注:因为很多同学留言是在python3.x环境下使用遇到问题,这里修改了一个版本兼容python2.x和python3.x, Ubuntu16.04下测试有效(2017.5.1)
#!/usr/bin/env python # -*- coding: utf-8 -*- # Author: Pan Yang (panyangnlp@gmail.com) # Copyrigh 2017 from __future__ import print_function import logging import os.path import six import sys from gensim.corpora import WikiCorpus if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) != 3: print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text") sys.exit(1) inp, outp = sys.argv[1:3] space = " " i = 0 output = open(outp, 'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): if six.PY3: output.write(b' '.join(text).decode('utf-8') + '\n') # ###another method### # output.write( # space.join(map(lambda x:x.decode("utf-8"), text)) + '\n') else: output.write(space.join(text) + "\n") i = i + 1 if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles") output.close() logger.info("Finished Saved " + str(i) + " articles") |
这里利用了gensim里的维基百科处理类WikiCorpus,通过get_texts将维基里的每篇文章转换位1行text文本,并且去掉了标点符号等内容,注意这里“wiki = WikiCorpus(inp, lemmatize=False, dictionary={})”将lemmatize设置为False的主要目的是不使用pattern模块来进行英文单词的词干化处理,无论你的电脑是否已经安装了pattern,因为使用pattern会严重影响这个处理过程,变得很慢。
执行"python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text":
2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text 2015-03-07 15:11:12,860: INFO: Saved 10000 articles 2015-03-07 15:13:25,369: INFO: Saved 20000 articles 2015-03-07 15:15:19,771: INFO: Saved 30000 articles 2015-03-07 15:16:58,424: INFO: Saved 40000 articles 2015-03-07 15:18:12,374: INFO: Saved 50000 articles 2015-03-07 15:19:03,213: INFO: Saved 60000 articles 2015-03-07 15:19:47,656: INFO: Saved 70000 articles 2015-03-07 15:20:29,135: INFO: Saved 80000 articles 2015-03-07 15:22:02,365: INFO: Saved 90000 articles 2015-03-07 15:23:40,141: INFO: Saved 100000 articles ..... 2015-03-07 19:33:16,549: INFO: Saved 3700000 articles 2015-03-07 19:33:49,493: INFO: Saved 3710000 articles 2015-03-07 19:34:23,442: INFO: Saved 3720000 articles 2015-03-07 19:34:57,984: INFO: Saved 3730000 articles 2015-03-07 19:35:31,976: INFO: Saved 3740000 articles 2015-03-07 19:36:05,790: INFO: Saved 3750000 articles 2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words) 2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles |
在我的macpro(4核16G机器)大约跑了4个半小时,处理了375万的文章后,我们得到了一个12G的text格式的英文维基百科数据wiki.en.text,格式类似这样的:
anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix...
有了这个数据后,无论用原始的word2vec binary版本还是gensim中的python word2vec版本,都可以用来训练word2vec模型,不过我们试了一下前者,发现很慢,所以还是采用google group 讨论帖中的gensim word2vec方式的训练脚本,不过做了一点修改,保留了vector text格式的输出,方便debug, 脚本train_word2vec_model.py如下:
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import os import sys import multiprocessing from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 4: print(globals()['__doc__'] % locals()) sys.exit(1) inp, outp1, outp2 = sys.argv[1:4] model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count()) # trim unneeded model memory = use(much) less RAM # model.init_sims(replace=True) model.save(outp1) model.wv.save_word2vec_format(outp2, binary=False) |
执行 "python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector":
2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector 2015-03-09 22:48:29,593: INFO: collecting all words and their counts 2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types 2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types 2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types 2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types 2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types 2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types 2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types 2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types 2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types 2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types 2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word types ...... 2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types 2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types 2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types 2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types 2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types 2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types 2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences 2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5 2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words 2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29 2015-03-09 23:14:09,790: INFO: resetting layer weights 2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0 2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s 2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s 2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s 2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s 2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s 2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s 2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s 2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s 2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s 2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s 2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s 2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s 2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s 2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s 2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s 2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s 2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s 2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s 2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s 2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s 2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s 2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s 2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s 2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s 2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s 2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s ....... 2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s 2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s 2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs 2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s 2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s 2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None 2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm 2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy 2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy 2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector |
大约跑了7个小时,我们得到了一个gensim中默认格式的word2vec model和一个原始c版本word2vec的vector格式的模型: wiki.en.text.vector,格式如下:
1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817...
...
在ipython中,我们通过gensim来加载和测试这个模型,因为这个模型大约有7G,所以加载的时间也稍长一些:
In [2]: import gensim # 注:因为gensim版本更新的问题,如果下面这个load有问题,可以使用新的接口:model = gensim.models.word2vec.Word2Vec.load(MODEL_PATH) In [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False) In [4]: model.most_similar("queen") Out[4]: [(u'princess', 0.5760838389396667), (u'hyoui', 0.5671186447143555), (u'janggyung', 0.5598698854446411), (u'king', 0.5556215047836304), (u'dollallolla', 0.5540223121643066), (u'loranella', 0.5522741079330444), (u'ramphaiphanni', 0.5310937166213989), (u'jeheon', 0.5298476219177246), (u'soheon', 0.5243583917617798), (u'coronation', 0.5217245221138)] In [5]: model.most_similar("man") Out[5]: [(u'woman', 0.7120707035064697), (u'girl', 0.58659827709198), (u'handsome', 0.5637181997299194), (u'boy', 0.5425317287445068), (u'villager', 0.5084836483001709), (u'mustachioed', 0.49287813901901245), (u'mcgucket', 0.48355430364608765), (u'spider', 0.4804879426956177), (u'policeman', 0.4780033826828003), (u'stranger', 0.4750771224498749)] In [6]: model.most_similar("woman") Out[6]: [(u'man', 0.7120705842971802), (u'girl', 0.6736541986465454), (u'prostitute', 0.5765659809112549), (u'divorcee', 0.5429972410202026), (u'person', 0.5276163816452026), (u'schoolgirl', 0.5102938413619995), (u'housewife', 0.48748138546943665), (u'lover', 0.4858251214027405), (u'handsome', 0.4773051142692566), (u'boy', 0.47445783019065857)] In [8]: model.similarity("woman", "man") Out[8]: 0.71207063453821218 In [10]: model.doesnt_match("breakfast cereal dinner lunch".split()) Out[10]: 'cereal' In [11]: model.similarity("woman", "girl") Out[11]: 0.67365416785207421 In [13]: model.most_similar("frog") Out[13]: [(u'toad', 0.6868536472320557), (u'barycragus', 0.6607867479324341), (u'grylio', 0.626731276512146), (u'heckscheri', 0.6208407878875732), (u'clamitans', 0.6150864362716675), (u'coplandi', 0.612680196762085), (u'pseudacris', 0.6108512878417969), (u'litoria', 0.6084023714065552), (u'raniformis', 0.6044802665710449), (u'watjulumensis', 0.6043726205825806)] |
一切ok,但是当加载gensim默认的基于numpy格式的模型时,却遇到了问题:
In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model") In [3]: model.most_similar("man") ... RuntimeWarning: invalid value encountered in divide self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL) Out[3]: [(u'ahsns', nan), (u'ny\xedl', nan), (u'indradeo', nan), (u'jaimovich', nan), (u'addlepate', nan), (u'jagello', nan), (u'festenburg', nan), (u'picatic', nan), (u'tolosanum', nan), (u'mithoo', nan)] |
这也是我修改前面这个脚本的原因所在,这个脚本在训练小一些的数据,譬如前10万条text的时候没任何问题,无论原始格式还是gensim格式,但是当跑完这个英文维基百科的时候,却存在这个问题,试了一些方法解决,还没有成功,如果大家有好的建议或解决方案,欢迎提出。
二、中文维基百科的Word2Vec测试
测试完英文维基百科之后,自然想试试中文的维基百科数据,与英文处理过程相似,也分两个步骤,不过这里需要对中文维基百科数据特殊处理一下,包括繁简转换,中文分词,去除非utf-8字符等。中文数据的下载地址是:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2。
中文维基百科的数据比较小,整个xml的压缩文件大约才1G,相对英文数据小了很多。首先用 process_wiki.py处理这个XML压缩文件,执行:python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
2015-03-11 17:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text 2015-03-11 17:40:08,329: INFO: Saved 10000 articles 2015-03-11 17:40:45,501: INFO: Saved 20000 articles 2015-03-11 17:41:23,659: INFO: Saved 30000 articles 2015-03-11 17:42:01,748: INFO: Saved 40000 articles 2015-03-11 17:42:33,779: INFO: Saved 50000 articles ...... 2015-03-11 17:55:23,094: INFO: Saved 200000 articles 2015-03-11 17:56:14,692: INFO: Saved 210000 articles 2015-03-11 17:57:04,614: INFO: Saved 220000 articles 2015-03-11 17:57:57,979: INFO: Saved 230000 articles 2015-03-11 17:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words) 2015-03-11 17:58:16,622: INFO: Finished Saved 232894 articles |
得到了大约23万多篇中文语料的text格式的语料:wiki.zh.text,大概750多M。不过查看之后发现,除了加杂一些英文词汇外,还有很多繁体字混迹其中,这里还是参考了 @licstar 《维基百科简体中文语料的获取》中的方法,安装opencc,然后将wiki.zh.text中的繁体字转化位简体字:
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini
然后就是分词处理了,这次我用基于MeCab训练的一套中文分词系统来进行中文分词,目前虽还没有达到实用的状态,但是性能和分词结果基本能达到这次的使用要求:
mecab -d ../data/ -O wakati wiki.zh.text.jian -o wiki.zh.text.jian.seg -b 10000000
注意这里data目录下是给mecab训练好的分词模型和词典文件等,详细可参考《用MeCab打造一套实用的中文分词系统》。
有了中文维基百科的分词数据,还以为就可以执行word2vec模型训练了:
python train_word2vec_model.py wiki.zh.text.jian.seg wiki.zh.text.model wiki.zh.text.vector
不过仍然遇到了问题,提示的错误是:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5394-5395: invalid continuation byte
google了一下,大致是文件中包含非utf-8字符,又用iconv处理了一下这个问题:
iconv -c -t UTF-8 < wiki.zh.text.jian.seg > wiki.zh.text.jian.seg.utf-8
这样基本上就没问题了,执行:
python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2015-03-11 18:50:02,586: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector 2015-03-11 18:50:02,592: INFO: collecting all words and their counts 2015-03-11 18:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types 2015-03-11 18:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types 2015-03-11 18:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types 2015-03-11 18:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types ... 2015-03-11 18:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types 2015-03-11 18:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types 2015-03-11 18:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types 2015-03-11 18:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences 2015-03-11 18:52:13,672: INFO: total 278291 word types after removing those with count<5 2015-03-11 18:52:13,673: INFO: constructing a huffman tree from 278291 words 2015-03-11 18:52:29,323: INFO: built huffman tree with maximum node depth 25 2015-03-11 18:52:29,683: INFO: resetting layer weights 2015-03-11 18:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0 2015-03-11 18:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s 2015-03-11 18:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s 2015-03-11 18:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s 2015-03-11 18:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s 2015-03-11 18:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s 2015-03-11 18:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s 2015-03-11 18:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s ...... 2015-03-11 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s 2015-03-11 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s 2015-03-11 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s 2015-03-11 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s 2015-03-11 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s 2015-03-11 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None 2015-03-11 19:22:13,884: INFO: not storing attribute syn0norm 2015-03-11 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy 2015-03-11 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy 2015-03-11 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector |
让我们看一下训练好的中文维基百科word2vec模型“wiki.zh.text.vector"的效果:
In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model") In [3]: model.most_similar(u"足球") Out[3]: [(u'\u8054\u8d5b', 0.6553816199302673), (u'\u7532\u7ea7', 0.6530429720878601), (u'\u7bee\u7403', 0.5967546701431274), (u'\u4ff1\u4e50\u90e8', 0.5872289538383484), (u'\u4e59\u7ea7', 0.5840631723403931), (u'\u8db3\u7403\u961f', 0.5560152530670166), (u'\u4e9a\u8db3\u8054', 0.5308005809783936), (u'allsvenskan', 0.5249762535095215), (u'\u4ee3\u8868\u961f', 0.5214947462081909), (u'\u7532\u7ec4', 0.5177896022796631)] In [4]: result = model.most_similar(u"足球") In [5]: for e in result: print e[0], e[1] ....: 联赛 0.65538161993 甲级 0.653042972088 篮球 0.596754670143 俱乐部 0.587228953838 乙级 0.58406317234 足球队 0.556015253067 亚足联 0.530800580978 allsvenskan 0.52497625351 代表队 0.521494746208 甲组 0.51778960228 In [6]: result = model.most_similar(u"男人") In [7]: for e in result: print e[0], e[1] ....: 女人 0.77537125349 家伙 0.617369174957 妈妈 0.567102909088 漂亮 0.560832381248 잘했어 0.540875017643 谎言 0.538448691368 爸爸 0.53660941124 傻瓜 0.535608053207 예쁘다 0.535151124001 mc刘 0.529670000076 In [8]: result = model.most_similar(u"女人") In [9]: for e in result: print e[0], e[1] ....: 男人 0.77537125349 我的某 0.589010596275 妈妈 0.576344847679 잘했어 0.562340974808 美丽 0.555426716805 爸爸 0.543958246708 新娘 0.543640494347 谎言 0.540272831917 妞儿 0.531066179276 老婆 0.528521537781 In [10]: result = model.most_similar(u"青蛙") In [11]: for e in result: print e[0], e[1] ....: 老鼠 0.559612870216 乌龟 0.489831030369 蜥蜴 0.478990525007 猫 0.46728849411 鳄鱼 0.461885392666 蟾蜍 0.448014199734 猴子 0.436584025621 白雪公主 0.434905380011 蚯蚓 0.433413207531 螃蟹 0.4314712286 In [12]: result = model.most_similar(u"姨夫") In [13]: for e in result: print e[0], e[1] ....: 堂伯 0.583935439587 祖父 0.574735701084 妃所生 0.569327116013 内弟 0.562012672424 早卒 0.558042645454 曕 0.553856015205 胤祯 0.553288519382 陈潜 0.550716996193 愔之 0.550510883331 叔父 0.550032019615 In [14]: result = model.most_similar(u"衣服") In [15]: for e in result: print e[0], e[1] ....: 鞋子 0.686688780785 穿着 0.672499775887 衣物 0.67173999548 大衣 0.667605519295 裤子 0.662670075893 内裤 0.662210345268 裙子 0.659705817699 西装 0.648508131504 洋装 0.647238850594 围裙 0.642895817757 In [16]: result = model.most_similar(u"公安局") In [17]: for e in result: print e[0], e[1] ....: 司法局 0.730189085007 公安厅 0.634275555611 公安 0.612798035145 房管局 0.597343325615 商业局 0.597183346748 军管会 0.59476184845 体育局 0.59283208847 财政局 0.588721752167 戒毒所 0.575558543205 新闻办 0.573395550251 In [18]: result = model.most_similar(u"铁道部") In [19]: for e in result: print e[0], e[1] ....: 盛光祖 0.565509021282 交通部 0.548688530922 批复 0.546967327595 刘志军 0.541010737419 立项 0.517836689949 报送 0.510296344757 计委 0.508456230164 水利部 0.503531932831 国务院 0.503227233887 经贸委 0.50156635046 In [20]: result = model.most_similar(u"清华大学") In [21]: for e in result: print e[0], e[1] ....: 北京大学 0.763922810555 化学系 0.724210739136 物理系 0.694550514221 数学系 0.684280991554 中山大学 0.677202701569 复旦 0.657914161682 师范大学 0.656435549259 哲学系 0.654701948166 生物系 0.654403865337 中文系 0.653147578239 In [22]: result = model.most_similar(u"卫视") In [23]: for e in result: print e[0], e[1] ....: 湖南 0.676812887192 中文台 0.626506924629 収蔵 0.621356606483 黄金档 0.582251906395 cctv 0.536769032478 安徽 0.536752820015 非同凡响 0.534517168999 唱响 0.533438682556 最强音 0.532605051994 金鹰 0.531676828861 In [24]: result = model.most_similar(u"习近平") In [25]: for e in result: print e[0], e[1] ....: 胡锦涛 0.809472680092 江泽民 0.754633367062 李克强 0.739740967751 贾庆林 0.737033963203 曾庆红 0.732847094536 吴邦国 0.726941585541 总书记 0.719057679176 李瑞环 0.716384887695 温家宝 0.711952567101 王岐山 0.703570842743 In [26]: result = model.most_similar(u"林丹") In [27]: for e in result: print e[0], e[1] ....: 黄综翰 0.538035452366 蒋燕皎 0.52646958828 刘鑫 0.522252976894 韩晶娜 0.516120731831 王晓理 0.512289524078 王适 0.508560419083 杨影 0.508159279823 陈跃 0.507353425026 龚智超 0.503159761429 李敬元 0.50262516737 In [28]: result = model.most_similar(u"语言学") In [29]: for e in result: print e[0], e[1] ....: 社会学 0.632598280907 人类学 0.623406708241 历史学 0.618442356586 比较文学 0.604823827744 心理学 0.600066184998 人文科学 0.577783346176 社会心理学 0.575571238995 政治学 0.574541330338 地理学 0.573896467686 哲学 0.573873817921 In [30]: result = model.most_similar(u"计算机") In [31]: for e in result: print e[0], e[1] ....: 自动化 0.674171924591 应用 0.614087462425 自动化系 0.611132860184 材料科学 0.607891201973 集成电路 0.600370049477 技术 0.597518980503 电子学 0.591316461563 建模 0.577238917351 工程学 0.572855889797 微电子 0.570086717606 In [32]: model.similarity(u"计算机", u"自动化") Out[32]: 0.67417196002404789 In [33]: model.similarity(u"女人", u"男人") Out[33]: 0.77537125129824813 In [34]: model.doesnt_match(u"早餐 晚餐 午餐 中心".split()) Out[34]: u'\u4e2d\u5fc3' In [35]: print model.doesnt_match(u"早餐 晚餐 午餐 中心".split()) 中心 |
有好的也有坏的case,甚至bad case可能会更多一些,这和语料库的规模有关,还和分词器的效果有关等等,不过这个实验暂且就到这里了。至于word2vec有什么用,目前除了用来来计算词语相似度外,业界更关注的是word2vec在具体的应用任务中的效果,这个才是更有意思的东东,也欢迎大家一起探讨。
注:原创文章,转载请注明出处“我爱自然语言处理”:www.52nlp.cn
本文链接地址:https://www.52nlp.cn/中英文维基百科语料上的word2vec实验
博主,gensim 的 多核是不是没有用啊,worker 设置了8,虽然8个核都在允许,但是每个核仅仅占30-40,总得也就300%,左右
[回复]
52nlp 回复:
17 3 月, 2017 at 14:31
有用,至于速度是不是线性提升的,我还没有详细测试过。另外注意多核这个依赖cython: https://rare-technologies.com/word2vec-tutorial/
"The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow)."
[回复]
我在用你的process_wiki.py处理维基百科英文的时候,遇到了错误。我的python版本是2.7.13,报错信息如下:
Traceback (most recent call last):
File "process_wiki.py", line 28, in
for text in wiki.get_texts():
File "C:\Users\LLL\Anaconda2\lib\site-packages\gensim\corpora\wikicorpus.py", line 304, in get_texts
for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
File "C:\Users\LLL\Anaconda2\lib\site-packages\gensim\utils.py", line 858, in chunkize
for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy):
File "C:\Users\LLL\Anaconda2\lib\site-packages\gensim\utils.py", line 810, in chunkize_serial
wrapped_chunk = [list(itertools.islice(it, int(chunksize)))]
File "C:\Users\LLL\Anaconda2\lib\site-packages\gensim\corpora\wikicorpus.py", line 300, in
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
File "C:\Users\LLL\Anaconda2\lib\site-packages\gensim\corpora\wikicorpus.py", line 212, in extract_pages
for elem in elems:
File "C:\Users\LLL\Anaconda2\lib\site-packages\gensim\corpora\wikicorpus.py", line 197, in
elems = (elem for _, elem in iterparse(f, events=("end",)))
File "", line 100, in next
IOError: invalid data stream
请博主大大有空看一下
[回复]
52nlp 回复:
19 3 月, 2017 at 20:57
windows下这个问题我不太清楚,google了一下,在gensim的groups下有同学遇到类似问题,gensim的作者回答可能是下载的压缩文件不全导致,你看看下载的压缩包是否完整:
https://groups.google.com/forum/#!topic/gensim/7k0_ICYuYqg
[回复]
xiaowei 回复:
19 3 月, 2017 at 21:20
前面我也谷歌到了这个,然后我又下载了一次,但是还是报错。具体情况是这样的,会抽取一部分,后面才报这个错。请问博主你是在什么环境下做的呢
[回复]
52nlp 回复:
19 3 月, 2017 at 22:04
mac & ubuntu 都做过,只有windows下没有;你校验一下md5看看下载是否完全?
你好~请问这是什么错误呢?
2017-03-27 16:16:40,927: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
Traceback (most recent call last):
File "process_wiki.py", line 23, in
print globals()['__doc__'] % locals()
TypeError: unsupported operand type(s) for %: 'NoneType' and 'dict'
[回复]
52nlp 回复:
27 3 月, 2017 at 16:28
你应该用得是python3吧?python3的话print后面加括号。另外这个版本仅针对python2, python3下可能会有一些问题。
[回复]
林二月 回复:
27 3 月, 2017 at 16:29
贴错代码了(lll¬ω¬)
[回复]
楼主你好,维基的数据下载完成后转换成TXT时,不是要运行process_wiki.py脚本吗,是不是新建一个.py文件将你的代码复制进去运行?那是怎么读取数据的呢?
[回复]
52nlp 回复:
13 4 月, 2017 at 15:00
文中不是写得很清楚吗,建立好py文件执行如下命令即可:
python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
[回复]
freedomzll 回复:
19 4 月, 2017 at 09:54
但是运行脚本的时候报错了,还有就是.py文件和维基百科的数据文件有特定的存放位置吗还是随便放在那里。
[回复]
freedomzll 回复:
19 4 月, 2017 at 10:09
我直接运行了这个命令,已经在转换了,可能是我太纠结每一步的对错了,运行脚本出错好像对转换不影响,谢谢楼主
[回复]
52nlp 回复:
19 4 月, 2017 at 11:02
你需要补充一点基础python和脚本知识
freedomzll 回复:
24 4 月, 2017 at 10:08
嗯嗯,谢谢楼主,还有一个问题,我在将xml转换成text时运行了三个半小时,跟您的时间差别也不是很大,但是转词向量的时候,到现在已经跑了快5天了才74%,您的才7个小时,虽然我用的是新一点的英文维基语料,但也不会差别这么大吧,总感觉有点不对劲啊!
52nlp 回复:
24 4 月, 2017 at 14:33
这个确实够慢,不过gensim里word2vc的多线程依赖cython,如果你系统里没有按照cython,可能多线程并没有起作用。
52nlp 回复:
24 4 月, 2017 at 14:35
之前回复过其他同学,你可以参考原文tutorial:
有用,至于速度是不是线性提升的,我还没有详细测试过。另外注意多核这个依赖cython: https://rare-technologies.com/word2vec-tutorial/
"The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow)."
freedomzll 回复:
26 4 月, 2017 at 09:49
楼主,跑完了但是没有wiki.en.text.vector这个文件是怎么回事,只有wiki.en.text.model,wiki.en.text.model.syn1neg.npy和wiki.en.text.model.wv.syn0.npy。下面是跑完的提示:
2017-04-25 20:41:48,924: INFO: PROGRESS: at 100.00% examples, 17501 words/s, in_qsize 5, out_qsize 1
2017-04-25 20:41:49,799: INFO: worker thread finished; awaiting finish of 3 more threads
2017-04-25 20:41:50,046: INFO: worker thread finished; awaiting finish of 2 more threads
2017-04-25 20:41:50,898: INFO: PROGRESS: at 100.00% examples, 17501 words/s, in_qsize 1, out_qsize 1
2017-04-25 20:41:50,900: INFO: worker thread finished; awaiting finish of 1 more threads
2017-04-25 20:41:51,007: INFO: worker thread finished; awaiting finish of 0 more threads
2017-04-25 20:41:51,022: INFO: training on 11632974485 raw words (9432720823 effective words) took 5
38972.6s, 17501 effective words/s
2017-04-25 20:41:51,885: INFO: saving Word2Vec object under wiki.en.text.model, separately None
2017-04-25 20:41:52,539: INFO: not storing attribute syn0norm
2017-04-25 20:41:52,571: INFO: storing np array 'syn0' to wiki.en.text.model.wv.syn0.npy
2017-04-25 21:36:54,424: INFO: storing np array 'syn1neg' to wiki.en.text.model.syn1neg.npy
2017-04-25 21:43:50,877: INFO: not storing attribute cum_table
2017-04-25 22:02:55,186: INFO: saved wiki.en.text.model
Traceback (most recent call last):
File "train_word2vec_model.py", line 33, in
model.save_word2vec_format(outp2, binary=False)
File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 1452, in save_wo
rd2vec_format
raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.
52nlp 回复:
26 4 月, 2017 at 10:46
上面有提示,这个接口过期了“raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.”,
可以用“model.wv.save_word2vec_format” 替代。
freedomzll 回复:
2 5 月, 2017 at 17:23
-> 1410 model = super(Word2Vec, cls).load(*args, **kwargs)
1411 # update older models
1412 if hasattr(model, 'table'):
C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.pyc in load(cls, fname,
mmap)
269 compress, subname = SaveLoad._adapt_by_suffix(fname)
270
--> 271 obj = unpickle(fname)
272 obj._load_specials(fname, mmap, compress, subname)
273 logger.info("loaded %s", fname)
C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.pyc in unpickle(fname)
933 return _pickle.load(f, encoding='latin1')
934 else:
--> 935 return _pickle.loads(f.read())
936
937
UnpicklingError: unpickling stack underflow
楼主,这种栈下溢是怎么回事啊,我的每一步怎么都会出问题,求解啊
[回复]
52nlp 回复:
3 5 月, 2017 at 09:28
没遇到过这类问题,你看看stackoverflow上关于这类问题的讨论:http://stackoverflow.com/questions/23964048/python-unpickling-stack-underflow
"unpickling stack underflow can happen when a pickle ends unexpectedly."
莫非你的pickle文件不完整?
[回复]
freedomzll 回复:
3 5 月, 2017 at 15:05
在ipython中通过gensim加载和测试模型时出现这个错,我import gensim的时候是有一个警告,楼主你说这个栈下溢跟这个警告有关系吗?
C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; alias
ing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
52nlp 回复:
3 5 月, 2017 at 15:41
不清楚了
freedomzll 回复:
5 5 月, 2017 at 14:12
楼主,你好,我加载model这个模型还是可以的,也没有存在大的数据就无法训练的情况。
词向量的查看是不是每次都要重新加载啊,还挺耗时的
52nlp 回复:
5 5 月, 2017 at 15:15
最近用新版本gensim训练后确实这个问题不存在了;你想查看词向量必须先load模型,不过load一次就可以了,除非你每次查看完就关掉。
你好,首先感谢您的文章。
我在最后使用模型时出现错误如下:
import gensim
model = gensim.models.word2vec.load("xxxxxx")
错误是: 'module' object has no attribute 'load'
[回复]
52nlp 回复:
16 4 月, 2017 at 09:42
gensim最近的更新版接口貌似改了,试一下:
model = gensim.models.word2vec.Word2Vec.load(MODEL_PATH)
[回复]
何宇飞 回复:
17 4 月, 2017 at 16:30
非常感谢
[回复]
你好,我的问题和你一样,总是报错
'module' object has no attribute 'load'
但是换了你认为对的代码为什么也是错的
[回复]
楼主你好,第一步的那个.py文件复制到pycharm中,然后在控制台运行可以吗,我在pycharm中直接运行报错了,这样是不对的吧
[回复]
52nlp 回复:
18 4 月, 2017 at 11:49
报错信息是?你可安装了gensim,第一个脚本也依赖gensim
[回复]
楼主,运行脚本出现这个错误是怎么回事啊:
E:\PycharmProjects\test>python process_wiki.py
C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.py:860: UserWarning: det
ected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2017-04-18 15:12:55,609: INFO: running process_wiki.py
Traceback (most recent call last):
File "process_wiki.py", line 21, in
print globals()['__doc__'] % locals()
TypeError: unsupported operand type(s) for %: 'NoneType' and 'dict'
[回复]
52nlp 回复:
18 4 月, 2017 at 15:44
你应该用得是python3吧?python3的话print后面加括号。另外这个版本仅针对python2, python3下可能会有一些问题。
[回复]
边走边看 回复:
8 5 月, 2017 at 11:35
楼主你好,我也遇到同样的问题,有带括号,希望楼主解答。
[回复]
52nlp 回复:
8 5 月, 2017 at 11:45
第一:这个脚本执行的不完备,所以报错,要输全:
python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
第二:目前看不是python2或者python3的问题,这个代码有点误导,我已经在上面做了一下修改,可以重新copy代码来执行。
楼主,下面这个错误是怎么回事,求解
E:\PycharmProjects\test>python process_wiki.py
C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.py:860: UserWarning: det
ected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2017-04-18 15:12:55,609: INFO: running process_wiki.py
Traceback (most recent call last):
File "process_wiki.py", line 21, in
print globals()['__doc__'] % locals()
TypeError: unsupported operand type(s) for %: 'NoneType' and 'dict'
[回复]
52nlp 回复:
19 4 月, 2017 at 09:35
same as above question, same answer
[回复]
请问python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text是在shell运行吗?
为什么会有SyntaxError: invalid syntax
我是2.7版本
[回复]
52nlp 回复:
25 4 月, 2017 at 12:14
是在shell下,“为什么会有SyntaxError: invalid syntax” 这个还有更具体的信息吗?
[回复]
请问博主能分享一下处理好后的英文wiki的txt数据吗?就是那个12G左右的txt,谢谢了。
[回复]
52nlp 回复:
3 5 月, 2017 at 14:29
抱歉,太大了,不好分享
[回复]
一缕清风 回复:
3 5 月, 2017 at 14:51
好吧,谢谢了,最近要做word2vec在文本特征提取上的一些应用,需要用到1G以上的英文txt语料库。但是wiki语料库太大,电脑处理起来压力太大,那请问博主有100M以上的英文语料库吗,有的话求发一份446810140@qq.com,没有的话也多谢了。
还有就是想再请问一下博主,有哪些模型可以用来和word2vec进行对比测试呢?打算用模型对比测试得到的词汇相似度来证明word2vec的优越性。
[回复]
一缕清风 回复:
3 5 月, 2017 at 14:56
麻烦题主了,因为电脑只有4G内存,所以wiki的语料处理不动Q_Q,但公开的英文语料库只有wiki的在100M以上,谢谢了
52nlp 回复:
3 5 月, 2017 at 15:48
斯坦福用来做情感分析的电影评论语料可以看看,80多M
52nlp 回复:
3 5 月, 2017 at 15:41
印象这个方法处理对内存要求不高;另外你也可以考虑最近新写得这篇文章的方法,用WikiExtractor提取语料,用yield方式训练模型,内存友好型:
https://www.52nlp.cn/%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E8%AF%AD%E6%96%99%E4%B8%AD%E7%9A%84%E8%AF%8D%E8%AF%AD%E7%9B%B8%E4%BC%BC%E5%BA%A6%E6%8E%A2%E7%B4%A2
52nlp 回复:
3 5 月, 2017 at 15:49
wiki语料是最优选择;其他模型的话看看glove和fasttext里加的一些语法属性的xx2vec吧,有很多衍生
一缕清风 回复:
4 5 月, 2017 at 14:45
谢谢博主了^_^
[回复]
similarity 这个函数可用吗?有没有相关的API文档?求解答
[回复]
52nlp 回复:
7 5 月, 2017 at 22:32
可用,具体参考gensim word2vec的相关文档:https://radimrehurek.com/gensim/models/word2vec.html
[回复]
楼主 python3.6 运行 python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text 这句话的时候 出这个错 File "", line 1
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
^
SyntaxError: invalid syntax
这个应该怎么解决 另外 楼主 能不能留个联系方式 最近在做课题 想详细咨询
[回复]