知道创宇IA-Lab 岳永鹏
目前,在NLP任务处理中,Python支持英文处理的开源包有NLTK、Scapy、StanfordCoreNLP、GATE、OPenNLP,支持中文处理的开源工具包有Jieba、ICTCLAS、THU LAC、HIT LTP,但是这些工具大部分仅对特定类型的语言提供支持。本文将介绍功能强大的支持Pipeline方式的多语言处理Python工具包:polyglot。该项目最早是由AboSamoor在2015年3月16日在GitHub上开源的项目,已经在Github收集star 1021个。
- Free software: GPLv3 license
- Documentation: http://polyglot.readthedocs.org
- GitHub: https://github.com/aboSamoor/polyglot
特征
- 语言检测 Language Detection (支持196种语言)
- 分句、分词 Tokenization (支持165种语言)
- 实体识别 Named Entity Recognition (支持40种语言)
- 词性标注 Part of Speech Tagging(支持16种语言)
- 情感分析 Sentiment(支持136种语言)
- 词嵌入 Word Embeddings(支持137种语言)
- 翻译 Transliteration(支持69种语言)
- 管道 Pipelines
安装
从PyPI安装/升级
$ pip install polyglot
安装polyglot依赖于numpy和 libicu-dev,在 ubuntu / debian linux发行版中你可以通过执行以下命令来安装这样的包:
$ sudo apt-get install python-numpy libicu-dev
安装成功以后,输入
$ import polyglot
$ polyglot.__version__
$ 16.07.04
数据
在随后的实例演示中,将以中文、英文或中英文混合语句作为测试数据。
text_en = u"Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years." text_cn = u" 日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。" text_mixed = text_cn + text_en
语言检测 Language Detection
polyglot的语言检测依赖pycld2和cld2,其中cld2是Google开发的多语言检测应用。
Example
导入依赖
from polyglot.detect import Detector
语言类型检测
>>> Detector(text_cn).language name: Chinese code: zh confidence: 99.0 read bytes: 1996 >>>> Detector(text_en).language name: English code: en confidence: 99.0 read bytes: 1144 >>> Detector(text_mixed).language name: Chinese code: zh confidence: 50.0 read bytes: 1996
对中英文混合的text_mixed,其识别的语言是中文,但置信度(confidence)仅有50,所有包含的语言类型检测
>>> for language in Detector(text_mixed): >>> print(language) name: Chinese code: zh confidence: 50.0 read bytes: 1996 name: English code: en confidence: 49.0 read bytes: 1144 name: un code: un confidence: 0.0 read bytes: 0
目前,cld2支持的语言检测类型有
>>> Detector.supported_languages() 1. Abkhazian 2. Afar 3. Afrikaans 4. Akan 5. Albanian 6. Amharic 7. Arabic 8. Armenian 9. Assamese 10. Aymara 11. Azerbaijani 12. Bashkir 13. Basque 14. Belarusian 15. Bengali 16. Bihari 17. Bislama 18. Bosnian 19. Breton 20. Bulgarian 21. Burmese 22. Catalan 23. Cebuano 24. Cherokee 25. Nyanja 26. Corsican 27. Croatian 28. Croatian 29. Czech 30. Chinese 31. Chinese 32. Chinese 33. Chinese 34. Chineset 35. Chineset 36. Chineset 37. Chineset 38. Chineset 39. Chineset 40. Danish 41. Dhivehi 42. Dutch 43. Dzongkha 44. English 45. Esperanto 46. Estonian 47. Ewe 48. Faroese 49. Fijian 50. Finnish 51. French 52. Frisian 53. Ga 54. Galician 55. Ganda 56. Georgian 57. German 58. Greek 59. Greenlandic 60. Guarani 61. Gujarati 62. Haitian_creole 63. Hausa 64. Hawaiian 65. Hebrew 66. Hebrew 67. Hindi 68. Hmong 69. Hungarian 70. Icelandic 71. Igbo 72. Indonesian 73. Interlingua 74. Interlingue 75. Inuktitut 76. Inupiak 77. Irish 78. Italian 79. Ignore 80. Javanese 81. Javanese 82. Japanese 83. Kannada 84. Kashmiri 85. Kazakh 86. Khasi 87. Khmer 88. Kinyarwanda 89. Krio 90. Kurdish 91. Kyrgyz 92. Korean 93. Laothian 94. Latin 95. Latvian 96. Limbu 97. Limbu 98. Limbu 99. Lingala 100. Lithuanian 101. Lozi 102. Luba_lulua 103. Luo_kenya_and_tanzania 104. Luxembourgish 105. Macedonian 106. Malagasy 107. Malay 108. Malayalam 109. Maltese 110. Manx 111. Maori 112. Marathi 113. Mauritian_creole 114. Romanian 115. Mongolian 116. Montenegrin 117. Montenegrin 118. Montenegrin 119. Montenegrin 120. Nauru 121. Ndebele 122. Nepali 123. Newari 124. Norwegian 125. Norwegian 126. Norwegian_n 127. Nyanja 128. Occitan 129. Oriya 130. Oromo 131. Ossetian 132. Pampanga 133. Pashto 134. Pedi 135. Persian 136. Polish 137. Portuguese 138. Punjabi 139. Quechua 140. Rajasthani 141. Rhaeto_romance 142. Romanian 143. Rundi 144. Russian 145. Samoan 146. Sango 147. Sanskrit 148. Scots 149. Scots_gaelic 150. Serbian 151. Serbian 152. Seselwa 153. Seselwa 154. Sesotho 155. Shona 156. Sindhi 157. Sinhalese 158. Siswant 159. Slovak 160. Slovenian 161. Somali 162. Spanish 163. Sundanese 164. Swahili 165. Swedish 166. Syriac 167. Tagalog 168. Tajik 169. Tamil 170. Tatar 171. Telugu 172. Thai 173. Tibetan 174. Tigrinya 175. Tonga 176. Tsonga 177. Tswana 178. Tumbuka 179. Turkish 180. Turkmen 181. Twi 182. Uighur 183. Ukrainian 184. Urdu 185. Uzbek 186. Venda 187. Vietnamese 188. Volapuk 189. Waray_philippines 190. Welsh 191. Wolof 192. Xhosa 193. Yiddish 194. Yoruba 195. Zhuang 196. Zulu
分句、分词 Tokenization
自然语言处理任务中,任务可以分为字符级、词语级、句子级、段落级和篇章级,Tokenization就是实现切分字符、词语、句子和段落边界的功能。分段可以用\n
、\n\r
作分割,字符分割也比较容易实现,分句和分词相对比较复杂一点。
Example
导入依赖
from polyglot.text import Text
分句
>>> Text(text_cn).sentences [Sentence("日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。"), Sentence("目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。")] >>> Text(text_en).sentences [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.")] >>> Text(text_mixed).sentences [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."), Sentence("日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。"), Sentence("目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。")]
分词
>>> Text(text_cn).words 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 , 将 于 2019 年 9 月 结束 服务 , 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 。 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 , 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 。 >>> Text(text_en).words Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years . >>> Text(text_mixed).words Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years . 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 , 将 于 2019 年 9 月 结束 服务 , 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 。 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 , 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 。
实体识别 Named Entity Recognition
实体识别是识别出文本中具有特定意义的实体,其常有三种分类:
- 实体类: 人名、地名、机构名、商品名、商标名等等
- 时间类: 日期、时间
- 数字类: 生日、电话号码、QQ号码等等
实体识别的方法也可以分为三种:
- 基于规则 Linguistic grammar-based techniques
基于语言语法的技术主要是用规则的方法,在工程的实现方面上的应用就是写很多的正则表达(RegEx),这种方式可以解决部分时间类、和数字类命名实体的识别。 - 统计学习 Statistical models
统计的方法目前主要是HMM和CRF模型,也是当前比较成熟的方式。 - 深度学习 Deep Learning models
深度学习的方法是目前最为流行的方式,特别是RNN系列的DL模型,其可以吸收到更多的文本语义信息,其效果是当前最好的。
polyglot实体识别的训练语料来源于维基百科(WIKI),其训练好的模型并没有初次安装,需要下载相应的模型。polyglot支持40种语言的实体类(人名、地名、机构名)的识别。
>>> from polyglot.downloader import downloader >>> print(downloader.supported_languages_table("ner2", 3)) 1. Polish 2. Turkish 3. Russian 4. Indonesian 5. Czech 6. Arabic 7. Korean 8. Catalan; Valencian 9. Italian 10. Thai 11. Romanian, Moldavian, ... 12. Tagalog 13. Danish 14. Finnish 15. German 16. Persian 17. Dutch 18. Chinese 19. French 20. Portuguese 21. Slovak 22. Hebrew (modern) 23. Malay 24. Slovene 25. Bulgarian 26. Hindi 27. Japanese 28. Hungarian 29. Croatian 30. Ukrainian 31. Serbian 32. Lithuanian 33. Norwegian 34. Latvian 35. Swedish 36. English 37. Greek, Modern 38. Spanish; Castilian 39. Vietnamese 40. Estonian
模型下载
下载英文和中文实体识别的模型
$ python >>> import polyglot >>> !polyglot download ner2.en ner2.zh embeddings2.zh embeddings2.en [polyglot_data] Downloading package ner2.en to [polyglot_data] Downloading package ner2.zh to [polyglot_data] Downloading package embeddings2.zh to [polyglot_data] Downloadinuserackage embeddings2.en to [polyglot_data] /home/user/polyglot_data...
Example
导入依赖
>>> from polyglot.text import Text
实体识别
>>> Text(text_cn).entities [I-ORG([u'东京'])] >>> Text(text_en).entities) [I-LOC([u'Tokyo'])] >>> Text(text_mixed).entities) [I-ORG([u'东京'])]
词性标注 Part of Speech Tagging
词性标注是对分词单元作相应的词性标记,其常用的标记包括:
- 形容词 ADJ: adjective
- 介词 ADP: adposition
- 副词 ADV: adverb
- 辅助动词 AUX: auxiliary verb
- 连词 CONJ: coordinating conjunction
- 限定词 DET: determiner
- 感叹词 INTJ: interjection
- 名词 NOUN: noun
- 数字 NUM: numeral
- 代词 PRON: pronoun
- 名词代词 PROPN: proper noun
- 标点符号 PUNCT: punctuation
- 从属连词 SCONJ: subordinating conjunction
- 符号 SYM: symbol
- 动词 VERB: verb
- 其他 X: other
polyglot训练词性标注的语料来源于CONLL数据集,其支持16种语言,不支持中文。
>>> from polyglot.downloader import downloader >>> print(downloader.supported_languages_table("pos2")) 1. German 2. Italian 3. Danish 4. Czech 5. Slovene 6. French 7. English 8. Swedish 9. Bulgarian 10. Spanish; Castilian 11. Indonesian 12. Portuguese 13. Finnish 14. Irish 15. Hungarian 16. Dutch
模型下载
下载英文词性标注的模型
$ python >>> import polyglot >>> !polyglot download pos2.en [polyglot_data] ownloading package pos2.en to [polyglot_data] /home/user/polyglot_data...
Example
导入依赖
from polyglot.text import Text
词性标注
>>> Text(text_en).pos_tags [(u"Japan's", u'NUM'), (u'last', u'ADJ'), (u'pager', u'NOUN'), (u'provider', u'NOUN'), (u'has', u'AUX'), (u'announced', u'VERB'), (u'it', u'PRON'), (u'will', u'AUX'), (u'end', u'VERB'), (u'its', u'PRON'), (u'service', u'NOUN'), (u'in', u'ADP'), (u'September', u'PROPN'), (u'2019', u'NUM'), (u'-', u'PUNCT'), (u'bringing', u'VERB'), (u'a', u'DET'), (u'national', u'ADJ'), (u'end', u'NOUN'), (u'to', u'ADP'), (u'telecommunication', u'VERB'), (u'beepers', u'NUM'), (u',', u'PUNCT'), (u'50', u'NUM'), (u'years', u'NOUN'), (u'after', u'ADP'), (u'their', u'PRON'), (u'introduction.Around', u'NUM'), (u'1,500', u'NUM'), (u'users', u'NOUN'), (u'remain', u'VERB'), (u'subscribed', u'VERB'), (u'to', u'ADP'), (u'Tokyo', u'PROPN'), (u'Telemessage', u'PROPN'), (u',', u'PUNCT'), (u'which', u'DET'), (u'has', u'AUX'), (u'not', u'PART'), (u'made', u'VERB'), (u'the', u'DET'), (u'devices', u'NOUN'), (u'in', u'ADP'), (u'20', u'NUM'), (u'years', u'NOUN'), (u'.', u'PUNCT')]
情感分析 Sentiment Analysis
polyglot的情感分析是词级别的,对每一个分词正面标记为1,中性标记为0,负面标记为1.其目前支持136种语言。
>>> from polyglot.downloader import downloader >>> print(downloader.supported_languages_table("sentiment2")) 1. Turkmen 2. Thai 3. Latvian 4. Zazaki 5. Tagalog 6. Tamil 7. Tajik 8. Telugu 9. Luxembourgish, Letzeb... 10. Alemannic 11. Latin 12. Turkish 13. Limburgish, Limburgan... 14. Egyptian Arabic 15. Tatar 16. Lithuanian 17. Spanish; Castilian 18. Basque 19. Estonian 20. Asturian 21. Greek, Modern 22. Esperanto 23. English 24. Ukrainian 25. Marathi (Marāṭhī) 26. Maltese 27. Burmese 28. Kapampangan 29. Uighur, Uyghur 30. Uzbek 31. Malagasy 32. Yiddish 33. Macedonian 34. Urdu 35. Malayalam 36. Mongolian 37. Breton 38. Bosnian 39. Bengali 40. Tibetan Standard, Tib... 41. Belarusian 42. Bulgarian 43. Bashkir 44. Vietnamese 45. Volapük 46. Gan Chinese 47. Manx 48. Gujarati 49. Yoruba 50. Occitan 51. Scottish Gaelic; Gaelic 52. Irish 53. Galician 54. Ossetian, Ossetic 55. Oriya 56. Walloon 57. Swedish 58. Silesian 59. Lombard language 60. Divehi; Dhivehi; Mald... 61. Danish 62. German 63. Armenian 64. Haitian; Haitian Creole 65. Hungarian 66. Croatian 67. Bishnupriya Manipuri 68. Hindi 69. Hebrew (modern) 70. Portuguese 71. Afrikaans 72. Pashto, Pushto 73. Amharic 74. Aragonese 75. Bavarian 76. Assamese 77. Panjabi, Punjabi 78. Polish 79. Azerbaijani 80. Italian 81. Arabic 82. Icelandic 83. Ido 84. Scots 85. Sicilian 86. Indonesian 87. Chinese Word 88. Interlingua 89. Waray-Waray 90. Piedmontese language 91. Quechua 92. French 93. Dutch 94. Norwegian Nynorsk 95. Norwegian 96. Western Frisian 97. Upper Sorbian 98. Nepali 99. Persian 100. Ilokano 101. Finnish 102. Faroese 103. Romansh 104. Javanese 105. Romanian, Moldavian, ... 106. Malay 107. Japanese 108. Russian 109. Catalan; Valencian 110. Fiji Hindi 111. Chinese 112. Cebuano 113. Czech 114. Chuvash 115. Welsh 116. West Flemish 117. Kirghiz, Kyrgyz 118. Kurdish 119. Kazakh 120. Korean 121. Kannada 122. Khmer 123. Georgian 124. Sakha 125. Serbian 126. Albanian 127. Swahili 128. Chechen 129. Sundanese 130. Sanskrit (Saṁskṛta) 131. Venetian 132. Northern Sami 133. Slovak 134. Sinhala, Sinhalese 135. Bosnian-Croatian-Serbian 136. Slovene
模型下载
下载英文和中文情感分析模型
$ python >>> import polyglot >>> !polyglot download sentiment2.en sentiment2.zh [polyglot_data] ownloading package sentiment2.en to [polyglot_data] ownloading package sentiment2.zh to [polyglot_data] /home/user/polyglot_data...
Example
导入依赖
from polyglot.text import Text
情感分析
>>> text = Text("The movie is very good and the actors are prefect, but the cinema environment is very poor.") >>> print(text.words,text.polarity) (WordList([u'The', u'movie', u'is', u'very', u'good', u'and', u'the', u'actors', u'are', u'prefect', u',', u'but', u'the', u'cinema', u'environment', u'is', u'very', u'poor', u'.']), 0.0) >>> print([(w,w.polarity) for w in text.words]) [(u'The', 0), (u'movie', 0), (u'is', 0), (u'very', 0), (u'good', 1), (u'and', 0), (u'the', 0), (u'actors', 0), (u'are', 0), (u'prefect', 0), (u',', 0), (u'but', 0), (u'the', 0), (u'cinema', 0), (u'environment', 0), (u'is', 0), (u'very', 0), (u'poor', -1), (u'.', 0)] >>> text = Text("这部电影故事非常好,演员也非常棒,但是电影院环境非常差。") >>> print(text.words,text.polarity) (WordList([这 部 电影 故事 非常 好 , 演员 也 非常 棒 , 但是 电影 院 环境 非常 差 。]), 0.0) >>> print([(w,w.polarity) for w in text.words]) [(u'\u8fd9', 0), (u'\u90e8', 0), (u'\u7535\u5f71', 0), (u'\u6545\u4e8b', 0), (u'\u975e\u5e38', 0), (u'\u597d', 1), (u'\uff0c', 0), (u'\u6f14\u5458', 0), (u'\u4e5f', 0), (u'\u975e\u5e38', 0), (u'\u68d2', 0), (u'\uff0c', 0), (u'\u4f46\u662f', 0), (u'\u7535\u5f71', 0), (u'\u9662', 0), (u'\u73af\u5883', 0), (u'\u975e\u5e38', 0), (u'\u5dee', -1), (u'\u3002', 0)]
词嵌入 Word Embeddings
Word Embedding在NLP中是指一组语言模型和特征学习技术的总称,把词汇表中的单词或者短语映射成由实数构成的向量上。常见的Word Embeddings有两种方法:离散表示和分布式表示。离散的方法包括one-hot和N-gram,离散表示的缺点是不能很好的刻画词与词之间的相关性和维数灾难的问题。分布式表示的思想是用一个词附近的其他词来表示该词,也就是大家所熟悉的word2ec。word2ec包含根据当前一个词预测前后nn个词Skip-Gram Model以及给定上下文的nn个词预测一个词的CBOW Model。目前训练好的英文词向量有glove,其提供了50、100、200、300维词向量,以及前一段时间腾讯AI Lab开源的中文词向量,其提供200维的中文词向量。polyglot支持从以下不同源读取词向量
- Gensim word2vec objects: (from_gensim method)
- Word2vec binary/text models: (from_word2vec method)
- GloVe models (from_glove method)
- polyglot pickle files: (load method)
其中,polyglot pickle files支持136种语言的词向量。
>>> from polyglot.downloader import downloader >>> print(downloader.supported_languages_table("embeddings2")) 1. Scots 2. Sicilian 3. Welsh 4. Chuvash 5. Czech 6. Egyptian Arabic 7. Kapampangan 8. Chechen 9. Catalan; Valencian 10. Slovene 11. Sinhala, Sinhalese 12. Bosnian-Croatian-Serbian 13. Slovak 14. Japanese 15. Northern Sami 16. Sanskrit (Saṁskṛta) 17. Croatian 18. Javanese 19. Sundanese 20. Swahili 21. Swedish 22. Albanian 23. Serbian 24. Marathi (Marāṭhī) 25. Breton 26. Bosnian 27. Bengali 28. Tibetan Standard, Tib... 29. Bulgarian 30. Belarusian 31. West Flemish 32. Bashkir 33. Malay 34. Romanian, Moldavian, ... 35. Romansh 36. Esperanto 37. Asturian 38. Greek, Modern 39. Burmese 40. Maltese 41. Malagasy 42. Spanish; Castilian 43. Russian 44. Mongolian 45. Chinese 46. Estonian 47. Yoruba 48. Sakha 49. Alemannic 50. Assamese 51. Lombard language 52. Yiddish 53. Silesian 54. Venetian 55. Azerbaijani 56. Afrikaans 57. Aragonese 58. Amharic 59. Hebrew (modern) 60. Hindi 61. Quechua 62. Haitian; Haitian Creole 63. Hungarian 64. Bishnupriya Manipuri 65. Armenian 66. Gan Chinese 67. Macedonian 68. Georgian 69. Khmer 70. Panjabi, Punjabi 71. Korean 72. Kannada 73. Kazakh 74. Kurdish 75. Basque 76. Pashto, Pushto 77. Portuguese 78. Gujarati 79. Manx 80. Irish 81. Scottish Gaelic; Gaelic 82. Upper Sorbian 83. Galician 84. Arabic 85. Walloon 86. Urdu 87. Norwegian Nynorsk 88. Norwegian 89. Dutch 90. Chinese Character 91. Nepali 92. French 93. Western Frisian 94. Bavarian 95. English 96. Persian 97. Polish 98. Finnish 99. Faroese 100. Italian 101. Icelandic 102. Volapük 103. Ido 104. Waray-Waray 105. Indonesian 106. Interlingua 107. Lithuanian 108. Uzbek 109. Latvian 110. German 111. Danish 112. Cebuano 113. Ukrainian 114. Latin 115. Luxembourgish, Letzeb... 116. Divehi; Dhivehi; Mald... 117. Vietnamese 118. Uighur, Uyghur 119. Limburgish, Limburgan... 120. Zazaki 121. Ilokano 122. Fiji Hindi 123. Malayalam 124. Tatar 125. Kirghiz, Kyrgyz 126. Ossetian, Ossetic 127. Oriya 128. Turkish 129. Tamil 130. Tagalog 131. Thai 132. Turkmen 133. Telugu 134. Occitan 135. Tajik 136. Piedmontese language
模型下载
下载英文和中文词向量
$ python >>> import polyglot >>> !polyglot download embeddings2.zh embeddings2.en [polyglot_data] Downloading package embeddings2.zh to [polyglot_data] Downloadinuserackage embeddings2.en to [polyglot_data] /home/user/polyglot_data...
Example
导入依赖并加载词向量
>>> from polyglot.mapping import Embedding >>> embeddings = Embedding.load('/home/user/polyglot_data/embeddings2/zh/embeddings_pkl.tar.bz2')
词向量查询
>>> print(embeddings.get("中国")) [ 0.60831094 0.37644583 -0.67009342 0.43529209 0.12993187 -0.07703398 -0.04931475 -0.42763838 -0.42447501 -0.0219319 -0.52271312 -0.57149178 -0.48139745 -0.31942225 0.12747335 0.34054375 0.27137381 0.1362032 -0.54999739 -0.39569679 1.01767457 0.12317979 -0.12878017 -0.65476489 0.18644606 0.2178454 0.18150428 0.18464987 0.29027358 0.21979097 -0.21173042 0.08130789 -0.77350897 0.66575652 -0.14730017 0.11383133 0.83101833 0.01702038 -0.71277034 0.29339811 0.3320756 0.25922608 -0.51986367 0.16533957 0.04327472 0.36460632 0.42984027 0.04811303 -0.16718218 -0.18613082 -0.52108622 -0.47057685 -0.14663117 -0.30221295 0.72923231 -0.54835045 -0.48428732 0.65475166 -0.34853089 0.03206051 0.2574054 0.07614037 0.32844698 -0.0087136 ] >>> print(len(embeddings.get("中国"))) 64
相似词查询
>>> neighbors = embeddings.nearest_neighbors("中国") >>> print(" ".join(neighbors)) 上海 美国 韩国 北京 欧洲 台湾 法国 德国 天津 广州
翻译 Transliteration
polyglot翻译采用是无监督的方法( False-Friend Detection and Entity Matching via Unsupervised Transliteration paper),其支持69种语言。
>>> from polyglot.downloader import downloader >>> print(downloader.supported_languages_table("transliteration2")) 1. Haitian; Haitian Creole 2. Tamil 3. Vietnamese 4. Telugu 5. Croatian 6. Hungarian 7. Thai 8. Kannada 9. Tagalog 10. Armenian 11. Hebrew (modern) 12. Turkish 13. Portuguese 14. Belarusian 15. Norwegian Nynorsk 16. Norwegian 17. Dutch 18. Japanese 19. Albanian 20. Bulgarian 21. Serbian 22. Swahili 23. Swedish 24. French 25. Latin 26. Czech 27. Yiddish 28. Hindi 29. Danish 30. Finnish 31. German 32. Bosnian-Croatian-Serbian 33. Slovak 34. Persian 35. Lithuanian 36. Slovene 37. Latvian 38. Bosnian 39. Gujarati 40. Italian 41. Icelandic 42. Spanish; Castilian 43. Ukrainian 44. Urdu 45. Indonesian 46. Khmer 47. Galician 48. Korean 49. Afrikaans 50. Georgian 51. Catalan; Valencian 52. Romanian, Moldavian, ... 53. Basque 54. Macedonian 55. Russian 56. Azerbaijani 57. Chinese 58. Estonian 59. Welsh 60. Arabic 61. Bengali 62. Amharic 63. Irish 64. Malay 65. Marathi (Marāṭhī) 66. Polish 67. Greek, Modern 68. Esperanto 69. Maltese
模型下载
下载英文和中文翻译模型
$ python >>> import polyglot >>> !polyglot download transliteration2.zh transliteration2.en [polyglot_data] Downloading package transliteration2.zh to [polyglot_data] Downloadinuserackage transliteration2.en to [polyglot_data] /home/user/polyglot_data...
Example
导入依赖
>>> from polyglot.text import Text
英文翻译中文
>>> text = Text(text_en) >>> print(text_en) Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years. >>> print("".join([t for t in text.transliterate("zh")])) 拉斯特帕格普罗维德尔哈斯安诺乌恩斯德伊特维尔恩德伊特斯塞尔维斯因塞普特艾伯布林吉恩格阿恩阿特伊奥纳尔恩德托特埃莱科姆穆尼卡特伊昂布熙佩尔斯年年耶阿尔斯阿夫特特海尔乌斯尔斯雷马因苏布斯克里贝德托托基奥特埃莱梅斯斯阿格埃惠克赫哈斯诺特马德特赫德耶夫伊斯斯因耶阿尔斯
中英文翻译的结果显示其效果还是比较差,在此不做过多的介绍。
管道 Pipelines
Pipelines的方式是指以管道的方式顺序执行多个NLP任务,上一个任务的输出作为下一个任务的输入。比如在实体识别和实体关系识别中,Pipeline方式就是先识别出实体,然后再识别这些实体的关系,另外一种是Join,将实体识别和关系识别放在一起。
Exmaple
先分词,然后统计词频数大于2的单词。
>>> !polyglot --lang en tokenize --input testdata/example.txt | polyglot count --min-count 2 in 10 the 6 . 6 - 5 , 4 of 3 and 3 by 3 South 2 5 2 2007 2 Bermuda 2 which 2 score 2 against 2 Mitchell 2 as 2 West 2 India 2 beat 2 Afghanistan 2 Indies 2