知道创宇IA-Lab 岳永鹏

目前,在NLP任务处理中,Python支持英文处理的开源包有NLTK、Scapy、StanfordCoreNLP、GATE、OPenNLP,支持中文处理的开源工具包有Jieba、ICTCLAS、THU LAC、HIT LTP,但是这些工具大部分仅对特定类型的语言提供支持。本文将介绍功能强大的支持Pipeline方式的多语言处理Python工具包:polyglot。该项目最早是由AboSamoor在2015年3月16日在GitHub上开源的项目,已经在Github收集star 1021个。

特征

  • 语言检测 Language Detection (支持196种语言)
  • 分句、分词 Tokenization (支持165种语言)
  • 实体识别 Named Entity Recognition (支持40种语言)
  • 词性标注 Part of Speech Tagging(支持16种语言)
  • 情感分析 Sentiment(支持136种语言)
  • 词嵌入 Word Embeddings(支持137种语言)
  • 翻译 Transliteration(支持69种语言)
  • 管道 Pipelines

安装

从PyPI安装/升级

$ pip install polyglot

安装polyglot依赖于numpy和 libicu-dev,在 ubuntu / debian linux发行版中你可以通过执行以下命令来安装这样的包:
$ sudo apt-get install python-numpy libicu-dev
安装成功以后,输入

$ import polyglot
$ polyglot.__version__
$ 16.07.04

数据

在随后的实例演示中,将以中文、英文或中英文混合语句作为测试数据。

text_en = u"Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."
text_cn = u" 日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。"
text_mixed = text_cn + text_en

语言检测 Language Detection

polyglot的语言检测依赖pycld2cld2,其中cld2是Google开发的多语言检测应用。

Example

导入依赖

from polyglot.detect import  Detector

语言类型检测

>>> Detector(text_cn).language
 name: Chinese     code: zh       confidence:  99.0 read bytes:  1996

>>>> Detector(text_en).language
 name: English     code: en       confidence:  99.0 read bytes:  1144

>>> Detector(text_mixed).language
 name: Chinese     code: zh       confidence:  50.0 read bytes:  1996

对中英文混合的text_mixed,其识别的语言是中文,但置信度(confidence)仅有50,所有包含的语言类型检测

>>> for language in Detector(text_mixed):
>>>     print(language)
 name: Chinese     code: zh       confidence:  50.0 read bytes:  1996
 name: English     code: en       confidence:  49.0 read bytes:  1144
 name: un          code: un       confidence:   0.0 read bytes:     0

目前,cld2支持的语言检测类型有

>>> Detector.supported_languages()

  1. Abkhazian                  2. Afar                       3. Afrikaans                
  4. Akan                       5. Albanian                   6. Amharic                  
  7. Arabic                     8. Armenian                   9. Assamese                 
 10. Aymara                    11. Azerbaijani               12. Bashkir                  
 13. Basque                    14. Belarusian                15. Bengali                  
 16. Bihari                    17. Bislama                   18. Bosnian                  
 19. Breton                    20. Bulgarian                 21. Burmese                  
 22. Catalan                   23. Cebuano                   24. Cherokee                 
 25. Nyanja                    26. Corsican                  27. Croatian                 
 28. Croatian                  29. Czech                     30. Chinese                  
 31. Chinese                   32. Chinese                   33. Chinese                  
 34. Chineset                  35. Chineset                  36. Chineset                 
 37. Chineset                  38. Chineset                  39. Chineset                 
 40. Danish                    41. Dhivehi                   42. Dutch                    
 43. Dzongkha                  44. English                   45. Esperanto                
 46. Estonian                  47. Ewe                       48. Faroese                  
 49. Fijian                    50. Finnish                   51. French                   
 52. Frisian                   53. Ga                        54. Galician                 
 55. Ganda                     56. Georgian                  57. German                   
 58. Greek                     59. Greenlandic               60. Guarani                  
 61. Gujarati                  62. Haitian_creole            63. Hausa                    
 64. Hawaiian                  65. Hebrew                    66. Hebrew                   
 67. Hindi                     68. Hmong                     69. Hungarian                
 70. Icelandic                 71. Igbo                      72. Indonesian               
 73. Interlingua               74. Interlingue               75. Inuktitut                
 76. Inupiak                   77. Irish                     78. Italian                  
 79. Ignore                    80. Javanese                  81. Javanese                 
 82. Japanese                  83. Kannada                   84. Kashmiri                 
 85. Kazakh                    86. Khasi                     87. Khmer                    
 88. Kinyarwanda               89. Krio                      90. Kurdish                  
 91. Kyrgyz                    92. Korean                    93. Laothian                 
 94. Latin                     95. Latvian                   96. Limbu                    
 97. Limbu                     98. Limbu                     99. Lingala                  
100. Lithuanian               101. Lozi                     102. Luba_lulua               
103. Luo_kenya_and_tanzania   104. Luxembourgish            105. Macedonian               
106. Malagasy                 107. Malay                    108. Malayalam                
109. Maltese                  110. Manx                     111. Maori                    
112. Marathi                  113. Mauritian_creole         114. Romanian                 
115. Mongolian                116. Montenegrin              117. Montenegrin              
118. Montenegrin              119. Montenegrin              120. Nauru                    
121. Ndebele                  122. Nepali                   123. Newari                   
124. Norwegian                125. Norwegian                126. Norwegian_n              
127. Nyanja                   128. Occitan                  129. Oriya                    
130. Oromo                    131. Ossetian                 132. Pampanga                 
133. Pashto                   134. Pedi                     135. Persian                  
136. Polish                   137. Portuguese               138. Punjabi                  
139. Quechua                  140. Rajasthani               141. Rhaeto_romance           
142. Romanian                 143. Rundi                    144. Russian                  
145. Samoan                   146. Sango                    147. Sanskrit                 
148. Scots                    149. Scots_gaelic             150. Serbian                  
151. Serbian                  152. Seselwa                  153. Seselwa                  
154. Sesotho                  155. Shona                    156. Sindhi                   
157. Sinhalese                158. Siswant                  159. Slovak                   
160. Slovenian                161. Somali                   162. Spanish                  
163. Sundanese                164. Swahili                  165. Swedish                  
166. Syriac                   167. Tagalog                  168. Tajik                    
169. Tamil                    170. Tatar                    171. Telugu                   
172. Thai                     173. Tibetan                  174. Tigrinya                 
175. Tonga                    176. Tsonga                   177. Tswana                   
178. Tumbuka                  179. Turkish                  180. Turkmen                  
181. Twi                      182. Uighur                   183. Ukrainian                
184. Urdu                     185. Uzbek                    186. Venda                    
187. Vietnamese               188. Volapuk                  189. Waray_philippines        
190. Welsh                    191. Wolof                    192. Xhosa                    
193. Yiddish                  194. Yoruba                   195. Zhuang                   
196. Zulu                     

分句、分词 Tokenization

自然语言处理任务中,任务可以分为字符级、词语级、句子级、段落级和篇章级,Tokenization就是实现切分字符、词语、句子和段落边界的功能。分段可以用\n\n\r作分割,字符分割也比较容易实现,分句和分词相对比较复杂一点。

Example

导入依赖

from polyglot.text import Text

分句

>>> Text(text_cn).sentences
 [Sentence("日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。"), Sentence("目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。")]

>>> Text(text_en).sentences
 [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.")]

>>> Text(text_mixed).sentences
 [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."), Sentence("日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。"), Sentence("目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。")]

分词

>>> Text(text_cn).words
 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 , 将 于 20199 月 结束 服务 , 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 。 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 , 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 。

>>> Text(text_en).words
 Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years .

>>> Text(text_mixed).words
Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years . 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 , 将 于 20199 月 结束 服务 , 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 。 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 , 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 。

实体识别 Named Entity Recognition

实体识别是识别出文本中具有特定意义的实体,其常有三种分类:

  • 实体类: 人名、地名、机构名、商品名、商标名等等
  • 时间类: 日期、时间
  • 数字类: 生日、电话号码、QQ号码等等

实体识别的方法也可以分为三种:

  • 基于规则 Linguistic grammar-based techniques
    基于语言语法的技术主要是用规则的方法,在工程的实现方面上的应用就是写很多的正则表达(RegEx),这种方式可以解决部分时间类、和数字类命名实体的识别。
  • 统计学习 Statistical models
    统计的方法目前主要是HMM和CRF模型,也是当前比较成熟的方式。
  • 深度学习 Deep Learning models
    深度学习的方法是目前最为流行的方式,特别是RNN系列的DL模型,其可以吸收到更多的文本语义信息,其效果是当前最好的。

polyglot实体识别的训练语料来源于维基百科(WIKI),其训练好的模型并没有初次安装,需要下载相应的模型。polyglot支持40种语言的实体类(人名、地名、机构名)的识别。

>>> from polyglot.downloader import downloader
>>> print(downloader.supported_languages_table("ner2", 3))

 1. Polish                     2. Turkish                    3. Russian
 4. Indonesian                 5. Czech                      6. Arabic
 7. Korean                     8. Catalan; Valencian         9. Italian
10. Thai                      11. Romanian, Moldavian, ...  12. Tagalog
13. Danish                    14. Finnish                   15. German
16. Persian                   17. Dutch                     18. Chinese
19. French                    20. Portuguese                21. Slovak
22. Hebrew (modern)           23. Malay                     24. Slovene
25. Bulgarian                 26. Hindi                     27. Japanese
28. Hungarian                 29. Croatian                  30. Ukrainian
31. Serbian                   32. Lithuanian                33. Norwegian
34. Latvian                   35. Swedish                   36. English
37. Greek, Modern             38. Spanish; Castilian        39. Vietnamese
40. Estonian

模型下载

下载英文和中文实体识别的模型

  $ python
>>> import polyglot
>>> !polyglot download ner2.en ner2.zh embeddings2.zh embeddings2.en

[polyglot_data] Downloading package ner2.en to
[polyglot_data] Downloading package ner2.zh to
[polyglot_data] Downloading package embeddings2.zh to
[polyglot_data] Downloadinuserackage embeddings2.en to
[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

>>> from polyglot.text import Text

实体识别

>>> Text(text_cn).entities
 [I-ORG([u'东京'])]

>>> Text(text_en).entities)
 [I-LOC([u'Tokyo'])]

>>> Text(text_mixed).entities)
 [I-ORG([u'东京'])]

词性标注 Part of Speech Tagging

词性标注是对分词单元作相应的词性标记,其常用的标记包括:

  • 形容词 ADJ: adjective
  • 介词 ADP: adposition
  • 副词 ADV: adverb
  • 辅助动词 AUX: auxiliary verb
  • 连词 CONJ: coordinating conjunction
  • 限定词 DET: determiner
  • 感叹词 INTJ: interjection
  • 名词 NOUN: noun
  • 数字 NUM: numeral
  • 代词 PRON: pronoun
  • 名词代词 PROPN: proper noun
  • 标点符号 PUNCT: punctuation
  • 从属连词 SCONJ: subordinating conjunction
  • 符号 SYM: symbol
  • 动词 VERB: verb
  • 其他 X: other

polyglot训练词性标注的语料来源于CONLL数据集,其支持16种语言,不支持中文。

>>> from polyglot.downloader import downloader
>>> print(downloader.supported_languages_table("pos2"))
  1. German                     2. Italian                    3. Danish                   
  4. Czech                      5. Slovene                    6. French                   
  7. English                    8. Swedish                    9. Bulgarian                
 10. Spanish; Castilian        11. Indonesian                12. Portuguese               
 13. Finnish                   14. Irish                     15. Hungarian                
 16. Dutch                    

模型下载

下载英文词性标注的模型

  $ python
>>> import polyglot
>>> !polyglot download pos2.en

[polyglot_data] ownloading package pos2.en to
[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

from polyglot.text import Text

词性标注

>>> Text(text_en).pos_tags
 [(u"Japan's", u'NUM'), (u'last', u'ADJ'), (u'pager', u'NOUN'), (u'provider', u'NOUN'), (u'has', u'AUX'), (u'announced', u'VERB'), (u'it', u'PRON'), (u'will', u'AUX'), (u'end', u'VERB'), (u'its', u'PRON'), (u'service', u'NOUN'), (u'in', u'ADP'), (u'September', u'PROPN'), (u'2019', u'NUM'), (u'-', u'PUNCT'), (u'bringing', u'VERB'), (u'a', u'DET'), (u'national', u'ADJ'), (u'end', u'NOUN'), (u'to', u'ADP'), (u'telecommunication', u'VERB'), (u'beepers', u'NUM'), (u',', u'PUNCT'), (u'50', u'NUM'), (u'years', u'NOUN'), (u'after', u'ADP'), (u'their', u'PRON'), (u'introduction.Around', u'NUM'), (u'1,500', u'NUM'), (u'users', u'NOUN'), (u'remain', u'VERB'), (u'subscribed', u'VERB'), (u'to', u'ADP'), (u'Tokyo', u'PROPN'), (u'Telemessage', u'PROPN'), (u',', u'PUNCT'), (u'which', u'DET'), (u'has', u'AUX'), (u'not', u'PART'), (u'made', u'VERB'), (u'the', u'DET'), (u'devices', u'NOUN'), (u'in', u'ADP'), (u'20', u'NUM'), (u'years', u'NOUN'), (u'.', u'PUNCT')]

情感分析 Sentiment Analysis

polyglot的情感分析是词级别的,对每一个分词正面标记为1,中性标记为0,负面标记为1.其目前支持136种语言。

>>> from polyglot.downloader import downloader
>>> print(downloader.supported_languages_table("sentiment2"))
 1. Turkmen                    2. Thai                       3. Latvian
 4. Zazaki                     5. Tagalog                    6. Tamil
 7. Tajik                      8. Telugu                     9. Luxembourgish, Letzeb...
10. Alemannic                 11. Latin                     12. Turkish
13. Limburgish, Limburgan...  14. Egyptian Arabic           15. Tatar
16. Lithuanian                17. Spanish; Castilian        18. Basque
19. Estonian                  20. Asturian                  21. Greek, Modern
22. Esperanto                 23. English                   24. Ukrainian
25. Marathi (Marāṭhī)         26. Maltese                   27. Burmese
28. Kapampangan               29. Uighur, Uyghur            30. Uzbek
31. Malagasy                  32. Yiddish                   33. Macedonian
34. Urdu                      35. Malayalam                 36. Mongolian
37. Breton                    38. Bosnian                   39. Bengali
40. Tibetan Standard, Tib...  41. Belarusian                42. Bulgarian
43. Bashkir                   44. Vietnamese                45. Volapük
46. Gan Chinese               47. Manx                      48. Gujarati
49. Yoruba                    50. Occitan                   51. Scottish Gaelic; Gaelic
52. Irish                     53. Galician                  54. Ossetian, Ossetic
55. Oriya                     56. Walloon                   57. Swedish
58. Silesian                  59. Lombard language          60. Divehi; Dhivehi; Mald...
61. Danish                    62. German                    63. Armenian
64. Haitian; Haitian Creole   65. Hungarian                 66. Croatian
67. Bishnupriya Manipuri      68. Hindi                     69. Hebrew (modern)
70. Portuguese                71. Afrikaans                 72. Pashto, Pushto
73. Amharic                   74. Aragonese                 75. Bavarian
76. Assamese                  77. Panjabi, Punjabi          78. Polish
79. Azerbaijani               80. Italian                   81. Arabic
82. Icelandic                 83. Ido                       84. Scots
85. Sicilian                  86. Indonesian                87. Chinese Word
88. Interlingua               89. Waray-Waray               90. Piedmontese language
91. Quechua                   92. French                    93. Dutch
94. Norwegian Nynorsk         95. Norwegian                 96. Western Frisian
97. Upper Sorbian             98. Nepali                    99. Persian
100. Ilokano                  101. Finnish                  102. Faroese
103. Romansh                  104. Javanese                 105. Romanian, Moldavian, ...
106. Malay                    107. Japanese                 108. Russian
109. Catalan; Valencian       110. Fiji Hindi               111. Chinese
112. Cebuano                  113. Czech                    114. Chuvash
115. Welsh                    116. West Flemish             117. Kirghiz, Kyrgyz
118. Kurdish                  119. Kazakh                   120. Korean
121. Kannada                  122. Khmer                    123. Georgian
124. Sakha                    125. Serbian                  126. Albanian
127. Swahili                  128. Chechen                  129. Sundanese
130. Sanskrit (Saṁskṛta)      131. Venetian                 132. Northern Sami
133. Slovak                   134. Sinhala, Sinhalese       135. Bosnian-Croatian-Serbian
136. Slovene

模型下载

下载英文和中文情感分析模型

  $ python
>>> import polyglot
>>> !polyglot download sentiment2.en sentiment2.zh

[polyglot_data] ownloading package sentiment2.en to
[polyglot_data] ownloading package sentiment2.zh to
[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

from polyglot.text import Text

情感分析

>>> text = Text("The movie is very good and the actors are prefect, but the cinema environment is very poor.")
>>> print(text.words,text.polarity)
 (WordList([u'The', u'movie', u'is', u'very', u'good', u'and', u'the', u'actors', u'are', u'prefect', u',', u'but', u'the', u'cinema', u'environment', u'is', u'very', u'poor', u'.']), 0.0)
>>> print([(w,w.polarity) for w in text.words])
 [(u'The', 0), (u'movie', 0), (u'is', 0), (u'very', 0), (u'good', 1), (u'and', 0), (u'the', 0), (u'actors', 0), (u'are', 0), (u'prefect', 0), (u',', 0), (u'but', 0), (u'the', 0), (u'cinema', 0), (u'environment', 0), (u'is', 0), (u'very', 0), (u'poor', -1), (u'.', 0)]


>>> text = Text("这部电影故事非常好,演员也非常棒,但是电影院环境非常差。")
>>> print(text.words,text.polarity)
 (WordList([这 部 电影 故事 非常 好 , 演员 也 非常 棒 , 但是 电影 院 环境 非常 差 。]), 0.0)
>>> print([(w,w.polarity) for w in text.words])
 [(u'\u8fd9', 0), (u'\u90e8', 0), (u'\u7535\u5f71', 0), (u'\u6545\u4e8b', 0), (u'\u975e\u5e38', 0), (u'\u597d', 1), (u'\uff0c', 0), (u'\u6f14\u5458', 0), (u'\u4e5f', 0), (u'\u975e\u5e38', 0), (u'\u68d2', 0), (u'\uff0c', 0), (u'\u4f46\u662f', 0), (u'\u7535\u5f71', 0), (u'\u9662', 0), (u'\u73af\u5883', 0), (u'\u975e\u5e38', 0), (u'\u5dee', -1), (u'\u3002', 0)]

词嵌入 Word Embeddings

Word Embedding在NLP中是指一组语言模型和特征学习技术的总称,把词汇表中的单词或者短语映射成由实数构成的向量上。常见的Word Embeddings有两种方法:离散表示和分布式表示。离散的方法包括one-hot和N-gram,离散表示的缺点是不能很好的刻画词与词之间的相关性和维数灾难的问题。分布式表示的思想是用一个词附近的其他词来表示该词,也就是大家所熟悉的word2ec。word2ec包含根据当前一个词预测前后nn个词Skip-Gram Model以及给定上下文的nn个词预测一个词的CBOW Model。目前训练好的英文词向量有glove,其提供了50、100、200、300维词向量,以及前一段时间腾讯AI Lab开源的中文词向量,其提供200维的中文词向量。polyglot支持从以下不同源读取词向量

  • Gensim word2vec objects: (from_gensim method)
  • Word2vec binary/text models: (from_word2vec method)
  • GloVe models (from_glove method)
  • polyglot pickle files: (load method)

其中,polyglot pickle files支持136种语言的词向量。

>>> from polyglot.downloader import  downloader
>>> print(downloader.supported_languages_table("embeddings2"))
  1. Scots                      2. Sicilian                   3. Welsh                    
  4. Chuvash                    5. Czech                      6. Egyptian Arabic          
  7. Kapampangan                8. Chechen                    9. Catalan; Valencian       
 10. Slovene                   11. Sinhala, Sinhalese        12. Bosnian-Croatian-Serbian
 13. Slovak                    14. Japanese                  15. Northern Sami            
 16. Sanskrit (Saṁskṛta)       17. Croatian                  18. Javanese                 
 19. Sundanese                 20. Swahili                   21. Swedish                  
 22. Albanian                  23. Serbian                   24. Marathi (Marāṭhī)        
 25. Breton                    26. Bosnian                   27. Bengali                  
 28. Tibetan Standard, Tib...  29. Bulgarian                 30. Belarusian               
 31. West Flemish              32. Bashkir                   33. Malay                    
 34. Romanian, Moldavian, ...  35. Romansh                   36. Esperanto                
 37. Asturian                  38. Greek, Modern             39. Burmese                  
 40. Maltese                   41. Malagasy                  42. Spanish; Castilian       
 43. Russian                   44. Mongolian                 45. Chinese                  
 46. Estonian                  47. Yoruba                    48. Sakha                    
 49. Alemannic                 50. Assamese                  51. Lombard language         
 52. Yiddish                   53. Silesian                  54. Venetian                 
 55. Azerbaijani               56. Afrikaans                 57. Aragonese                
 58. Amharic                   59. Hebrew (modern)           60. Hindi                    
 61. Quechua                   62. Haitian; Haitian Creole   63. Hungarian                
 64. Bishnupriya Manipuri      65. Armenian                  66. Gan Chinese              
 67. Macedonian                68. Georgian                  69. Khmer                    
 70. Panjabi, Punjabi          71. Korean                    72. Kannada                  
 73. Kazakh                    74. Kurdish                   75. Basque                   
 76. Pashto, Pushto            77. Portuguese                78. Gujarati                 
 79. Manx                      80. Irish                     81. Scottish Gaelic; Gaelic  
 82. Upper Sorbian             83. Galician                  84. Arabic                   
 85. Walloon                   86. Urdu                      87. Norwegian Nynorsk        
 88. Norwegian                 89. Dutch                     90. Chinese Character        
 91. Nepali                    92. French                    93. Western Frisian          
 94. Bavarian                  95. English                   96. Persian                  
 97. Polish                    98. Finnish                   99. Faroese                  
100. Italian                  101. Icelandic                102. Volapük                  
103. Ido                      104. Waray-Waray              105. Indonesian               
106. Interlingua              107. Lithuanian               108. Uzbek                    
109. Latvian                  110. German                   111. Danish                   
112. Cebuano                  113. Ukrainian                114. Latin                    
115. Luxembourgish, Letzeb... 116. Divehi; Dhivehi; Mald... 117. Vietnamese               
118. Uighur, Uyghur           119. Limburgish, Limburgan... 120. Zazaki                   
121. Ilokano                  122. Fiji Hindi               123. Malayalam                
124. Tatar                    125. Kirghiz, Kyrgyz          126. Ossetian, Ossetic        
127. Oriya                    128. Turkish                  129. Tamil                    
130. Tagalog                  131. Thai                     132. Turkmen                  
133. Telugu                   134. Occitan                  135. Tajik                    
136. Piedmontese language

模型下载

下载英文和中文词向量

  $ python
>>> import polyglot
>>> !polyglot download embeddings2.zh embeddings2.en

[polyglot_data] Downloading package embeddings2.zh to
[polyglot_data] Downloadinuserackage embeddings2.en to
[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖并加载词向量

>>> from polyglot.mapping import Embedding
>>> embeddings = Embedding.load('/home/user/polyglot_data/embeddings2/zh/embeddings_pkl.tar.bz2')

词向量查询

>>> print(embeddings.get("中国"))
[ 0.60831094  0.37644583 -0.67009342  0.43529209  0.12993187 -0.07703398
 -0.04931475 -0.42763838 -0.42447501 -0.0219319  -0.52271312 -0.57149178
 -0.48139745 -0.31942225  0.12747335  0.34054375  0.27137381  0.1362032
 -0.54999739 -0.39569679  1.01767457  0.12317979 -0.12878017 -0.65476489
  0.18644606  0.2178454   0.18150428  0.18464987  0.29027358  0.21979097
 -0.21173042  0.08130789 -0.77350897  0.66575652 -0.14730017  0.11383133
  0.83101833  0.01702038 -0.71277034  0.29339811  0.3320756   0.25922608
 -0.51986367  0.16533957  0.04327472  0.36460632  0.42984027  0.04811303
 -0.16718218 -0.18613082 -0.52108622 -0.47057685 -0.14663117 -0.30221295
  0.72923231 -0.54835045 -0.48428732  0.65475166 -0.34853089  0.03206051
  0.2574054   0.07614037  0.32844698 -0.0087136 ]
>>> print(len(embeddings.get("中国")))
 64

相似词查询

>>> neighbors = embeddings.nearest_neighbors("中国")
>>> print(" ".join(neighbors))
 上海 美国 韩国 北京 欧洲 台湾 法国 德国 天津 广州

翻译 Transliteration

polyglot翻译采用是无监督的方法( False-Friend Detection and Entity Matching via Unsupervised Transliteration paper),其支持69种语言。

>>> from polyglot.downloader import  downloader
>>> print(downloader.supported_languages_table("transliteration2"))
 1. Haitian; Haitian Creole    2. Tamil                      3. Vietnamese               
 4. Telugu                     5. Croatian                   6. Hungarian                
 7. Thai                       8. Kannada                    9. Tagalog                  
10. Armenian                  11. Hebrew (modern)           12. Turkish                  
13. Portuguese                14. Belarusian                15. Norwegian Nynorsk        
16. Norwegian                 17. Dutch                     18. Japanese                 
19. Albanian                  20. Bulgarian                 21. Serbian                  
22. Swahili                   23. Swedish                   24. French                   
25. Latin                     26. Czech                     27. Yiddish                  
28. Hindi                     29. Danish                    30. Finnish                  
31. German                    32. Bosnian-Croatian-Serbian  33. Slovak                   
34. Persian                   35. Lithuanian                36. Slovene                  
37. Latvian                   38. Bosnian                   39. Gujarati                 
40. Italian                   41. Icelandic                 42. Spanish; Castilian       
43. Ukrainian                 44. Urdu                      45. Indonesian               
46. Khmer                     47. Galician                  48. Korean                   
49. Afrikaans                 50. Georgian                  51. Catalan; Valencian       
52. Romanian, Moldavian, ...  53. Basque                    54. Macedonian               
55. Russian                   56. Azerbaijani               57. Chinese                  
58. Estonian                  59. Welsh                     60. Arabic                   
61. Bengali                   62. Amharic                   63. Irish                    
64. Malay                     65. Marathi (Marāṭhī)         66. Polish                   
67. Greek, Modern             68. Esperanto                 69. Maltese  

模型下载

下载英文和中文翻译模型

  $ python
>>> import polyglot
>>> !polyglot download transliteration2.zh transliteration2.en

[polyglot_data] Downloading package transliteration2.zh to
[polyglot_data] Downloadinuserackage transliteration2.en to
[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

>>> from polyglot.text import Text

英文翻译中文

>>> text = Text(text_en)
>>> print(text_en)
  Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.
>>> print("".join([t for t in text.transliterate("zh")]))
 拉斯特帕格普罗维德尔哈斯安诺乌恩斯德伊特维尔恩德伊特斯塞尔维斯因塞普特艾伯布林吉恩格阿恩阿特伊奥纳尔恩德托特埃莱科姆穆尼卡特伊昂布熙佩尔斯年年耶阿尔斯阿夫特特海尔乌斯尔斯雷马因苏布斯克里贝德托托基奥特埃莱梅斯斯阿格埃惠克赫哈斯诺特马德特赫德耶夫伊斯斯因耶阿尔斯

中英文翻译的结果显示其效果还是比较差,在此不做过多的介绍。

管道 Pipelines

Pipelines的方式是指以管道的方式顺序执行多个NLP任务,上一个任务的输出作为下一个任务的输入。比如在实体识别和实体关系识别中,Pipeline方式就是先识别出实体,然后再识别这些实体的关系,另外一种是Join,将实体识别和关系识别放在一起。

Exmaple

先分词,然后统计词频数大于2的单词。

>>> !polyglot --lang en tokenize --input testdata/example.txt | polyglot count --min-count 2
 in  10
the 6
.   6
-   5
,   4
of  3
and 3
by  3
South       2
5   2
2007        2
Bermuda     2
which       2
score       2
against     2
Mitchell    2
as  2
West        2
India       2
beat        2
Afghanistan 2
Indies      2

作者 befeng

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注