spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。
安装和编译 spaCy 比较方便,在ubuntu环境下,直接用pip安装即可:
sudo apt-get install build-essential python-dev git
sudo pip install -U spacy
sudo python -m spacy.en.download all
# 这个过程下载英文tokenizer,词性标注,句法分析,命名实体识别相关的模型
python -m spacy.en.download parser
# 这个过程下载glove训练好的词向量数据
python -m spacy.en.download glove
textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *
776M en-1.1.0
774M en_glove_cc_300_1m_vectors-1.0.0
textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *
424M deps
8.0K meta.json
35M ner
12M pos
84K tokenizer
300M vocab
6.3M wordnet
textminer@textminer:~$ python -c "import spacy; spacy.load('en'); print('OK')"
# 首先找到spacy的安装路径:
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
# 再安装pytest:
sudo python -m pip install -U pytest
# 最后进行测试:
python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0
rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:
collected 318 items
../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py .....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx
../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x...............
../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............
============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============
textminer@textminer:~$ ipython
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
Type "copyright", "credits" or "license" for more information.
IPython 2.4.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import spacy
# 加载英文模型数据,稍许等待
In [2]: nlp = spacy.load('en')
Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分词:
In [3]: test_doc = nlp(u"it's word tokenize test for spacy")
In [4]: print(test_doc)
it's word tokenize test for spacy
In [5]: for token in test_doc:
In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')
In [7]: for sent in test_doc.sents:
Natural language processing (NLP) deals with the application of computational models to text or speech data.
Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.
NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.
From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.
In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")
In [9]: for token in test_doc:
print(token, token.lemma_, token.lemma)
(you, u'you', 472)
(are, u'be', 488)
(best, u'good', 556)
(., u'.', 419)
(it, u'it', 473)
(is, u'be', 488)
(lemmatize, u'lemmatize', 1510296)
(test, u'test', 1351)
(for, u'for', 480)
(spacy, u'spacy', 173783)
(., u'.', 419)
(I, u'i', 570)
(love, u'love', 644)
(these, u'these', 642)
(books, u'book', 1011)
词性标注(POS Tagging):
In [10]: for token in test_doc:
print(token, token.pos_, token.pos)
(you, u'PRON', 92)
(are, u'VERB', 97)
(best, u'ADJ', 82)
(., u'PUNCT', 94)
(it, u'PRON', 92)
(is, u'VERB', 97)
(lemmatize, u'ADJ', 82)
(test, u'NOUN', 89)
(for, u'ADP', 83)
(spacy, u'NOUN', 89)
(., u'PUNCT', 94)
(I, u'PRON', 92)
(love, u'VERB', 97)
(these, u'DET', 87)
(books, u'NOUN', 89)
In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")
In [12]: for ent in test_doc.ents:
print(ent, ent.label_, ent.label)
(Rami Eid, u'PERSON', 346)
(Stony Brook University, u'ORG', 349)
(New York, u'GPE', 350)
In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')
In [14]: for np in test_doc.noun_chunks:
Natural language processing
Natural language processing (NLP) deals
the application
computational models
Application areas
automatic (machine) translation
dialogue systems
a human
a machine
natural language
information extraction
the goal
unstructured text
structured (database) representations
flexible ways
NLP technologies
a dramatic impact
the way
the way
the use
the way
the vast amount
linguistic data
electronic form
a scientific viewpoint
fundamental questions
formal models
natural language phenomena
these models
In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
In [16]: apples = test_doc[0]
In [17]: print(apples)
In [18]: oranges = test_doc[2]
In [19]: print(oranges)
In [20]: boots = test_doc[6]
In [21]: print(boots)
In [22]: hippos = test_doc[8]
In [23]: print(hippos)
In [24]: apples.similarity(oranges)
Out[24]: 0.77809414836023805
In [25]: boots.similarity(hippos)
Out[25]: 0.038474555379008429
当然,spaCy还包括句法分析的相关功能等。另外值得关注的是 spaCy 从1.0版本起,加入了对深度学习工具的支持,例如 Tensorflow 和 Keras 等,这方面具体可以参考官方文档给出的一个对情感分析(Sentiment Analysis)模型进行分析的例子:Hooking a deep learning model into spaCy.
