spaCy ( ) is an
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
software library for advanced
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
, written in the programming languages
Python and
Cython
Cython () is a superset of the programming language Python, which allows developers to write Python code (with optional, C-inspired syntax extensions) that yields performance comparable to that of C.
Cython is a compiled language that is ty ...
. The library is published under the
MIT license
The MIT License is a permissive software license originating at the Massachusetts Institute of Technology (MIT) in the late 1980s. As a permissive license, it puts very few restrictions on reuse and therefore has high license compatibility.
Unl ...
and its main developers are
Matthew Honnibal and
Ines Montani, the founders of the software company Explosion.
Unlike
NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage.
spaCy also supports
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
workflows that allow connecting statistical models trained by popular
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
libraries like
TensorFlow
TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...
,
PyTorch or
MXNet through its own machine learning library Thinc. Using Thinc as its backend, spaCy features
convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
models for
part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
,
dependency parsing,
text categorization and
named entity recognition (NER). Prebuilt statistical
neural network
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
models to perform these tasks are available for 23 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language
NER model. Additional support for
tokenization for more than 65 languages allows users to train custom models on their own datasets as well.
History
* Version 1.0 was released on October 19, 2016, and included preliminary support for deep learning workflows by supporting custom processing pipelines. It further included a rule matcher that supported
entity
An entity is something that Existence, exists as itself. It does not need to be of material existence. In particular, abstractions and legal fictions are usually regarded as entities. In general, there is also no presumption that an entity is Lif ...
annotations, and an officially documented training API.
* Version 2.0 was released on November 7, 2017, and introduced convolutional neural network models for 7 different languages. It also supported custom processing pipeline components and extension attributes, and featured a built-in trainable
text classification component.
* Version 3.0 was released on February 1, 2021, and introduced state-of-the-art
transformer
In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
-based pipelines. It also introduced a new configuration system and training workflow, as well as type hints and project templates. This version dropped support for
Python 2.
Main features
* Non-destructive
tokenization
* "Alpha tokenization" support for over 65 languages
* Built-in support for trainable pipeline components such as
Named entity recognition,
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
,
dependency parsing,
Text classification,
Entity Linking
In natural language processing, Entity Linking, also referred to as named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD), named-entity normalization (NEN), or Concept Recognition, is the task of assigning a unique ...
and more
*
Statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
s for 19 languages
*
Multi-task learning with pretrained transformers like
BERT
* Support for custom models in PyTorch, TensorFlow and other frameworks
* State-of-the-art speed and accuracy
* Production-ready training system
* Built-in visualizers for
syntax
In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...
and
named entities
* Easy model packaging, deployment and workflow management
Extensions and visualizers

spaCy comes with several extensions and visualizations that are available as free,
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
libraries:
* : A
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
library optimized for
CPU usage and
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
with text input.
* : A library for computing word similarities, based on
Word2vec
Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
.
[Trask et al. (2015)]
sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings
* : An
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
dependency
parse tree
A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...
visualizer built with
JavaScript
JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior.
Web browsers have ...
,
CSS and
SVG.
* : An
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
named entity visualizer built with
JavaScript
JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior.
Web browsers have ...
and
CSS.
References
External links
*
Implementing Spacy Library
{{Natural language processing
Free science software
Natural language processing toolkits
Python (programming language) libraries
2015 software