spaCy ( ) is an

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

software library for advanced

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

, written in the programming languages Python and

Cython Cython () is a superset of the programming language Python, which allows developers to write Python code (with optional, C-inspired syntax extensions) that yields performance comparable to that of C. Cython is a compiled language that is ty ...

. The library is published under the

MIT license The MIT License is a permissive software license originating at the Massachusetts Institute of Technology (MIT) in the late 1980s. As a permissive license, it puts very few restrictions on reuse and therefore has high license compatibility. Unl ...

and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion. Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage. spaCy also supports

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

workflows that allow connecting statistical models trained by popular

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

libraries like

TensorFlow TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...

, PyTorch or MXNet through its own machine learning library Thinc. Using Thinc as its backend, spaCy features

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

models for

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...

, dependency parsing, text categorization and named entity recognition (NER). Prebuilt statistical

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

models to perform these tasks are available for 23 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language NER model. Additional support for tokenization for more than 65 languages allows users to train custom models on their own datasets as well.

History

* Version 1.0 was released on October 19, 2016, and included preliminary support for deep learning workflows by supporting custom processing pipelines. It further included a rule matcher that supported

entity An entity is something that Existence, exists as itself. It does not need to be of material existence. In particular, abstractions and legal fictions are usually regarded as entities. In general, there is also no presumption that an entity is Lif ...

annotations, and an officially documented training API. * Version 2.0 was released on November 7, 2017, and introduced convolutional neural network models for 7 different languages. It also supported custom processing pipeline components and extension attributes, and featured a built-in trainable text classification component. * Version 3.0 was released on February 1, 2021, and introduced state-of-the-art

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

-based pipelines. It also introduced a new configuration system and training workflow, as well as type hints and project templates. This version dropped support for Python 2.

Main features

* Non-destructive tokenization * "Alpha tokenization" support for over 65 languages * Built-in support for trainable pipeline components such as Named entity recognition,

Part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...

, dependency parsing, Text classification,

Entity Linking In natural language processing, Entity Linking, also referred to as named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD), named-entity normalization (NEN), or Concept Recognition, is the task of assigning a unique ...

and more *

Statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...

s for 19 languages * Multi-task learning with pretrained transformers like BERT * Support for custom models in PyTorch, TensorFlow and other frameworks * State-of-the-art speed and accuracy * Production-ready training system * Built-in visualizers for

syntax In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...

and named entities * Easy model packaging, deployment and workflow management

Extensions and visualizers

spaCy comes with several extensions and visualizations that are available as free,

libraries: * : A

library optimized for CPU usage and

with text input. * : A library for computing word similarities, based on

Word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...

.Trask et al. (2015)
sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings
* : An

dependency

parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...

visualizer built with

JavaScript JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior. Web browsers have ...

, CSS and SVG. * : An

named entity visualizer built with

and CSS.

References

External links

*
Implementing Spacy Library
{{Natural language processing Free science software Natural language processing toolkits Python (programming language) libraries 2015 software