HOME

TheInfoList



OR:

Gensim is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
library for unsupervised
topic model In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...
ing, document indexing, retrieval by similarity, and other
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
functionalities, using modern statistical
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
. Gensim is implemented in Python and
Cython Cython () is a superset of the programming language Python, which allows developers to write Python code (with optional, C-inspired syntax extensions) that yields performance comparable to that of C. Cython is a compiled language that is ty ...
for performance. Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.


Main Features

Gensim includes streamed parallelized implementations of
fastText fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows one to create an unsupervised learning or supervised learning algorithm for obtaining vector representati ...
,
word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
and doc2vec algorithms, as well as
latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...
(LSA, LSI, SVD),
non-negative matrix factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property th ...
(NMF),
latent Dirichlet allocation In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network (and, therefore, a generative statistical model) for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic ...
(LDA), tf-idf and random projections. Some of the novel online algorithms in Gensim were also published in the 2011 PhD dissertation ''Scalability of Semantic Analysis in Natural Language Processing'' of Radim Řehůřek, the creator of Gensim.


Uses of Gensim

Gensim library has been used and cited in over 1400 commercial and academic applications as of 2018, in a diverse array of disciplines from medicine to insurance claim analysis to patent search. The software has been covered in several new articles, podcasts and interviews.


Free and Commercial Support

The open source code is developed and hosted on
GitHub GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...
and a public support forum is maintained on
Google Groups Google Groups is a service from Google that provides discussion groups for people sharing common interests. Until February 2024, the Groups service also provided a gateway to Usenet newsgroups, both reading and posting to them, via a shared user ...
and Gitter. Gensim is commercially supported by the company rare-technologies.com, who also provide student mentorships and academic thesis projects for Gensim via their Student Incubator programme.Gensim open source Incubator
/ref>


References


External links

* Free science software Natural language processing toolkits Python (programming language) libraries {{science-software-stub