ELKI (''Environment for Developing KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases)

software framework In computer programming, a software framework is a software abstraction that provides generic functionality which developers can extend with custom code to create applications. It establishes a standard foundation for building and deploying soft ...

developed for use in research and teaching. It was originally created by the database systems research unit at the

Ludwig Maximilian University of Munich The Ludwig Maximilian University of Munich (simply University of Munich, LMU or LMU Munich; ) is a public university, public research university in Munich, Bavaria, Germany. Originally established as the University of Ingolstadt in 1472 by Duke ...

, Germany, led by Professor Hans-Peter Kriegel. The project has continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Description

The ELKI framework is written in

Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...

and built around a modular architecture. Most currently included algorithms perform clustering, outlier detection, and database indexes. The object-oriented architecture allows the combination of arbitrary algorithms, data types, distance functions, indexes, and evaluation measures. The Java just-in-time compiler optimizes all combinations to a similar extent, making benchmarking results more comparable if they share large parts of the code. When developing new algorithms or index structures, the existing components can be easily reused, and the type safety of Java detects many programming errors at compile time. ELKI is a free tool for analyzing data, mainly focusing on finding patterns and unusual data points without needing labels. It's written in Java and aims to be fast and able to handle big datasets by using special structures. It's made for researchers and students to add their own methods and compare different algorithms easily. ELKI has been used in

data science Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, stru ...

to cluster

sperm whale The sperm whale or cachalot (''Physeter macrocephalus'') is the largest of the toothed whales and the largest toothed predator. It is the only living member of the Genus (biology), genus ''Physeter'' and one of three extant species in the s ...

codas, for

phoneme A phoneme () is any set of similar Phone (phonetics), speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible Phonetics, phonetic unit—that helps distinguish one word fr ...

clustering, for anomaly detection in spaceflight operations, for bike sharing redistribution, and traffic prediction.

Objectives

The university project is developed for use in ''teaching and research''. The source code is written with extensibility and reusability in mind, but is also optimized for performance. The experimental

evaluation In common usage, evaluation is a systematic determination and assessment of a subject's merit, worth and significance, using criteria governed by a set of Standardization, standards. It can assist an organization, program, design, project or any o ...

of algorithms depends on many environmental factors and implementation details can have a large impact on the runtime. ELKI aims at providing a shared codebase with comparable implementations of many algorithms. As research project, it currently does not offer integration with

business intelligence Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...

applications or an interface to common

database management system In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and an ...

s via SQL. The copyleft ( AGPL) license may also be a hindrance to an integration in commercial products; nevertheless it can be used to evaluate algorithms prior to developing an own implementation for a commercial product. Furthermore, the application of the algorithms requires knowledge about their usage, parameters, and study of original literature. The audience is

student A student is a person enrolled in a school or other educational institution, or more generally, a person who takes a special interest in a subject. In the United Kingdom and most The Commonwealth, commonwealth countries, a "student" attends ...

researcher Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to ...

s, data scientists, and

software engineer Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining software applications. It involves applying engineering principles and computer programming expertise to develop ...

Architecture

ELKI is modeled around a

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

-inspired core, which uses a vertical data layout that stores data in column groups (similar to column families in NoSQL databases). This database core provides nearest neighbor search, range/radius search, and distance query functionality with index acceleration for a wide range of dissimilarity measures. Algorithms based on such queries (e.g. k-nearest-neighbor algorithm, local outlier factor and DBSCAN) can be implemented easily and benefit from the index acceleration. The database core also provides fast and memory efficient collections for object collections and associative structures such as nearest neighbor lists. ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example, custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration. ELKI uses a service loader architecture to allow publishing extensions as separate jar files. ELKI uses optimized collections for performance rather than the standard Java API.

For loop In computer science, a for-loop or for loop is a control flow Statement (computer science), statement for specifying iteration. Specifically, a for-loop functions by running a section of code repeatedly until a certain condition has been satisfi ...

s for example are written similar to C++ iterators: for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) In contrast to typical Java iterators (which can only iterate over objects), this conserves memory, because the iterator can internally use primitive values for data storage. The reduced garbage collection improves the runtime. Optimized collections libraries such as GNU Trove3, Koloboke, and fastutil employ similar optimizations. ELKI includes data structures such as object collections and heaps (for, e.g., nearest neighbor search) using such optimizations.

Visualization

The visualization module uses SVG for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into

PostScript PostScript (PS) is a page description language and dynamically typed, stack-based programming language. It is most commonly used in the electronic publishing and desktop publishing realm, but as a Turing complete programming language, it c ...

and PDF for easy inclusion in scientific publications in

LaTeX Latex is an emulsion (stable dispersion) of polymer microparticles in water. Latices are found in nature, but synthetic latices are common as well. In nature, latex is found as a wikt:milky, milky fluid, which is present in 10% of all floweri ...

. Exported files can be edited with SVG editors such as

Inkscape Inkscape is a vector graphics editor. It is used for both artistic and technical illustrations such as cartoons, clip art, logos, typography, diagrams, and flowcharts. It uses vector graphics to allow for sharp printouts and renderings at ...

. Since cascading style sheets are used, the graphics design can be restyled easily. Unfortunately, Batik is rather slow and memory intensive, so the visualizations are not very scalable to large data sets (for larger data sets, only a subsample of the data is visualized by default).

Awards

Version 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection, won the conference's "best demonstration paper award".

Included algorithms

Select included algorithms: *

Cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...

: **

K-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...

(including fast algorithms such as Elkan, Hamerly, Annulus, and Exponion k-Means, and robust variants such as k-means--) ** K-medians clustering ** K-medoids clustering (PAM) (including FastPAM and approximations such as CLARA, CLARANS) ** Expectation-maximization algorithm for Gaussian mixture modeling ** Hierarchical clustering (including the fast SLINK, CLINK, NNChain and Anderberg algorithms) ** Single-linkage clustering **Leader clustering ** DBSCAN (Density-Based Spatial Clustering of Applications with Noise, with full index acceleration for arbitrary distance functions) **

OPTICS Optics is the branch of physics that studies the behaviour and properties of light, including its interactions with matter and the construction of optical instruments, instruments that use or Photodetector, detect it. Optics usually describes t ...

(Ordering Points To Identify the Clustering Structure), including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH **HDBSCAN ** Mean-shift clustering ** BIRCH clustering ** SUBCLU (Density-Connected Subspace Clustering for High-Dimensional Data) **CLIQUE clustering **ORCLUS and PROCLUS clustering **COPAC, ERiC and 4C clustering **CASH clustering **DOC and FastDOC subspace clustering **P3C clustering ** Canopy clustering algorithm * Anomaly detection: ** k-Nearest-Neighbor outlier detection ** LOF (Local outlier factor) **LoOP (Local Outlier Probabilities) **

-OF **DB-Outlier (Distance-Based Outliers) **LOCI (Local Correlation Integral) **LDOF (Local Distance-Based Outlier Factor) ** EM-Outlier **SOD (Subspace Outlier Degree) **COP (Correlation Outlier Probabilities) * Frequent Itemset Mining and association rule learning ** Apriori algorithm **Eclat **FP-growth * Dimensionality reduction **

Principal component analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that th ...

** Multidimensional scaling ** T-distributed stochastic neighbor embedding (t-SNE) * Spatial index structures and other search indexes: ** R-tree ** R*-tree ** M-tree ** k-d tree ** X-tree **Cover tree **iDistance **NN descent ** Locality sensitive hashing (LSH) *Evaluation: ** Precision and recall, F1 score, Average Precision ** Receiver operating characteristic (ROC curve) ** Discounted cumulative gain (including NDCG) ** Silhouette index ** Davies–Bouldin index ** Dunn index **Density-based cluster validation (DBCV) *Visualization ** Scatter plots **

Histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...

s ** Parallel coordinates (also in 3D, using

OpenGL OpenGL (Open Graphics Library) is a Language-independent specification, cross-language, cross-platform application programming interface (API) for rendering 2D computer graphics, 2D and 3D computer graphics, 3D vector graphics. The API is typic ...

) *Other: ** Statistical distributions and many parameter estimators, including robust MAD based and L-moment based estimators ** Dynamic time warping ** Change point detection in time series ** Intrinsic dimensionality estimators

Version history

Version 0.1 (July 2008) contained several Algorithms from

cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...

and anomaly detection, as well as some index structures such as the R*-tree. The focus of the first release was on subspace clustering and correlation clustering algorithms. Version 0.2 (July 2009) added functionality for

time series analysis In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...

, in particular distance functions for time series. Version 0.3 (March 2010) extended the choice of anomaly detection algorithms and visualization modules. Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures. Version 0.5 (April 2012) focuses on the evaluation of

results, adding new visualizations and some new algorithms. Version 0.6 (June 2013) introduces a new 3D adaption of parallel coordinates for data visualization, apart from the usual additions of algorithms and index structures. Version 0.7 (August 2015) adds support for uncertain data types, and algorithms for the analysis of uncertain data. Version 0.7.5 (February 2019) adds additional clustering algorithms, anomaly detection algorithms, evaluation measures, and indexing structures. Version 0.8 (October 2022) adds automatic index creation, garbage collection, and incremental priority search, as well as many more algorithms such as BIRCH.

Similar applications

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...

: machine learning library in Python * Weka: A similar project by the University of Waikato, with a focus on

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

algorithms *

RapidMiner RapidMiner is a data science platform that analyses the collective impact of an organization's data. It was acquired by Altair Engineering in September 2022. History RapidMiner, formerly known as YALE (Yet Another Learning Environment), was deve ...

: An application available commercially (a restricted version is available as open source) *

KNIME KNIME (), the Konstanz Information Miner, is a data analytics, reporting and integrating platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" con ...

: An open source platform which integrates various components for machine learning and data mining

References

External links

* of ELKI with download and documentation. {{DEFAULTSORT:Environment For Developing Kdd-Applications Supported By Index-Structures Data mining and machine learning software Free artificial intelligence applications Free data analysis software Free science software Free software programmed in Java (programming language) Software using the GNU Affero General Public License