Apache SystemDS (Previously, Apache SystemML) is an open source ML system for the end-to-end data science lifecycle.
SystemDS's distinguishing characteristics are:
# Algorithm customizability via R-like and Python-like languages.
# Multiple execution modes, including Standalone,
Spark Batch,
Spark MLContext,
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
Batch, and JMLC.
# Automatic optimization based on data and cluster characteristics to ensure both efficiency and scalability.
History
SystemML was created in 2010 by researchers at the
IBM Almaden Research Center
IBM Research is the research and development division for IBM, an American multinational information technology company headquartered in Armonk, New York, with operations in over 170 countries. IBM Research is the largest industrial research org ...
led by IBM Fellow Shivakumar Vaithyanathan. It was observed that data scientists would write machine learning algorithms in languages such as
R and
Python for small data. When it came time to scale to big data, a systems programmer would be needed to scale the algorithm in a language such as
Scala. This process typically involved days or weeks per iteration, and errors would occur translating the algorithms to operate on big data. SystemML seeks to simplify this process. A primary goal of SystemML is to automatically scale an algorithm written in an R-like or Python-like language to operate on big data, generating the same answer without the error-prone, multi-iterative translation approach.
On June 15, 2015, at the Spark Summit in San Francisco, Beth Smith, General Manager of IBM Analytics, announced that IBM was open-sourcing SystemML as part of IBM's major commitment to
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Califor ...
and Spark-related projects. SystemML became publicly available on
GitHub
GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, co ...
on August 27, 2015 and became an
Apache Incubator
Apache Incubator is the gateway for open-source projects intended to become fully fledged Apache Software Foundation projects.
The Incubator project was created in October 2002 to provide an entry path to the Apache Software Foundation for proj ...
project on November 2, 2015. On May 17, 2017, the Apache Software Foundation Board approved the graduation of Apache SystemML as an Apache Top Level Project.
Key technologies
The following are some of the technologies built into the SystemDS engine.
Compressed Linear Algebra for Large Scale Machine LearningDeclarative Machine Learning Language
Examples
Principal Component Analysis
The following code snippet does the
Principal component analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
of input matrix
, which returns the
and the
.
# PCA.dml
# Refer: https://github.com/apache/systemds/blob/master/scripts/algorithms/PCA.dml#L61
N = nrow(A);
D = ncol(A);
# perform z-scoring (centering and scaling)
A = scale(A, center1, scale1);
# co-variance matrix
mu = colSums(A)/N;
C = (t(A) %*% A)/(N-1) - (N/(N-1))*t(mu) %*% mu;
# compute eigen vectors and values
values, evectors= eigen(C);
Invocation script
spark-submit SystemDS.jar -f PCA.dml -nvargs INPUT=INPUT_DIR/pca-1000x1000 \
OUTPUT=OUTPUT_DIR/pca-1000x1000-model PROJDATA=1 CENTER=1 SCALE=1
Database functions
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.
It is a density-based clustering non-parametric algorithm: g ...
clustering algorithm with
Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore o ...
.
X = rand(rows=1780, cols=180, min=1, max=20)
ndices, model= dbscan(X = X, eps = 2.5, minPts = 360)
Improvements
SystemDS 2.0.0 is the first major release under the new name. This release contains a major refactoring, a few major features, a large number of improvements and fixes, and some experimental features to better support the end-to-end data science lifecycle. In addition to that, this release also removes several features that are not up date and outdated.
* New mechanism for DML-bodied (script-level)
builtin
functions, and a wealth of new built-in functions for data preprocessing including data cleaning, augmentation and feature engineering techniques, new ML algorithms, and model debugging.
* Several methods for data cleaning have been implemented including multiple imputations with multivariate imputation by chained equations (MICE) and other techniques, SMOTE, an oversampling technique for class imbalance, forward and backward NA filling, cleaning using schema and length information, support for outlier detection using standard deviation and inter-quartile range, and functional dependency discovery.
* A complete framework for lineage tracing and reuse including support for loop deduplication, full and partial reuse, compiler assisted reuse, several new rewrites to facilitate reuse.
* New federated runtime backend including support for federated matrices and frames, federated
builtin
s (
transform-encode
,
decode
etc.).
* Refactor compression package and add functionalities including quantization for lossy compression, binary cell operations, left matrix multiplication.
xperimental* New python bindings with supports for several
builtin
s, matrix operations, federated tensors and lineage traces.
* Cuda implementation of cumulative aggregate operators (
cumsum
,
cumprod
etc.)
* New model debugging technique with slice finder.
* New tensor data model (basic tensors of different value types, data tensors with schema)
xperimental* Cloud deployment scripts for AWS and scripts to set up and start federated operations.
* Performance improvements with
parallel sort
,
gpu cum agg
,
append cbind
etc.
* Various compiler and runtime improvements including new and improved rewrites, reduced Spark context creation, new
eval
framework, list operations, updated native kernel libraries to name a few.
* New data reader/writer for
json
frames and support for
sql
as a data source.
* Miscellaneous improvements: improved documentation, better testing, run/release scripts, improved packaging, Docker container for systemds, support for lambda expressions, bug fixes.
* Removed MapReduce compiler and runtime backend,
pydml
parser, Java-UDF framework, script-level debugger.
* Deprecated
./scripts/algorithms
, as those algorithms gradually will be part of SystemDS
builtin
s.
Contributions
Apache SystemDS welcomes contributions in code, question and answer, community building, or spreading the word. The contributor guide is available at https://github.com/apache/systemds/blob/main/CONTRIBUTING.md
See also
*
Comparison of deep learning software
References
External links
Apache SystemML websiteIBM Research - SystemMLQ & A with Shiv Vaithyanathan, Creator of SystemML and IBM FellowA Universal Translator for Big Data and Machine LearningSystemML: Declarative Machine Learning at Scale presentation by Fred ReissSystemML: Declarative Machine Learning on MapReduceHybrid Parallelization Strategies for Large-Scale Machine Learning in SystemMLSystemML's Optimizer: Plan Generation for Large-Scale Machine Learning ProgramsIBM's SystemML machine learning system becomes Apache Incubator projectIBM donates machine learning tech to Apache Spark open source community
{{DEFAULTSORT:SystemML
Cluster computing
Data mining and machine learning software
Hadoop
SystemML
Software using the Apache license
Java platform
Big data products
2015 software