These
datasets are used in
machine learning (ML) research and have been cited in
peer-reviewed
Peer review is the evaluation of work by one or more people with similar competencies as the producers of the work ( peers). It functions as a form of self-regulation by qualified members of a profession within the relevant field. Peer review ...
academic journal
An academic journal (or scholarly journal or scientific journal) is a periodical publication in which Scholarly method, scholarship relating to a particular academic discipline is published. They serve as permanent and transparent forums for the ...
s. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
s (such as
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
),
computer hardware
Computer hardware includes the physical parts of a computer, such as the central processing unit (CPU), random-access memory (RAM), motherboard, computer data storage, graphics card, sound card, and computer case. It includes external devices ...
, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for
supervised and
semi-supervised machine learning
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
s are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for
unsupervised learning can also be difficult and costly to produce.
Many organizations, including governments, publish and share their
datasets. The datasets are classified, based on the licenses, as
Open data
Open data are data that are openly accessible, exploitable, editable and shareable by anyone for any purpose. Open data are generally licensed under an open license.
The goals of the open data movement are similar to those of other "open(-so ...
and
Non-Open data.
The datasets from various
governmental-bodies are presented in
List of open government data sites. The datasets are ported on
open data portals. They are made available for searching, depositing and accessing through interfaces like
Open API. The datasets are made available as various sorted types and subtypes.
List of sorting used for datasets
The data portal is classified based on its type of license. The
open source license based data portals are known as
open data portals which are used by many
government organizations
State ownership, also called public ownership or government ownership, is the ownership of an industry, asset, property, or enterprise by the national government of a country or state, or a public body representing a community, as opposed to ...
and
academic institution
An academic institution is an educational institution dedicated to education and research, which grants academic degrees. See also academy and university.
Types
* Primary schools – (from French ''école primaire'') institutions where childre ...
s.
List of open data portals
List of portals suitable for multiple types of applications
The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many
machine learning applications.
List of portals suitable for a specific subtype of applications
The data portals which are suitable for a specific subtype of
machine learning application are listed in the subsequent sections.
Image data
Text data
These datasets consist primarily of text for tasks such as
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
,
sentiment analysis, translation, and
cluster analysis
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
.
Reviews
News articles
Messages
Twitter and tweets
Dialogues
Legal
Other text
Sound data
These datasets consist of sounds and sound features used for tasks such as
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
and
speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...
.
Speech
Music
Other sounds
Signal data
Datasets containing electric signal information requiring some sort of
signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...
for further analysis.
Electrical
Motion-tracking
Other signals
Chemical data
Datasets from physical systems.
Chemical Reactions with transition states (TS)
OpenReACT-CHON-EFH
OpenReACT-CHON-EFH (Open Reaction Dataset of Atomic ConfiguraTions comprising C, H, O and N with Energies, Forces and Hessians) is a 2025 open-access benchmark for machine-learning interatomic potentials.
* **RTP set** – 35,087 stationary-point geometries (reactant, transition state and product) drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G(d) level.
* **IRC set** – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points.
* **NMS set** – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations.
The collection underpins the study ''Does Hessian Data Improve the Performance of Machine Learning Potentials?'' and was used to train and benchmark the machine-learning interatomic potentials reported therein.
The dataset itself is distributed under a CC licence via Figshare.
Physical data
Datasets from physical systems.
High-energy physics
Systems
Astronomy
Earth science
Other physical
Biological data
Datasets from biological systems.
Human
Animal
Fungi
Plant
Microbe
Drug discovery
Anomaly data
Question answering data
This section includes datasets that deals with structured data.
Dialog or instruction prompted data
This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.
Cybersecurity
Climate and sustainability
Code data
Multivariate data
Financial
Weather
Census
Transit
Internet
Games
Other multivariate
Curated repositories of datasets
As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.
* OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
* PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
*Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
*
Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.
See also
*
Comparison of deep learning software
*
List of manual image annotation tools
*
List of biological databases
References
{{DEFAULTSORT:Datasets for machine-learning research