Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw

data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...

. The motivation is to use these extra features to improve the quality of results from a

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

process, compared with supplying only the raw data to the machine learning process.

Process

The feature engineering process is: * Brainstorming or testing features * Deciding what features to create * Creating features * Testing the impact of the identified features on the task * Improving your features if needed * Repeat

Typical engineered features

The following list provides some typical ways to engineer useful features * Numerical transformations (like taking fractions or scaling) * Category encoder like one-hot or target encoder (for categorical data) * Clustering * Group aggregated values * Principal component analysis (for numerical data) * Feature construction : building new "physical", knowledge-based parameters relevant to the problem. For example, in physics, construction of dimensionless numbers such as Reynolds number in

fluid dynamics In physics and engineering, fluid dynamics is a subdiscipline of fluid mechanics that describes the flow of fluids—liquids and gases. It has several subdisciplines, including '' aerodynamics'' (the study of air and other gases in motion) ...

, Nusselt number in

heat transfer Heat transfer is a discipline of thermal engineering that concerns the generation, use, conversion, and exchange of thermal energy (heat) between physical systems. Heat transfer is classified into various mechanisms, such as thermal conduction ...

, Archimedes number in

sedimentation Sedimentation is the deposition of sediments. It takes place when particles in suspension settle out of the fluid in which they are entrained and come to rest against a barrier. This is due to their motion through the fluid in response to t ...

, construction of first approximations of the solution such as analytical

strength of materials The field of strength of materials, also called mechanics of materials, typically refers to various methods of calculating the stresses and strains in structural members, such as beams, columns, and shafts. The methods employed to predict the re ...

solutions in mechanics, etc.

Relevance

Features vary in significance. Even relatively insignificant features may contribute to a model.

Feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...

can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting).

Explosion

Feature explosion occurs when the number of identified features grows inappropriately. Common causes include: * Feature templates - implementing feature templates instead of coding new features * Feature combinations - combinations that cannot be represented by a linear system Feature explosion can be limited via techniques such as: regularization,

kernel method In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...

s, and

feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...

Automation

Automation of feature engineering is a research topic that dates back to the 1990s. Machine learning software that incorporates

automated feature engineering Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready ...

has been commercially available since 2016. Related academic literature can be roughly separated into two types: * Multi-relational decision tree learning (MRDTL) uses a supervised algorithm that is similar to a

decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains co ...

. * Deep Feature Synthesis uses simpler methods.

Multi-relational decision tree learning (MRDTL)

MRDTL generates features in the form of SQL queries by successively adding clauses to the queries. For instance, the algorithm might start out with SELECT COUNT(*) FROM ATOM t1 LEFT JOIN MOLECULE t2 ON t1.mol_id = t2.mol_id GROUP BY t1.mol_id The query can then successively be refined by adding conditions, such as "WHERE t1.charge <= -0.392". However, most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation. Efficiency can be increased by using incremental updates, which eliminates redundancies.

Open-source implementations

There are a number of open-source libraries and tools that automate feature engineering on relational data and time series: * featuretools is a Python library for transforming time series and relational data into feature matrices for machine learning. * OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques. * getML community is an open source tool for automated feature engineering on time series and relational data. It is implemented in C/ C++ with a Python interface. It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats. * tsfresh is a Python library for feature extraction on time series data. It evaluates the quality of the features using hypothesis testing. * tsflex is an open source Python library for extracting features from time series data. Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel. * seglearn is an extension for multivariate, sequential time series data to the scikit-learn Python library. * tsfel is a Python package for feature extraction on time series data. * kats is a Python toolkit for analyzing time series data.

Deep feature synthesis

The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition.

Feature stores

The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions. A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used. Feature stores can be standalone software tools or built into machine learning platforms.