In
business intelligence
Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...
, data classification has close ties to
data clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
, but where data clustering is ''descriptive'',
data classification is ''predictive''.
In essence data classification consists of using
variables with known values to predict the unknown or future values of other variables. It can be used in e.g.
direct marketing
Direct marketing is a form of communicating an offer, where organizations communicate directly to a pre-selected customer and supply a method for a direct response. Among practitioners, it is also known as ''direct response marketing''. By ...
,
insurance fraud
Insurance fraud is any act committed to defraud an insurance process. It occurs when a claimant attempts to obtain some benefit or advantage they are not entitled to, or when an insurer knowingly denies some benefit that is due. According to th ...
detection or
medical diagnosis
Medical diagnosis (abbreviated Dx, Dx, or Ds) is the process of determining which disease or condition explains a person's symptoms and signs. It is most often referred to as a diagnosis with the medical context being implicit. The information r ...
.
[Kimball, R. et al. (2008). ''The Data Warehouse Lifecycle Toolkit. (2. Ed.)''. Wiley. ]
The first step in doing a data classification is to cluster the
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...
used for category training, to create the wanted number of categories. An
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
, called the ''classifier'', is then used on the categories, creating a descriptive model for each. These models can then be used to categorize new items in the created classification system.
[Golfarelli, M. & Rizzi, S. (2009). ''Data Warehouse Design : Modern Principles and Methodologies.'' McGraw-Hill Osburn. ]
Effectiveness
According to Golfarelli and Rizzi, these are the measures of effectiveness of the classifier:
*''Predictive accuracy'': How well does it predict the categories for new observations?
*''Speed'': What is the computational cost of using the classifier?
*''Robustness'': How well do the models created perform if
data quality
Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for tsintended uses in operations, decision making and p ...
is low?
*''Scalability'': Does the classifier function efficiently with large amounts of data?
*''Interpretability'': Are the results understandable to users?
Typical examples of input for data classification could be variables such as
demographics
Demography () is the statistical study of populations, especially human beings.
Demographic analysis examines and measures the dimensions and dynamics of populations; it can cover whole societies or groups defined by criteria such as edu ...
, lifestyle information, or economical behaviour.
Challenges
There are several challenges in working with data classification. One in particular is that it is necessary for all using categories on e.g.
customer
In sales, commerce, and economics, a customer (sometimes known as a client, buyer, or purchaser) is the recipient of a good, service, product or an idea - obtained from a seller, vendor, or supplier via a financial transaction or exchange f ...
s or clients, to do the modeling in an iterative process. This is to make sure that change in the characteristics of customer groups does not go unnoticed, making the existing categories outdated and obsolete, without anyone noticing.
This could be of special importance to
insurance
Insurance is a means of protection from financial loss in which, in exchange for a fee, a party agrees to compensate another party in the event of a certain loss, damage, or injury. It is a form of risk management, primarily used to hedge ...
or
banking
A bank is a financial institution that accepts Deposit account, deposits from the public and creates a demand deposit while simultaneously making loans. Lending activities can be directly performed by the bank or indirectly through capital m ...
companies, where
fraud detection
In law, fraud is intentional deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud can violate civil law (e.g., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compen ...
is extremely relevant. New fraud patterns may come unnoticed, if the methods to surveil these changes and alert when categories are changing, disappearing or new ones emerge, are not developed and implemented.
References
{{DEFAULTSORT:Data Classification (Business Intelligence)
Statistical classification
Business intelligence