Data Classification (business Intelligence)
   HOME

TheInfoList



OR:

In
business intelligence Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
, data classification is "the construction of some kind of a method for making judgments for a continuing sequence of cases, where each new case must be assigned to one of pre-defined classes." Data Classification has close ties to
data clustering Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each o ...
, but where data clustering is ''descriptive'', data classification is ''predictive''. In essence data classification consists of using
variables Variable may refer to: Computer science * Variable (computer science), a symbolic name associated with a value and whose associated value may be changed Mathematics * Variable (mathematics), a symbol that represents a quantity in a mathemat ...
with known values to predict the unknown or future values of other variables. It can be used in e.g.
direct marketing Direct marketing is a form of communicating an offer, where organizations communicate directly to a Target market, pre-selected customer and supply a method for a direct response. Among practitioners, it is also known as ''direct response ...
,
insurance fraud Insurance fraud is any intentional act committed to deceive or mislead an insurance company during the application or claims process, or the wrongful denial of a legitimate claim by an insurance company. It occurs when a claimant knowingly attem ...
detection or
medical diagnosis Medical diagnosis (abbreviated Dx, Dx, or Ds) is the process of determining which disease or condition explains a person's symptoms and signs. It is most often referred to as a diagnosis with the medical context being implicit. The information ...
.Kimball, R. et al. (2008). ''The Data Warehouse Lifecycle Toolkit. (2. Ed.)''. Wiley. The first step in doing a data classification is to cluster the
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
used for category training, to create the wanted number of categories. An
algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
, called the ''classifier'', is then used on the categories, creating a descriptive model for each. These models can then be used to categorize new items in the created classification system.Golfarelli, M. & Rizzi, S. (2009). ''Data Warehouse Design : Modern Principles and Methodologies.'' McGraw-Hill Osburn.


Effectiveness

According to Golfarelli and Rizzi, these are the measures of effectiveness of the classifier: *''Predictive accuracy'': How well does it predict the categories for new observations? *''Speed'': What is the computational cost of using the classifier? *''Robustness'': How well do the models created perform if
data quality Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for tsintended uses in operations, decision making and ...
is low? *''Scalability'': Does the classifier function efficiently with large amounts of data? *''Interpretability'': Are the results understandable to users? Typical examples of input for data classification could be variables such as
demographics Demography () is the statistical study of human populations: their size, composition (e.g., ethnic group, age), and how they change through the interplay of fertility (births), mortality (deaths), and migration. Demographic analysis examin ...
, lifestyle information, or economical behaviour.


Challenges

There are several challenges in working with data classification. One in particular is that it is necessary for all using categories on e.g.
customer In sales, commerce, and economics, a customer (sometimes known as a Client (business), client, buyer, or purchaser) is the recipient of a Good (economics), good, service (economics), service, product (business), product, or an Intellectual prop ...
s or clients, to do the modeling in an iterative process. This is to make sure that change in the characteristics of customer groups does not go unnoticed, making the existing categories outdated and obsolete, without anyone noticing. This could be of special importance to
insurance Insurance is a means of protection from financial loss in which, in exchange for a fee, a party agrees to compensate another party in the event of a certain loss, damage, or injury. It is a form of risk management, primarily used to protect ...
or
banking A bank is a financial institution that accepts Deposit account, deposits from the public and creates a demand deposit while simultaneously making loans. Lending activities can be directly performed by the bank or indirectly through capital m ...
companies, where
fraud detection In law, fraud is intent (law), intentional deception to deprive a victim of a legal right or to gain from a victim unlawfully or unfairly. Fraud can violate Civil law (common law), civil law (e.g., a fraud victim may sue the fraud perpetrato ...
is extremely relevant. New fraud patterns may come unnoticed, if the methods to surveil these changes and alert when categories are changing, disappearing or new ones emerge, are not developed and implemented.


References

{{DEFAULTSORT:Data Classification (Business Intelligence) Statistical classification Business intelligence Data processing