Datafly algorithm is an
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
for providing anonymity in medical data. The algorithm was developed by
Latanya Arvette Sweeney
Latanya Arvette Sweeney is an American computer scientist. She is the Daniel Paul Professor of the Practice of Government and Technology at the Harvard Kennedy School and in the Harvard Faculty of Arts and Sciences at Harvard University. She is t ...
in 1997−98. Anonymization is achieved by automatically generalizing, substituting, inserting, and removing information as appropriate without losing many of the details found within the data. The method can be used on-the-fly in
role-based security within an institution, and in
batch mode for
exporting data from an institution.
Organizations release and receive medical data with all explicit
identifier
An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, physical countable object (or class thereof), or physical noncountable ...
s—such as name—removed, in the erroneous belief that
patient confidentiality
A patient is any recipient of health care services that are performed by healthcare professionals. The patient is most often ill or injured and in need of treatment by a physician, nurse, optometrist, dentist, veterinarian, or other health ...
is maintained because the resulting data look anonymous. However the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the
field
Field may refer to:
Expanses of open ground
* Field (agriculture), an area of land used for agricultural purposes
* Airfield, an aerodrome that lacks the infrastructure of an airport
* Battlefield
* Lawn, an area of mowed grass
* Meadow, a grass ...
s and
records of the
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
itself.
The Datafly algorithm has been criticized for trying to achieve anonymization by overgeneralization. The algorithm selects the
attribute with the greatest number of distinct
value
Value or values may refer to:
Ethics and social
* Value (ethics) wherein said concept may be construed as treating actions themselves as abstract objects, associating value to them
** Values (Western philosophy) expands the notion of value beyo ...
s as the one to generalize first.
Core algorithm
An outline of the Datafly algorithm is presented below.
Input:
Private
Table
Table may refer to:
* Table (furniture), a piece of furniture with a flat surface and one or more legs
* Table (landform), a flat area of land
* Table (information), a data arrangement with rows and columns
* Table (database), how the table data ...
PT;
quasi-identifier Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier.
Quasi-identifiers c ...
QI = ( ''A''
1, ..., ''A''
n ), ''k''-anonymity constraint ''k''; domain generalization hierarchies DGH
Ai, where ''i'' = 1,...,''n'' with accompanying
functions ''f''
Ai, and loss, which is a limit on the percentage of
tuple
In mathematics, a tuple is a finite ordered list (sequence) of elements. An -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0-tuple, referred to as ''the empty tuple''. An -tuple is defi ...
s that can be suppressed. PT
dis the set
of unique identifiers or keys for each tuple.
Output
Output may refer to:
* The information produced by a computer, see Input/output
* An output state of a system, see state (computer science)
* Output (economics), the amount of goods and services produced
** Gross output in economics, the value of ...
:
MGT a generalization of PT
Ithat enforces ''k''-anonymity
Assumes: , PT , ≤ ''k'', and loss * , PT , = ''k''
algorithm Datafly:
// Construct a frequency
list containing unique
sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...
s of values across the quasi-identifier in PT,
// along with the number of occurrences of each sequence.
:1. let freq be an expandable and collapsible
vector with no elements initially. Each element is of the form ( QI, frequency, SID ), where SID = ; and, frequency = , SID, . Therefore, freq is also accessible as a table over (QI, frequency, SID).
:2. let pos
0, total
0
:3. while total ≠ , PT, do
::3.1 freq
os ( ''t''
I occurs, SID ) where ''t''
I∈
I ( ''t''
QI __, ___ )
freq; occurs = , PT, - , PT
I– , ; and, SID =
::3.2 pos
pos + 1, total
total + occurs
:// Make a solution by generalizing the attribute with the most number of distinct values
:// and suppressing no more than the allowed number of tuples.
:4. let belowk
0
:5. for pos
1 to , freq, do
::5.1 ( __, count )
freq
os
::5.2 if count < ''k'' then do
:::5.2.1 belowk
belowk + count
:6. if belowk > ''k'' then do: // Note. loss * , PT, = ''k''.
::6.1 freq
generalize(freq)
::6.2 go to step 4
:7. else do
:// assert: the number of tuples to suppress in freq is ≤ loss * , PT,
::7.1 freq
suppress(freq, belowk )
::7.2 MGT
reconstruct(freq)
:8. return MGT.
References
{{reflist
External links
Details of the Datafly algorithm
Privacy
Anonymity
Medical records
Data protection
Datasets
Obfuscation
Articles with example pseudocode