Labeled data is a group of
samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor.
Labels can be obtained by asking humans to make judgments about a given piece of unlabeled data. Labeled data is significantly more expensive to obtain than the raw unlabeled data.
Crowdsourced labeled data
In 2006
Fei-Fei Li
Fei-Fei Li (; born 1976) is a Chinese-American computer scientist who is known for establishing ImageNet, the dataset that enabled rapid advances in computer vision in the 2010s.
She is the Sequoia Capital Professor of Computer Science at Sta ...
, the co-director of the Stanford Human-Centered AI Institute, set out to improve the
artificial intelligence
Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machine
A machine is a physical system using Power (physics), power to apply Force, forces and control Motion, moveme ...
models and algorithms for image recognition by significantly enlarging the
training data. The researchers downloaded millions of images from the
World Wide Web
The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet.
Documents and downloadable media are made available to the network through web se ...
and a team of undergraduates started to apply labels for objects to each image. In 2007 Li outsourced the data labelling work on
Amazon Mechanical Turk
Amazon Mechanical Turk (MTurk) is a crowdsourcing website for businesses to hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do. It is operated under Amazon Web Services, and is owned ...
, an
online marketplace
An online marketplace (or online e-commerce marketplace) is a type of e-commerce website where product or service information is provided by multiple third parties. Online marketplaces are the primary type of multichannel ecommerce and can be a wa ...
for digital
piece work
Piece work (or piecework) is any type of employment in which a worker is paid a fixed piece rate for each unit produced or action performed, regardless of time.
Context
When paying a worker, employers can use various methods and combinations o ...
. The 3.2 million images that were labelled by more than 49,000 workers formed the basis for
ImageNet, one of the largest hand-labeled database for
outline of object recognition.
Automated data labelling
After obtaining a labeled dataset,
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.
Data-driven bias
Algorithmic decision-making is subject to programmer-driven bias as well as data-driven bias. Training data that relies on bias labeled data will result in prejudices and omissions in a
predictive model, despite the machine learning algorithm being legitimate. The labelled data used to train a specific machine learning algorithm needs to be a statistically
representative sample
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians att ...
to not bias the results. Because the labeled data available to train
facial recognition system
A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and wor ...
s has not been representative of a population, underrepresented groups in the labeled data are later often misclassified. In 2018 a study by
Joy Buolamwini and
Timnit Gebru
Timnit Gebru ( am, ትምኒት ገብሩ; born 1983/1984) is an American computer scientist who works on algorithmic bias and data mining. She is an advocate for diversity in technology and co-founder of Black in AI, a community of Black rese ...
demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively.
References
{{Reflist
Machine learning