The field of population informatics is the systematic study of populations via secondary analysis of massive data collections (termed "
big data") about people. Scientists in the field refer to this massive data collection as the
social genome
The social genome is the collection of data about members of a society that is captured in ever-larger and ever-more complex databases (e.g., government administrative data, operational data, social media data etc.). Some have used the term digita ...
, denoting the collective
digital footprint
Digital footprint or digital shadow refers to one's unique set of traceable digital activities, actions, contributions and communications manifested on the Internet or digital devices. Digital footprints can be classified as either passive or ac ...
of our society. Population informatics applies
data science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
to social genome data to answer fundamental questions about human society and population health much like
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
applies data science to
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the ...
data to answer questions about individual health. It is an emerging research area at the intersection of SBEH (Social, Behavioral, Economic, & Health) sciences, computer science, and statistics in which quantitative methods and computational tools are used to answer fundamental questions about our society.
Introduction
History
The term was first used in August 2012 when the Population Informatics Lab was founded at the University of North Carolina at Chapel Hill by Dr. Hye-Cung Kum. The term was first defined in a peer reviewed article in 2013
and further elaborated on in another article in 2014. The first Workshop on Population Informatics for Big Data was held at the ACM SIGKDD conference in Sydney, Australia, in August 2015.
Goals
To study social, behavioral, economic, and health sciences using the massive data collections, aka
social genome
The social genome is the collection of data about members of a society that is captured in ever-larger and ever-more complex databases (e.g., government administrative data, operational data, social media data etc.). Some have used the term digita ...
data, about people. The primary goal of population informatics is to increase the understanding of social processes by developing and applying computationally intensive techniques to the
social genome
The social genome is the collection of data about members of a society that is captured in ever-larger and ever-more complex databases (e.g., government administrative data, operational data, social media data etc.). Some have used the term digita ...
data.
Some of the important sub-disciplines are :
*
Business analytics
Business analytics (BA) refers to the skills, technologies, and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. Business analytics focuses on developing n ...
*
Social computing
Social computing is an area of computer science that is concerned with the intersection of social behavior and computational systems. It is based on creating or recreating social conventions and social contexts through the use of software and tech ...
: social network data analysis
* Policy informatics
*
Public health informatics
Public health informatics has been defined as the systematic application of information and computer science and technology to public health practice, research, and learning. It is one of the subdomains of health informatics.
Definition
Public h ...
*
Computational journalism Computational journalism can be defined as the application of computation to the activities of journalism such as information gathering, organization, sensemaking, communication and dissemination of news information, while upholding values of journa ...
*
Computational transportation science
*
Computational epidemiology
Computational epidemiology is a multidisciplinary field that uses techniques from computer science, mathematics, geographic information science and public health to better understand issues central to epidemiology such as the spread of diseases o ...
*
Computational economics
Computational Economics is an interdisciplinary research discipline that involves computer science, economics, and management science.''Computational Economics''."About This Journal"an"Aims and Scope" This subject encompasses computational mod ...
*
Computational sociology
Computational sociology is a branch of sociology that uses computationally intensive methods to analyze and model social phenomena. Using computer simulations, artificial intelligence, complex statistical methods, and analytic approaches like s ...
*
Computational social science
Computational social science is the academic sub-discipline concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. Fields include computational economics ...
Approaches
Record Linkage
Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and d ...
, the task of finding records in a dataset that refer to the same entity across different data sources, is a major activity in the population informatics field because most of the digital traces about people are fragmented in many heterogeneous databases that need to be linked before analysis can be done.
Once relevant datasets are linked, the next task is usually to develop valid meaningful measures to answer the research question. Often developing measures involves iterating between inductive and deductive approaches with the data and research question until usable measures are developed because the data were collected for other purposes with no intended use to answer the question at hand. Developing meaningful and useful measures from existing data is a major challenge in many research projects. In computation fields, these measures are often called features.
Finally, with the datasets linked and required measures developed, the analytic dataset is ready for analysis. Common analysis methods include traditional hypothesis driven research as well more inductive approaches such as
data science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
and
predictive analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In busin ...
.
Relation to other fields
Computational social science
Computational social science is the academic sub-discipline concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. Fields include computational economics ...
refers to the academic sub-disciplines concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. Fields include
computational economics
Computational Economics is an interdisciplinary research discipline that involves computer science, economics, and management science.''Computational Economics''."About This Journal"an"Aims and Scope" This subject encompasses computational mod ...
and
computational sociology
Computational sociology is a branch of sociology that uses computationally intensive methods to analyze and model social phenomena. Using computer simulations, artificial intelligence, complex statistical methods, and analytic approaches like s ...
. The seminal article on computational social science is by Lazer et al. 2009 which was a summary of a workshop held at Harvard with the same title. However, the article does not define the term computational social science precisely.
In general, computational social science is a broader field and encompasses population informatics. Besides population informatics, it also includes complex
simulation
A simulation is the imitation of the operation of a real-world process or system over time. Simulations require the use of models; the model represents the key characteristics or behaviors of the selected system or process, whereas the ...
s of social phenomena. Often complex simulation models use results from population informatics to configure with real world parameters.
Data Science for Social Good (DSSG) is another similar field coming about. But again, DSSG is a bigger field applying data science to any social problem that includes study of human populations but also many problems that do not use any data about people.
Population reconstruction is the multi-disciplinary field to reconstruct specific (historical) populations by linking data from diverse sources, leading to rich novel resources for study by social scientists.
Related groups and workshops
The firstWorkshop on Population Informatics for Big Data was held at the ACM SIGKDD conference in Sydney, Australia, in 2015. The workshop brought together computer science researchers, as well as public health practitioners and researchers. This Wikipedia page started at the workshop.
The International Population Data Linkage Network (IPDLN) facilitates communication between centres that specialize in data linkage and users of the linked data. The producers and users alike are committed to the systematic application of data linkage to produce community benefit in the population and health-related domains.
Challenges
Three major challenges specific to population informatics are:
# Preserving privacy of the subjects of the data – due to increasing concerns of privacy and confidentiality sharing or exchanging sensitive data about the subjects across different organizations is often not allowed. Therefore, population informatics need to be applied on encrypted data or in a privacy-preserving setting.
# The need for error bounds on the results – since real world data often contain errors and variations error bound need to be used (for approximate matching) so that real decisions that have direct impact on people can be made based on these results. Research on error propagation in the full data pipeline from data integration to final analysis is also important.
# Scalability – databases are continuously growing in size which makes population informatics computationally expensive in terms of the size and number of data sources.
[Thilina Ranbaduge, Dinusha Vatsalan, and Peter Christen]
"Clustering-Based Scalable Indexing for Multi-party Privacy-Preserving Record Linkage"
PAKDD: 549-561 (Springer) 201
doi: 10.1007/978-3-319-18032-8_43
/ref> Scalable algorithms need to be developed for providing efficient and practical population informatics applications in the real world context.
See also
* Record linkage
Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and d ...
* Social genome
The social genome is the collection of data about members of a society that is captured in ever-larger and ever-more complex databases (e.g., government administrative data, operational data, social media data etc.). Some have used the term digita ...
References
External links
Population Informatics Lab
Privacy Preserving Interactive Record Linkage (PPIRL)
First International Workshop on Population Informatics for Big Data
International Population Data Linkage Network (IPDLN)
Public Health Informatics page at AMIA
Data Science for Social Good
Computational science
Computing and society
Behavioural sciences
Economic data
Public health
*