Flow cytometry bioinformatics is the application of

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

flow cytometry Flow cytometry (FC) is a technique used to detect and measure physical and chemical characteristics of a population of cells or particles. In this process, a sample containing cells or particles is suspended in a fluid and injected into the fl ...

data, which involves storing, retrieving, organizing and analyzing flow cytometry data using extensive computational resources and tools. Flow cytometry bioinformatics requires extensive use of and contributes to the development of techniques from

computational statistics Computational statistics, or statistical computing, is the bond between statistics and computer science. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific comput ...

and

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

. Flow cytometry and related methods allow the quantification of multiple independent

biomarkers In biomedical contexts, a biomarker, or biological marker, is a measurable indicator of some biological state or condition. Biomarkers are often measured and evaluated using blood, urine, or soft tissues to examine normal biological processes, ...

on large numbers of single

cells Cell most often refers to: * Cell (biology), the functional basic unit of life Cell may also refer to: Locations * Monastic cell, a small room, hut, or cave in which a religious recluse lives, alternatively the small precursor of a monastery w ...

. The rapid growth in the multidimensionality and throughput of flow cytometry data, particularly in the 2000s, has led to the creation of a variety of computational analysis methods, data standards, and public databases for the sharing of results. Computational methods exist to assist in the preprocessing of flow cytometry data, identifying cell populations within it, matching those cell populations across samples, and performing diagnosis and discovery using the results of previous steps. For preprocessing, this includes compensating for spectral overlap, transforming data onto scales conducive to visualization and analysis, assessing data for quality, and normalizing data across samples and experiments. For population identification, tools are available to aid traditional manual identification of populations in two-dimensional

scatter plot A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. ...

s (gating), to use

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

to aid gating, and to find populations automatically in higher-dimensional space in a variety of ways. It is also possible to characterize data in more comprehensive ways, such as the density-guided

binary space partitioning In computer science, binary space partitioning (BSP) is a method for space partitioning which recursively subdivides a Euclidean space into two convex sets by using hyperplanes as partitions. This process of subdividing gives rise to a represe ...

technique known as probability binning, or by combinatorial gating. Finally, diagnosis using flow cytometry data can be aided by

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

techniques, and discovery of new cell types of biological importance by high-throughput statistical methods, as part of pipelines incorporating all of the aforementioned methods.

Open standard An open standard is a standard that is openly accessible and usable by anyone. It is also a prerequisite to use open license, non-discrimination and extensibility. Typically, anybody can participate in the development. There is no single definitio ...

data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...

and

software Software is a set of computer programs and associated software documentation, documentation and data (computing), data. This is in contrast to Computer hardware, hardware, from which the system is built and which actually performs the work. ...

are also key parts of flow cytometry bioinformatics. Data standards include the widely adopted Flow Cytometry Standard (FCS) defining how data from cytometers should be stored, but also several new standards under development by the International Society for Advancement of Cytometry (ISAC) to aid in storing more detailed information about experimental design and analytical steps. Open data is slowly growing with the opening of the CytoBank database in 2010, and FlowRepository in 2012, both of which allow users to freely distribute their data, and the latter of which has been recommended as the preferred repository for MIFlowCyt-compliant data by ISAC. Open software is most widely available in the form of a suite of

Bioconductor Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology. Bioconductor is based primarily on the statistical R programming ...

packages, but is also available for web execution on the

GenePattern GenePattern is a freely available computational biology open-source software package originally created and developed at the Broad Institute for the analysis of genomic data. Designed to enable researchers to develop, capture, and reproduce genomi ...

platform.

Data collection

Flow cytometers operate by hydrodynamically focusing suspended cells so that they separate from each other within a fluid stream. The stream is interrogated by one or more lasers, and the resulting

fluorescent Fluorescence is the emission of light by a substance that has absorbed light or other electromagnetic radiation. It is a form of luminescence. In most cases, the emitted light has a longer wavelength, and therefore a lower photon energy, th ...

and

scattered Scattered may refer to: Music * ''Scattered'' (album), a 2010 album by The Handsome Family * "Scattered" (The Kinks song), 1993 * "Scattered", a song by Ace Young * "Scattered", a song by Lauren Jauregui * "Scattered", a song by Green Day from ...

light is detected by

photomultiplier A photomultiplier is a device that converts incident photons into an electrical signal. Kinds of photomultiplier include: * Photomultiplier tube, a vacuum tube converting incident photons into an electric signal. Photomultiplier tubes (PMTs for s ...

s. By using optical filters, particular

fluorophore A fluorophore (or fluorochrome, similarly to a chromophore) is a fluorescent chemical compound that can re-emit light upon light excitation. Fluorophores typically contain several combined aromatic groups, or planar or cyclic molecules with ...

s on or within the cells can be quantified by peaks in their

emission spectra The emission spectrum of a chemical element or chemical compound is the spectrum of frequencies of electromagnetic radiation emitted due to an electron making a transition from a high energy state to a lower energy state. The photon energy of th ...

. These may be endogenous fluorophores such as

chlorophyll Chlorophyll (also chlorophyl) is any of several related green pigments found in cyanobacteria and in the chloroplasts of algae and plants. Its name is derived from the Greek words , ("pale green") and , ("leaf"). Chlorophyll allow plants to ...

transgenic A transgene is a gene that has been transferred naturally, or by any of a number of genetic engineering techniques, from one organism to another. The introduction of a transgene, in a process known as transgenesis, has the potential to change the ...

green fluorescent protein The green fluorescent protein (GFP) is a protein that exhibits bright green fluorescence when exposed to light in the blue to ultraviolet range. The label ''GFP'' traditionally refers to the protein first isolated from the jellyfish '' Aeq ...

, or they may be artificial fluorophores

covalently bonded A covalent bond is a chemical bond that involves the sharing of electrons to form electron pairs between atoms. These electron pairs are known as shared pairs or bonding pairs. The stable balance of attractive and repulsive forces between atoms ...

to detection molecules such as antibodies for detecting

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...

s, or

hybridization probe In molecular biology, a hybridization probe (HP) is a fragment of DNA or RNA of usually 15–10000 nucleotide long which can be radioactively or fluorescently labeled. HP can be used to detect the presence of nucleotide sequences in analyzed R ...

s for detecting DNA or

RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...

. The ability to quantify these has led to flow cytometry being used in a wide range of applications, including but not limited to: * Monitoring of

CD4 In molecular biology, CD4 (cluster of differentiation 4) is a glycoprotein that serves as a co-receptor for the T-cell receptor (TCR). CD4 is found on the surface of immune cells such as T helper cells, monocytes, macrophages, and dendritic ce ...

count in

HIV The human immunodeficiency viruses (HIV) are two species of '' Lentivirus'' (a subgroup of retrovirus) that infect humans. Over time, they cause acquired immunodeficiency syndrome (AIDS), a condition in which progressive failure of the immu ...

* Diagnosis of various

cancer Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. These contrast with benign tumors, which do not spread. Possible signs and symptoms include a lump, abnormal bl ...

s * Analysis of aquatic

microbiome A microbiome () is the community of microorganisms that can usually be found living together in any given habitat. It was defined more precisely in 1988 by Whipps ''et al.'' as "a characteristic microbial community occupying a reasonably we ...

s *

Sperm sorting Sperm sorting is a means of choosing what type of sperm cell is to fertilize the egg cell. Several conventional techniques of centrifugation or swim-up. Newly applied methods such as flow cytometry expand the possibilities of sperm sorting and new ...

* Measuring

telomere A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mo ...

length Until the early 2000s, flow cytometry could only measure a few fluorescent markers at a time. Through the late 1990s into the mid-2000s, however, rapid development of new fluorophores resulted in modern instruments capable of quantifying up to 18 markers per cell. More recently, the new technology of mass cytometry replaces fluorophores with

rare-earth element The rare-earth elements (REE), also called the rare-earth metals or (in context) rare-earth oxides or sometimes the lanthanides (yttrium and scandium are usually included as rare earths), are a set of 17 nearly-indistinguishable lustrous silv ...

s detected by

time of flight mass spectrometry Time-of-flight mass spectrometry (TOFMS) is a method of mass spectrometry in which an ion's mass-to-charge ratio is determined by a time of flight measurement. Ions are accelerated by an electric field of known strength. This acceleration resu ...

, achieving the ability to measure the expression of 34 or more markers. At the same time,

microfluidic Microfluidics refers to the behavior, precise control, and manipulation of fluids that are geometrically constrained to a small scale (typically sub-millimeter) at which surface forces dominate volumetric forces. It is a multidisciplinary field tha ...

qPCR A real-time polymerase chain reaction (real-time PCR, or qPCR) is a laboratory technique of molecular biology based on the polymerase chain reaction (PCR). It monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real ...

methods are providing a flow cytometry-like method of quantifying 48 or more RNA molecules per cell. The rapid increase in the dimensionality of flow cytometry data, coupled with the development of high-throughput robotic platforms capable of assaying hundreds to thousands of samples automatically have created a need for improved computational analysis methods.

Data

Flow cytometry data is in the form of a large matrix of intensities over M wavelengths by N events. Most events will be a particular cell, although some may be doublets (pairs of cells which pass the laser closely together). For each event, the measured fluorescence intensity over a particular wavelength range is recorded. The measured fluorescence intensity indicates the amount of that fluorophore in the cell, which indicates the amount that has bound to detector molecules such as antibodies. Therefore, fluorescence intensity can be considered a proxy for the amount of detector molecules present on the cell. A simplified, if not strictly accurate, way of considering flow cytometry data is as a matrix of M measurements times N cells where each element corresponds to the amounts of molecules.

Steps in computational flow cytometry data analysis

The process of moving from primary FCM data to disease diagnosis and biomarker discovery involves four major steps: # Data pre-processing (including compensation, transformation and normalization) # Cell population identification (a.k.a. gating) # Cell population matching for cross sample comparison # Relating cell populations to external variables (diagnosis and discovery) Saving of the steps taken in a particular flow cytometry

workflow A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence ...

is supported by some flow cytometry software, and is important for the reproducibility of flow cytometry experiments. However, saved workspace files are rarely interchangeable between software. An attempt to solve this problem is the development of the Gating-ML

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...

-based data standard (discussed in more detail under the standards section), which is slowly being adopted in both commercial and open source flow cytometry software. The CytoML R package is also filling the gap by importing/exporting the Gating-ML that is compatible with

FlowJo FlowJo is a software package for analyzing flow cytometry data. Files produced by modern flow cytometers are written in the Flow Cytometry Standard format with an .fcs file extension. FlowJo will import and analyze cytometry data regardless of whi ...

, CytoBank and FACS Diva softwares.

Data pre-processing

Prior to analysis, flow cytometry data must typically undergo pre-processing to remove artifacts and poor quality data, and to be transformed onto an optimal scale for identifying cell populations of interest. Below are various steps in a typical flow cytometry preprocessing pipeline.

Compensation

When more than one fluorochrome is used with the same laser, their

frequently overlap. Each particular fluorochrome is typically measured using a bandpass optical filter set to a narrow band at or near the fluorochrome's emission intensity peak. The result is that the reading for any given fluorochrome is actually the sum of that fluorochrome's peak emission intensity, and the intensity of all other fluorochromes' spectra where they overlap with that frequency band. This overlap is termed spillover, and the process of removing spillover from flow cytometry data is called compensation. Compensation is typically accomplished by running a series of representative samples each stained for only one fluorochrome, to give measurements of the contribution of each fluorochrome to each channel. The total signal to remove from each channel can be computed by solving a system of

linear equation In mathematics, a linear equation is an equation that may be put in the form a_1x_1+\ldots+a_nx_n+b=0, where x_1,\ldots,x_n are the variables (or unknowns), and b,a_1,\ldots,a_n are the coefficients, which are often real numbers. The coeffici ...

s based on this data to produce a spillover matrix, which when inverted and multiplied with the raw data from the cytometer produces the compensated data. The processes of computing the spillover matrix, or applying a precomputed spillover matrix to compensate flow cytometry data, are standard features of flow cytometry software.

Transformation

Cell populations detected by flow cytometry are often described as having approximately

log-normal In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable is log-normally distributed, then has a normal ...

expression. As such, they have traditionally been transformed to a

logarithmic scale A logarithmic scale (or log scale) is a way of displaying numerical data over a very wide range of values in a compact way—typically the largest numbers in the data are hundreds or even thousands of times larger than the smallest numbers. Such a ...

. In early cytometers, this was often accomplished even before data acquisition by use of a

log amplifier A log amplifier, also known as logarithmic amplifier or logarithm amplifier or log amp, is an amplifier for which the output voltage ''V''out is ''K'' times the natural log of the input voltage ''V''in. This can be expressed as, :V_\text = K \ln\l ...

. On modern instruments, data is usually stored in linear form, and transformed digitally prior to analysis. However, compensated flow cytometry data frequently contains negative values due to compensation, and cell populations do occur which have low means and normal distributions. Logarithmic transformations cannot properly handle negative values, and poorly display normally distributed cell types. Alternative transformations which address this issue include the log-linear hybrid transformations Logicle and Hyperlog, as well as the hyperbolic arcsine and the Box–Cox. A comparison of commonly used transformations concluded that the biexponential and Box–Cox transformations, when optimally parameterized, provided the clearest visualization and least variance of cell populations across samples. However, a later comparison of the flowTrans package used in that comparison indicated that it did not parameterize the Logicle transformation in a manner consistent with other implementations, potentially calling those results into question.

Quality control

Particularly in newer, high-throughput experiments, there is a need for

visualization Visualization or visualisation may refer to: * Visualization (graphics), the physical or imagining creation of images, diagrams, or animations to communicate a message * Data visualization, the graphic representation of data * Information visualiz ...

methods to help detect technical errors in individual samples. One approach is to visualize summary statistics, such as the

empirical distribution function In statistics, an empirical distribution function (commonly also called an empirical Cumulative Distribution Function, eCDF) is the distribution function associated with the empirical measure of a sample. This cumulative distribution function ...

s of single dimensions of technical or biological replicates to ensure they are the similar. For more rigor, the

Kolmogorov–Smirnov test In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous (or discontinuous, see Section 2.2), one-dimensional probability distributions that can be used to compare a sample wi ...

can be used to determine if individual samples deviate from the norm. The

Grubbs's test for outliers In statistics, Grubbs's test or the Grubbs test (named after Frank E. Grubbs, who published the test in 1950), also known as the maximum normalized residual test or extreme studentized deviate test, is a test used to detect outliers in a univariat ...

may be used to detect samples deviating from the group. A method for quality control in higher-dimensional space is to use probability binning with bins fit to the whole data set pooled together. Then the standard deviation of the number of cells falling in the bins within each sample can be taken as a measure of multidimensional similarity, with samples that are closer to the norm having a smaller standard deviation. With this method, higher standard deviation can indicate outliers, although this is a relative measure as the absolute value depends partly on the number of bins. With all of these methods, the cross-sample variation is being measured. However, this is the combination of technical variations introduced by the instruments and handling, and actual biological information that is desired to be measured. Disambiguating the technical and the biological contributions to between-sample variation can be a difficult to impossible task.

Normalization

Particularly in multi-centre studies, technical variation can make biologically equivalent populations of cells difficult to match across samples. Normalization methods to remove technical variance, frequently derived from

image registration Image registration is the process of transforming different sets of data into one coordinate system. Data may be multiple photographs, data from different sensors, times, depths, or viewpoints. It is used in computer vision, medical imaging, mili ...

techniques, are thus a critical step in many flow cytometry analyses. Single-marker normalization can be performed using landmark registration, in which peaks in a kernel density estimate of each sample are identified and aligned across samples.

Identifying cell populations

The complexity of raw flow cytometry data (dozens of measurements for thousands to millions of cells) makes answering questions directly using statistical tests or supervised learning difficult. Thus, a critical step in the analysis of flow cytometric data is to reduce this complexity to something more tractable while establishing common features across samples. This usually involves identifying multidimensional regions that contain functionally and phenotypically homogeneous groups of cells. This is a form of

cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

. There are a range of methods by which this can be achieved, detailed below.

Gating

The data generated by flow-cytometers can be plotted in one or two

dimension In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coor ...

s to produce a

histogram A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...

or scatter plot. The regions on these plots can be sequentially separated, based on fluorescence

intensity Intensity may refer to: In colloquial use * Strength (disambiguation) *Amplitude *Level (disambiguation) *Magnitude (disambiguation) In physical sciences Physics *Intensity (physics), power per unit area (W/m2) * Field strength of electric, ma ...

, by creating a series of subset extractions, termed "

gate A gate or gateway is a point of entry to or from a space enclosed by walls. The word derived from old Norse "gat" meaning road or path; But other terms include '' yett and port''. The concept originally referred to the gap or hole in the wal ...

s". These gates can be produced using software, e.g.

, FCS Express, WinMDI, CytoPaint (aka Paint-A-Gate), VenturiOne, Cellcion, CellQuest Pro, Cytospec, Kaluza. or flowCore. In datasets with a low number of dimensions and limited cross-sample technical and biological variability (e.g., clinical laboratories), manual analysis of specific cell populations can produce effective and reproducible results. However, exploratory analysis of a large number of cell populations in a high-dimensional dataset is not feasible. In addition, manual analysis in less controlled settings (e.g., cross-laboratory studies) can increase the overall error rate of the study. In one study, several computational gating algorithms performed better than manual analysis in the presence of some variation. However, despite the considerable advances in computational analysis, manual gating remains the main solution for the identification of specific rare cell populations that are not well-separated from other cell types.

Gating guided by dimension reduction

The number of scatter plots that need to be investigated increases with the square of the number of markers measured (or faster since some markers need to be investigated several times for each group of cells to resolve high-dimensional differences between cell types that appear to be similar in most markers). To address this issue,

principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...

has been used to summarize the high-dimensional datasets using a combination of markers that maximizes the variance of all data points. However, PCA is a linear method and is not able to preserve complex and non-linear relationships. More recently, two dimensional

minimum spanning tree A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. T ...

layouts have been used to guide the manual gating process. Density-based down-sampling and clustering was used to better represent rare populations and control the time and memory complexity of the minimum spanning tree construction process. More sophisticated

dimension reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

algorithms are yet to be investigated. Spade plot

Automated gating

Developing computational tools for identification of cell populations has been an area of active research only since 2008. Many individual clustering approaches have recently been developed, including model-based algorithms (e.g., flowClust and FLAME), density based algorithms (e.g. FLOCK and SWIFT, graph-based approaches (e.g. SamSPECTRAL) and most recently, hybrids of several approaches (flowMeans and flowPeaks). These algorithms are different in terms of memory and time complexity, their software requirements, their ability to automatically determine the required number of cell populations, and their sensitivity and specificity. The FlowCAP (Flow Cytometry: Critical Assessment of Population Identification Methods) project, with active participation from most academic groups with research efforts in the area, is providing a way to objectively cross-compare state-of-the-art automated analysis approaches. Other surveys have also compared automated gating tools on several datasets.

Probability binning methods

Probability binning is a non-gating analysis method in which flow cytometry data is split into

quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile ...

s on a univariate basis. The locations of the quantiles can then be used to test for differences between samples (in the variables not being split) using the chi-squared test. This was later extended into multiple dimensions in the form of frequency difference gating, a

technique where data is iteratively partitioned along the median. These partitions (or bins) are fit to a control sample. Then the proportion of cells falling within each bin in test samples can be compared to the control sample by the chi squared test. Finally, cytometric fingerprinting uses a variant of frequency difference gating to set bins and measure for a series of samples how many cells fall within each bin. These bins can be used as gates and used for subsequent analysis similarly to automated gating methods.

Combinatorial gating

High-dimensional clustering algorithms are often unable to identify rare cell types that are not well separated from other major populations. Matching these small cell populations across multiple samples is even more challenging. In manual analysis, prior biological knowledge (e.g., biological controls) provides guidance to reasonably identify these populations. However, integrating this information into the exploratory clustering process (e.g., as in

semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of ...

) has not been successful. An alternative to high-dimensional clustering is to identify cell populations using one marker at a time and then combine them to produce higher-dimensional clusters. This functionality was first implemented in FlowJo. The flowType algorithm builds on this framework by allowing the exclusion of the markers. This enables the development of statistical tools (e.g. RchyOptimyx) that can investigate the importance of each marker and exclude high-dimensional redundancies.

Diagnosis and discovery

After identification of the cell population of interest, a cross sample analysis can be performed to identify phenotypical or functional variations that are correlated with an external variable (e.g., a clinical outcome). These studies can be partitioned into two main groups:

Diagnosis

In these studies, the goal usually is to diagnose a disease (or a sub-class of a disease) using variations in one or more cell populations. For example, one can use multidimensional clustering to identify a set of clusters, match them across all samples, and then use

to construct a classifier for prediction of the classes of interest (e.g., this approach can be used to improve the accuracy of the classification of specific lymphoma subtypes). Alternatively, all the cells from the entire cohort can be pooled into a single multidimensional space for clustering before classification. This approach is particularly suitable for datasets with a high amount of biological variation (in which cross-sample matching is challenging) but requires technical variations to be carefully controlled.

Discovery

In a discovery setting, the goal is to identify and describe cell populations correlated with an external variable (as opposed to the diagnosis setting in which the goal is to combine the predictive power of multiple cell types to maximize the accuracy of the results). Similar to the diagnosis use-case, cluster matching in high-dimensional space can be used for exploratory analysis but the descriptive power of this approach is very limited, as it is hard to characterize and visualize a cell population in a high-dimensional space without first reducing the dimensionality. Finally, combinatorial gating approaches have been particularly successful in exploratory analysis of FCM data. Simplified Presentation of Incredibly Complex Evaluations (SPICE) is a software package that can use the gating functionality of FlowJo to statistically evaluate a wide range of different cell populations and visualize those that are correlated with the external outcome. flowType and RchyOptimyx (as discussed above) expand this technique by adding the ability of exploring the impact of independent markers on the overall correlation with the external outcome. This enables the removal of unnecessary markers and provides a simple visualization of all identified cell types. In a recent analysis of a large (n=466) cohort of HIV+ patients, this pipeline identified three correlates of protection against HIV, only one of which had been previously identified through extensive manual analysis of the same dataset.

Data formats and interchange

Flow Cytometry Standard

Flow Cytometry Standard (FCS) was developed in 1984 to allow recording and sharing of flow cytometry data. Since then, FCS became the standard

file format A file format is a Computer standard, standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary format, pr ...

supported by all flow cytometry software and hardware vendors. The FCS specification has traditionally been developed and maintained by the International Society for Advancement of Cytometry (ISAC). Over the years, updates were incorporated to adapt to technological advancements in both flow cytometry and computing technologies with FCS 2.0 introduced in 1990, FCS 3.0 in 1997, and the most current specification FCS 3.1 in 2010. FCS used to be the only widely adopted file format in flow cytometry. Recently, additional standard file formats have been developed by ISAC.

netCDF

ISAC is considering replacing FCS with a flow cytometry specific version of the Network Common Data Form (netCDF) file format. netCDF is a set of freely available software libraries and machine independent data formats that support the creation, access, and sharing of array-oriented scientific data. In 2008, ISAC drafted the first version of netCDF conventions for storage of raw flow cytometry data.International Society for the Advancement of Cytometry (2008). Analytical Cytometry Standard NetCDF Conventions for List Mode Binary Data File Component
/ref>

Archival Cytometry Standard (ACS)

The Archival Cytometry Standard (ACS) is being developed to bundle data with different components describing cytometry experiments. It captures relations among data, metadata, analysis files and other components, and includes support for audit trails, versioning and digital signatures. The ACS container is based on the

ZIP file format ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the ...

with an

-based table of contents specifying relations among files in the container. The

XML Signature XML Signature (also called ''XMLDSig'', ''XML-DSig'', ''XML-Sig'') defines an XML syntax for digital signatures and is defined in the W3C recommendationbr>XML Signature Syntax and Processing Functionally, it has much in common with PKCS #7 but is ...

W3C The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working t ...

Recommendation has been adopted to allow for digital signatures of components within the ACS container. An initial draft of ACS has been designed in 2007 and finalized in 2010. Since then, ACS support has been introduced in several software tools including FlowJo and Cytobank.

Gating-ML

The lack of gating interoperability has traditionally been a bottleneck preventing reproducibility of flow cytometry data analysis and the usage of multiple analytical tools. To address this shortcoming, ISAC developed Gating-ML, an XML-based mechanism to formally describe gates and related data (scale) transformations. The draft recommendation version of Gating-ML was approved by ISAC in 2008 and it is partially supported by tools like FlowJo, the flowUtils, CytoML libraries in R/BioConductor, and FlowRepository. It supports rectangular gates, polygon gates, convex polytopes, ellipsoids, decision trees and Boolean collections of any of the other types of gates. In addition, it includes dozens of built in public transformations that have been shown to potentially useful for display or analysis of cytometry data. In 2013, Gating-ML version 2.0 was approved by ISAC's Data Standards Task Force as a Recommendation. This new version offers slightly less flexibility in terms of the power of gating description; however, it is also significantly easier to implement in software tools.

Classification Results (CLR)

The Classification Results (CLR) File Format has been developed to exchange the results of manual gating and algorithmic classification approaches in a standard way in order to be able to report and process the classification. CLR is based in the commonly supported CSV file format with columns corresponding to different classes and cell values containing the probability of an event being a member of a particular class. These are captured as values between 0 and 1. Simplicity of the format and its compatibility with common spreadsheet tools have been the major requirements driving the design of the specification. Although it was originally designed for the field of flow cytometry, it is applicable in any domain that needs to capture either fuzzy or unambiguous classifications of virtually any kinds of objects.

Public data and software

As in other bioinformatics fields, development of new methods has primarily taken the form of

free open source software Free and open-source software (FOSS) is a term used to refer to groups of software consisting of both free software and open-source software where anyone is freely licensed to use, copy, study, and change the software in any way, and the source ...

, and several databases have been created for depositing

open data Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license. The goals of the open data movement are similar to those of other "open(-source)" movements ...

AutoGate

AutoGate performs compensation, gating, preview of clusters, exhaustive projection pursuit (EPP), multi-dimension scaling and phenogram, produces a visual dendogram to express HiD readiness. It is free to researchers and clinicians at academic, government, and non-profit institutions.

Bioconductor

The Bioconductor project is a repository of free open source software, mostly written in the

R programming language R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinforma ...

. As of July 2013, Bioconductor contained 21 software packages for processing flow cytometry data. These packages cover most of the range of functionality described earlier in this article.

GenePattern

GenePattern is a predominantly genomic analysis platform with over 200 tools for analysis of gene expression, proteomics, and other data. A web-based interface provides easy access to these tools and allows the creation of automated analysis pipelines enabling reproducible research. Recently, a GenePattern Flow Cytometry Suite has been developed in order to bring advanced flow cytometry data analysis tools to experimentalists without programmatic skills. It contains close to 40 open source GenePattern flow cytometry modules covering methods from basic processing of flow cytometry standard (i.e., FCS) files to advanced algorithms for automated identification of cell populations, normalization and quality assessment. Internally, most of these modules leverage functionality developed in BioConductor. Much of the functionality of the Bioconductor packages for flow cytometry analysis has been packaged up for use with the GenePattern

workflow system A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. International standards There are several international standards ...

, in the form of the GenePattern Flow Cytometry Suite.

FACSanadu

FACSanadu is an open source portable application for visualization and analysis of FCS data. Unlike Bioconductor, it is an interactive program aimed at non-programmers for routine analysis. It supports standard FCS files as well as COPAS profile data.

hema.to

hema.to
is a web service for the classification of flow cytometry data of patients suspected to have lymphoma. The artificial intelligence within the tool uses a

deep convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

to recognize patterns of distinct subtypes. All data and code is open access. It processes raw data, which makes gating unnecessary. For best performance on new data, fine tuning by knowledge transfer is required.

Public databases

The Minimum Information about a Flow Cytometry Experiment (MIFlowCyt), requires that any flow cytometry data used in a publication be available, although this does not include a requirement that it be deposited in a public database. Thus, although the journals Cytometry Part A and B, as well as all journals from the

Nature Publishing Group Nature Portfolio (formerly known as Nature Publishing Group and Nature Research) is a division of the international scientific publishing company Springer Nature that publishes academic journals, magazines, online databases, and services in scie ...

require MIFlowCyt compliance, there is still relatively little publicly available flow cytometry data. Some efforts have been made towards creating public databases, however. Firstly, CytoBank, which is a complete web-based flow cytometry data storage and analysis platform, has been made available to the public in a limited form. Using the CytoBank code base, FlowRepository was developed in 2012 with the support of ISAC to be a public repository of flow cytometry data. FlowRepository facilitates MIFlowCyt compliance, and as of July 2013 contained 65 public data sets.

Datasets

In 2012, the flow cytometry community has started to release a set of publicly available datasets. A subset of these datasets representing the existing data analysis challenges is described below. For comparison against manual gating, the FlowCAP-I project has released five datasets, manually gated by human analysts, and two of them gated by eight independent analysts. The FlowCAP-II project included three datasets for binary classification and also reported several algorithms that were able to classify these samples perfectly. FlowCAP-III included two larger datasets for comparison against manual gates as well as one more challenging sample classification dataset. As of March 2013, public release of FlowCAP-III was still in progress. The datasets used in FlowCAP-I, II, and III either have a low number of subjects or parameters. However, recently several more complex clinical datasets have been released including a dataset of 466 HIV-infected subjects, which provides both 14 parameter assays and sufficient clinical information for survival analysis. Another class of datasets are higher-dimensional mass cytometry assays. A representative of this class of datasets is a study which includes analysis of two bone marrow samples using more than 30 surface or intracellular markers under a wide range of different stimulations. The raw data for this dataset is publicly available as described in the manuscript, and manual analyses of the surface markers are available upon request from the authors.

Open problems

Despite rapid development in the field of flow cytometry bioinformatics, several problems remain to be addressed. Variability across flow cytometry experiments arises from biological variation among samples, technical variations across instruments used, as well as methods of analysis. In 2010, a group of researchers from Stanford University and the

National Institutes of Health The National Institutes of Health, commonly referred to as NIH (with each letter pronounced individually), is the primary agency of the United States government The federal government of the United States (U.S. federal government or U ...

pointed out that while technical variation can be ameliorated by standardizing sample handling, instrument setup and choice of reagents, solving variation in analysis methods will require similar standardization and computational automation of gating methods. They further opined that centralization of both data and analysis could aid in decreasing variability between experiments and in comparing results. This was echoed by another group of

Pacific Biosciences Pacific Biosciences of California, Inc. (aka PacBio) is an American biotechnology company founded in 2004 that develops and manufactures systems for gene sequencing and some novel real time biological observation. PacBio describes its platform ...

and Stanford University researchers, who suggested that

cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over m ...

could enable centralized, standardized, high-throughput analysis of flow cytometry experiments. They also emphasised that ongoing development and adoption of standard data formats could continue to aid in reducing variability across experiments. They also proposed that new methods will be needed to model and summarize results of high-throughput analysis in ways that can be interpreted by biologists, as well as ways of integrating large-scale flow cytometry data with other high-throughput biological information, such as

gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. ...

genetic variation Genetic variation is the difference in DNA among individuals or the differences between populations. The multiple sources of genetic variation include mutation and genetic recombination. Mutations are the ultimate sources of genetic variation, b ...

metabolite In biochemistry, a metabolite is an intermediate or end product of metabolism. The term is usually used for small molecules. Metabolites have various functions, including fuel, structure, signaling, stimulatory and inhibitory effects on enzymes, ...

levels and disease states.

References

{{reflist, 35em Flow cytometry Bioinformatics