Sequence graph, also called an alignment graph, breakpoint graph, or adjacency graph, are bidirected graphs used in
comparative genomics
Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural ...
. The structure consists of multiple graphs or
genomes
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding gen ...
with a series of edges and vertices represented as
adjacencies between segments in a
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
and
DNA segments respectively. Traversing a connected component of segments and adjacency edges (called a ''thread'') yields a sequence, which typically represents a genome or a section of a genome. The segments can be thought of as
synteny
In genetics, the term synteny refers to two related concepts:
* In classical genetics, ''synteny'' describes the physical co-localization of genetic loci on the same chromosome within an individual or species.
* In current biology, ''synteny'' mo ...
blocks, with the edges dictating how to arrange these blocks in a particular genome, and the labelling of the adjacency edges representing bases that are not contained in synteny blocks.
Construction
Before constructing a sequence graph, there must be at least two genomes represented as directed graphs with edges as threads (adjacency edges) and vertices as DNA segments. The genomes should be labeled ''P'' and ''Q,'' while the sequence graph is labeled as BreakpointGraph(''P, Q'').
The directional vertices of Q and their edges are arranged in the order of P. Once completed, the edges of Q are reconnected to their original vertices. After all edges have been matched the vertex directions are removed and instead each vertex is labeled as v
h (vertex head) and v
t (vertex tail).
Similarity between genomes is represented by the number of cycles (independent systems) within the sequence graph. The number of cycles is equal to cycles (P, Q). The max number of cycles possible is equal to the number of vertices in the sequence graph.
Example
Figure example.
Upon receiving genomes P (+a +b -c) and Q (+a +b -c),
Q should be realigned to follow the direction edges (red) of P. The vertices should be renamed from a, b, c to a
h a
t, b
h b
t, c
h c
t and the edges of P and Q should be connected to their original vertices (P edges = black, Q edges = green). Remove the directional edges (red). The number of cycles in G(P, Q) is 1 while the max possible is 3.
Applications
Reconstruction of ancestral genomes
Alekseyev and Pevzner use sequence graphs to create their own algorithm to study the genome rearrangement history of several mammals, as well as a way to overcome problems with current ancestral reconstruction of genomes.
Multiple sequence alignment
Sequence graphs can be used to represent
multiple sequence alignment
Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolution ...
s with the addition of a new kind of edge representing
homology between segments. For a set of genomes, one can create an acyclic breakpoint graph with a thread for each genome. For two segments
and
, where
,
,
, and
represent the endpoints of the two segments, homology edges can be created from
to
and
to
or from
to
and
to
- representing the two possible orientations of the homology. The advantage of representing a multiple sequence alignment this way is that it is possible to include inversions and other structural rearrangements that wouldn't be allowable in a matrix representation.
Representing variation
If there are multiple possible paths when traversing a thread in a sequence graph, multiple sequences can be represented by the same thread. This means it is possible to create a sequence graph that represents a population of individuals with slightly different genomes - with each genome corresponding to one path through the graph. These graphs have been proposed as a replacement for the
reference
Reference is a relationship between objects in which one object designates, or acts as a means by which to connect to or link to, another object. The first object in this relation is said to ''refer to'' the second object. It is called a '' name'' ...
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the ...
.
References
{{reflist
Bioinformatics
Genomics
Evolutionary biology
Application-specific graphs