Rfam is a
database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
containing information about
non-coding RNA
A non-coding RNA (ncRNA) is a functional RNA molecule that is not Translation (genetics), translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally imp ...
(ncRNA) families and other structured RNA elements. It is an
annotated
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
,
open access
Open access (OA) is a set of principles and a range of practices through which nominally copyrightable publications are delivered to readers free of access charges or other barriers. With open access strictly defined (according to the 2001 de ...
database originally developed at the
Wellcome Trust Sanger Institute
The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit organisation, non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust.
It is l ...
in collaboration with
Janelia Farm,
and currently hosted at the
European Bioinformatics Institute
The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...
. Rfam is designed to be similar to the
Pfam
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The latest version of Pfam, 37.0, was released in June 2024 and contains 21,979 families. It is cur ...
database for annotating protein families.
Unlike
proteins
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, re ...
, ncRNAs often have similar
secondary structure
Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
without sharing much similarity in the
primary sequence
Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length sca ...
. Rfam divides ncRNAs into families based on evolution from a common ancestor. Producing
multiple sequence alignment
Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis an ...
s (MSA) of these families can provide insight into their structure and function, similar to the case of protein families. These MSAs become more useful with the addition of secondary structure information. Rfam researchers also contribute to
Wikipedia
Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...
's
RNA WikiProject.
Uses
The Rfam database can be used for a variety of functions. For each ncRNA family, the interface allows users to: view and download multiple sequence alignments; read annotation; and examine species distribution of family members. There are also links provided to literature references and other RNA databases.
Rfam also provides links to Wikipedia so that entries can be created or edited by users.
The interface at the Rfam website allows users to search ncRNAs by keyword, family name, or genome as well as to search by ncRNA sequence or
EMBL
The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 29 member states, two prospect member states, and one associate member state. EMBL was created in ...
accession number.
The database information is also available for download, installation and use using the INFERNAL software package.
The INFERNAL package can also be used with Rfam to annotate sequences (including complete genomes) for homologues to known ncRNAs.
Methods

In the database, the information of the
secondary structure
Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
and the
primary sequence
Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length sca ...
, represented by the
MSA, is combined in statistical models called profile
stochastic context-free grammar In theoretical linguistics and computational linguistics, probabilistic context free grammars (PCFGs) extend context-free grammars, similar to how hidden Markov models extend regular grammars. Each Formal grammar#The syntax of grammars, production i ...
s (SCFGs), also known as covariance models. These are analogous to
hidden Markov models
A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
used for protein family annotation in the
Pfam
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The latest version of Pfam, 37.0, was released in June 2024 and contains 21,979 families. It is cur ...
database.
[ Each family in the database is represented by two multiple sequence alignments in ]Stockholm format
Stockholm format is a multiple sequence alignment format used by Pfam, Rfam anDfam to disseminate protein, RNA and DNA sequence alignments. The alignment editorRalee,;() and a SCFG.
The first MSA is the "seed" alignment. It is a hand-curated alignment that contains representative members of the ncRNA family and is annotated with structural information. This seed alignment is used to create the SCFG, which is used with the Rfam software INFERNAL to identify additional family members and add them to the alignment. A family-specific threshold value is chosen to avoid false positives.
Until release 12, Rfam used an initial BLAST filtering step because profile SCFGs were too computationally expensive. However, the latest versions of INFERNAL are fast enough
so that the BLAST step is no longer necessary.
The second MSA is the “full” alignment, and is created as a result of a search using the covariance model against the sequence database. All detected homology (biology)">homologs
Homologous chromosomes or homologs are a set of one maternal and one paternal chromosome that pair up with each other inside a cell during meiosis. Homologs have the same genes in the same loci, where they provide points along each chromosome th ...
are aligned to the model, giving the automatically produced full alignment.
History
Version 1.0 of Rfam was launched in 2003 and contained 25 ncRNA families and annotated about 50 000 ncRNA genes. In 2005, version 6.1 was released and contained 379 families annotating over 280 000 genes. In August 2012, version 11.0 contained 2208 RNA families, while the current version (14.9, released in November 2022) annotates 4108 families.
Major releases and publications
* 2003 - Rfam: an RNA family database.[
* 2005 - Rfam: annotating non-coding RNAs in complete genomes.][
* 2008 - The RNA WikiProject: community annotation of RNA families.]
* 2008 - Rfam: updates to the RNA families database.
* 2011 - Rfam: Wikipedia, clans and the “decimal” release.
* 2012 - Rfam 11.0: 10 years of RNA families.
* 2014 - Rfam 12.0: updates to the RNA families database.
* 2017 - Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.
* 2020 - Rfam 14: expanded coverage of metagenomic, viral and microRNA families.
Problems
#The genomes of higher eukaryotes contain many ncRNA-derived pseudogenes and repeats. Distinguishing these non-functional copies from functional ncRNA is a formidable challenge.[
#Introns are not modeled by covariance models.
]
References
External links
Rfam website at the European Bioinformatics Institute
INFERNAL software package
miRBase
{{Wellcome Trust
Genetic engineering in the United Kingdom
Genetics databases
Molecular biology
Public-domain software with source code
RNA
Science and technology in Cambridgeshire
South Cambridgeshire District
Wellcome Trust