HOME

TheInfoList



OR:

WormBase is an online
biological database Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including geno ...
about the biology and genome of the nematode
model organism A model organism is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workings of other organisms. Mo ...
''
Caenorhabditis elegans ''Caenorhabditis elegans'' () is a free-living transparent nematode about 1 mm in length that lives in temperate soil environments. It is the type species of its genus. The name is a Hybrid word, blend of the Greek ''caeno-'' (recent), ''r ...
'' and contains information about other related nematodes. WormBase is used by the ''C. elegans'' research community both as an information resource and as a place to publish and distribute their results. The database is regularly updated with new versions being released every two months. WormBase is one of the organizations participating in the Generic Model Organism Database (GMOD) project. It is also part of the Alliance of Genome Resources.


Contents

WormBase comprises the following main data sets: *The annotated genomes of ''
Caenorhabditis elegans ''Caenorhabditis elegans'' () is a free-living transparent nematode about 1 mm in length that lives in temperate soil environments. It is the type species of its genus. The name is a Hybrid word, blend of the Greek ''caeno-'' (recent), ''r ...
'', '' Caenorhabditis briggsae'', ''
Caenorhabditis remanei ''Caenorhabditis remanei'' is a species of nematode found in North America and Europe, and likely lives throughout the temperate world. Several strains have been developed in the laboratory.Caenorhabditis brenneri ''Caenorhabditis brenneri'' is a small nematode, closely related to the model organism ''Caenorhabditis elegans''. Its genome is being sequenced by Washington University in St. Louis Genome Sequencing Center. This species has previously been re ...
'', '' Caenorhabditis angaria'', '' Pristionchus pacificus'', ''
Haemonchus contortus ''Haemonchus contortus'', also known as the barber's pole worm, is a very common parasite and one of the most pathogenic nematodes of ruminants. Adult worms attach to abomasal mucosa and feed on the blood. This parasite is responsible for anemi ...
'', '' Meloidogyne hapla'', ''
Meloidogyne incognita ''Meloidogyne incognita'' (root-knot nematode, RKN), also known as the southern root-nematode or cotton root-knot nematode is a plant-parasitic roundworm in the family Heteroderidae. This nematode is one of the four most common species worldwid ...
'', ''
Brugia malayi ''Brugia malayi'' is a filarial (arthropod-borne) nematode (roundworm), one of the three causative agents of lymphatic filariasis in humans. Lymphatic filariasis, also known as elephantiasis tropica, elephantiasis, is a condition characterized by ...
'' and '' Onchocerca volvulus''; *Hand-curated annotations describing the function of ~20,500 ''C. elegans'' protein-coding genes and ~16,000 ''C. elegans'' non-coding genes; *Gene families; *Orthologies; *Genomic transcription factor binding sites *Comprehensive information on mutant alleles and their phenotypes; *Whole-genome RNAi (''RNA interference'') screens; *Genetic maps, markers and polymorphisms; *The ''C. elegans'' physical map; *Gene expression profiles (stage, tissue and cell) from microarrays, SAGE analysis and GFP promoter fusions; *The complete cell lineage of the worm; *The wiring diagram of the worm nervous system; *Protein-protein interaction
Interactome In molecular biology, an interactome is the whole set of molecular interactions in a particular cell. The term specifically refers to physical interactions among molecules (such as those among proteins, also known as protein–protein interactions ...
data; *Genetic regulatory relationships; *Details of intra- and inter-specific sequence homologies (with links to other
Model Organism Databases Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large set ...
). In addition, WormBase contains an up-to-date searchable bibliography of ''C. elegans'' research and is linked to the WormBook project.


Tools

WormBase offers many ways of searching and retrieving data from the database:
WormMartWiki
- was a tool for retrieving varied information on many genes (or the sequences of those genes). This was the WormBase implementation o
BioMart

WormMineWiki
- as of 2016, the primary data mining facility. This is the WormBase implementation o
InterMine

Genome Browser
- browse the genes of ''C. elegans'' (and other species) in their genomic context
Textpresso
- a search tool that queries published ''C. elegans'' literature (including meeting abstracts) and a subset of nematode literature.


Sequence curation

Sequence curation at WormBase refers to the maintenance and annotation of the primary genomic sequence and a consensus gene set.


Genome sequence

Even though the ''C. elegans'' genome sequence is the most accurate and complete eukaryotic genome sequence, it has continually needed refinement as new evidence has been created. Many of these changes were single nucleotide insertions or deletions, however several large mis-assemblies have been uncovered. For example, in 2005 a 39 kb cosmid had to be inverted. Other improvements have come from comparing genomic DNA to cDNA sequences and analysis of RNASeq high-throughput data. When differences between the genomic sequence and transcripts are identified, re-analysis of the original genomic data often leads to modifications of the genomic sequence. The changes in the genomic sequence pose difficulties when comparing chromosomal coordinates of data derived from different releases of WormBase. There is a coordinate re-mapping program and mapping data are available to aid these comparisons.


Gene structure models

All the gene-sets of the WormBase species were initially generated by gene prediction programs. Gene prediction programs give a reasonable set of gene structures, but the best of them only predict about 80% of the complete gene structures correctly. They have difficulty predicting genes with unusual structures, as well as those with a weak translation start signal, weak splice sites or single exon genes. They can incorrectly predict a coding gene model where the gene is a pseudogene and they predict the isoforms of a gene poorly, if at all. The gene models of ''C. elegans'', ''C. briggsae'', ''C. remanei'', and ''C. brenneri'' genes are manually curated. The majority of gene structure changes have been based on transcript data from large scale projects such as Yuji Kohara's EST libraries, Mark Vidal's Orfeome project (worfdb.dfci.harvard.edu/) Waterston and Hillier's Illumina data and Makedonka Mitreva's 454 data. However, other data types (e.g. protein alignments, ''ab initio'' prediction programs, trans-splice leader sites, poly-A signals and addition sites, SAGE and TEC-RED transcript tags, mass-spectroscopic peptides, and conserved protein domains) are useful in refining the structures, especially where expression is low and so transcripts are not sufficiently available. When genes are conserved between the available nematode species, comparative analysis can also be very informative. WormBase encourages researchers to inform them via the help-desk if they have evidence for an incorrect gene structure. Any cDNA or mRNA sequence evidence for the change should be submitted to EMBL/GenBank/DDBJ; this helps in the confirmation and evidence for the gene model as WormBase routinely retrieve sequence data from these public databases. This also makes the data public, allowing appropriate reference and acknowledgement to the researchers. When any change is made to a CDS (or Pseudogene), the old gene model is preserved as a ‘history’ object. This will have a suffix name like: “AC3.5:wp119”, where ‘AC3.5’ is the name of the CDS and the ‘119’ refers to the database release in which the change was made. The reason for the change and the evidence for the change are added to the annotation of the CDS – these can be seen in the Visible/Remark section of the CDS's ‘Tree Display’ section on the WormBase web site.


Gene nomenclature


Genes

In WormBase, a Gene is a region that is expressed or a region that has been expressed and is now a Pseudogene. Genes have unique identifiers like ‘WBGene00006415’. All C. elegans WormBase genes also have a Sequence Name, which is derived from the cosmid, fosmid or YAC clone on which they reside, for instance F38H4.7, indicating it is on the cosmid ‘F38H4’, and there are at least 6 other genes on that cosmid. If a gene produces a protein that can be classified as a member of a family, the gene may also be assigned a CGC name like tag-30 indicating that this is the 30th member of the tag gene family. Assignment of gene family names is controlled by WormBase. Before publication, requests for names should be made in WormBase. There are a few exceptions to this format, like the genes cln-3.1, cln-3.2, and cln-3.3 which all are equally similar to the human gene CLN3. Gene GCG names for non-elegans species in WormBase have the 3-letter species code prepended, like Cre-acl-5, Cbr-acl-5, Cbn-acl-5. A gene can be a Pseudogene, or can express one or more non-coding RNA genes (ncRNA) or protein-coding sequences (CDS).


Pseudogenes

Pseudogenes are genes that do not produce a reasonable, functional transcript. They may be pseudogenes of coding genes or of non-coding RNA and may be whole or fragments of a gene and may or may not express a transcript. The boundary between what is considered a ''reasonable'' coding transcript is sometimes subjective as, in the absence of other evidence, the use of weak splice sites or short exons can often produce a putative, though unsatisfactory, model of a CDS. Pseudogenes and genes with a problematic structure are constantly under review in WormBase and new evidence is used to try to resolve their status.


CDSs

Coding Sequences (CDSs) are the only part of a Gene's structure that is manually curated in WormBase. The structure of the Gene and its transcripts are derived from the structure of their CDSs. CDSs have a Sequence Name that is derived from the same Sequence Name as their parent Gene object, so the gene ‘F38H4.7’ has a CDS called ‘F38H4.7’. The CDS specifies coding exons in the gene from the START (Methionine) codon up to (and including) the STOP codon. Any gene can code for multiple proteins as a result of alternative splicing. These isoforms have a name that is formed from the Sequence Name of the gene with a unique letter appended. In the case of the gene bli-4 there are 6 known CDS isoforms, called K04F10.4a, K04F10.4b, K04F10.4c, K04F10.4d, K04F10.4e and K04F10.4f. It is common to refer to isoforms in the literature using the CGC gene family name with a letter appended, for example pha-4a, however this has no meaning within the WormBase database and searches for pha-4a in WormBase will not return anything. The correct name of this isoform is either the CDS/Transcript name: F38A6.1a, or even better, the Protein name: WP:CE15998.


Gene transcripts

The transcripts of a gene in WormBase are automatically derived by mapping any available cDNA or mRNA alignments onto the CDS model. These gene transcripts will therefore often include the UTR exons surrounding the CDS. If there are no available cDNA or mRNA transcripts, then the gene transcripts will have exactly the same structure as the CDS that they are modelled on. Gene transcripts are named after the Sequence Name of the CDS used to create them, for example, F38H4.7 or K04F10.4a. However, if there is alternative splicing in the UTRs, which would not change the protein sequence, the alternatively spliced transcripts are named with a digit appended, for example: K04F10.4a.1 and K04F10.4a.2. If there are no isoforms of the coding gene, for example AC3.5, but there is alternative splicing in the UTRs, there will be multiple transcripts named AC3.5.1 and AC3.5.2, etc. If there are no alternate UTR transcripts the single coding_transcript is named the same as the CDS and does not have the .1 appended, as in the case of K04F10.4f.


Operons

Groups of genes which are co-transcribed as operons are curated as Operon objects. These have names like CEOP5460 and are manually curated using evidence from the SL2 trans-spliced leader sequence sites.


Non-coding RNA genes

There are several classes of non-coding RNA gene classes in WormBase: *
tRNA Transfer ribonucleic acid (tRNA), formerly referred to as soluble ribonucleic acid (sRNA), is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes). In a cell, it provides the physical link between the gene ...
genes are predicted by the program ‘tRNAscan-SE’. *
rRNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
genes are predicted by homology with other species. *
snRNA Small nuclear RNA (snRNA) is a class of small RNA molecules that are found within the splicing speckles and Cajal bodies of the cell nucleus in eukaryotic cells. The length of an average snRNA is approximately 150 nucleotides. They are transcrib ...
genes are mainly imported from Rfam. *
piRNA Pirna (; , ) is a town in Saxony, Germany and capital of the administrative district Sächsische Schweiz-Osterzgebirge. The town's population is over 37,000. Pirna is located near Dresden and is an important district town as well as a ''Große ...
genes are from an analysis of the characteristic motif in these genes. *
miRNA Micro ribonucleic acid (microRNA, miRNA, μRNA) are small, single-stranded, non-coding RNA molecules containing 21–23 nucleotides. Found in plants, animals, and even some viruses, miRNAs are involved in RNA silencing and post-transcri ...
genes have mainly been imported from miRBase. They have the primary transcript and the mature transcript marked up. The primary transcript will have a Sequence name like W09G3.10 and the mature transcript will have a letter added to this name like W09G3.10a (and if there are alternative mature transcripts, W09G3.10b, etc.). *
snoRNA In molecular biology, small nucleolar RNAs (snoRNAs) are a class of small RNA molecules that primarily guide chemical modifications of other RNAs, mainly ribosomal RNAs, transfer RNAs and small nuclear RNAs. There are two main classes of snoRNA, t ...
genes are mainly imported from Rfam or from papers. * ncRNA genes that have no obvious other function but which are obviously not protein-coding and are not pseudogenes are curated. Many of these have conserved homology with genes in other species. A few of these are expressed on the reverse sense to protein-coding genes. There is also one scRNA gene.


Transposons

Transposons are not classed as genes and so do not have a parent gene object. Their structure is curated as a Transposon_CDS object with a name like C29E6.6.


Other species

The non-elegans species in WormBase have genomes that have been assembled from sequencing technologies that do not involve sequencing cosmids or YACs. These species therefore do not have sequence names for CDSs and gene transcripts that are based on cosmid names. Instead they have unique alphanumeric identifiers constructed like the names in the table below.


Proteins

The protein products of gene are created by translating the CDS sequences. Each unique protein sequence is given a unique identifying name like WP:CE40440. Examples of the protein identifier names for each species in WormBase is given in the table, below. It is possible for two CDS sequences from separate genes, within a species, to be identical and so it is possible to have identical proteins coded for by separate genes. When this happens, a single, unique identifying name is used for the protein even though it is produced by two genes.


ParaSite

WormBase ParaSite is a sub-portal for approximately 100 draft genomes of parasitic helminths (
nematode The nematodes ( or ; ; ), roundworms or eelworms constitute the phylum Nematoda. Species in the phylum inhabit a broad range of environments. Most species are free-living, feeding on microorganisms, but many are parasitic. Parasitic worms (h ...
s and
platyhelminthes Platyhelminthes (from the Greek πλατύ, ''platy'', meaning "flat" and ἕλμινς (root: ἑλμινθ-), ''helminth-'', meaning "worm") is a phylum of relatively simple bilaterian, unsegmented, soft-bodied invertebrates commonly called f ...
) developed at the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...
and
Wellcome Trust Sanger Institute The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit organisation, non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is l ...
. All genomes are assembled and annotated. Additional information such as protein domains and
Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ...
terms are also available. Gene trees allow the alignment of orthologues between parasitic worms, other nematodes and non-worm comparator species. A BioMart data-mining tool is offered to permit large scale access to the data.


WormBase management

WormBase is a collaboration among the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...
,
Wellcome Trust Sanger Institute The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit organisation, non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is l ...
, Ontario Institute for Cancer Research,
Washington University in St. Louis Washington University in St. Louis (WashU) is a private research university in St. Louis, Missouri, United States. Founded in 1853 by a group of civic leaders and named for George Washington, the university spans 355 acres across its Danforth ...
, and the
California Institute of Technology The California Institute of Technology (branded as Caltech) is a private research university in Pasadena, California, United States. The university is responsible for many modern scientific advancements and is among a small group of institutes ...
. It is supported by the grant P41-HG002223 from the
National Institutes of Health The National Institutes of Health (NIH) is the primary agency of the United States government responsible for biomedical and public health research. It was founded in 1887 and is part of the United States Department of Health and Human Service ...
and the grant G0701197 from the British Medical Research Council . Caltech carries out the biological curation and develops the underlying ontologies, the EBI carries out sequence curation and computation as well as database builds, the Sanger is primarily involved in curation and display of parasitic nematode genomes and genes, and the OICR develops the website and main data mining tools.


See also

*
Flybase FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, ''Drosophila melanogaster'', a wide range of da ...
* Xenbase


Notes and references


External links


WormBase

WormBase ParaSite

The WormBook website
the online textbook companion to WormBase.
Textpresso
search engine for C. elegans and other biological literature.
WormBase Wiki

Release notes
details of the latest WormBase release
WormBase: better software, richer content
Nucleic Acids Research article describing WormBase (2006). * {{Wellcome Trust Caenorhabditis elegans Genetics databases Genetics in the United Kingdom Genomics organizations Model organism databases South Cambridgeshire District Wellcome Trust