A sequence profiling tool in

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

is a type of

software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consist ...

that presents information related to a

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...

tic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, res ...

sequence or ‘keyword’ and search one or more

database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...

s for information related to that sequence. Summaries and aggregate results are provided in standardized format describing the information that would otherwise have required visits to many smaller sites or direct literature searches to compile. Many sequence profiling tools are software portals or gateways that simplify the process of finding information about a query in the large and growing number of bioinformatics databases. The access to these kinds of tools is either web based or locally downloadable executables.

Introduction and usage

The "post-

genomics Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...

" era has given rise to a range of web-based tools and software to compile, organize, and deliver large amounts of

primary sequence Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length ...

information, as well as protein structures, gene annotations,

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Al ...

s, and other common bioinformatics tasks. In general, there exist three types of databases and service providers. The first one includes the popular public-domain or open-access databases supported by funding and grants such as NCBI, ExPASy,

Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...

, and PDB. The second one includes smaller or more specific databases organized and compiled by individual research groups Examples includ
Yeast Genome DatabaseRNA database
The third and final one includes private corporate or institutional databases that require payment or institutional affiliation to access. Such examples are rare given the globalization of public databases, unless the purported service is ‘in-development’ or the end point of the analysis is of commercial value. Typical scenarios of a profiling approach become relevant, particularly, in the cases of the first two groups, where researchers commonly wish to combine information derived from several sources about a single query or target sequence. For example, users might use the sequence alignment and search tool

BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...

to identify homologs of their gene of interest in other species, and then use these results to locate a solved protein structure for one of the homologs. Similarly, they might also want to know the likely

secondary structure Protein secondary structure is the three dimensional form of ''local segments'' of proteins. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary struct ...

of the

mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the ...

encoding the gene of interest, or whether a company sells a

DNA construct A DNA construct is an artificially-designed segment of DNA borne on a vector that can be used to incorporate genetic material into a target tissue or cell. A DNA construct contains a DNA insert, called a transgene, delivered via a transformation ...

containing the gene. Sequence profiling tools serve to automate and integrate the process of seeking such disparate information by rendering the process of searching several different external databases transparent to the user. Many public databases are already extensively linked so that complementary information in another database is easily accessible; for example, Genbank and the PDB are closely intertwined. However, specialized tools organized and hosted by specific research groups can be difficult to integrate into this linkage effort because they are narrowly focused, are frequently modified, or use custom versions of common file formats. Advantages of sequence profiling tools include the ability to use multiple of these specialized tools in a single query and present the output with a common interface, the ability to direct the output of one set of tools or database searches into the input of another, and the capacity to disseminate hosting and compilation obligations to a network of research groups and institutions rather than a single centralized repository.

Keyword based profilers

Most of the profiling tools available on the web today fall into this category. The user, upon visiting the site/tool, enters any relevant information like a keyword e.g. dystrophy, diabetes etc., or GenBank accession numbers, PDB ID. All the relevant hits by the search are presented in a format unique to each tool’s main focus. Profiling tools based on keyword searches are essentially

search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

s that are highly specialized for bioinformatics work, thereby eliminating a clutter of irrelevant or non-scholarly hits that might occur with a traditional search engine like

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

. Most keyword-based profiling tools allow flexible types of keyword input, accession numbers from indexed databases as well as traditional keyword descriptors. Each profiling tool has its own focus and area of interest. For example, the NCBI search engine Entrez segregates its hits by category, so that users looking for protein structure information can screen out sequences with no corresponding structure, while users interested in perusing the literature on a subject can view abstracts of papers published in scholarly journals without distraction from gene or sequence results. The

PubMed PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintai ...

biosciences literature database is a popular tool for literature searches, though this service is nearly equaled with the more general

Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes ...

. Keyword-based data aggregation services like the

Bioinformatic Harvester The Bioinformatic Harvester was a bioinformatic meta search engine created by the European Molecular Biology Laboratory and subsequently hosted and further developed by KIT Karlsruhe Institute of Technology for genes and protein-associated informa ...

performs provide reports from a variety of third-party servers in an ''as-is'' format so that users need not visit the website or install the software for each individual component service. This is particularly invaluable given the rapid emergence of various sites providing different sequence analysis and manipulation tools. Another aggregative web portal, the Human Protein Reference Database ( Hprd), contains manually annotated and curated entries for human proteins. The information provided is thus both selective and comprehensive, and the query format is flexible and intuitive. The pros of developing manually curated databases include presentation of proofread material and the concept of ‘molecule authorities’ to undertake the responsibility of specific proteins. However, the cons are that they are typically slower to update and may not contain very new or disputed data.

Sequence data based profilers

A typical sequence profiling tool carries this further by using an actual DNA, RNA, or protein sequence as an input and allows the user to visit different web-based analysis tools to obtain the information desired. Such tools are also commonly supplied with commercial laboratory equipment like gene sequencers or sometimes sold as software applications for molecular biology. In another public-database example, the

sequence search report from NCBI provides a link from its alignment report to other relevant information in its own databases, if such specific information exists. For example, a retrieved record that contains a human sequence will carry a separate link that connects to its location on a human genome map; a record that contains a sequence for which a 3-D structure has been solved would carry a link that connects it to its structure database.

Sequerome Sequerome is a web-based sequence profiling tool for integrating the results of a BLAST sequence-alignment report with external research tools and servers that perform advanced sequence manipulations, and allowing the user to record the steps of s ...

, a public service tool, links the entire BLAST report to many third party servers/sites that provide highly specific services in sequence manipulations such as

restriction enzyme A restriction enzyme, restriction endonuclease, REase, ENase or'' restrictase '' is an enzyme that cleaves DNA into fragments at or near specific recognition sites within molecules known as restriction sites. Restriction enzymes are one class ...

maps,

open reading frame In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible readi ...

analyses for

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecu ...

sequences, and

prediction. The tool provides added advantage of maintaining a research log of the operations performed by the user, which can be then conveniently archived using 'mail', 'print' or 'save' functionality. Thus an entire operation of researching on a sequence using different research tools and thus carrying a project to its completion can be completed within one browser interface. Consequently, future generation of sequence profiling tools would include ability to collaborate online with researchers to share project logs and research tools, annotate results of sequence analysis or lab work, customize and automate the processing of sets of sequence data etc
InstaSeq
is a Google powered search tool that allows the user to directly enter a sequence and search the entire World Wide Web. This unique search engine, which is the only one of its kind, is in contrast to searching specific databases e.g. GenBank. As a result, the user can end up with a privately hosted document or a page from a lesser known database from just about anywhere in the world. Though the presence of sequence based profilers are far and few in the present scenario, their key role will become evident when huge amounts of sequence data need to be cross processed across portals and domains.

Future growth and directions

The proliferation of bioinformatics tools for genetic analysis aids researchers in identifying and categorizing genes and gene sets of interest in their work; however, the large variety of tools that perform substantially similar aggregative and analytical functions can also confuse and frustrate new users. The decentralization encouraged by aggregative tools allows individual research groups to maintain specialized servers dedicated to specific types of data analysis in the expectation that their output will be collected into a larger report on a gene or protein of interest to other researchers. Data produced by microarray experiments,

two-hybrid screening Two-hybrid screening (originally known as yeast two-hybrid system or Y2H) is a molecular biology technique used to discover protein–protein interactions (PPIs) and protein–DNA interactions by testing for physical interactions (such as bindi ...

, and other high-throughput biological experiments is voluminous and difficult to analyze by hand; the efforts of

structural genomics Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling ...

collaborations that are aimed at quickly solving large numbers of highly varied protein structures also increase the need for integration between sequence and structure databases and portals. This impetus toward developing more comprehensive and more user-friendly methods of sequence profiling makes this an active area of research among current genomics researchers.

References

* * * * * *{{cite journal , author1=Ganesan N , author2=Kalyanasundaram B , author3=Velauthapillai M , title=Bioinformatics data profiling tools: a prelude to metabolic profiling , journal=Pac. Symp. Biocomput. , pages=127–32 , date=March 2007 , pmid=17990486 Bioinformatics software

Introduction and usage

Keyword based profilers

Sequence data based profilers

Future growth and directions

See also

References