Biopython is an
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
collection of non-commercial
Python tools for
computational biology
Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...
and
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
.
[Refer to the Biopython website for othe]
papers describing Biopython
and a list of over one hundre
publications using/citing Biopython
It contains classes to represent
biological sequences and
sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online
databases of biological information, such as those at
NCBI
The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is loca ...
. Separate modules extend Biopython's capabilities to
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
,
protein structure
Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, which are the monomers of the polymer. A single amino acid ...
,
population genetics
Population genetics is a subfield of genetics that deals with genetic differences within and among populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as Adaptation (biology), adaptation, s ...
,
phylogenetics
In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...
,
sequence motif
In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''A ...
s, and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
. Biopython is one of a number of Bio* projects designed to reduce
code duplication in
computational biology
Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...
.
History
Biopython development began in 1999 and it was first released in July 2000.
It was developed during a similar time frame and with analogous goals to other projects that added bioinformatics capabilities to their respective programming languages, including
BioPerl,
BioRuby and
BioJava
BioJava is an open-source software project dedicated to providing Java tools for processing biological data.VS Matha and P Kangueane, 2009, ''Bioinformatics: a concept-based introduction'', 2009. p26 BioJava is a set of library functions written i ...
. Early developers on the project included Jeff Chang, Andrew Dalke and Brad Chapman, though over 100 people have made contributions to date.
In 2007, a similar
Python project, namely PyCogent, was established.
The initial scope of Biopython involved accessing, indexing and processing biological sequence files. While this is still a major focus, over the following years added modules have extended its functionality to cover additional areas of biology (see
Key features and examples).
As of version 1.77, Biopython no longer supports Python 2.
Design
Wherever possible, Biopython follows the conventions used by the Python programming language to make it easier for users familiar with Python. For example,
Seq
and
SeqRecord
objects can be manipulated via
slicing, in a manner similar to Python's strings and lists. It is also designed to be functionally similar to other Bio* projects, such as BioPerl.
Biopython is able to read and write most common file formats for each of its functional areas, and its license is permissive and compatible with most other software licenses, which allow Biopython to be used in a variety of software projects.
Key features and examples
Sequences
A core concept in Biopython is the biological sequence, and this is represented by the
Seq
class.
A Biopython
Seq
object is similar to a Python string in many respects: it supports the Python slice notation, can be concatenated with other sequences and is immutable. In addition, it includes sequence-specific methods and specifies the particular biological alphabet used.
>>> # This script creates a DNA sequence and performs some typical manipulations
>>> from Bio.Seq import Seq
>>> dna_sequence = Seq("AGGCTTCTCGTA", IUPAC.unambiguous_dna)
>>> dna_sequence
Seq('AGGCTTCTCGTA', IUPACUnambiguousDNA())
>>> dna_sequence :7Seq('GCTTC', IUPACUnambiguousDNA())
>>> dna_sequence.reverse_complement()
Seq('TACGAGAAGCCT', IUPACUnambiguousDNA())
>>> rna_sequence = dna_sequence.transcribe()
>>> rna_sequence
Seq('AGGCUUCUCGUA', IUPACUnambiguousRNA())
>>> rna_sequence.translate()
Seq('RLLV', IUPACProtein())
Sequence annotation
The
SeqRecord
class describes sequences, along with information such as name, description and features in the form of
SeqFeature
objects. Each
SeqFeature
object specifies the type of the feature and its location. Feature types can be ‘gene’, ‘CDS’ (coding sequence), ‘repeat_region’, ‘mobile_element’ or others, and the position of features in the sequence can be exact or approximate.
>>> # This script loads an annotated sequence from file and views some of its contents.
>>> from Bio import SeqIO
>>> seq_record = SeqIO.read("pTC2.gb", "genbank")
>>> seq_record.name
'NC_019375'
>>> seq_record.description
'Providencia stuartii plasmid pTC2, complete sequence.'
>>> seq_record.features 4SeqFeature(FeatureLocation(ExactPosition(4516), ExactPosition(5336), strand=1), type='mobile_element')
>>> seq_record.seq
Seq("GGATTGAATATAACCGACGTGACTGTTACATTTAGGTGGCTAAACCCGTCAAGC...GCC", IUPACAmbiguousDNA())
Input and output
Biopython can read and write to a number of common sequence formats, including
FASTA,
FASTQ, GenBank, Clustal, PHYLIP and
NEXUS. When reading files, descriptive information in the file is used to populate the members of Biopython classes, such as
SeqRecord
. This allows records of one file format to be converted into others.
Very large sequence files can exceed a computer's memory resources, so Biopython provides various options for accessing records in large files. They can be loaded entirely into memory in Python data structures, such as lists or
dictionaries
A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged Alphabetical order, alphabetically (or by Semitic root, consonantal root for Semitic languages or radical-and-stroke sorting, radical an ...
, providing fast access at the cost of memory usage. Alternatively, the files can be read from disk as needed, with slower performance but lower memory requirements.
>>> # This script loads a file containing multiple sequences and saves each one in a different format.
>>> from Bio import SeqIO
>>> genomes = SeqIO.parse("salmonella.gb", "genbank")
>>> for genome in genomes:
... SeqIO.write(genome, genome.id + ".fasta", "fasta")
Accessing online databases
Through the Bio.Entrez module, users of Biopython can download biological data from NCBI databases. Each of the functions provided by the
Entrez
The Entrez () Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The NCB ...
search engine is available through functions in this module, including searching for and downloading records.
>>> # This script downloads genomes from the NCBI Nucleotide database and saves them in a FASTA file.
>>> from Bio import Entrez
>>> from Bio import SeqIO
>>> output_file = open("all_records.fasta", "w")
>>> Entrez.email = "[email protected]"
>>> records_to_download = FO834906.1", "FO203501.1">>> for record_id in records_to_download:
... handle = Entrez.efetch(db="nucleotide", id=record_id, rettype="gb")
... seqRecord = SeqIO.read(handle, format="gb")
... handle.close()
... output_file.write(seqRecord.format("fasta"))
Phylogeny

The Bio.Phylo module provides tools for working with and visualising
phylogenetic tree
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...
s. A variety of file formats are supported for reading and writing, including
Newick,
NEXUS and
phyloXML. Common tree manipulations and traversals are supported via the
Tree
and
Clade
objects. Examples include converting and collating tree files, extracting subsets from a tree, changing a tree's root, and analysing branch features such as length or score.
Rooted trees can be drawn in
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
or using
matplotlib
Matplotlib (portmanteau of MATLAB, plot, and library) is a Plotter, plotting Library (computer science), library for the Python (programming language), Python programming language and its Numerical analysis, numerical mathematics extension NumPy. ...
(see Figure 1), and the
Graphviz
Graphviz (short for ''Graph Visualization Software'') is a package of open-source software, open-source tools initiated by AT&T Labs, AT&T Labs Research for Graph drawing, drawing graph (discrete mathematics), graphs (as in Vertex (graph theory ...
library can be used to create unrooted layouts (see Figure 2).
Genome diagrams

The GenomeDiagram module provides methods of visualising sequences within Biopython.
Sequences can be drawn in a linear or circular form (see Figure 3), and many output formats are supported, including
PDF
Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe Inc., Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, computer hardware, ...
and
PNG. Diagrams are created by making tracks and then adding sequence features to those tracks. By looping over a sequence's features and using their attributes to decide if and how they are added to the diagram's tracks, one can exercise much control over the appearance of the final diagram. Cross-links can be drawn between different tracks, allowing one to compare multiple sequences in a single diagram.
Macromolecular structure
The Bio.PDB module can load molecular structures from
PDB and
mmCIF files, and was added to Biopython in 2003.
The
Structure
object is central to this module, and it organises macromolecular structure in a hierarchical fashion:
Structure
objects contain
Model
objects which contain
Chain
objects which contain
Residue
objects which contain
Atom
objects. Disordered residues and atoms get their own classes,
DisorderedResidue
and
DisorderedAtom
, that describe their uncertain positions.
Using Bio.PDB, one can navigate through individual components of a macromolecular structure file, such as examining each atom in a protein. Common analyses can be carried out, such as measuring distances or angles, comparing residues and calculating residue depth.
Population genetics
The Bio.PopGen module adds support to Biopython for Genepop, a software package for statistical analysis of population genetics.
This allows for analyses of
Hardy–Weinberg equilibrium,
linkage disequilibrium Linkage disequilibrium, often abbreviated to LD, is a term in population genetics referring to the association of genes, usually linked genes, in a population. It has become an important tool in medical genetics and other fields
In defining LD, it ...
and other features of a population's
allele frequencies.
This module can also carry out population genetic simulations using
coalescent theory
Coalescent theory is a Scientific modelling, model of how alleles sampled from a population may have originated from a most recent common ancestor, common ancestor. In the simplest case, coalescent theory assumes no genetic recombination, recombina ...
with the fastsimcoal2 program.
Wrappers for command line tools
Many of Biopython's modules contain command line wrappers for commonly used tools, allowing these tools to be used from within Biopython. These wrappers include
BLAST,
Clustal, PhyML,
EMBOSS and
SAMtools. Users can subclass a generic wrapper class to add support for any other command line tool.
See also
*
Open Bioinformatics Foundation
The Open Bioinformatics Foundation is a non-profit, volunteer-run organization focused on supporting open source programming in bioinformatics. The mission of the foundation is to support the development of open source toolkits for bioinformatics, ...
*
BioPerl
*
BioRuby
*
BioJS
*
BioJava
BioJava is an open-source software project dedicated to providing Java tools for processing biological data.VS Matha and P Kangueane, 2009, ''Bioinformatics: a concept-based introduction'', 2009. p26 BioJava is a set of library functions written i ...
References
External links
* {{Official website, https://biopython.org/
Biopython Tutorial and CookbookPDF
Biopython source code on GitHub
Articles with example Python (programming language) code
Bioinformatics software
Computational science
Python (programming language) scientific libraries
Free bioinformatics software