I-sites are short sequence-structure motifs that are mined from the

Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules such as proteins and nucleic acids, which is overseen by the Worldwide Protein Data Bank (wwPDB). This structural data is obtained a ...

(PDB) that correlate strongly with three-dimensional structural elements. These sequence-structure motifs are used for the local structure prediction of proteins. Local structure can be expressed as fragments or as backbone angles. Locations in the protein sequence that have high confidence I-sites predictions may be the initiation sites of folding. I-sites have also been identified as discrete models for folding pathways. I-sites consist of about 250 motifs. Each motif has an amino acid profile, a fragment structure (represented by a "paradigm" fragment chosen from a protein in the PDB) and optionally, a 4-dimensional tensor of pairwise sequence covariance.

Construction of I-site Library

The sequence and structure database The database initially consisted of 471 protein sequence families from the HSSP database, with an average of 47 aligned sequences per family. Each family contained a single known structure (parent) from the Brookhaven protein Data Bank. These were a subset of the PDBSelect-25 list, having no more than 25% sequence identity between any two alignments. Disordered loops were omitted. Gaps and insertions in the sequence were ignored. Clustering of sequence segments Each position in the database is described by a weighted amino acid frequency. A

similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...

in sequence space between a segment (p) and a cluster of segments (q) is defined as:

D_ = \sum_ log \left \dfrac \right log \left \dfrac \right /math> 

where Pij(p) is the frequency of amino acid i in position j within the segment p. Nq is the number of sequence segments k in the cluster q. Fi is the frequency of amino acid type i in the database overall. The optimal values of a and a0 were determined empirically to be 0.5 and 15, respectively. Using this similarity measure, segments of a given length (3 to 15) were clustered via the k-means algorithm .

Assessing structure within a cluster; choice of paradigm

The structural similarity between any two peptide segments was evaluated using a combination of the RMS distance matrix error (dme): dme = \sqrt where ai->j is the distance between a-carbon atoms i and j in the segment s1 of length L, and the maximum deviation in backbone torsion angles (mda) over the length of the segment is given by: mda(L) = max_ (\Delta \Phi_ , \Delta \Psi_i) The paradigm structure for a cluster was chosen from the top-scoring 20 segments in the database as that with the smallest sum of mda values to the other 19. Other structural measures were tried before settling on these two: RMS deviation of a-carbon atoms (rmsd), dme alone, and a structural filter that looked for specific conserved contacts. The latter worked best in discriminating true and false positives, but could not be easily automated. The rmsd and dme were found to be poor discriminators of the two types of helix cap. The mda-dme combined filter best simulates the conserved contacts filter and is rapidly computed.

References

External links

I-sites Library
{{Webarchive, url=https://web.archive.org/web/20120328040522/http://www.bioinfo.rpi.edu/bystrc/Isites2/ , date=2012-03-28
I-sites/HMMSTR Prediction Server
Protein structural motifs