The Pseudo K-tuple nucleotide composition or PseKNC, is a method for converting a

nucleotide sequence A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nu ...

(DNA or RNA) into a numerical vector so as to be used in pattern recognition techniques. Generally, the K-tuple can refer to a dinucleotide (when K=2) or a trinucleotide (when K=3). Depending on the instance, the technique can also be called PseDNC or PseTNC. The method was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition) that is applied to

protein sequence Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthe ...

Background

PseAAC

PseKNC was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition). Previously, investigations either relied on sequential models for making predictions of certain protein properties (which, in its simplest case, just refers to the amino acid composition of the protein), or a discrete model which represents a vector of twenty elements, each of which represent the frequency of each amino acid in the protein sample. The discrete model, however, fails to account for sequence-order information. The PseACC model extends the 20-length vector in the discrete model with λ components, each of which in some way captures sequence-order information, and this vector becomes the basis for making predictions.

Analogous problem in genomics

Analogously, a discrete model of a nucleotide sequence based on its dinucleotide composition would lay involve a vector of 16 elements, the value of which one representing the frequency of each dinucleotide in the sequence:

\mathbf=^

Where D is the DNA sequence, T is the transpose operator, and f(AA) is the normalized occurrence frequency of AA in the DNA sequence. A trinucleotide representation can be denoted as:

\mathbf=^

As can be seen, these discrete models fail to consider any global or long-range sequence-order information. To address this for both DNA and RNA sequences, the pseudo K-tuple

nucleotide Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...

composition or PseKNC was proposed.

PseKNC

PseKNC extends the discrete model by adding λ components to represent sequence-order and physico-chemical properties of the nucleotide sequence. The original KNC model will involve 4^K components. In a dinucleotide situation where K = 2, 4² = 16 components will be included. The extension by PseKNC results in (4^K + λ) components.

Applications

A wide diversity of applications have been developed with respect to the PseKNC method. For example, it has become an integral component of many algorithms designed to predict the locations of recombination hotspots and coldspots from sequence information.

Web servers

For the convenience scientific community, a freely available web server called PseKNC and an open source package called PseKNC-General were developed in 2013 and 2014, respectively, that could convert large-scale sequence datasets to pseudo nucleotide compositions with numerous choices of physicochemical property combinations. PseKNC-General can generate several modes of pseudo nucleotide compositions, including conventional k-tuple nucleotide compositions, Moreau–Broto autocorrelation coefficient, Moran autocorrelation coefficient, Geary autocorrelation coefficient, Type I PseKNC and Type II PseKNC. Another web server, Pse-in-One, allows users to hand-select all pre-existing PseAAC and PseKNC methods for protein, RNA, and DNA sequences, along with any selection of the existing availability of physicochemical property combinations for these options.

References

{{reflist , refs= {{cite journal , doi=10.1002/prot.1035 , pmid=11288174 , title=Prediction of protein cellular attributes using pseudo-amino acid composition , journal=Proteins: Structure, Function, and Genetics , volume=43 , issue=3 , pages=246–55 , year=2001 , last1=Chou , first1=Kuo-Chen , s2cid=28406797 {{cite journal , doi=10.1016/j.ab.2014.04.001 , pmid=24732113 , title=PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition , journal=Analytical Biochemistry , volume=456 , pages=53–60 , year=2014 , last1=Chen , first1=Wei , last2=Lei , first2=Tian-Yu , last3=Jin , first3=Dian-Chuan , last4=Lin , first4=Hao , last5=Chou , first5=Kuo-Chen {{cite journal , doi=10.1093/bioinformatics/btu602 , pmid=25231908 , title=PseKNC-General: A cross-platform package for generating various modes of pseudo nucleotide compositions , journal=Bioinformatics , volume=31 , issue=1 , pages=119–20 , year=2015 , last1=Chen , first1=Wei , last2=Zhang , first2=Xitong , last3=Brooker , first3=Jordan , last4=Lin , first4=Hao , last5=Zhang , first5=Liqing , last6=Chou , first6=Kuo-Chen , doi-access=free {{cite journal , doi=10.1039/c5mb00155b , pmid=26099739 , title=Pseudo nucleotide composition or PseKNC: An effective formulation for analyzing genomic sequences , journal=Molecular BioSystems , volume=11 , issue=10 , pages=2620–34 , year=2015 , last1=Chen , first1=Wei , last2=Lin , first2=Hao , last3=Chou , first3=Kuo-Chen Nucleotides