machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and data mining, a string kernel is a

kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...

that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the similarity of pairs of strings: the more similar two strings ''a'' and ''b'' are, the higher the value of a string kernel ''K''(''a'', ''b'') will be. Using string kernels with kernelized learning algorithms such as

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

s allow such algorithms to work with strings, without having to translate these to fixed-length, real-valued

feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern ...

s. String kernels are used in domains where sequence data are to be clustered or

classified Classified may refer to: General *Classified information, material that a government body deems to be sensitive *Classified advertising or "classifieds" Music *Classified (rapper) (born 1977), Canadian rapper *The Classified, a 1980s American roc ...

, e.g. in

text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

and

gene analysis In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...

Informal introduction

Suppose one wants to compare some text passages automatically and indicate their relative similarity. For many applications, it might be sufficient to find some keywords which match exactly. One example where exact matching is not always enough is found in

spam detection Various anti-spam techniques are used to prevent email spam (unsolicited bulk email). No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email (false positives) as opposed to ...

. Another would be in computational gene analysis, where

homologous Homology may refer to: Sciences Biology *Homology (biology), any characteristic of biological organisms that is derived from a common ancestor *Sequence homology, biological homology between DNA, RNA, or protein sequences * Homologous chrom ...

genes In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...

have

mutated In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mitos ...

, resulting in common subsequences along with deleted, inserted or replaced symbols.

Motivation

Since several well-proven data clustering, classification and information retrieval methods (for example support vector machines) are designed to work on vectors (i.e. data are elements of a vector space), using a string kernel allows the extension of these methods to handle sequence data. The string kernel method is to be contrasted with earlier approaches for text classification where feature vectors only indicated the presence or absence of a word. Not only does it improve on these approaches, but it is an example for a whole class of kernels adapted to data structures, which began to appear at the turn of the 21st century. A survey of such methods has been compiled by Gärtner. In bioinformatics string kernels are used especially to transform biological sequences such as proteins or DNA into vectors for further use in machine learning models. An example of a string kernel used for that purpose is the profile kernel.

Definition

kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine lea ...

on a domain

D

is a function

K: D \times D \rightarrow \mathbb

satisfying some conditions (being

symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definit ...

in the arguments,

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous g ...

and positive semidefinite in a certain sense).

Mercer's theorem In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most n ...

asserts that

K

can then be expressed as

K(x,y)=\varphi(x)\cdot \varphi(y)

with

\varphi

mapping the arguments into an

inner product space In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...

. We can now reproduce the definition of a string subsequence kernel on strings over an

alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a s ...

\Sigma

. Coordinate-wise, the mapping is defined as follows: :

\varphi_u :
\left\{
\begin{array}{l}
\Sigma^n \rightarrow \mathbb{R}^{\Sigma^n} \\
 s \mapsto \sum_{\mathbf{i} : u=s_{\mathbf{i} \lambda^{l(\mathbf{i})}
\end{array}
\right.

The

\mathbf{i}

are

multiindices Multi-index notation is a mathematical notation that simplifies formulas used in multivariable calculus, partial differential equations and the theory of distributions, by generalising the concept of an integer index to an ordered tuple of indices. ...

and

u

is a string of length

n

: subsequences can occur in a non-contiguous manner, but gaps are penalized. The multiindex

\mathbf{i}

gives the positions of the characters matching

u

s

l(\mathbf{i})

is the difference between the first and last entry in

\mathbf{i}

, that is: how far apart in

s

the subsequence matching

u

is. The parameter

\lambda

may be set to any value between

0

(gaps are not allowed, as only

0^0

is not

0

but

1

) and

1

(even widely-spread "occurrences" are weighted the same as appearances as a contiguous substring, as

1^{l(\mathbf{i})}=1

). For several relevant algorithms, data enters into the algorithm only in expressions involving an inner product of feature vectors, hence the name

kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...

. A desirable consequence of this is that one does not need to explicitly calculate the transformation

\phi(x)

, only the inner product via the kernel, which may be a lot quicker, especially when approximated.

References

{{Reflist Algorithms on strings Kernel methods for machine learning Natural language processing String metrics