String operations
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, in the area of formal language theory, frequent use is made of a variety of
string functions String functions are used in computer programming languages to manipulate a string or query information about a string (some do both). Most programming languages that have a string datatype will have some string functions although there may be ...
; however, the notation used is different from that used for
computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as anal ...
, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.


Strings and languages

A string is a finite sequence of characters. The
empty string In formal language theory, the empty string, or empty word, is the unique string of length zero. Formal theory Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special c ...
is denoted by \varepsilon. The concatenation of two string s and t is denoted by s \cdot t, or shorter by s t. Concatenating with the empty string makes no difference: s \cdot \varepsilon = s = \varepsilon \cdot s. Concatenation of strings is associative: s \cdot (t \cdot u) = (s \cdot t) \cdot u. For example, (\langle b \rangle \cdot \langle l \rangle) \cdot (\varepsilon \cdot \langle ah \rangle) = \langle bl \rangle \cdot \langle ah \rangle = \langle blah \rangle. A
language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
is a finite or infinite set of strings. Besides the usual set operations like union, intersection etc., concatenation can be applied to languages: if both S and T are languages, their concatenation S \cdot T is defined as the set of concatenations of any string from S and any string from T, formally S \cdot T = \. Again, the concatenation dot \cdot is often omitted for brevity. The language \ consisting of just the empty string is to be distinguished from the empty language \. Concatenating any language with the former doesn't make any change: S \cdot \ = S = \ \cdot S, while concatenating with the latter always yields the empty language: S \cdot \ = \ = \ \cdot S. Concatenation of languages is associative: S \cdot (T \cdot U) = (S \cdot T) \cdot U. For example, abbreviating D = \, the set of all three-digit decimal numbers is obtained as D \cdot D \cdot D. The set of all decimal numbers of arbitrary length is an example for an infinite language.


Alphabet of a string

The alphabet of a string is the set of all of the characters that occur in a particular string. If ''s'' is a string, its
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...
is denoted by :\operatorname(s) The alphabet of a language S is the set of all characters that occur in any string of S, formally: \operatorname(S) = \bigcup_ \operatorname(s). For example, the set \ is the alphabet of the string \langle cacao \rangle, and the above D is the alphabet of the above language D \cdot D \cdot D as well as of the language of all decimal numbers.


String substitution

Let ''L'' be a
language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
, and let Σ be its alphabet. A string substitution or simply a substitution is a mapping ''f'' that maps characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a character ''a'' ∈ Σ, one has ''f''(''a'')=''L''''a'' where ''L''''a'' ⊆ Δ * is some language whose alphabet is Δ. This mapping may be extended to strings as :''f''(ε)=ε for the
empty string In formal language theory, the empty string, or empty word, is the unique string of length zero. Formal theory Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special c ...
ε, and :''f''(''sa'')=''f''(''s'')''f''(''a'') for string ''s'' ∈ ''L'' and character ''a'' ∈ Σ. String substitutions may be extended to entire languages as :f(L)=\bigcup_ f(s) Regular languages are closed under string substitution. That is, if each character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language. Similarly, context-free languages are closed under string substitution.Although every regular language is also context-free, the previous theorem is not implied by the current one, since the former yields a shaper result for regular languages. A simple example is the conversion ''f''uc(.) to uppercase, which may be defined e.g. as follows: For the extension of ''f''uc to strings, we have e.g. * ''f''uc(‹Straße›) = ⋅ ⋅ ⋅ ⋅ ⋅ = , * ''f''uc(‹u2›) = ⋅ = , and * ''f''uc(‹Go!›) = ⋅ ⋅ = . For the extension of ''f''uc to languages, we have e.g. * ''f''uc() = ∪ ∪ = .


String homomorphism

A string homomorphism (often referred to simply as a
homomorphism In algebra, a homomorphism is a morphism, structure-preserving map (mathematics), map between two algebraic structures of the same type (such as two group (mathematics), groups, two ring (mathematics), rings, or two vector spaces). The word ''homo ...
in formal language theory) is a string substitution such that each character is replaced by a single string. That is, f(a)=s, where s is a string, for each character a.Strictly formally, a homomorphism yields a language consisting of just one string, i.e. f(a) = . String homomorphisms are monoid morphisms on the free monoid, preserving the empty string and the
binary operation In mathematics, a binary operation or dyadic operation is a rule for combining two elements (called operands) to produce another element. More formally, a binary operation is an operation of arity two. More specifically, an internal binary op ...
of string concatenation. Given a language L, the set f(L) is called the homomorphic image of L. The inverse homomorphic image of a string s is defined as f^(s) = \ while the inverse homomorphic image of a language L is defined as f^(L) = \ In general, f(f^(L)) \neq L, while one does have f(f^(L)) \subseteq L and L \subseteq f^(f(L)) for any language L. The class of regular languages is closed under homomorphisms and inverse homomorphisms. Similarly, the context-free languages are closed under homomorphismsThis follows from the above-mentioned closure under arbitrary substitutions. and inverse homomorphisms. A string homomorphism is said to be ε-free (or e-free) if f(a) \neq \varepsilon for all ''a'' in the alphabet \Sigma. Simple single-letter substitution ciphers are examples of (ε-free) string homomorphisms. An example string homomorphism ''g''uc can also be obtained by defining similar to the above substitution: ''g''uc(‹a›) = ‹A›, ..., ''g''uc(‹0›) = ε, but letting ''g''uc be undefined on punctuation chars. Examples for inverse homomorphic images are * ''g''uc−1() = , since ''g''uc(‹sss›) = ''g''uc(‹sß›) = ''g''uc(‹ßs›) = ‹SSS›, and *''g''uc−1() = , since ''g''uc(‹a›) = ‹A›, while ‹bb› cannot be reached by ''g''uc. For the latter language, ''g''uc(''g''uc−1()) = ''g''uc() = ≠ . The homomorphism ''g''uc is not ε-free, since it maps e.g. ‹0› to ε. A very simple string homomorphism example that maps each character to just a character is the conversion of an EBCDIC-encoded string to
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
.


String projection

If ''s'' is a string, and \Sigma is an alphabet, the string projection of ''s'' is the string that results by removing all characters that are not in \Sigma. It is written as \pi_\Sigma(s)\,. It is formally defined by removal of characters from the right hand side: :\pi_\Sigma(s) = \begin \varepsilon & \mbox s=\varepsilon \mbox \\ \pi_\Sigma(t) & \mbox s=ta \mbox a \notin \Sigma \\ \pi_\Sigma(t)a & \mbox s=ta \mbox a \in \Sigma \end Here \varepsilon denotes the
empty string In formal language theory, the empty string, or empty word, is the unique string of length zero. Formal theory Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special c ...
. The projection of a string is essentially the same as a
projection in relational algebra In relational algebra, a projection is a unary operation written as \Pi_( R ), where R is a relation and a_1,...,a_n are attribute names. Its result is defined as the set obtained when the components of the tuples in R are restricted to the set \ ...
. String projection may be promoted to the projection of a language. Given a
formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sym ...
''L'', its projection is given by :\pi_\Sigma (L)=\


Right quotient

The right quotient of a character ''a'' from a string ''s'' is the truncation of the character ''a'' in the string ''s'', from the right hand side. It is denoted as s/a. If the string does not have ''a'' on the right hand side, the result is the empty string. Thus: :(sa)/ b = \begin s & \mbox a=b \\ \varepsilon & \mbox a \ne b \end The quotient of the empty string may be taken: :\varepsilon / a = \varepsilon Similarly, given a subset S\subset M of a monoid M, one may define the quotient subset as :S/a=\ Left quotients may be defined similarly, with operations taking place on the left of a string. Hopcroft and Ullman (1979) define the quotient ''L''1/''L''2 of the languages ''L''1 and ''L''2 over the same alphabet as ''L''1/''L''2 = . This is not a generalization of the above definition, since, for a string ''s'' and distinct characters ''a'', ''b'', Hopcroft's and Ullman's definition implies / yielding , rather than . The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton language ''L''1 and an arbitrary language ''L''2 is known as
Brzozowski derivative In theoretical computer science, in particular in formal language theory, the Brzozowski derivative u^S of a set S of strings and a string u is the set of all strings obtainable from a string in S by cutting off the prefix u, as illustrated in ...
; if ''L''2 is represented by a regular expression, so can be the left quotient.


Syntactic relation

The right quotient of a subset S\subset M of a monoid M defines an
equivalence relation In mathematics, an equivalence relation is a binary relation that is reflexive, symmetric and transitive. The equipollence relation between line segments in geometry is a common example of an equivalence relation. Each equivalence relatio ...
, called the right
syntactic relation In mathematics and computer science, the syntactic monoid M(L) of a formal language L is the smallest monoid that recognizes the language L. Syntactic quotient The free monoid on a given set is the monoid whose elements are all the strings of ...
of ''S''. It is given by :\sim_S \;\,=\, \ The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if :\ is finite. In the case that ''M'' is the monoid of words over some alphabet, ''S'' is then a regular language, that is, a language that can be recognized by a finite state automaton. This is discussed in greater detail in the article on syntactic monoids.


Right cancellation

The right cancellation of a character ''a'' from a string ''s'' is the removal of the first occurrence of the character ''a'' in the string ''s'', starting from the right hand side. It is denoted as s\div a and is recursively defined as :(sa)\div b = \begin s & \mbox a=b \\ (s\div b)a & \mbox a \ne b \end The empty string is always cancellable: :\varepsilon \div a = \varepsilon Clearly, right cancellation and projection
commute Commute, commutation or commutative may refer to: * Commuting, the process of travelling between a place of residence and a place of work Mathematics * Commutative property, a property of a mathematical operation whose result is insensitive to th ...
: :\pi_\Sigma(s)\div a = \pi_\Sigma(s \div a )


Prefixes

The prefixes of a string is the set of all
prefixes A prefix is an affix which is placed before the stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy''. Particula ...
to a string, with respect to a given language: :\operatorname_L(s) = \ where s\in L. The prefix closure of a language is :\operatorname (L) = \bigcup_ \operatorname_L(s) = \left\ Example:
L=\left\\mbox \operatorname(L)=\left\ A language is called prefix closed if \operatorname (L) = L. The prefix closure operator is idempotent: :\operatorname (\operatorname (L)) =\operatorname (L) The prefix relation is a binary relation \sqsubseteq such that s\sqsubseteq t if and only if s \in \operatorname_L(t). This relation is a particular example of a prefix order.


See also

*
Comparison of programming languages (string functions) String functions are used in computer programming languages to manipulate a string or query information about a string (some do both). Most programming languages that have a string datatype will have some string functions although there may be ...
* Levi's lemma *
String (computer science) In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation) ...
— definition and implementation of more basic operations on strings


Notes


References

* ''(See chapter 3.)'' {{reflist Formal languages Relational algebra Operations