theoretical linguistics Theoretical linguistics is a term in linguistics that, like the related term general linguistics, can be understood in different ways. Both can be taken as a reference to the theory of language, or the branch of linguistics that inquires into the ...

and

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

, probabilistic context free grammars (PCFGs) extend

context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the fo ...

s, similar to how

hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

s extend regular grammars. Each

production Production may refer to: Economics and business * Production (economics) * Production, the act of manufacturing goods * Production, in the outline of industrial organization, the act of making products (goods and services) * Production as a stat ...

is assigned a probability. The probability of a derivation (parse) is the product of the probabilities of the productions used in that derivation. These probabilities can be viewed as parameters of the model, and for large problems it is convenient to learn these parameters via

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

. A probabilistic grammar's validity is constrained by context of its training dataset. PCFGs originated from grammar theory, and have application in areas as diverse as

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

to the study the structure of

RNA Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...

molecules and design of

programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...

s. Designing efficient PCFGs has to weigh factors of scalability and generality. Issues such as grammar ambiguity must be resolved. The grammar design affects results accuracy. Grammar parsing algorithms have various time and memory requirements.

Definitions

Derivation: The process of recursive generation of strings from a grammar.

Parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...

: Finding a valid derivation using an automaton. Parse Tree: The alignment of the grammar to a sequence. An example of a parser for PCFG grammars is the

pushdown automaton In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack. Pushdown automata are used in theories about what can be computed by machines. They are more capab ...

. The algorithm parses grammar nonterminals from left to right in a stack-like manner. This brute-force approach is not very efficient. In RNA secondary structure prediction variants of the Cocke–Younger–Kasami (CYK) algorithm provide more efficient alternatives to grammar parsing than pushdown automata. Another example of a PCFG parser is the Stanford Statistical Parser which has been trained using

Treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empi ...

Formal definition

Similar to a CFG, a probabilistic context-free grammar can be defined by a quintuple: :

G = (M, T, R, S, P)

where * is the set of non-terminal symbols * is the set of terminal symbols * is the set of production rules * is the start symbol * is the set of probabilities on production rules

Relation with hidden Markov models

PCFGs models extend

s the same way as

s extend regular grammars. The Inside-Outside algorithm is an analogue of the Forward-Backward algorithm. It computes the total probability of all derivations that are consistent with a given sequence, based on some PCFG. This is equivalent to the probability of the PCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar. The Inside-Outside algorithm is used in model parametrization to estimate prior frequencies observed from training sequences in the case of RNAs. Dynamic programming variants of the

CYK algorithm In computer science, the Cocke–Younger–Kasami algorithm (alternatively called CYK, or CKY) is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Coc ...

find the Viterbi parse of a RNA sequence for a PCFG model. This parse is the most likely derivation of the sequence by the given PCFG.

Grammar construction

Context-free grammars are represented as a set of rules inspired from attempts to model natural languages. The rules are absolute and have a typical syntax representation known as

Backus–Naur form In computer science, Backus–Naur form (BNF, pronounced ), also known as Backus normal form, is a notation system for defining the Syntax (programming languages), syntax of Programming language, programming languages and other Formal language, for ...

. The production rules consist of terminal

\left \

and non-terminal symbols and a blank

\epsilon

may also be used as an end point. In the production rules of CFG and PCFG the left side has only one nonterminal whereas the right side can be any string of terminal or nonterminals. In PCFG nulls are excluded. An example of a grammar: :

S \to aS, S \to bS, S \to \epsilon

This grammar can be shortened using the ', ' ('or') character into: :

S\to aS ,  bS , \epsilon

Terminals in a grammar are words and through the grammar rules a non-terminal symbol is transformed into a string of either terminals and/or non-terminals. The above grammar is read as "beginning from a non-terminal the emission can generate either or or

\epsilon

". Its derivation is: :

S \Rightarrow aS\Rightarrow abS\Rightarrow abbS \Rightarrow abb

Ambiguous grammar In computer science, an ambiguous grammar is a context-free grammar for which there exists a string (computer science), string that can have more than one leftmost derivation or parse tree. Every non-empty context-free language admits an ambiguous ...

may result in ambiguous parsing if applied on

homograph A homograph (from the , and , ) is a word that shares the same written form as another word but has a different meaning. However, some dictionaries insist that the words must also be pronounced differently, while the Oxford English Dictionar ...

s since the same word sequence can have more than one interpretation. Pun sentences such as the newspaper headline "Iraqi Head Seeks Arms" are an example of ambiguous parses. One strategy of dealing with ambiguous parses (originating with grammarians as early as

Pāṇini (; , ) was a Sanskrit grammarian, logician, philologist, and revered scholar in ancient India during the mid-1st millennium BCE, dated variously by most scholars between the 6th–5th and 4th century BCE. The historical facts of his life ar ...

) is to add yet more rules, or prioritize them so that one rule takes precedence over others. This, however, has the drawback of proliferating the rules, often to the point where they become difficult to manage. Another difficulty is overgeneration, where unlicensed structures are also generated. Probabilistic grammars circumvent these problems by ranking various productions on frequency weights, resulting in a "most likely" (winner-take-all) interpretation. As usage patterns are altered in

diachronic Synchrony and diachrony are two complementary viewpoints in linguistic analysis. A ''synchronic'' approach - from ,("together") + ,("time") - considers a language at a moment in time without taking its history into account. In contrast, a ''diac ...

shifts, these probabilistic rules can be re-learned, thus updating the grammar. Assigning probability to production rules makes a PCFG. These probabilities are informed by observing distributions on a training set of similar composition to the language to be modeled. On most samples of broad language, probabilistic grammars where probabilities are estimated from data typically outperform hand-crafted grammars. CFGs when contrasted with PCFGs are not applicable to RNA structure prediction because while they incorporate sequence-structure relationship they lack the scoring metrics that reveal a sequence structural potential

Weighted context-free grammar

A weighted context-free grammar (WCFG) is a more general category of

, where each production has a numeric weight associated with it. The weight of a specific

parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...

in a WCFG is the product (or sum ) of all rule weights in the tree. Each rule weight is included as often as the rule is used in the tree. A special case of WCFGs are PCFGs, where the weights are (

logarithm In mathematics, the logarithm of a number is the exponent by which another fixed value, the base, must be raised to produce that number. For example, the logarithm of to base is , because is to the rd power: . More generally, if , the ...

s of )

probabilities Probability is a branch of mathematics and statistics concerning Event (probability theory), events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probab ...

. An extended version of the

can be used to find the "lightest" (least-weight) derivation of a string given some WCFG. When the tree weight is the product of the rule weights, WCFGs and PCFGs can express the same set of

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

Applications

RNA structure prediction

Since the 1990s, PCFG has been applied to model

RNA structure Nucleic acid structure refers to the structure of nucleic acids such as DNA and RNA. Chemically speaking, DNA and RNA are very similar. Nucleic acid structure is often divided into four different levels: primary, secondary, tertiary, and quaterna ...

s. Energy minimization and PCFG provide ways of predicting RNA secondary structure with comparable performance. However structure prediction by PCFGs is scored probabilistically rather than by minimum free energy calculation. PCFG model parameters are directly derived from frequencies of different features observed in databases of RNA structures rather than by experimental determination as is the case with energy minimization methods. The types of various structure that can be modeled by a PCFG include long range interactions, pairwise structure and other nested structures. However, pseudoknots can not be modeled. PCFGs extend CFG by assigning probabilities to each production rule. A maximum probability parse tree from the grammar implies a maximum probability structure. Since RNAs preserve their structures over their primary sequence, RNA structure prediction can be guided by combining evolutionary information from comparative sequence analysis with biophysical knowledge about a structure plausibility based on such probabilities. Also search results for structural homologs using PCFG rules are scored according to PCFG derivations probabilities. Therefore, building grammar to model the behavior of base-pairs and single-stranded regions starts with exploring features of structural

multiple sequence alignment Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis an ...

of related RNAs. :

S \to aSa ,  bSb ,  aa ,  bb

The above grammar generates a string in an outside-in fashion, that is the basepair on the furthest extremes of the terminal is derived first. So a string such as

aabaabaa

is derived by first generating the distal 's on both sides before moving inwards: :

S \Rightarrow aSa \Rightarrow aaSaa \Rightarrow aabSbaa \Rightarrow aabaabaa

A PCFG model extendibility allows constraining structure prediction by incorporating expectations about different features of an RNA . Such expectation may reflect for example the propensity for assuming a certain structure by an RNA. However incorporation of too much information may increase PCFG space and memory complexity and it is desirable that a PCFG-based model be as simple as possible. Every possible string a grammar generates is assigned a probability weight

P(x, \theta)

given the PCFG model

\theta

. It follows that the sum of all probabilities to all possible grammar productions is

\sum_P(x, \theta)=1

. The scores for each paired and unpaired residue explain likelihood for secondary structure formations. Production rules also allow scoring loop lengths as well as the order of base pair stacking hence it is possible to explore the range of all possible generations including suboptimal structures from the grammar and accept or reject structures based on score thresholds.

Implementations

RNA secondary structure implementations based on PCFG approaches can be utilized in : * Finding consensus structure by optimizing structure joint probabilities over MSA. * Modeling base-pair covariation to detecting homology in database searches. * pairwise simultaneous folding and alignment. Different implementation of these approaches exist. For example, Pfold is used in secondary structure prediction from a group of related RNA sequences, covariance models are used in searching databases for homologous sequences and RNA annotation and classification, RNApromo, CMFinder and TEISER are used in finding stable structural motifs in RNAs.

Design considerations

PCFG design impacts the secondary structure prediction accuracy. Any useful structure prediction probabilistic model based on PCFG has to maintain simplicity without much compromise to prediction accuracy. Too complex a model of excellent performance on a single sequence may not scale. A grammar based model should be able to: * Find the optimal alignment between a sequence and the PCFG. * Score the probability of the structures for the sequence and subsequences. * Parameterize the model by training on sequences/structures. * Find the optimal grammar parse tree (CYK algorithm). * Check for ambiguous grammar (Conditional Inside algorithm). The resulting of multiple

s per grammar denotes grammar ambiguity. This may be useful in revealing all possible base-pair structures for a grammar. However an optimal structure is the one where there is one and only one correspondence between the parse tree and the secondary structure. Two types of ambiguities can be distinguished. Parse tree ambiguity and structural ambiguity. Structural ambiguity does not affect thermodynamic approaches as the optimal structure selection is always on the basis of lowest free energy scores. Parse tree ambiguity concerns the existence of multiple parse trees per sequence. Such an ambiguity can reveal all possible base-paired structures for the sequence by generating all possible parse trees then finding the optimal one. In the case of structural ambiguity multiple parse trees describe the same secondary structure. This obscures the CYK algorithm decision on finding an optimal structure as the correspondence between the parse tree and the structure is not unique. Grammar ambiguity can be checked for by the conditional-inside algorithm.

Building a PCFG model

A probabilistic context free grammar consists of terminal and nonterminal variables. Each feature to be modeled has a production rule that is assigned a probability estimated from a training set of RNA structures. Production rules are recursively applied until only terminal residues are left. A starting non-terminal

\mathbf

produces loops. The rest of the grammar proceeds with parameter

\mathbf

that decide whether a loop is a start of a stem or a single stranded region and parameter

\mathbf

that produces paired bases. The formalism of this simple PCFG looks like: :

\mathit

\mathit

\mathit

The application of PCFGs in predicting structures is a multi-step process. In addition, the PCFG itself can be incorporated into probabilistic models that consider RNA evolutionary history or search homologous sequences in databases. In an evolutionary history context inclusion of prior distributions of RNA structures of a

structural alignment Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large R ...

in the production rules of the PCFG facilitates good prediction accuracy. A summary of general steps for utilizing PCFGs in various scenarios: * Generate production rules for the sequences. * Check ambiguity. * Recursively generate parse trees of the possible structures using the grammar. * Rank and score the parse trees for the most plausible sequence.

Algorithms

Several algorithms dealing with aspects of PCFG based probabilistic models in RNA structure prediction exist. For instance the inside-outside algorithm and the CYK algorithm. The inside-outside algorithm is a recursive dynamic programming scoring algorithm that can follow expectation-maximization paradigms. It computes the total probability of all derivations that are consistent with a given sequence, based on some PCFG. The inside part scores the subtrees from a parse tree and therefore subsequences probabilities given an PCFG. The outside part scores the probability of the complete parse tree for a full sequence. CYK modifies the inside-outside scoring. Note that the term 'CYK algorithm' describes the CYK variant of the inside algorithm that finds an optimal parse tree for a sequence using a PCFG. It extends the actual

used in non-probabilistic CFGs. The inside algorithm calculates

\alpha(i,j,v)

probabilities for all

i, j, v

of a parse subtree rooted at

W_v

for subsequence

x_i,...,x_j

. Outside algorithm calculates

\beta(i,j,v)

probabilities of a complete parse tree for sequence from root excluding the calculation of

x_i,...,x_j

. The variables and refine the estimation of probability parameters of an PCFG. It is possible to reestimate the PCFG algorithm by finding the expected number of times a state is used in a derivation through summing all the products of and divided by the probability for a sequence given the model

P(x, \theta)

. It is also possible to find the expected number of times a production rule is used by an expectation-maximization that utilizes the values of and . The CYK algorithm calculates

\gamma(i, j, v)

to find the most probable parse tree

\hat

and yields

\log P(x, \hat, \theta)

. Memory and time complexity for general PCFG algorithms in RNA structure predictions are

O(L^2M)

and

O(L^3M^3)

respectively. Restricting a PCFG may alter this requirement as is the case with database searches methods.

PCFG in homology search

Covariance models (CMs) are a special type of PCFGs with applications in database searches for homologs, annotation and RNA classification. Through CMs it is possible to build PCFG-based RNA profiles where related RNAs can be represented by a consensus secondary structure. The RNA analysis package Infernal uses such profiles in inference of RNA alignments. The Rfam database also uses CMs in classifying RNAs into families based on their structure and sequence information. CMs are designed from a consensus RNA structure. A CM allows

indel Indel (insertion-deletion) is a molecular biology term for an insertion or deletion of bases in the genome of an organism. Indels ≥ 50 bases in length are classified as structural variants. In coding regions of the genome, unless the lengt ...

s of unlimited length in the alignment. Terminals constitute states in the CM and the transition probabilities between the states is 1 if no indels are considered. Grammars in a CM are as follows: :;

P \to aWb

:probabilities of pairwise interactions between 16 possible pairs :;

L \to aW

:probabilities of generating 4 possible single bases on the left :;

R \to Wa

:probabilities of generating 4 possible single bases on the right :;

B \to SS

:bifurcation with a probability of 1 :;

S \to W

:start with a probability of 1 :;

E \to \epsilon

:end with a probability of 1 The model has 6 possible states and each state grammar includes different types of secondary structure probabilities of the non-terminals. The states are connected by transitions. Ideally current node states connect to all insert states and subsequent node states connect to non-insert states. In order to allow insertion of more than one base insert states connect to themselves. In order to score a CM model the inside-outside algorithms are used. CMs use a slightly different implementation of CYK. Log-odds emission scores for the optimum parse tree -

\log \hat

- are calculated out of the emitting states

P,~L,~R

. Since these scores are a function of sequence length a more discriminative measure to recover an optimum parse tree probability score-

\log\text(x, \hat, \theta)

- is reached by limiting the maximum length of the sequence to be aligned and calculating the log-odds relative to a null. The computation time of this step is linear to the database size and the algorithm has a memory complexity of

O(M_aD+M_bD^2)

Example: Using evolutionary information to guide structure prediction

The KH-99 algorithm by Knudsen and Hein lays the basis of the Pfold approach to predicting RNA secondary structure. In this approach the parameterization requires evolutionary history information derived from an alignment tree in addition to probabilities of columns and mutations. The grammar probabilities are observed from a training dataset.

= Estimate column probabilities for paired and unpaired bases

= In a structural alignment the probabilities of the unpaired bases columns and the paired bases columns are independent of other columns. By counting bases in single base positions and paired positions one obtains the frequencies of bases in loops and stems. For basepair and an occurrence of

XY

is also counted as an occurrence of

YX

. Identical basepairs such as

XX

are counted twice.

= Calculate mutation rates for paired and unpaired bases

= By pairing sequences in all possible ways overall mutation rates are estimated. In order to recover plausible mutations a sequence identity threshold should be used so that the comparison is between similar sequences. This approach uses 85% identity threshold between pairing sequences. First single base positions differences -except for gapped columns- between sequence pairs are counted such that if the same position in two sequences had different bases the count of the difference is incremented for each sequence. For unpaired bases a 4 X 4 mutation rate matrix is used that satisfies that the mutation flow from X to Y is reversible: :

PX^rXY = PY^rYX

For basepairs a 16 X 16 rate distribution matrix is similarly generated. The PCFG is used to predict the prior probability distribution of the structure whereas posterior probabilities are estimated by the inside-outside algorithm and the most likely structure is found by the CYK algorithm.

= Estimate alignment probabilities

= After calculating the column prior probabilities the alignment probability is estimated by summing over all possible secondary structures. Any column in a secondary structure

\sigma

for a sequence of length such that

D=(C_1,~C_2, ...C_l )

can be scored with respect to the alignment tree and the mutational model . The prior distribution given by the PCFG is

P(\sigma, M)

. The phylogenetic tree, can be calculated from the model by maximum likelihood estimation. Note that gaps are treated as unknown bases and the summation can be done through dynamic programming. :

P(D, T,M)

=\sum P (D, \sigma , T,M)

=\sum P(D, \sigma, T, M) P(\sigma, T,M)

=\sum P(D, \sigma,T,M) P(\sigma, M)

= Assign production probabilities to each rule in the grammar

= Each structure in the grammar is assigned production probabilities devised from the structures of the training dataset. These prior probabilities give weight to predictions accuracy. The number of times each rule is used depends on the observations from the training dataset for that particular grammar feature. These probabilities are written in parentheses in the grammar formalism and each rule will have a total of 100%. For instance: :

S \to LS (80\%) , L (20\%)

L \to s (70\%) ,  dFd (30\%)

F \to dFd (60.4\%),  LS (39.6\%)

= Predict the structure likelihood

= Given the prior alignment frequencies of the data the most likely structure from the ensemble predicted by the grammar can then be computed by maximizing

P(\sigma, D,T,M)

through the CYK algorithm. The structure with the highest predicted number of correct predictions is reported as the consensus structure. :

\sigma_= \arg\underset\max P(D,  \sigma,T^ML, M) P(\sigma, M)

= Pfold improvements on the KH-99 algorithm

= PCFG based approaches are desired to be scalable and general enough. Compromising speed for accuracy needs to as minimal as possible. Pfold addresses the limitations of the KH-99 algorithm with respect to scalability, gaps, speed and accuracy. *In Pfold gaps are treated as unknown. In this sense the probability of a gapped column equals that of an ungapped one. *In Pfold the tree is calculated prior to structure prediction through neighbor joining and not by maximum likelihood through the PCFG grammar. Only the branch lengths are adjusted to maximum likelihood estimates. *An assumption of Pfold is that all sequences have the same structure. Sequence identity threshold and allowing a 1% probability that any nucleotide becomes another limit the performance deterioration due to alignment errors.

Protein sequence analysis

Whereas PCFGs have proved powerful tools for predicting RNA secondary structure, usage in the field of protein sequence analysis has been limited. Indeed, the size of the

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...

alphabet and the variety of interactions seen in proteins make grammar inference much more challenging. As a consequence, most applications of

formal language theory In logic, mathematics, computer science, and linguistics, a formal language is a set of string (computer science), strings whose symbols are taken from a set called "#Definition, alphabet". The alphabet of a formal language consists of symbol ...

to protein analysis have been mainly restricted to the production of grammars of lower expressive power to model simple functional patterns based on local interactions. Since protein structures commonly display higher-order dependencies including nested and crossing relationships, they clearly exceed the capabilities of any CFG. Still, development of PCFGs allows expressing some of those dependencies and providing the ability to model a wider range of protein patterns.

References

{{reflist, 30em, refs={{cite journal , author=Chomsky, Noam, title= Three models for the description of language. , journal=IRE Transactions on Information Theory, volume=2, issue= 3 , pages=113–124, year=1956, doi= 10.1109/TIT.1956.1056813, s2cid= 19519474 {{cite journal , author=Chomsky, Noam , title=On certain formal properties of grammars , journal=Information and Control , date=June 1959 , volume=2 , issue=2 , pages=137–167 , doi=10.1016/S0019-9958(59)90362-6, doi-access=free {{cite book , editor=Noam Chomsky, title=Syntactic Structures., publisher= Mouton & Co. Publishers, Den Haag, Netherlands, year= 1957 {{cite journal , author1=Eddy S. R. , author2=Durbin R. , name-list-style=amp , title=RNA sequence analysis using covariance models, journal=Nucleic Acids Research, volume=22, pages=2079–2088, year=1994, doi=10.1093/nar/22.11.2079, pmid=8029015, pmc=308124, issue=11 {{cite journal , author=Sakakibara Y. , author2=Brown M. , author3=Hughey R. , author4=Mian I. S. , display-authors=etal , title=Stochastic context-free grammars for tRNA modelling , journal=Nucleic Acids Research , volume=22 , issue= 23 , pages=5112–5120 , year=1994 , doi=10.1093/nar/22.23.5112, pmid=7800507 , pmc=523785 {{cite journal, author=Grat, L., title=Automatic RNA secondary structure determination with stochastic context-free grammars, journal=In Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T and Wodak, S. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, AAAI Press, pages=136–144, url=http://www.aaai.org/Papers/ISMB/1995/ISMB95-017.pdf, year=1995, access-date=2017-08-03, archive-date=2015-12-04, archive-url=https://web.archive.org/web/20151204013359/http://www.aaai.org/Papers/ISMB/1995/ISMB95-017.pdf, url-status=dead {{cite book , author=Lefebvre, F , contribution=An optimized parsing algorithm well suited to RNA folding , editor=Rawlings, C. , editor2=Clark, D. , editor3=Altman, R. , editor4=Hunter, L. , editor5=Lengauer, T. , editor6=Wodak, S. , title=Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology , publisher=AAAI Press , pages=222–230 , date=1995, url=https://www.aaai.org/Papers/ISMB/1995/ISMB95-027.pdf {{cite book , last=Lefebvre , first=F. , contribution=A grammar-based unification of several alignment and folding algorithms , editor=States, D. J. , editor2=Agarwal, P. , editor3=Gaasterlan, T. , editor4=Hunter, L. , editor5=Smith R. F. , title=Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology , publisher=AAAI Press , pages=143–153 , date=1996, url=https://www.aaai.org/Papers/ISMB/1996/ISMB96-016.pdf {{cite book , author1=R. Durbin , author2=S. Eddy , author3=A. Krogh , author4=G. Mitchinson , title=Biological sequence analysis: probabilistic models of proteins and nucleic acids , url=https://books.google.com/books?id=R5P2GlJvigQC , publisher=Cambridge University Press , year=1998 , isbn=978-0-521-62971-3 {{cite journal , author=Dowell R. , author2=Eddy S., name-list-style=amp , title=Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction, journal=BMC Bioinformatics , volume=5, issue=71, pages=71 , doi=10.1186/1471-2105-5-71, pmid=15180907 , pmc=442121 , year=2004 , doi-access=free {{cite journal , author=McCaskill J. S., title= The Equilibrium Partition Function and Base Pair Binding Probabilities for RNA Secondary Structure , journal=Biopolymers, volume=29, pages=1105–19, year=1990, doi=10.1002/bip.360290621, pmid=1695107, issue=6–7, hdl= 11858/00-001M-0000-0013-0DE3-9 , s2cid= 12629688 , hdl-access=free {{cite journal , author=Juan V. , author2=Wilson C. , title= RNA Secondary Structure Prediction Based on Free Energy and Phylogenetic Analysis, journal=J. Mol. Biol., volume= 289, issue= 4, pages=935–947, year=1999, doi=10.1006/jmbi.1999.2801, pmid= 10369773 {{cite journal , author=Zuker M, title= Calculating Nucleic Acid Secondary Structure, journal=Curr. Opin. Struct. Biol., volume=10, issue= 3, pages=303–310, year=2000, doi=10.1016/S0959-440X(00)00088-9, pmid= 10851192 {{cite journal , author1=Mathews D. H. , author2=Sabina J. , author3=Zuker M. , author4=Turner D. H. , title= Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure, journal=J. Mol. Biol., volume=288, pages=911–940, year=1999, doi=10.1006/jmbi.1999.2700, pmid=10329189, issue=5, s2cid=19989405 , doi-access=free {{cite journal , author1=B. Knudsen , author2=J. Hein. , name-list-style=amp , title= Pfold: RNA secondary structure prediction using stochastic context-free grammars , journal=Nucleic Acids Research , volume=31, issue=13, pages=3423–3428, year=2003, doi=10.1093/nar/gkg614 , pmid=12824339 , pmc=169020 {{cite journal , author=Knudsen B. , author2=Hein J. , title= RNA Secondary Structure Prediction Using Stochastic Context-Free Grammars and Evolutionary History , journal=Bioinformatics, volume=15, pages=446–454, year=1999, doi=10.1093/bioinformatics/15.6.446, pmid=10383470, issue=6, doi-access=free {{cite journal , author=Rivas E. , author2=Eddy S. R. , title= Noncoding RNA Gene Detection Using Comparative Sequence Analysis , journal=BMC Bioinformatics, volume=2, year=2001, doi=10.1186/1471-2105-2-8, pmid=11801179, pmc=64605, issue=1, pages=8 , doi-access=free {{cite book, author=Holmes I., author2=Rubin G. M., title=Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars, journal=In. Pac. Symp. Biocomput., page
163–174
year=2002, doi=10.1142/9789812799623_0016, pmid=11928472, isbn=978-981-02-4777-5, url-access=registration, url=https://archive.org/details/pacificsymposium00paci/page/163 {{cite journal , author1=P. P. Gardner , author2=J. Daub , author3=J. Tate , author4=B. L. Moore , author5=I. H. Osuch , author6=S. Griffiths-Jones , author7=R. D. Finn , author8=E. P. Nawrocki , author9=D. L. Kolbe , author10=S. R. Eddy , author11=A. Bateman. , title=Rfam: Wikipedia, clans and the "decimal" release , journal=Nucleic Acids Research, year=2011, volume=39, doi=10.1093/nar/gkq1129, pmid=21062808, pmc=3013711, issue=Suppl 1, pages=D141–D145 {{cite journal , author=Yao Z. , author2=Weinberg Z. , author3=Ruzzo W. L. , title=CMfinder-a covariance model based RNA motif finding algorithm , journal=Bioinformatics, volume=22 , pages=445–452, year=2006, doi=10.1093/bioinformatics/btk008, pmid=16357030, issue=4, doi-access=free {{cite journal , author=Rabani M. , author2=Kertesz M. , author3=Segal E. , title= Computational prediction of RNA structural motifs involved in post-transcriptional regulatory processes , journal= Proc. Natl. Acad. Sci. USA, volume=105, issue= 39 , pages=14885–14890, year=2008, doi=10.1073/pnas.0803169105 , pmid= 18815376 , pmc=2567462, bibcode=2008PNAS..10514885R , doi-access=free {{cite journal , author=Goodarzi H. , author2=Najafabadi H. S. , author3=Oikonomou P. , author4=Greco T. M. , author5=Fish L. , author6=Salavati R. , author7=Cristea I. M. , author8=Tavazoie S. , title= Systematic discovery of structural elements governing stability of mammalian messenger RNAs , journal=Nature , volume=485 , issue= 7397 , pages=264–268 , year=2012 , doi=10.1038/nature11013 , pmid=22495308 , pmc=3350620, bibcode=2012Natur.485..264G {{cite book , author=Sipser M., title= Introduction to Theory of Computation, publisher=Brooks Cole Pub Co., year=1996 {{cite book , author=Michael A. Harrison , title= Introduction to Formal Language Theory , publisher=Addison-Wesley, year=1978, author-link= Michael A. Harrison {{cite book , author1=Hopcroft J. E. , author2=Ullman J. D. , title= Introduction to Automata Theory, Languages, and Computation , publisher=Addison-Wesley, year=1979 {{cite book , author=Giegerich R., title=Combinatorial Pattern Matching , chapter=Explaining and Controlling Ambiguity in Dynamic Programming , publisher= In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching 1848 Edited by: Giancarlo R., Sankoff D. Montréal, Canada: Springer-Verlag, Berlin, series=Lecture Notes in Computer Science, pages=46–59, year=2000, volume=1848, doi=10.1007/3-540-45123-4_6, isbn=978-3-540-67633-1, s2cid=17088251 {{cite journal , author1=Lari K. , author2=Young S. J. , title= The estimation of stochastic context-free grammars using the inside-outside algorithm , journal=Computer Speech and Language, volume=4, pages=35–56, year=1990, doi=10.1016/0885-2308(90)90022-X {{cite journal , author1=Lari K. , author2=Young S. J. , title= Applications of stochastic context-free grammars using the inside-outside algorithm, journal=Computer Speech and Language, volume=5, issue=3 , pages=237–257, year=1991, doi=10.1016/0885-2308(91)90009-F {{cite journal, author=Nawrocki E. P., Eddy S. R., title=Infernal 1.1:100-fold faster RNA homology searches, journal=Bioinformatics, volume=29, pages=2933–2935, year=2013, doi=10.1093/bioinformatics/btt509, pmid=24008419, pmc=3810854, issue=22 {{cite journal , author=Tavaré S., title= Some probabilistic and statistical problems in the analysis of DNA sequences. , journal=Lectures on Mathematics in the Life Sciences. American Mathematical Society, volume=17, pages=57–86, year=1986 {{cite journal , author=Muse S. V. , title=Evolutionary analyses of DNA sequences subject to constraints of secondary structure., journal=Genetics, volume=139 , issue=3 , pages=1429–1439, year=1995, doi=10.1093/genetics/139.3.1429, pmc=1206468 , pmid=7768450 {{cite journal , doi=10.1121/1.2017061, title=Trainable grammars for speech recognition, journal=The Journal of the Acoustical Society of America, volume=65, issue=S1, pages=S132, year=1979, last1=Baker, first1=J. K., bibcode=1979ASAJ...65Q.132B, doi-access=free {{cite journal , author1=Schöniger M. , author2=von Haeseler A. , title=A stochastic model for the evolution of autocorrelated DNA sequences , volume=3 , issue=3 , pages=240–7 , year=1994, journal = Mol. Phylogenet. Evol., doi=10.1006/mpev.1994.1026, pmid=7529616 , bibcode=1994MolPE...3..240S

External links

Rfam DatabaseInfernalThe Stanford Parser: A statistical parserpyStatParser
Bioinformatics Formal languages Language modeling Natural language parsing Statistical natural language processing Probabilistic models