frequency analysis
   HOME

TheInfoList



OR:

In
cryptanalysis Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic se ...
, frequency analysis (also known as counting letters) is the study of the frequency of letters or groups of letters in a
ciphertext In cryptography, ciphertext or cyphertext is the result of encryption performed on plaintext using an algorithm, called a cipher. Ciphertext is also known as encrypted or encoded information because it contains a form of the original plaintext ...
. The method is used as an aid to breaking
classical cipher In cryptography, a classical cipher is a type of cipher that was used historically but for the most part, has fallen into disuse. In contrast to modern cryptographic algorithms, most classical ciphers can be practically computed and solved by hand ...
s. Frequency analysis is based on the fact that, in any given stretch of written language, certain letters and combinations of letters occur with varying frequencies. Moreover, there is a characteristic distribution of letters that is roughly the same for almost all samples of that language. For instance, given a section of
English language English is a West Germanic language that developed in early medieval England and has since become a English as a lingua franca, global lingua franca. The namesake of the language is the Angles (tribe), Angles, one of the Germanic peoples th ...
, , , and are the most common, while , , and are rare. Likewise, , , , and are the most common pairs of letters (termed ''
bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used f ...
s'' or ''digraphs''), and , , , and are the most common repeats. The nonsense phrase " ETAOIN SHRDLU" represents the 12 most frequent letters in typical English language text. In some ciphers, such properties of the natural language plaintext are preserved in the ciphertext, and these patterns have the potential to be exploited in a
ciphertext-only attack In cryptography, a ciphertext-only attack (COA) or known ciphertext attack is an attack model for cryptanalysis where the attacker is assumed to have access only to a set of ciphertexts. While the attacker has no channel providing access to the p ...
.


Frequency analysis for simple substitution ciphers

In a simple
substitution cipher In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters (the most common), pairs of letters, t ...
, each letter of the
plaintext In cryptography, plaintext usually means unencrypted information pending input into cryptographic algorithms, usually encryption algorithms. This usually refers to data that is transmitted or stored unencrypted. Overview With the advent of comp ...
is replaced with another, and any particular letter in the plaintext will always be transformed into the same letter in the ciphertext. For instance, if all occurrences of the letter turn into the letter , a ciphertext message containing numerous instances of the letter would suggest to a cryptanalyst that represents . The basic use of frequency analysis is to first count the frequency of ciphertext letters and then associate guessed plaintext letters with them. More s in the ciphertext than anything else suggests that corresponds to in the plaintext, but this is not certain; and are also very common in English, so might be either of them. It is unlikely to be a plaintext or , which are less common. Thus the cryptanalyst may need to try several combinations of mappings between ciphertext and plaintext letters. More complex use of statistics can be conceived, such as considering counts of pairs of letters (''bigrams''), triplets (''trigrams''), and so on. This is done to provide more information to the cryptanalyst, for instance, and nearly always occur together in that order in English, even though itself is rare.


An example

Suppose
Eve Eve is a figure in the Book of Genesis in the Hebrew Bible. According to the origin story, "Creation myths are symbolic stories describing how the universe and its inhabitants came to be. Creation myths develop through oral traditions and there ...
has intercepted the
cryptogram A cryptogram is a type of puzzle that consists of a short piece of encrypted text. Generally the cipher used to encrypt the text is simple enough that the cryptogram can be solved by hand. Substitution ciphers where each letter is replaced by ...
below, and it is known to be encrypted using a simple substitution cipher: For this example, uppercase letters are used to denote ciphertext, lowercase letters are used to denote plaintext (or guesses at such), and ~ is used to express a guess that ciphertext letter represents the plaintext letter . Eve could use frequency analysis to help solve the message along the following lines: counts of the letters in the cryptogram show that is the most common single letter, most common
bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used f ...
, and is the most common trigram. is the most common letter in the English language, is the most common bigram, and is the most common trigram. This strongly suggests that ~, ~ and ~. The second most common letter in the cryptogram is ; since the first and second most frequent letters in the English language, and are accounted for, Eve guesses that ~, the third most frequent letter. Tentatively making these assumptions, the following partial decrypted message is obtained. Using these initial guesses, Eve can spot patterns that confirm her choices, such as "". Moreover, other patterns suggest further guesses. "" might be "", which would mean ~. Similarly "" could be guessed as "", yielding ~ and ~. Furthermore, "" might be "", giving ~. Filling in these guesses, Eve gets: In turn, these guesses suggest still others (for example, "" could be "", implying ~) and so on, and it is relatively straightforward to deduce the rest of the letters, eventually yielding the plaintext. At this point, it would be a good idea for Eve to insert spaces and punctuation: Hereupon Legrand arose, with a grave and stately air, and brought me the beetle from a glass case in which it was enclosed. It was a beautiful scarabaeus, and, at that time, unknown to naturalists—of course a great prize in a scientific point of view. There were two round black spots near one extremity of the back, and a long one near the other. The scales were exceedingly hard and glossy, with all the appearance of burnished gold. The weight of the insect was very remarkable, and, taking all things into consideration, I could hardly blame Jupiter for his opinion respecting it. In this example from " The Gold-Bug", Eve's guesses were all correct. This would not always be the case, however; the variation in statistics for individual plaintexts can mean that initial guesses are incorrect. It may be necessary to backtrack incorrect guesses or to analyze the available statistics in much more depth than the somewhat simplified justifications given in the above example. It is possible that the plaintext does not exhibit the expected distribution of letter frequencies. Shorter messages are likely to show more variation. It is also possible to construct artificially skewed texts. For example, entire novels have been written that omit the letter altogether — a form of literature known as a
lipogram A lipogram (from , ''leipográmmatos'', "leaving out a letter" is a kind of constrained writing or word game consisting of writing paragraphs or longer works in which a particular letter or group of letters is avoided.McArthur, Tom (1992). ''The ...
.


History and usage

The first known recorded explanation of frequency analysis (indeed, of any kind of cryptanalysis) was given in the 9th century by
Al-Kindi Abū Yūsuf Yaʻqūb ibn ʼIsḥāq aṣ-Ṣabbāḥ al-Kindī (; ; ; ) was an Arab Muslim polymath active as a philosopher, mathematician, physician, and music theorist Music theory is the study of theoretical frameworks for understandin ...
, an
Arab Arabs (,  , ; , , ) are an ethnic group mainly inhabiting the Arab world in West Asia and North Africa. A significant Arab diaspora is present in various parts of the world. Arabs have been in the Fertile Crescent for thousands of years ...
polymath A polymath or polyhistor is an individual whose knowledge spans many different subjects, known to draw on complex bodies of knowledge to solve specific problems. Polymaths often prefer a specific context in which to explain their knowledge, ...
, in ''A Manuscript on Deciphering Cryptographic Messages''. It has been suggested that a close textual study of the
Qur'an The Quran, also romanized Qur'an or Koran, is the central religious text of Islam, believed by Muslims to be a revelation directly from God ('' Allāh''). It is organized in 114 chapters (, ) which consist of individual verses ('). Besides ...
first brought to light that
Arabic Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
has a characteristic letter frequency. Its use spread, and similar systems were widely used in European states by the time of the
Renaissance The Renaissance ( , ) is a Periodization, period of history and a European cultural movement covering the 15th and 16th centuries. It marked the transition from the Middle Ages to modernity and was characterized by an effort to revive and sur ...
. By 1474,
Cicco Simonetta Francesco (Cicco) Simonetta (1410 – 30 October 1480) was an Italian Renaissance statesman who composed an early treatise on cryptography. Biography Francesco, nicknamed Cicco, was born in Caccuri, Calabria, and received a fine educatio ...
had written a manual on deciphering encryptions of
Latin Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
and
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, a Romance ethnic group related to or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance languag ...
text. Several schemes were invented by cryptographers to defeat this weakness in simple substitution encryptions. These included: * '' Homophonic substitution'': Use of ''homophones'' — several alternatives to the most common letters in otherwise monoalphabetic substitution ciphers. For example, for English, both X and Y ciphertext might mean plaintext E. * '' Polyalphabetic substitution'', that is, the use of several alphabets — chosen in assorted, more or less devious, ways ( Leone Alberti seems to have been the first to propose this); and * '' Polygraphic substitution'', schemes where pairs or triplets of plaintext letters are treated as units for substitution, rather than single letters, for example, the Playfair cipher invented by
Charles Wheatstone Sir Charles Wheatstone (; 6 February 1802 – 19 October 1875) was an English physicist and inventor best known for his contributions to the development of the Wheatstone bridge, originally invented by Samuel Hunter Christie, which is used to m ...
in the mid-19th century. A disadvantage of all these attempts to defeat frequency counting attacks is that it increases complication of both enciphering and deciphering, leading to mistakes. Famously, a British Foreign Secretary is said to have rejected the Playfair cipher because, even if school boys could cope successfully as Wheatstone and Playfair had shown, "our attachés could never learn it!". The
rotor machine In cryptography, a rotor machine is an electro-mechanical stream cipher device used for encrypting and decrypting messages. Rotor machines were the cryptographic state-of-the-art for much of the 20th century; they were in widespread use from ...
s of the first half of the 20th century (for example, the
Enigma machine The Enigma machine is a cipher device developed and used in the early- to mid-20th century to protect commercial, diplomatic, and military communication. It was employed extensively by Nazi Germany during World War II, in all branches of the W ...
) were essentially immune to straightforward frequency analysis. However, other kinds of analysis ("attacks") successfully decoded messages from some of those machines. Frequency analysis requires only a basic understanding of the statistics of the plaintext language and some problem-solving skills, and, if performed by hand, tolerance for extensive letter bookkeeping. During
World War II World War II or the Second World War (1 September 1939 – 2 September 1945) was a World war, global conflict between two coalitions: the Allies of World War II, Allies and the Axis powers. World War II by country, Nearly all of the wo ...
, both the
British British may refer to: Peoples, culture, and language * British people, nationals or natives of the United Kingdom, British Overseas Territories and Crown Dependencies. * British national identity, the characteristics of British people and culture ...
and the
Americans Americans are the Citizenship of the United States, citizens and United States nationality law, nationals of the United States, United States of America.; ; Law of the United States, U.S. federal law does not equate nationality with Race (hu ...
recruited codebreakers by placing
crossword A crossword (or crossword puzzle) is a word game consisting of a grid of black and white squares, into which solvers enter words or phrases ("entries") crossing each other horizontally ("across") and vertically ("down") according to a set of cl ...
puzzles in major newspapers and running contests for who could solve them the fastest. Several of the ciphers used by the
Axis powers The Axis powers, originally called the Rome–Berlin Axis and also Rome–Berlin–Tokyo Axis, was the military coalition which initiated World War II and fought against the Allies of World War II, Allies. Its principal members were Nazi Ge ...
were breakable using frequency analysis, for example, some of the consular ciphers used by the Japanese. Mechanical methods of letter counting and statistical analysis (generally
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
card type machinery) were first used in World War II, possibly by the US Army's SIS. Today, the work of letter counting and analysis is done by
computer A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...
software Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications. The history of software is closely tied to the development of digital comput ...
, which can carry out such analysis in seconds. With modern computing power, classical ciphers are unlikely to provide any real protection for confidential data.


Frequency analysis in fiction

Frequency analysis has been described in fiction.
Edgar Allan Poe Edgar Allan Poe (; January 19, 1809 – October 7, 1849) was an American writer, poet, editor, and literary critic who is best known for his poetry and short stories, particularly his tales involving mystery and the macabre. He is widely re ...
's " The Gold-Bug" and Sir Arthur Conan Doyle's
Sherlock Holmes Sherlock Holmes () is a Detective fiction, fictional detective created by British author Arthur Conan Doyle. Referring to himself as a "Private investigator, consulting detective" in his stories, Holmes is known for his proficiency with obser ...
tale " The Adventure of the Dancing Men" are examples of stories which describe the use of frequency analysis to attack simple substitution ciphers. The cipher in the Poe story is encrusted with several deception measures, but this is more a literary device than anything significant cryptographically.


See also

*
Index of coincidence In cryptography, coincidence counting is the technique (invented by William F. Friedman) of putting two texts side-by-side and counting the number of times that identical letters appear in the same position in both texts. This count, either as a r ...
*
Topics in cryptography The following outline is provided as an overview of and topical guide to cryptography: Cryptography (or cryptology) – practice and study of hiding information. Modern cryptography intersects the disciplines of mathematics, computer scie ...
* Zipf's law *
A Void ''A Void'', translated from the original French ( "The Disappearance"), is a 300-page French lipogrammatic novel, written in 1969 by Georges Perec, entirely without using the letter '' e'', following Oulipo constraints. Perec would go on to ...
, a novel by
Georges Perec Georges Perec (; 7 March 1936 – 3 March 1982) was a French novelist, filmmaker, documentalist, and essayist. He was a member of the Oulipo group. His father died as a soldier early in the Second World War and his mother was killed in the Ho ...
. The original French text is written without the letter ''e'', as is the English translation. The Spanish version contains no ''a''. *
Gadsby (novel) ''Gadsby'' is a 1939 in literature, 1939 novel by Ernest Vincent Wright, written without words that contain the letter E (letter), E, the most common letter in English. A work that deliberately avoids certain letters is known as a lipogram. The ...
, a novel by Ernest Vincent Wright. The novel is written as a
lipogram A lipogram (from , ''leipográmmatos'', "leaving out a letter" is a kind of constrained writing or word game consisting of writing paragraphs or longer works in which a particular letter or group of letters is avoided.McArthur, Tom (1992). ''The ...
, which does not include words that contain the letter E.


Further reading

* Helen Fouché Gaines, "Cryptanalysis", 1939, Dover. . * Abraham Sinkov, "Elementary Cryptanalysis: A Mathematical Approach", The Mathematical Association of America, 1966. .


References


External links


Online frequency analysis tool

Character
an
syllable
frequencies of 41 languages and a portable tool to create frequency and syllable distributions
Arabic letter frequency analysis



Czech letter/bigram/trigram frequency
{{DEFAULTSORT:Frequency Analysis Cryptographic attacks Frequency distribution Arab inventions Quantitative linguistics