HOME

TheInfoList



OR:

In cryptanalysis, frequency analysis (also known as counting letters) is the study of the frequency of letters or groups of letters in a
ciphertext In cryptography, ciphertext or cyphertext is the result of encryption performed on plaintext using an algorithm, called a cipher. Ciphertext is also known as encrypted or encoded information because it contains a form of the original plaintex ...
. The method is used as an aid to breaking classical ciphers. Frequency analysis is based on the fact that, in any given stretch of written language, certain letters and combinations of letters occur with varying frequencies. Moreover, there is a characteristic distribution of letters that is roughly the same for almost all samples of that language. For instance, given a section of
English language English is a West Germanic language of the Indo-European language family, with its earliest forms spoken by the inhabitants of early medieval England. It is named after the Angles, one of the ancient Germanic peoples that migrated to the ...
, , , and are the most common, while , , and are rare. Likewise, , , , and are the most common pairs of letters (termed '' bigrams'' or ''digraphs''), and , , , and are the most common repeats. The nonsense phrase " ETAOIN SHRDLU" represents the 12 most frequent letters in typical English language text. In some ciphers, such properties of the natural language plaintext are preserved in the ciphertext, and these patterns have the potential to be exploited in a
ciphertext-only attack In cryptography, a ciphertext-only attack (COA) or known ciphertext attack is an attack model for cryptanalysis where the attacker is assumed to have access only to a set of ciphertexts. While the attacker has no channel providing access to the pla ...
.


Frequency analysis for simple substitution ciphers

In a simple substitution cipher, each letter of the
plaintext In cryptography, plaintext usually means unencrypted information pending input into cryptographic algorithms, usually encryption algorithms. This usually refers to data that is transmitted or stored unencrypted. Overview With the advent of comp ...
is replaced with another, and any particular letter in the plaintext will always be transformed into the same letter in the ciphertext. For instance, if all occurrences of the letter turn into the letter , a ciphertext message containing numerous instances of the letter would suggest to a cryptanalyst that represents . The basic use of frequency analysis is to first count the frequency of ciphertext letters and then associate guessed plaintext letters with them. More s in the ciphertext than anything else suggests that corresponds to in the plaintext, but this is not certain; and are also very common in English, so might be either of them also. It is unlikely to be a plaintext or which are less common. Thus the cryptanalyst may need to try several combinations of mappings between ciphertext and plaintext letters. More complex use of statistics can be conceived, such as considering counts of pairs of letters (''bigrams''), triplets (''trigrams''), and so on. This is done to provide more information to the cryptanalyst, for instance, and nearly always occur together in that order in English, even though itself is rare.


An example

Suppose Eve has intercepted the cryptogram below, and it is known to be encrypted using a simple substitution cipher as follows: For this example, uppercase letters are used to denote ciphertext, lowercase letters are used to denote plaintext (or guesses at such), and ~ is used to express a guess that ciphertext letter represents the plaintext letter . Eve could use frequency analysis to help solve the message along the following lines: counts of the letters in the cryptogram show that is the most common single letter, most common bigram, and is the most common
trigram Trigrams are a special case of the ''n''-gram, where ''n'' is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. Frequency Context ...
. is the most common letter in the English language, is the most common bigram, and is the most common trigram. This strongly suggests that ~, ~ and ~. The second most common letter in the cryptogram is ; since the first and second most frequent letters in the English language, and are accounted for, Eve guesses that ~, the third most frequent letter. Tentatively making these assumptions, the following partial decrypted message is obtained. Using these initial guesses, Eve can spot patterns that confirm her choices, such as "". Moreover, other patterns suggest further guesses. "" might be "", which would mean ~. Similarly "" could be guessed as "", yielding ~ and ~. Furthermore, "" might be "", giving ~. Filling in these guesses, Eve gets: In turn, these guesses suggest still others (for example, "" could be "", implying ~) and so on, and it is relatively straightforward to deduce the rest of the letters, eventually yielding the plaintext. At this point, it would be a good idea for Eve to insert spaces and punctuation: Hereupon Legrand arose, with a grave and stately air, and brought me the beetle from a glass case in which it was enclosed. It was a beautiful scarabaeus, and, at that time, unknown to naturalists—of course a great prize in a scientific point of view. There were two round black spots near one extremity of the back, and a long one near the other. The scales were exceedingly hard and glossy, with all the appearance of burnished gold. The weight of the insect was very remarkable, and, taking all things into consideration, I could hardly blame Jupiter for his opinion respecting it. In this example from The Gold-Bug, Eve's guesses were all correct. This would not always be the case, however; the variation in statistics for individual plaintexts can mean that initial guesses are incorrect. It may be necessary to
backtrack BackTrack was a Linux distribution that focused on security, based on the Knoppix Linux distribution aimed at digital forensics and penetration testing use. In March 2013, the Offensive Security team rebuilt BackTrack around the Debian distr ...
incorrect guesses or to analyze the available statistics in much more depth than the somewhat simplified justifications given in the above example. It is also possible that the plaintext does not exhibit the expected distribution of letter frequencies. Shorter messages are likely to show more variation. It is also possible to construct artificially skewed texts. For example, entire novels have been written that omit the letter "" altogether — a form of literature known as a lipogram.


History and usage

The first known recorded explanation of frequency analysis (indeed, of any kind of cryptanalysis) was given in the 9th century by Al-Kindi, an
Arab The Arabs (singular: Arab; singular ar, عَرَبِيٌّ, DIN 31635: , , plural ar, عَرَب, DIN 31635: , Arabic pronunciation: ), also known as the Arab people, are an ethnic group mainly inhabiting the Arab world in Western Asia, ...
polymath A polymath ( el, πολυμαθής, , "having learned much"; la, homo universalis, "universal human") is an individual whose knowledge spans a substantial number of subjects, known to draw on complex bodies of knowledge to solve specific pro ...
, in ''A Manuscript on Deciphering Cryptographic Messages''. It has been suggested that close textual study of the
Qur'an The Quran (, ; Standard Arabic: , Quranic Arabic: , , 'the recitation'), also romanized Qur'an or Koran, is the central religious text of Islam, believed by Muslims to be a revelation from God. It is organized in 114 chapters (pl.: , si ...
first brought to light that
Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter ...
has a characteristic letter frequency. Its use spread, and similar systems were widely used in European states by the time of the
Renaissance The Renaissance ( , ) , from , with the same meanings. is a period in European history marking the transition from the Middle Ages to modernity and covering the 15th and 16th centuries, characterized by an effort to revive and surpass ide ...
. By 1474, Cicco Simonetta had written a manual on deciphering encryptions of
Latin Latin (, or , ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through ...
and Italian text. Several schemes were invented by cryptographers to defeat this weakness in simple substitution encryptions. These included: * '' Homophonic substitution'': Use of ''homophones'' — several alternatives to the most common letters in otherwise monoalphabetic substitution ciphers. For example, for English, both X and Y ciphertext might mean plaintext E. * '' Polyalphabetic substitution'', that is, the use of several alphabets — chosen in assorted, more or less devious, ways (
Leone Alberti Leon Battista Alberti (; 14 February 1404 – 25 April 1472) was an Italian Renaissance humanist author, artist, architect, poet, priest, linguist, philosopher, and cryptographer; he epitomised the nature of those identified now as polymaths. He ...
seems to have been the first to propose this); and * ''
Polygraphic substitution Polygraphic substitution is a cipher in which a uniform substitution is performed on blocks of letters. When the length of the block is specifically known, more precise terms are used: for instance, a cipher in which pairs of letters are substitu ...
'', schemes where pairs or triplets of plaintext letters are treated as units for substitution, rather than single letters, for example, the Playfair cipher invented by Charles Wheatstone in the mid-19th century. A disadvantage of all these attempts to defeat frequency counting attacks is that it increases complication of both enciphering and deciphering, leading to mistakes. Famously, a British Foreign Secretary is said to have rejected the Playfair cipher because, even if school boys could cope successfully as Wheatstone and Playfair had shown, "our attachés could never learn it!". The rotor machines of the first half of the 20th century (for example, the Enigma machine) were essentially immune to straightforward frequency analysis. However, other kinds of analysis ("attacks") successfully decoded messages from some of those machines. Frequency analysis requires only a basic understanding of the statistics of the plaintext language and some problem solving skills, and, if performed by hand, tolerance for extensive letter bookkeeping. During
World War II World War II or the Second World War, often abbreviated as WWII or WW2, was a world war that lasted from 1939 to 1945. It involved the World War II by country, vast majority of the world's countries—including all of the great power ...
(WWII), both the British and the
Americans Americans are the citizens and nationals of the United States of America.; ; Although direct citizens and nationals make up the majority of Americans, many dual citizens, expatriates, and permanent residents could also legally claim Ame ...
recruited codebreakers by placing
crossword A crossword is a word puzzle that usually takes the form of a square or a rectangular grid of white- and black-shaded squares. The goal is to fill the white squares with letters, forming words or phrases, by solving clues which lead to th ...
puzzles in major newspapers and running contests for who could solve them the fastest. Several of the ciphers used by the
Axis powers The Axis powers, ; it, Potenze dell'Asse ; ja, 枢軸国 ''Sūjikukoku'', group=nb originally called the Rome–Berlin Axis, was a military coalition that initiated World War II and fought against the Allies. Its principal members were ...
were breakable using frequency analysis, for example, some of the consular ciphers used by the Japanese. Mechanical methods of letter counting and statistical analysis (generally IBM card type machinery) were first used in World War II, possibly by the US Army's SIS. Today, the hard work of letter counting and analysis has been replaced by
computer A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations ( computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These prog ...
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consist ...
, which can carry out such analysis in seconds. With modern computing power, classical ciphers are unlikely to provide any real protection for confidential data.


Frequency analysis in fiction

Frequency analysis has been described in fiction.
Edgar Allan Poe Edgar Allan Poe (; Edgar Poe; January 19, 1809 – October 7, 1849) was an American writer, poet, editor, and literary critic. Poe is best known for his poetry and short stories, particularly his tales of mystery and the macabre. He is wid ...
's " The Gold-Bug", and Sir Arthur Conan Doyle's
Sherlock Holmes Sherlock Holmes () is a fictional detective created by British author Arthur Conan Doyle. Referring to himself as a " consulting detective" in the stories, Holmes is known for his proficiency with observation, deduction, forensic science and ...
tale " The Adventure of the Dancing Men" are examples of stories which describe the use of frequency analysis to attack simple substitution ciphers. The cipher in the Poe story is encrusted with several deception measures, but this is more a literary device than anything significant cryptographically.


See also

* ETAOIN SHRDLU * Letter frequencies * Arabic Letter Frequency *
Index of coincidence In cryptography, coincidence counting is the technique (invented by William F. Friedman) of putting two texts side-by-side and counting the number of times that identical letters appear in the same position in both texts. This count, either as a r ...
* Topics in cryptography * Zipf's law * A Void, a novel by Georges Perec. The original French text is written without the letter ''e'', as is the English translation. The Spanish version contains no ''a''. *
Gadsby (novel) ''Gadsby'' is a 1939 novel by Ernest Vincent Wright which does not include any words that contain the letter E, the most common letter in English. A work that deliberately avoids certain letters is known as a lipogram. The plot revolves arou ...
, a novel by Ernest Vincent Wright. The novel is written as a lipogram, which does not include words that contain the letter E.


Further reading

* Helen Fouché Gaines, "Cryptanalysis", 1939, Dover. *
Abraham Sinkov Abraham Sinkov (August 22, 1907 – January 19, 1998) was a US cryptanalyst. An early employee of the U.S. Army's Signals Intelligence Service, he held several leadership positions during World War II, transitioning to the new National Security A ...
, "Elementary Cryptanalysis: A Mathematical Approach", The Mathematical Association of America, 1966. .


References


External links


Online frequency analysis tool

Character
an
syllable
frequencies of 41 languages and a portable tool to create frequency and syllable distributions
Arabic letter frequency analysis



Czech letter/bigram/trigram frequency
{{DEFAULTSORT:Frequency Analysis Cryptographic attacks Frequency distribution Arab inventions Quantitative linguistics