
The internationalized domain name (IDN) homograph attack (sometimes written as homoglyph attack) is a method used by malicious parties to deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike (i.e., they rely on
homoglyph
In orthography and typography, a homoglyph is one of two or more graphemes, character (computing), characters, or glyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequence ...
s to deceive visitors). For example, the
Cyrillic
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...
,
Greek and
Latin
Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
alphabets each have a letter that has the same shape but represents different sounds or phonemes in their respective writing systems.
This kind of spoofing attack is also known as script spoofing.
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
incorporates numerous scripts (
writing systems
A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independe ...
), and, for a number of reasons, similar-looking characters such as
Greek Ο,
Latin O, and
Cyrillic О were not assigned the same code. Their incorrect or malicious usage is a possibility for security attacks. Thus, for example, a regular user of may be lured to click on it unquestioningly as an apparently familiar link, unaware that the third letter is not the Latin character "a" but rather the Cyrillic character "а" and is thus an entirely different domain from the intended one.
The registration of homographic domain names is akin to
typosquatting, in that both forms of attacks use a similar-looking name to a more established domain to fool a user. The major difference is that in typosquatting the perpetrator attracts victims by relying on natural typographical errors commonly made when manually entering a URL, while in homograph spoofing the perpetrator deceives the victims by presenting visually indistinguishable hyperlinks. Indeed, it would be a rare accident for a web user to type, for example, a Cyrillic letter within an otherwise English word, turning "b
ank" into "b
аnk". There are cases in which a registration can be both typosquatting and homograph spoofing; the pairs of
l/I
,
i/j
, and
0/O
are all both close together on keyboards and, depending on the
typeface
A typeface (or font family) is a design of Letter (alphabet), letters, Numerical digit, numbers and other symbols, to be used in printing or for electronic display. Most typefaces include variations in size (e.g., 24 point), weight (e.g., light, ...
, may be difficult or impossible to distinguish visually.
History
An early nuisance of this kind, pre-dating the Internet and even
text terminals, was the confusion between "l" (lowercase letter "L") / "1" (the number "one") and "O" (capital letter for vowel "o") / "0" (the number "zero"). Some
typewriter
A typewriter is a Machine, mechanical or electromechanical machine for typing characters. Typically, a typewriter has an array of Button (control), keys, and each one causes a different single character to be produced on paper by striking an i ...
s in the pre-computer era even
combined the L and the one; users had to type a lowercase L when the number one was needed. The zero/o confusion gave rise to the tradition of
crossing zeros, so that a
computer operator
A computer operator is a role in IT which oversees the running of computer systems, ensuring that the machines, and computers are running properly. The job of a computer operator as defined by the United States Bureau of Labor Statistics is to " ...
would type them correctly.
[ Unicode may contribute to this greatly with its combining characters, accents, several types of ]hyphen
The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation.
The hyphen is sometimes confused with dashes (en dash , em dash and others), which are wider, or with t ...
, etc., often due to inadequate rendering support, especially with smaller font sizes and the wide variety of fonts.["Unicode Security Considerations"](_blank)
Technical Report #36, 2010-04-28
Even earlier, handwriting provided rich opportunities for confusion. A notable example is the etymology of the word "zenith
The zenith (, ) is the imaginary point on the celestial sphere directly "above" a particular location. "Above" means in the vertical direction (Vertical and horizontal, plumb line) opposite to the gravity direction at that location (nadir). The z ...
". The translation from the Arabic "samt" included the scribe's confusing of "m" into "ni". This was common in medieval blackletter
Blackletter (sometimes black letter or black-letter), also known as Gothic script, Gothic minuscule or Gothic type, was a script used throughout Western Europe from approximately 1150 until the 17th century. It continued to be commonly used for ...
, which did not connect the vertical columns on the letters i, m, n, or u, making them difficult to distinguish when several were in a row. The latter, as well as "rn"/"m"/"rri" ("RN"/"M"/"RRI") confusion, is still possible for a human eye even with modern advanced computer technology.
Intentional look-alike character substitution with different alphabets has also been known in various contexts. For example, Faux Cyrillic
Faux Cyrillic, pseudo-Cyrillic, pseudo-Russian or faux Russian typography is the use of Cyrillic letters in Latin text, usually to evoke the Soviet Union or Russia, though it may be used in other contexts as well. It is a common Western trope ...
has been used as an amusement or attention-grabber and "Volapuk encoding", in which Cyrillic script is represented by similar Latin characters, was used in early days of the Internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
as a way to overcome the lack of support for the Cyrillic alphabet. Another example is that vehicle registration plates
A vehicle registration plate, also known as a number plate (British English, British, Indian English , Indian and Australian English), license plate (American English) or licence plate (Canadian English), is a metal or plastic plate attached t ...
can have both Cyrillic (for domestic usage in Cyrillic script countries) and Latin (for international driving) with the same letters. Registration plates that are issued in Greece are limited to using letters of the Greek alphabet
The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BC. It was derived from the earlier Phoenician alphabet, and is the earliest known alphabetic script to systematically write vowels as wel ...
that have homoglyphs in the Latin alphabet, as European Union
The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are Geography of the European Union, located primarily in Europe. The u ...
regulations require the use of Latin letters.
Homographs in ASCII
ASCII has several characters or pairs of characters that look alike and are known as ''homographs'' (or ''homoglyph
In orthography and typography, a homoglyph is one of two or more graphemes, character (computing), characters, or glyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequence ...
s''). Spoofing attack
In the context of information security, and especially network security, a spoofing attack is a situation in which a person or program successfully identifies as another by falsifying data, to gain an illegitimate advantage.
Internet Spoofing an ...
s based on these similarities are known as homograph spoofing attacks. For example, 0 (the number) and O (the letter), "l" lowercase "L", and "I" uppercase "i".
In a typical example of a hypothetical attack, someone could register a domain name
In the Internet, a domain name is a string that identifies a realm of administrative autonomy, authority, or control. Domain names are often used to identify services provided through the Internet, such as websites, email services, and more. ...
that appears almost identical to an existing domain but goes somewhere else. For example, the domain "rnicrosoft.com" begins with "r" and "n", not "m".
Other examples are ''G00GLE.COM'' which looks much like ''GOOGLE.COM'' in some fonts.
Using a mix of uppercase and lowercase characters, ''googIe.com'' (capital ''i'', not small ''L'') looks much like ''google.com'' in some fonts. PayPal
PayPal Holdings, Inc. is an American multinational financial technology company operating an online payments system in the majority of countries that support E-commerce payment system, online money transfers; it serves as an electronic alter ...
was a target of a phishing scam exploiting this, using the domain PayPaI.com. In certain narrow-spaced fonts such as Tahoma (the default in the address bar in Windows XP
Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct successor to Windows 2000 for high-end and business users a ...
), placing a c in front of a j, l or i will produce homoglyphs such as cl cj ci (d g a).
Homographs in internationalized domain names
In multilingual computer systems, different logical characters may have identical appearances.
For example, Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
character U+0430, Cyrillic
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...
small letter a ("а"), can look identical to Unicode character U+0061, Latin
Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
small letter a, ("a") which is the lowercase "a" used in English. Hence wikipediа.org
(xn--wikipedi-86g.org
; the Cyrillic version) instead of wikipedia.org
(the Latin version).
The problem arises from the different treatment of the characters in the user's mind and the computer's programming. From the viewpoint of the user, a Cyrillic "а" within a Latin string ''is'' a Latin "a"; there is no difference in the glyphs for these characters in most fonts. However, the computer treats them differently when processing the character string as an identifier. Thus, the user's assumption of a one-to-one correspondence between the visual appearance of a name and the named entity breaks down.
Internationalized domain name
An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacrit ...
s provide a backward-compatible way for domain names to use the full Unicode character set, and this standard is already widely supported. However this system expanded the character repertoire from a few dozen characters in a single alphabet to many thousands of characters in many scripts; this greatly increased the scope for homograph attacks.
This opens a rich vein of opportunities for phishing
Phishing is a form of social engineering and a scam where attackers deceive people into revealing sensitive information or installing malware such as viruses, worms, adware, or ransomware. Phishing attacks have become increasingly sophisticate ...
and other varieties of fraud. An attacker could register a domain name that ''looks'' just like that of a legitimate website, but in which some of the letters have been replaced by homographs in another alphabet. The attacker could then send e-mail messages purporting to come from the original site, but directing people to the bogus site. The spoof site could then record information such as passwords or account details, while passing traffic through to the real site. The victims may never notice the difference, until suspicious or criminal activity occurs with their accounts.
In December 2001 Evgeniy Gabrilovich and Alex Gontmakher, both from Technion, Israel
Israel, officially the State of Israel, is a country in West Asia. It Borders of Israel, shares borders with Lebanon to the north, Syria to the north-east, Jordan to the east, Egypt to the south-west, and the Mediterranean Sea to the west. Isr ...
, published a paper titled "The Homograph Attack",[Evgeniy Gabrilovich and Alex Gontmakher, , Communications of the ACM, 45(2):128, February 2002] which described an attack that used Unicode URLs to spoof a website URL. To prove the feasibility of this kind of attack, the researchers successfully registered a variant of the domain name ''microsoft
Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
.com'' which incorporated Cyrillic characters.
Problems of this kind were anticipated before IDN was introduced, and guidelines were issued to registries to try to avoid or reduce the problem. For example, it was advised that registries only accept characters from the Latin alphabet and that of their own country, not all of Unicode characters, but this advice was neglected by major TLD
A top-level domain (TLD) is one of the domain name, domains at the highest level in the hierarchical Domain Name System of the Internet after the root domain. The top-level domain names are installed in the DNS root zone, root zone of the nam ...
s.
On February 6, 2005, Cory Doctorow
Cory Efram Doctorow (; born 17 July 1971) is a Canadian-British blogger, journalist, and science fiction author who served as co-editor of the blog ''Boing Boing''. He is an activist in favour of liberalising copyright laws and a proponent of th ...
reported that this exploit was disclosed by 3ric Johanson at the hacker
A hacker is a person skilled in information technology who achieves goals and solves problems by non-standard means. The term has become associated in popular culture with a security hackersomeone with knowledge of bug (computing), bugs or exp ...
conference Shmoocon. Web browsers supporting IDNA appeared to direct the URL http://www.pаypal.com/, in which the first ''a'' character is replaced by a Cyrillic ''а'', to the site of the well known payment site PayPal
PayPal Holdings, Inc. is an American multinational financial technology company operating an online payments system in the majority of countries that support E-commerce payment system, online money transfers; it serves as an electronic alter ...
, but actually led to a spoofed web site with different content. Popular browsers continued to have problems properly displaying international domain names through April 2017.
The following alphabets have characters that can be used for spoofing attacks (please note, these are only the most obvious and common, given artistic license and how much risk the spoofer will take of getting caught; the possibilities are far more numerous than can be listed here):
Cyrillic
Cyrillic is, by far, the most commonly used alphabet for homoglyphs, largely because it contains 11 lowercase glyphs that are identical or nearly identical to Latin counterparts.
The Cyrillic letters
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Easte ...
а, с, е, о, р, х and у have optical counterparts in the basic Latin alphabet and look close or identical to a, c, e, o, p, x and y. Cyrillic З, Ч and б resemble the numerals 3, 4 and 6. Italic type
In typography, italic type is a cursive font based on a stylised form of calligraphic handwriting. Along with blackletter and roman type, it served as one of the major typefaces in the history of Western typography.
Owing to the influence f ...
generates more homoglyphs: '' д т п и'' or '' д т п и'' ( д т п и in standard type), resembling d m n u (in some fonts д can be used, since its italic form resembles a lowercase g; however, in most mainstream fonts, д instead resembles a partial differential sign, ∂).
If capital letters are counted, А В С Е Н І Ј К М О Р Ѕ Т Х can substitute A B C E H I J K M O P S T X, in addition to the capitals for the lowercase Cyrillic homoglyphs.
Cyrillic non-Russian problematic letters are і and i, ј and j, ԛ and q, ѕ and s, ԝ and w, Ү and Y, while Ғ and F, Ԍ and G bear some resemblance to each other. Cyrillic ӓ ё ї ӧ can also be used if an IDN itself is being spoofed, to fake ä ë ï ö.
While Komi De ( ԁ), shha ( һ), palochka
The palochka () is a letter in the Cyrillic script. The letter is usually caseless. It was introduced in the late 1930s as the Hindu-Arabic digit ' 1', and on Cyrillic keyboards, it is usually typeset as the Roman numeral ''. Unicode currentl ...
( Ӏ) and izhitsa
Izhitsa (Ѵ, ѵ; italics: ; OCS: ѷжица, Russian: ижица, Ukrainian: іжиця) is a letter of the early Cyrillic alphabet and several later alphabets, usually the last in the row. It originates from the Greek letter upsilon (Y, υ) ...
( ѵ) bear strong resemblance to Latin d, h, l and v, these letters are either rare or archaic and are not widely supported in most standard fonts (they are not included in the WGL-4). Attempting to use them could cause a ransom note effect.
Greek
From the Greek alphabet
The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BC. It was derived from the earlier Phoenician alphabet, and is the earliest known alphabetic script to systematically write vowels as wel ...
, only omicron
Omicron (, ; uppercase Ο, lowercase ο, ) is the fifteenth letter of the Greek alphabet. This letter is derived from the Phoenician letter ayin: . In classical Greek, omicron represented the close-mid back rounded vowel in contrast to '' o ...
( ο) and sometimes nu ( ν) appear identical to a Latin alphabet letter in the lowercase used for URLs. Fonts that are in italic type
In typography, italic type is a cursive font based on a stylised form of calligraphic handwriting. Along with blackletter and roman type, it served as one of the major typefaces in the history of Western typography.
Owing to the influence f ...
will feature Greek alpha ('' α'') looking like a Latin '' a''.
This list increases if close matches are also allowed (such as Greek ε ι κ η ρ τ υ ω χ γ for e i k n p t u w x y). Using capital letter
Letter case is the distinction between the letters that are in larger uppercase or capitals (more formally ''majuscule'') and smaller lowercase (more formally '' minuscule'') in the written representation of certain languages. The writing system ...
s, the list expands greatly. Greek Α Β Ε Η Ι Κ Μ Ν Ο Ρ Τ Χ Υ Ζ looks identical to Latin A B E H I K M N O P T X Y Z. Greek Α Γ Β Ε Η Κ Μ Ο Π Ρ Τ Φ Χ looks similar to Cyrillic А Г В Е Н К М О П Р Т Ф Х (as do Cyrillic Л л ( Л л) and Greek Λ in certain geometric sans-serif fonts), Greek letters κ and ο look similar to Cyrillic к and о. Besides this Greek τ, φ can be similar to Cyrillic т, ф in some fonts, Greek δ looks like Cyrillic б, and the Cyrillic '' а'' also italicizes the same as its Latin counterpart, making it possible to substitute it for alpha or vice versa. The lunate form of sigma, Ϲ ϲ, resembles both Latin C c and Cyrillic С с. Especially in contemporary typefaces, Cyrillic л is rendered with a glyph indistinguishable from Greek π.
If an IDN itself is being spoofed, Greek beta β can be a substitute for German eszett ß in some fonts (and in fact, code page 437
Code page 437 ( CCSID 437) is the character set of the original IBM PC (personal computer). It is also known as CP437, OEM-US, OEM 437, PC-8, or MS-DOS Latin US. The set includes all printable ASCII characters as well as some accented letters (di ...
treats them as equivalent), as can Greek end-of-word-variant sigma ς for ç; accented Greek substitutes '' ό ί ά'' can usually be used for '' ó í á'' in many fonts, with the last of these (alpha) again only resembling ''a'' in italic type.
Armenian
The Armenian alphabet
The Armenian alphabet (, or , ) or, more broadly, the Armenian script, is an alphabetic writing system developed for Armenian and occasionally used to write other languages. It is one of the three historical alphabets of the South Caucasu ...
can also contribute critical characters: several Armenian characters like օ, ո, ս, as well as capital Տ and Լ are often completely identical to Latin characters in modern fonts, and symbols which similar enough to pass off, such as ցհոօզս which look like ghnoqu, յ which resembles j (albeit dotless), and ք, which can either resemble p or f depending on the font; ա can resemble Cyrillic ш. However, the use of Armenian is, luckily, a bit less reliable: Not all standard fonts feature Armenian glyphs (whereas the Greek and Cyrillic scripts are); Windows prior to Windows 7 rendered Armenian in a distinct font, Sylfaen, of which the mixing of Armenian with Latin would appear obviously different if using a font other than Sylfaen or a Unicode typeface. (This is known as a ransom note effect.) The current version of Tahoma, used in Windows 7, supports Armenian (previous versions did not). Furthermore, this font differentiates Latin g from Armenian ց.
Two letters in Armenian (Ձշ) also can resemble the number 2, Յ resembles 3, while another (վ) sometimes resembles the number 4.
Hebrew
Hebrew spoofing is generally rare. Only three letters from that alphabet can reliably be used: samekh (ס), which sometimes resembles o, vav with diacritic (וֹ), which resembles an i, and heth (ח), which resembles the letter n. Less accurate approximants for some other alphanumerics can also be found, but these are usually only accurate enough to use for the purposes of foreign branding and not for substitution. Furthermore, the Hebrew alphabet
The Hebrew alphabet (, ), known variously by scholars as the Ktav Ashuri, Jewish script, square script and block script, is a unicase, unicameral abjad script used in the writing of the Hebrew language and other Jewish languages, most notably ...
is written from right to left and trying to mix it with left-to-right glyphs may cause problems.
Thai
Though the Thai script
The Thai script (, , ) is the abugida used to write Thai language, Thai, Southern Thai language, Southern Thai and many other languages spoken in Thailand. The Thai script itself (as used to write Thai) has 44 consonant symbols (, ), 16 vowel s ...
has historically had a distinct look with numerous loops and small flourishes, modern Thai typography, beginning with Manoptica in 1973 and continuing through IBM Plex in the modern era, has increasingly adopted a simplified style in which Thai characters are represented with glyphs
A glyph ( ) is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A ...
strongly resembling Latin letters. ค (A), ท (n), น (u), บ (U), ป (J), พ (W), ร (S), and ล (a) are among the Thai glyphs that can closely resemble Latin.
Chinese
The Chinese language
Chinese ( or ) is a group of languages spoken natively by the ethnic Han Chinese majority and List of ethnic groups in China, many minority ethnic groups in China, as well as by various communities of the Chinese diaspora. Approximately 1.39& ...
can be problematic for homographs as many characters exist as both traditional (regular script) and simplified Chinese characters
Simplified Chinese characters are one of two standardized Chinese characters, character sets widely used to write the Chinese language, with the other being traditional characters. Their mass standardization during the 20th century was part of ...
. In the .org domain, registering one variant renders the other unavailable to anyone; in .biz a single Chinese-language IDN registration delivers both variants as active domains (which must have the same domain name server and the same registrant). .hk (.香港) also adopts this policy.
Other scripts
Other Unicode scripts in which homographs can be found include Number Forms (Roman numeral
Roman numerals are a numeral system that originated in ancient Rome and remained the usual way of writing numbers throughout Europe well into the Late Middle Ages. Numbers are written with combinations of letters from the Latin alphabet, ea ...
s), CJK Compatibility and Enclosed CJK Letters and Months (certain abbreviations), Latin (certain digraphs), Currency Symbols, Mathematical Alphanumeric Symbols
Mathematical Alphanumeric Symbols is a Unicode block comprising styled forms of Latin alphabet, Latin and Greek alphabet, Greek letters and decimal numerical digit, digits that enable mathematicians to denote different notions with different l ...
, and Alphabetic Presentation Forms
Alphabetic Presentation Forms is a Unicode block containing standard ligatures for the Latin, Armenian, and Hebrew scripts.
Block
History
The following Unicode-related documents record the purpose and process of defining specific characters in ...
(typographic ligature
In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters and used in English and French, in which the letters and are joined for the first ligature ...
s).
Accented characters
Two names which differ only in an accent on one character may look very similar, particularly when the substitution involves the dotted letter i; the tittle (dot) on the i can be replaced with a diacritic
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
(such as a grave accent
The grave accent () ( or ) is a diacritical mark used to varying degrees in French, Dutch, Portuguese, Italian, Catalan and many other Western European languages as well as for a few unusual uses in English. It is also used in other ...
or acute accent
The acute accent (), ,
is a diacritic used in many modern written languages with alphabets based on the Latin alphabet, Latin, Cyrillic script, Cyrillic, and Greek alphabet, Greek scripts. For the most commonly encountered uses of the accen ...
; both ì and í are included in most standard character sets and fonts) that can only be detected with close inspection. In most top-level domain registries, wíkipedia.tld (xn--wkipedia-c2a.tld) and wikipedia.tld are two different names which may be held by different registrants. One exception is .ca, where reserving the plain-ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
version of the domain prevents another registrant from claiming an accented version of the same name.
Non-displayable characters
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
includes many characters which are not displayed by default, such as the zero-width space
The zero-width space (rendered: ; HTML entity: or ), abbreviated ZWSP, is a control character, non-printing character used in computerized typesetting to indicate where the word boundaries are, without actually displaying a visible space in the re ...
. In general, ICANN
The Internet Corporation for Assigned Names and Numbers (ICANN ) is a global multistakeholder group and nonprofit organization headquartered in the United States responsible for coordinating the maintenance and procedures of several dat ...
prohibits any domain with these characters from being registered, regardless of TLD.
Known homograph attacks
In 2011, an unknown source (registering under the name "Completely Anonymous") registered a domain name homographic to television station KBOI-TV's to create a fake news website. The sole purpose of the site was to spread an April Fool's Day joke regarding the Governor of Idaho issuing a supposed ban on the sale of music by Justin Bieber
Justin Drew Bieber ( ; born March 1, 1994) is a Canadian singer. Regarded as a pop icon, he is recognized for his multi-genre musical performances. He was discovered by record executive Scooter Braun in 2008 and subsequently brought to the U ...
.
In September 2017, security researcher Ankit Anubhav discovered an IDN homograph attack where the attackers registered adoḅe.com to deliver the Betabot trojan.
Defending against the attack
Client-side mitigation
The simplest defense is for web browsers not to support IDNA or other similar mechanisms, or for users to turn off whatever support their browsers have. That could mean blocking access to IDNA sites, but generally browsers permit access and just display IDNs in Punycode. Either way, this amounts to abandoning non-ASCII domain names.
* Mozilla Firefox versions 22 and later display IDNs if either the TLD prevents homograph attacks by restricting which characters can be used in domain names or labels do not mix scripts for different languages. Otherwise, IDNs are displayed in Punycode.
* Google Chrome
Google Chrome is a web browser developed by Google. It was first released in 2008 for Microsoft Windows, built with free software components from Apple WebKit and Mozilla Firefox. Versions were later released for Linux, macOS, iOS, iPadOS, an ...
versions 51 and later use an algorithm similar to the one used by Firefox. Previous versions display an IDN only if all of its characters belong to one (and only one) of the user's preferred languages. Chromium
Chromium is a chemical element; it has Symbol (chemistry), symbol Cr and atomic number 24. It is the first element in Group 6 element, group 6. It is a steely-grey, Luster (mineralogy), lustrous, hard, and brittle transition metal.
Chromium ...
and Chromium-based browsers such as Microsoft Edge
Microsoft Edge is a Proprietary Software, proprietary cross-platform software, cross-platform web browser created by Microsoft and based on the Chromium (web browser), Chromium open-source project, superseding Edge Legacy. In Windows 11, Edge ...
(since 2020) and Opera
Opera is a form of History of theatre#European theatre, Western theatre in which music is a fundamental component and dramatic roles are taken by Singing, singers. Such a "work" (the literal translation of the Italian word "opera") is typically ...
also use the same algorithm.
* Safari's approach is to render problematic character sets as Punycode. This can be changed by altering the settings in Mac OS X
macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
's system files.
* Internet Explorer
Internet Explorer (formerly Microsoft Internet Explorer and Windows Internet Explorer, commonly abbreviated as IE or MSIE) is a deprecation, retired series of graphical user interface, graphical web browsers developed by Microsoft that were u ...
versions 7 and later allow IDNs except for labels that mix scripts for different languages. Labels that mix scripts are displayed in Punycode. There are exceptions to locales where ASCII characters are commonly mixed with localized scripts. Internet Explorer 7 was capable of using IDNs, but it imposes restrictions on displaying non-ASCII domain names based on a user-defined list of allowed languages and provides an anti-phishing filter that checks suspicious websites against a remote database of known phishing sites.
* Microsoft Edge Legacy
Microsoft Edge Legacy (often shortened to Edge Legacy), originally released as simply Microsoft Edge or Edge is a discontinued Proprietary Software, proprietary cross-platform software, cross-platform web browser created by Microsoft. Released ...
converts all Unicode into Punycode.
As an additional defense, Internet Explorer 7, Firefox 2.0 and above, and Opera 9.10 include phishing filters that attempt to alert users when they visit malicious websites. As of April 2017, several browsers (including Chrome, Firefox, and Opera) were displaying IDNs consisting purely of Cyrillic characters normally (not as punycode), allowing spoofing attacks. Chrome tightened IDN restrictions in version 59 to prevent this attack.
These methods of defense only extend to within a browser. Homographic URLs that house malicious software can still be distributed, without being displayed as Punycode, through e-mail
Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
, social networking
A social network is a social structure consisting of a set of social actors (such as individuals or organizations), networks of Dyad (sociology), dyadic ties, and other Social relation, social interactions between actors. The social network per ...
or other websites without being detected until the user actually clicks the link. While the fake link will show in Punycode when it is clicked, by this point the page has already begun loading into the browser.
Server-side/registry operator mitigation
The IDN homographs database is a Python library that allows developers to defend against this using machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
-based character recognition.
ICANN
The Internet Corporation for Assigned Names and Numbers (ICANN ) is a global multistakeholder group and nonprofit organization headquartered in the United States responsible for coordinating the maintenance and procedures of several dat ...
has implemented a policy prohibiting any potential internationalized TLD from choosing letters that could resemble an existing Latin TLD and thus be used for homograph attacks. Proposed IDN TLDs .бг (Bulgaria), .укр (Ukraine) and .ελ (Greece) have been rejected or stalled because of their perceived resemblance to Latin letters. All three (and Serbian .срб and Mongolian .мон) have later been accepted. Three-letter TLD are considered safer than two-letter TLD, since they are harder to match to normal Latin ISO-3166 country domains; although the potential to match new generic domains remains, such generic domains are far more expensive than registering a second- or third-level domain address, making it cost-prohibitive to try to register a homoglyphic TLD for the sole purpose of making fraudulent domains (which itself would draw ICANN scrutiny).
The Russian registry operator Coordination Center for TLD RU only accepts Cyrillic names for the top-level domain .рф, forbidding a mix with Latin or Greek characters. However, the problem in .com and other gTLDs remains open.
Research based mitigations
In their 2019 study, Suzuki et al. introduced ShamFinder, a program for recognizing IDNs, shedding light on their prevalence in real-world scenarios. Similarly, Chiba et al. (2019) designed DomainScouter, a system adept at detecting diverse homograph IDNs in domains through analyzing an estimated 4.4 million registered IDNs across 570 Top-Level Domains (TLDs) it was able to successfully identify 8,284 IDN homographs, including many previously unidentified cases targeting brands in languages other than English.
See also
* Security issues in Unicode
* Internationalized domain name
An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacrit ...
* Homoglyph
In orthography and typography, a homoglyph is one of two or more graphemes, character (computing), characters, or glyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequence ...
* Faux Cyrillic
Faux Cyrillic, pseudo-Cyrillic, pseudo-Russian or faux Russian typography is the use of Cyrillic letters in Latin text, usually to evoke the Soviet Union or Russia, though it may be used in other contexts as well. It is a common Western trope ...
* Metal umlaut
* Duplicate characters in Unicode
Unicode has a certain amount of duplication of characters. These are pairs of single Unicode code points that are canonically equivalent. The reason for this are compatibility issues with legacy systems.
Unless two characters are canonically eq ...
* Unicode equivalence
* Typosquatting
* Leet
* Gyaru-moji
or is a style of obfuscated (cant (language), cant) Japanese writing system, Japanese writing popular amongst urban Japanese youth. As the name suggests ( meaning "gal"), this writing system was created by and remains primarily employed by yo ...
* Yaminjeongeum
* Martian language
Notes
References
{{Domain parking
Internationalized domain names
Nonstandard spelling
Unicode
Deception
Obfuscation
Web security exploits
Orthography