
In
orthography
An orthography is a set of convention (norm), conventions for writing a language, including norms of spelling, punctuation, Word#Word boundaries, word boundaries, capitalization, hyphenation, and Emphasis (typography), emphasis.
Most national ...
and
typography
Typography is the art and technique of Typesetting, arranging type to make written language legibility, legible, readability, readable and beauty, appealing when displayed. The arrangement of type involves selecting typefaces, Point (typogra ...
, a homoglyph is one of two or more
grapheme
In linguistics, a grapheme is the smallest functional unit of a writing system.
The word ''grapheme'' is derived from Ancient Greek ('write'), and the suffix ''-eme'' by analogy with ''phoneme'' and other emic units. The study of graphemes ...
s,
characters, or
glyph
A glyph ( ) is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A ...
s with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequences of characters sharing these properties.
In 2008, the
Unicode Consortium published its Technical Report #36
on a range of issues deriving from the visual similarity of characters both in single scripts, and similarities between characters in different scripts.
Examples of homoglyphic symbols are (a) the
diaeresis and umlaut (both a pair of dots, but with different meaning, although
encoded with the same
code point
A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
s); and (b) the
hyphen
The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation.
The hyphen is sometimes confused with dashes (en dash , em dash and others), which are wider, or with t ...
and
minus sign
The plus sign () and the minus sign () are mathematical symbols used to denote positive and negative functions, respectively. In addition, the symbol represents the operation of addition, which results in a sum, while the symbol represent ...
(both a short horizontal stroke, but with different meaning, although often encoded with
the same code point). Among
digits and
letters, digit
1 and lowercase
l are always encoded separately but in many
typeface
A typeface (or font family) is a design of Letter (alphabet), letters, Numerical digit, numbers and other symbols, to be used in printing or for electronic display. Most typefaces include variations in size (e.g., 24 point), weight (e.g., light, ...
s are given very similar glyphs, and digit
0 and capital
O are always encoded separately but in many typefaces are given very similar glyphs. Virtually every example of a homoglyphic pair of characters can potentially be differentiated graphically with clearly distinguishable glyphs and separate code points, but this is not always done.
Typeface
A typeface (or font family) is a design of Letter (alphabet), letters, Numerical digit, numbers and other symbols, to be used in printing or for electronic display. Most typefaces include variations in size (e.g., 24 point), weight (e.g., light, ...
s that do not emphatically distinguish the one/el and zero/oh homoglyphs are considered unsuitable for writing
formula
In science, a formula is a concise way of expressing information symbolically, as in a mathematical formula or a ''chemical formula''. The informal use of the term ''formula'' in science refers to the general construct of a relationship betwe ...
s,
URLs,
source code
In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.
Since a computer, at base, only ...
, IDs and other text where characters cannot always be differentiated without
context. Fonts which distinguish glyphs by means of a
slashed zero, for example, are preferred for those uses.
Related terms
The term ''
homograph'' is sometimes misused
synonymously with ''homoglyph'', but in the usual linguistic sense, homographs are
word
A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
s that are spelled the same but have different meanings, a property of words, not characters.
Allograph
In graphemics and typography, the term allograph is used of a glyph that is a design variant of a letter or other grapheme, such as a letter, a number, an ideograph, a punctuation mark or other typographic symbol. In graphemics, an obvious exa ...
s are
typeface
A typeface (or font family) is a design of Letter (alphabet), letters, Numerical digit, numbers and other symbols, to be used in printing or for electronic display. Most typefaces include variations in size (e.g., 24 point), weight (e.g., light, ...
design variants that look different but mean the same thing for example and , or a
dollar sign with one or two strokes. The term ''synoglyph'' has a similar but slightly more abstract meaning for example the symbol and the letter (in
Lsd) both mean the
pound sterling
Sterling (symbol: £; currency code: GBP) is the currency of the United Kingdom and nine of its associated territories. The pound is the main unit of sterling, and the word '' pound'' is also used to refer to the British currency general ...
, but only in that context. Allographs and synoglyphs are also known informally as ''display variants''.
Trema for umlaut and diaeresis
The ''
trema'' is used to indicate
umlaut or
diaeresis. In the days of early mechanical typewriters it was typed with the same key (using the "backspace and over-type" technique) used for a double inverted comma. However, the umlaut originated specifically as a pair of short vertical lines, not two dots (see
Sutterlin). Incidentally, the two dots above the letter ⟨ë⟩ in Albanian are described as a ''diaeresis'' but do not fulfil the function of a diaeresis.
0 and O; 1, l and I
Two common and important sets of homoglyphs in use today are the digit zero ⟨0⟩ and the capital letter ⟨O⟩; and the digit one ⟨1⟩, the lowercase letter ''L'' ⟨l⟩ and the uppercase ''i'' ⟨I⟩. In the early days of mechanical typewriters it was common to omit keys for the digits ⟨1⟩ and ⟨0⟩, and the keys for the letters ⟨l⟩ and ⟨O⟩ produced glyphs used for both characters. As typists who had used such typewriters transitioned in the 1970s and 1980s to being computer keyboard operators, their old keyboarding habits continued with them and were an occasional source of confusion.
Most current type designs carefully distinguish between these homoglyphs, usually by drawing the digit zero narrower and drawing the digit one with prominent
serifs. Early computer print-outs went even further and marked the zero with a slash or dot, which led to a new conflict involving the
Scandinavian letter "
Ø" and the Greek letter Φ (
phi). The redesigning of character types to differentiate these characters has meant less confusion.
Some type designs conform to the
DIN 1450 legibility standard by carefully designing such characters to be easy to distinguish:
slashed zero to distinguish it from capital ⟨O⟩;
lowercase l with a tail and uppercase ⟨I⟩ with serifs to distinguish it from the digit ⟨1⟩; distinguishing the numeral ⟨5⟩ from the capital ⟨S⟩; etc.
An example of confusion due to near-homoglyphs arose from the use of a to represent a (
thorn). Early English typesetters imported Dutch typesets that did not contain the latter character, so used the letter instead because (in
Blackletter
Blackletter (sometimes black letter or black-letter), also known as Gothic script, Gothic minuscule or Gothic type, was a script used throughout Western Europe from approximately 1150 until the 17th century. It continued to be commonly used for ...
typeface) they look sufficiently similar.
It has led in modern times to such phenomena as ''
Ye olde shoppe'', implying incorrectly that the word ''the'' was formerly written ''ye'' rather than ''þe''. The spelling of the name
Menzies (pronounced ''Mengis'' and originally spelled ''Menȝies'') arose for the same reason: the letter was substituted for (
yogh).
Multi-letter homoglyphs
Some other combinations of letters look similar, for instance ⟨rn⟩ looks similar to ⟨m⟩, ⟨cl⟩ looks similar to ⟨d⟩, and ⟨vv⟩ looks similar to ⟨w⟩.
In certain narrow-spaced fonts (such as
Tahoma), placing the letter ⟨c⟩ next to a letter such as ⟨j⟩, ⟨l⟩ or ⟨i⟩ will create a homoglyph, such as ⟨
cj cl ci⟩ (⟨g d a⟩).
When some characters are placed next to each other, seen together at a glance they give the visual impression of another, unrelated character. A more precise way of saying this is that some
typographic ligatures can look similar to standalone glyphs. For example, the ⟨⟩ ligature (of ⟨f⟩ and ⟨i⟩) can look similar to ⟨A⟩ in some typefaces or fonts. This potential for confusion is sometimes an argument made against the use of ligatures.
Canonicalization
Homoglyphs of all kinds can be detected through a process called 'dual canonicalization'.
The first step in this process is to identify homoglyph sets, namely characters appearing the same to a given observer. From here, a single token is specified to represent the homoglyph set. This token is called a canon. The next step is to convert each character in the text to the corresponding canon in a process called
canonicalization. If the canons of two runs of text are the same but the original text is different, then a homoglyph exists in the text.
Homoglyph prevention
Homoglyph attacks can be mitigated through a combination of user awareness and proactive measures. It is crucial to educate users about the risks associated with homoglyph attacks, urging them to meticulously inspect URLs before clicking. Employing advanced security solutions, particularly those capable of scanning for homoglyph variations in domain names, can automate the detection and prevention of potential threats. Additionally, implementing stringent domain name monitoring and registration policies can help identify and neutralize homoglyph-related risks promptly. By fostering a culture of cyber vigilance and leveraging cutting-edge technologies, organizations can fortify their defenses against homoglyph attacks, ensuring a more secure online environment.
Unicode homoglyphs
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
has
code point
A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
s for many strongly homoglyphic characters, known as "confusables".
These present security risks in a variety of situations (addressed in UTR#36) and were called to particular attention in regard to
internationalized domain name
An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacrit ...
s. In theory at least, one might deliberately spoof a domain name by replacing one character with its homoglyph, thus creating a second domain name, not readily distinguishable from the first, that can be exploited in
phishing (''see main article
IDN homograph attack
The internationalized domain name (IDN) homograph attack (sometimes written as homoglyph attack) is a method used by malicious parties to deceive computer users about what remote system they are communicating with, by exploiting the fact that man ...
''). In many
typeface
A typeface (or font family) is a design of Letter (alphabet), letters, Numerical digit, numbers and other symbols, to be used in printing or for electronic display. Most typefaces include variations in size (e.g., 24 point), weight (e.g., light, ...
s, the
Greek letter ⟨Α⟩, the
Cyrillic
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...
letter ⟨А⟩ and the
Latin
Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
letter ⟨A⟩ are visually identical, as are the Latin letter ⟨a⟩ and the Cyrillic letter ⟨а⟩ (the same can be applied to the Latin letters "aBceHKopTxy" and the Cyrillic letters ""). A domain name can be spoofed simply by substituting one of these forms for another in a separately registered name. There are also many examples of near-homoglyphs within the same script such as ⟨í⟩ (with an
acute accent
The acute accent (), ,
is a diacritic used in many modern written languages with alphabets based on the Latin alphabet, Latin, Cyrillic script, Cyrillic, and Greek alphabet, Greek scripts. For the most commonly encountered uses of the accen ...
) and ⟨i⟩ (with a
tittle), ⟨É⟩ (''E''-acute) and ⟨Ė⟩ (⟨E⟩ with dot above) and ⟨È⟩ (''E''-grave), ⟨Í⟩ (capital ⟨I⟩ with an acute accent) and ⟨ĺ⟩ (lowercase ⟨L⟩ with acute accent). When discussing this specific security issue, any two sequences of similar characters may be assessed in terms of its potential to be taken as a ''homoglyph pair'', or if the sequences clearly appear to be words, as ''pseudo-homographs'' (noting again that these terms may themselves cause confusion in other contexts). In the
Chinese language
Chinese ( or ) is a group of languages spoken natively by the ethnic Han Chinese majority and List of ethnic groups in China, many minority ethnic groups in China, as well as by various communities of the Chinese diaspora. Approximately 1.39& ...
, many
simplified Chinese characters
Simplified Chinese characters are one of two standardized Chinese characters, character sets widely used to write the Chinese language, with the other being traditional characters. Their mass standardization during the 20th century was part of ...
are homoglyphs of the corresponding
traditional Chinese characters
Traditional Chinese characters are a standard set of Chinese character forms used to written Chinese, write Chinese languages. In Taiwan, the set of traditional characters is regulated by the Ministry of Education (Taiwan), Ministry of Educat ...
.
Efforts by
TLD registries and
Web browser
A web browser, often shortened to browser, is an application for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's scr ...
designers aim to minimize the risks of homoglyphic confusion. Commonly, this is achieved by prohibiting names which mix character sets from multiple languages (
toys-Я-us.org, using the Cyrillic letter ⟨
Я⟩, would be invalid, but
wíkipedia.org and
wikipedia.org still exist as different websites); Canada's
.ca registry goes one step further by requiring names which differ only in
diacritic
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
s to have the same owner and same registrar. The handling of Chinese characters varies: in
.org and
.info registration of one variant renders the other unavailable to anyone, while in
.biz the traditional and simplified versions of the same name are delivered as a two-domain bundle which both point to the same
domain name server.
Relevant documentation will be found both on the developers' Web sites, and on an IDN Forum
provided by
ICANN
The Internet Corporation for Assigned Names and Numbers (ICANN ) is a global multistakeholder group and nonprofit organization headquartered in the United States responsible for coordinating the maintenance and procedures of several dat ...
.

The Cyrillic letter () not only looks like Latin (), but also occupies the same button in JCUKEN-QWERTY hybrid layout keyboards. This design nuance can be seen on the C/С button represented in
Keyboard Monument in
Yekaterinburg
Yekaterinburg (, ; ), alternatively Romanization of Russian, romanized as Ekaterinburg and formerly known as Sverdlovsk ( ; 1924–1991), is a city and the administrative centre of Sverdlovsk Oblast and the Ural Federal District, Russia. The ci ...
.
See also
*
*
*
Vehicle registration plates of Bosnia and Herzegovina use only numbers and letters that look the same in the Latin and Cyrillic alphabets.
* , South Korean language game of intentionally substituting
Hangul
The Korean alphabet is the modern writing system for the Korean language. In North Korea, the alphabet is known as (), and in South Korea, it is known as (). The letters for the five basic consonants reflect the shape of the speech organs ...
characters for homoglyphs.
References
External links
https://www.unicode.org/Public/security/latest/confusables.txt- recommended confusable mapping for IDN.
{{Unicode navigation
Typography
Unicode