Punycode
   HOME

TheInfoList



OR:

Punycode is a representation of
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
with the limited
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, the German ''
München Munich is the capital and most populous city of Bavaria, Germany. As of 30 November 2024, its population was 1,604,384, making it the third-largest city in Germany after Berlin and Hamburg. Munich is the largest city in Germany that is no ...
'' ( English:
Munich Munich is the capital and most populous city of Bavaria, Germany. As of 30 November 2024, its population was 1,604,384, making it the third-largest city in Germany after Berlin and Hamburg. Munich is the largest city in Germany that is no ...
) is encoded as ''Mnchen-3ya''. While the
Domain Name System The Domain Name System (DNS) is a hierarchical and distributed name service that provides a naming system for computers, services, and other resources on the Internet or other Internet Protocol (IP) networks. It associates various information ...
(DNS) technically supports arbitrary sequences of octets in domain name labels, the DNS standards recommend the use of the LDH subset of ASCII conventionally used for host names, and require that string comparisons between DNS domain names should be case-insensitive. The Punycode syntax is a method of encoding strings containing Unicode characters, such as
internationalized domain name An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacrit ...
s (IDNA), into the LDH subset of ASCII favored by DNS. It is specified in
IETF The Internet Engineering Task Force (IETF) is a standards organization for the Internet standard, Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster ...
Request for Comments A Request for Comments (RFC) is a publication in a series from the principal technical development and standards-setting bodies for the Internet, most prominently the Internet Engineering Task Force (IETF). An RFC is authored by individuals or ...
3492.RF
3492
''Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDN)'', A. Costello, The Internet Society (March 2003)


Origin of the name

The RFC author, Adam Costello, is reported to have written:
Why “Punycode”? It rhymes with Unicode and is intended to encode Unicode strings. It is “puny” in three senses: The repertoire of characters used in the encoded strings is small, the encoded strings are short, and the implementation is small.


Description

As stated in RFC 3492, "Punycode is an instance of a more general algorithm called ''Bootstring'', which allows strings composed from a small set of 'basic' code points to uniquely represent any string of code points drawn from a larger set." Punycode defines parameters for the general Bootstring algorithm to match the characteristics of Unicode text. This section demonstrates the procedure for Punycode encoding, using as an example the German string "bücher" ( English: ''books''), which is translated into the label "bcher-kva". To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding. Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using nameprep and (for
top-level domain A top-level domain (TLD) is one of the domain name, domains at the highest level in the hierarchical Domain Name System of the Internet after the root domain. The top-level domain names are installed in the DNS root zone, root zone of the nam ...
s) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.


Separation of ASCII characters

First, all
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
characters in the string are copied from input to output, skipping over any other characters. For example, "bücher" is copied to "bcher". If any characters were copied, i.e. if there was at least one ASCII character in the input, an ASCII hyphen is appended to the output (e.g., "bücher" → "bcher-", but "ü" → ""). Note that hyphens are themselves ASCII characters. Thus, they can be present in the input and, if so, they will be copied to the output. This causes no ambiguity: if the output contains hyphens, the one that got added is always the last one. It marks the end of the ASCII characters.


Encoding the non-ASCII characters

The non-ASCII characters are sorted by Unicode value, lowest first (if a character occurs more than once, they are sorted by position). Each is then encoded as a single number. This single number defines both the location to insert the character at and which character to insert. * An into the result to insert the code at, starting at 0 (for insertion at the start). * The number of (current length of the result plus one). * The is the Unicode code point to insert minus 127. The encoded number is . By dividing by and also getting the remainder, a decoder can determine and . There are 6 possible insertion points for a character in the string "bcher" (including before the first character and after the last one). is Unicode code point or 252 (see Latin-1 Supplement), and the reduced code point is , or . The is inserted at position 1, after the . Thus the encoder will add the number , and the decoder can retrieve these by and . These numbers are strictly increasing. For the second and subsequent inserted characters, the difference between the number and the previous one is written.


Variable-length number encoding

The number is encoded using the letters through and the digits through . It is not base-36 but a more complex scheme, generalized variable-length integers, which allows the numbers to be concatenated with nothing separating them. This is how "kva" is used to represent the code number 745:
A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits vary. In this case a number system with 36 symbols is used, with the
case-insensitive In computers, case sensitivity defines whether uppercase and lowercase letters are treated as distinct (case-sensitive) or equivalent (case-insensitive). For instance, when users interested in learning about dogs search an e-book, "dog" and "Dog ...
'a' through 'z' equal to the decimal numbers 0 through 25, and '0' through '9' equal to the decimal numbers 26 through 35. Thus "kva", corresponds to the decimal number string "10 21 0".
To decode this string of symbols, a sequence of thresholds will be needed, in this case it's (1, 1, 26, 26, ...). The weight (or
place value Place may refer to: Geography * Place (United States Census Bureau), defined as any concentration of population ** Census-designated place, a populated area lacking its own municipal government * "Place", a type of street or road name ** O ...
) of the least-significant digit is always 1: 'k' (=10) with a weight of 1 equals 10. After this, the weight of the next digit depends on the first threshold: generally, for any ''n'', the weight of the (''n''+1)-th digit is ''w'' × (36 − ''t''), where ''w'' is the previous weight and ''t'' is the threshold of the ''n''-th digit. So in this case, the second symbol has a place value of 36 minus the previous threshold value of 1, which equals 35. Therefore, the sum of the first two symbols 'k' (=10) and 'v' (=21) is 10 × 1 + 21 × 35. Since the second symbol is not less than its threshold value of 1, there is more to come. However, since the third symbol in this example is 'a' (=0), we may ignore calculating its weight. Therefore, "kva" represents the decimal number (10 × 1) + (21 × 35) = 745. Number 745 will be encoded as 10 + 21 × 35 + 0 (base 35 used for second digit, the most significant digit 0 needed as terminator), 10 → 'k', 21 → 'v', 0 → 'a', so "bücher" → "bcher-kva". The thresholds themselves are determined for each successive encoded character by an algorithm keeping them between 1 and 26 inclusive. The case can then be used to provide information about the original case of the string. Because special characters are sorted by their code points by encoding algorithm, for the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes codes representing insertion of ý, the Unicode character following ü, starting with "ýbücher" with code "bcher-kvaf" (different from "übücher" coded "bcher-jvab"), etc.


ACE prefix for internationalized domain names

To prevent hyphens in non-international domain names from triggering a Punycode decoding, the string xn-- is prepended to Punycode sequences in internationalized domain names. This is called ACE (ASCII Compatible Encoding). Thus the domain name "bücher.tld" would be represented in a URL as "xn--bcher-kva.tld".


Examples

The following table shows examples of Punycode encodings for different types of input.The Punycode in this table was created using the builtin codec "punycode" of the
Python programming language Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically type-checked and garbage-collected. It supports multiple prog ...
version 3.8 (s.encode("punycode")). See talk page.


See also

* Emoji domain * UTF-5 *
UTF-6 This article compares Unicode encodings in two types of environments: 8-bit clean environments, and environments that forbid the use of byte values with the high bit set. Originally, such prohibitions allowed for links that used only seven dat ...
*
Website spoofing Website spoofing is the act of creating a website with the intention of misleading readers that the website has been created by a different person or organization. Techniques Normally, the spoof website will adopt the design of the target websit ...


References

{{Reflist


External links


IETF Punycode standard

ICU IDNA Demonstration
An online demonstration of how ICU performs IDN operations
List of TLDs considered by the Mozilla developers to have an effective anti-spoofing policy for name registration

IDN and Punycode in IE7

Simple Punycode converter
Unicode Transformation Formats Internationalized domain names