The Internationalized Resource Identifier (IRI) is an
internet protocol standard which builds on the
Uniform Resource Identifier
A Uniform Resource Identifier (URI), formerly Universal Resource Identifier, is a unique sequence of characters that identifies an abstract or physical resource, such as resources on a webpage, mail address, phone number, books, real-world obje ...
(URI) protocol by greatly expanding the set of permitted characters.
It was defined by the
Internet Engineering Task Force
The Internet Engineering Task Force (IETF) is a standards organization for the Internet standard, Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster ...
(IETF) in 2005 in RFC 3987. While URIs are limited to a subset of the
US-ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
character set (characters outside that set must be mapped to octets according to some unspecified character encoding, then
percent-encoded), IRIs may additionally contain most characters from the
Universal Character Set
The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/ IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), w ...
(Unicode/
ISO 10646
The International Organization for Standardization (ISO ; ; ) is an independent, non-governmental, international standard development organization composed of representatives from the national standards organizations of member countries.
Mem ...
), including
Chinese,
Japanese,
Korean, and
Cyrillic
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...
characters.
Syntax
IRIs extend URIs by using the
Universal Character Set
The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/ IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), w ...
, where URIs were limited to
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
, with far fewer characters. IRIs may be represented by a sequence of octets but by definition are defined as a sequence of characters, because IRIs may be spoken or written by hand.
Compatibility
IRIs are mapped to URIs to retain backwards-compatibility with systems that do not support the new format.
For applications and protocols that do not allow direct consumption of IRIs, the IRI should first be converted to Unicode using
canonical composition normalization (NFC), if not already in Unicode format.
All non-ASCII code points in the IRI should next be encoded as
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
, and the resulting bytes
percent-encoded, to produce a valid URI.
Example: The IRI https://en.wiktionary.org/wiki/Ῥόδος becomes the URI https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82
ASCII code points that are invalid URI characters ''may'' be encoded the same way, depending on implementation.
This conversion is easily reversible; by definition, converting an IRI to an URI and back again will yield an IRI that is semantically equivalent to the original IRI, even though it may differ in exact representation.
Some protocols may impose further transformations; e.g.
Punycode
Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, w ...
for
DNS labels.
Advantages
There are reasons to see URIs displayed in different languages; mostly, it makes it easier for users who are unfamiliar with the Latin (A–Z) alphabet. Assuming that it isn't too difficult for anyone to replicate arbitrary Unicode on their keyboards, this can make the
URI system more accessible.
Disadvantages
Mixing IRIs and
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
URIs can make it much easier to execute
phishing
Phishing is a form of social engineering and a scam where attackers deceive people into revealing sensitive information or installing malware such as viruses, worms, adware, or ransomware. Phishing attacks have become increasingly sophisticate ...
attacks that trick someone into believing they are on a different site than they really are. For example, one can replace an ASCII "a" in
www.myfictionalbank.com
with the Unicode look-alike "
α" to give
www.myfictionαlbank.com
and point that IRI to a malicious site. This is known as an
IDN homograph attack
The internationalized domain name (IDN) homograph attack (sometimes written as homoglyph attack) is a method used by malicious parties to deceive computer users about what remote system they are communicating with, by exploiting the fact that man ...
.
While a URI does not provide people with a way to specify web resources using their own alphabets, an IRI does not make clear how web resources can be accessed with keyboards that are not capable of generating the requisite internationalized characters. This means that IRIs are now handled in a way very similar to many other software which might require the use of a non-keyboard
input method
An input method (or input method editor, commonly abbreviated IME) is an operating system component or program that enables users to generate characters not natively available on their input devices by using sequences of characters (or mouse oper ...
when dealing with texts in various languages.
See also
*
IDN (Internationalized Domain Name)
*
Semantic Web
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
To enable the encoding o ...
*
Punycode
Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, w ...
*
XRI (Extensible Resource Identifier)
References
External links
RFC 3987: Proposed Standard of Internationalized Resource Identifiers (IRIs)W3C Internationalization ActivityIANA List of Registered URI Schemes
{{URI scheme
Application layer protocols
Internet protocols
Internet Standards
URL
Web services