Trojan Source is a
software vulnerability
Vulnerabilities are flaws or weaknesses in a system's design, implementation, or management that can be exploited by a malicious actor to compromise its security.
Despite a system administrator's best efforts to achieve complete correctness, vir ...
that abuses
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
's bidirectional characters to display
source code
In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.
Since a computer, at base, only ...
differently than the actual execution of the source code.
The exploit utilizes how writing scripts of different reading directions are displayed and encoded on computers. It was discovered by Nicholas Boucher and Ross Anderson at
Cambridge University
The University of Cambridge is a Public university, public collegiate university, collegiate research university in Cambridge, England. Founded in 1209, the University of Cambridge is the List of oldest universities in continuous operation, wo ...
in late 2021.
Background
Unicode is an encoding standard for representing text, symbols, and glyphs. Unicode is the most dominant encoding on computers, used in over 98% of websites .
It supports many languages, and because of this, it must support different methods of writing text. This requires support for both
left-to-right
A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
languages, such as English and Russian, and
right-to-left
A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
languages, such as
Hebrew
Hebrew (; ''ʿÎbrit'') is a Northwest Semitic languages, Northwest Semitic language within the Afroasiatic languages, Afroasiatic language family. A regional dialect of the Canaanite languages, it was natively spoken by the Israelites and ...
and
Arabic
Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
. Since Unicode aims to enable using more than one writing system, it must be able to mix scripts with different display orders and resolve conflicting orders. As a solution, Unicode contains characters called ''bidirectional characters'' (''Bidi'') that describe how text is displayed and represented. These characters can be abused to change how text is interpreted without changing it visually, as the characters are often invisible.
Methodology
In the exploit,
bidirectional characters are abused to visually reorder text in source code so that later execution occurs in a different order.
Bidirectional characters can be inserted in areas of source code where string literals are allowed. This often applies to documentation, variables, or comments.
In the above example, the RLI mark (right-to-left isolate) forces the following text to be interpreted differently than it is displayed: the triple-quote is first (ending the string), followed by a semicolon (starting a new line), and finally with the premature return (returning and ignoring any code below it). The new line terminates the RLI mark, preventing it from flowing into the below code. Because of the Bidi character, some source code editors and
IDEs rearrange the code for display without any visual indication that the code has been rearranged, so a human code reviewer would not normally detect them. However, when the code is inserted into a compiler, the compiler may ignore the Bidi character and process the characters in a different order than visually displayed. When the compiler is finished, it could potentially execute code that visually appeared to be non-executable.
Formatting marks can be combined multiple times to create complex attacks.
Impact and mitigation
Programming languages that support Unicode strings and follow Unicode's Bidi algorithm are vulnerable to the exploit. This includes languages like
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
Go,
C,
C++,
C#,
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (prog ...
, and
JavaScript
JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior.
Web browsers have ...
.
While the attack is not strictly an error, many compilers, interpreters, and websites added warnings or mitigations for the exploit. Both
GNU GCC and
LLVM
LLVM, also called LLVM Core, is a target-independent optimizer and code generator. It can be used to develop a Compiler#Front end, frontend for any programming language and a Compiler#Back end, backend for any instruction set architecture. LLVM i ...
received requests to deal with the exploit.
Marek Polacek submitted a patch to GCC shortly after the exploit was published that implemented a warning for potentially unsafe directional characters; this functionality was merged for GCC 12 under the
-Wbidi-chars
flag.
LLVM also merged similar patches.
Rust
Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...
fixed the exploit in 1.56.1, rejecting code that includes the characters by default. The developers of Rust found no vulnerable packages prior to the fix.
Many source code editors and IDEs now make these potentially unsafe characters more visible.
Visual Studio Code
Visual Studio Code, commonly referred to as VS Code, is an integrated development environment developed by Microsoft for Windows, Linux, macOS and web browsers. Features include support for debugging, syntax highlighting, intelligent code comp ...
now renders control characters by default.
Notepad++
Notepad++ (sometimes npp or NPP) is a text and source code editor for use with Microsoft Windows. It supports tabbed editing, which allows working with multiple open files in one window. The program's name comes from the C postfix increment op ...
and
vim already made these characters more visible, as noted in the research paper.
Red Hat
Red Hat, Inc. (formerly Red Hat Software, Inc.) is an American software company that provides open source software products to enterprises and is a subsidiary of IBM. Founded in 1993, Red Hat has its corporate headquarters in Raleigh, North ...
issued an advisory on their website, labeling the exploit as "moderate".
GitHub
GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...
released a warning on their blog, as well as updating the website to show a dialog box when Bidi characters are detected in a repository's code.
References
{{reflist
External links
https://trojansource.codes/site by the discoverers, Nicholas Boucher and Ross Anderson
*
Proof of concept code
*
Trojan Source full research paper*
NIST
The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical s ...
br>
National Vulnerability Database&
CVEbr>
Common Vulnerabilities and Exposures** CVE-2021-42574
NISTCVE(BIDI exploit)
** CVE-2021-42694
NISTCVE(homoglyph attack)
UAX 9from the
Unicode Consortium
The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the in ...
about bidirectional characters and formatting
*
Unicode UTR 36from the
Unicode Consortium
The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the in ...
, which describes the vulnerability in Unicode
CERT/CC vulnerability report
2021 in computing
Injection exploits
Software bugs