HOME

TheInfoList



OR:

A decompiler is a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components. A computer progra ...
that translates an
executable In computing, executable code, an executable file, or an executable program, sometimes simply referred to as an executable or binary, causes a computer "to perform indicated tasks according to encoded instructions", as opposed to a data fil ...
file to a high-level source file which can be recompiled successfully. It does therefore the opposite of a typical
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
, which translates a high-level language to a low-level language. Decompilers are usually unable to perfectly reconstruct the original source code, thus frequently will produce
obfuscated code In software development, obfuscation is the act of creating source or machine code that is difficult for humans or computers to understand. Like obfuscation in natural language, it may use needlessly roundabout expressions to compose statemen ...
. Nonetheless, decompilers remain an important tool in the
reverse engineering Reverse engineering (also known as backwards engineering or back engineering) is a process or method through which one attempts to understand through deductive reasoning how a previously made device, process, system, or piece of software accompli ...
of
computer software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
.


Introduction

The term ''decompiler'' is most commonly applied to a program which
translate Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...
s
executable In computing, executable code, an executable file, or an executable program, sometimes simply referred to as an executable or binary, causes a computer "to perform indicated tasks according to encoded instructions", as opposed to a data fil ...
programs (the output from a
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
) into
source code In computing, source code, or simply code, is any collection of code, with or without comment (computer programming), comments, written using a human-readable programming language, usually as plain text. The source code of a Computer program, p ...
in a (relatively) high level language which, when compiled, will produce an executable whose behavior is the same as the original executable program. By comparison, a
disassembler A disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. A disassembler differs from a decompiler, which targets a high-level language rather than an assembl ...
translates an executable program into assembly language (and an assembler could be used for assembling it back into an executable program). Decompilation is the act of using a decompiler, although the term can also refer to the output of a decompiler. It can be used for the recovery of lost source code, and is also useful in some cases for
computer security Computer security, cybersecurity (cyber security), or information technology security (IT security) is the protection of computer systems and networks from attack by malicious actors that may result in unauthorized information disclosure, t ...
,
interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader def ...
and
error correction In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communic ...
. The success of decompilation depends on the amount of information present in the code being decompiled and the sophistication of the analysis performed on it. The bytecode formats used by many virtual machines (such as the
Java Virtual Machine A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. The JVM is detailed by a specification that formally describ ...
or the
.NET Framework The .NET Framework (pronounced as "''dot net"'') is a proprietary software framework developed by Microsoft that runs primarily on Microsoft Windows. It was the predominant implementation of the Common Language Infrastructure (CLI) until bein ...
Common Language Runtime The Common Language Runtime (CLR), the virtual machine component of Microsoft .NET Framework, manages the execution of .NET programs. Just-in-time compilation converts the managed code (compiled intermediate language code) into machine instruc ...
) often include extensive metadata and high-level features that make decompilation quite feasible. The application of debug data, i.e. debug-symbols, may enable to reproduce the original names of variables and structures and even the line numbers.
Machine language In computer programming, machine code is any low-level programming language, consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Each instruction causes the CPU to perform a ver ...
without such metadata or debug data is much harder to decompile. Some compilers and post-compilation tools produce
obfuscated code In software development, obfuscation is the act of creating source or machine code that is difficult for humans or computers to understand. Like obfuscation in natural language, it may use needlessly roundabout expressions to compose statemen ...
(that is, they attempt to produce output that is very difficult to decompile, or that decompiles to confusing output). This is done to make it more difficult to
reverse engineer Reverse engineering (also known as backwards engineering or back engineering) is a process or method through which one attempts to understand through deductive reasoning how a previously made device, process, system, or piece of software accompli ...
the executable. While decompilers are normally used to (re-)create source code from binary executables, there are also decompilers to turn specific binary data files into human-readable and editable sources.


Design

Decompilers can be thought of as composed of a series of phases each of which contributes specific aspects of the overall decompilation process.


Loader

The first decompilation phase loads and parses the input machine code or
intermediate language An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation. A "good" ...
program's
binary file A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document fi ...
format. It should be able to discover basic facts about the input program, such as the architecture (Pentium, PowerPC, etc.) and the entry point. In many cases, it should be able to find the equivalent of the main function of a C program, which is the start of the user written code. This excludes the runtime initialization code, which should not be decompiled if possible. If available the symbol tables and debug data are also loaded. The front end may be able to identify the libraries used even if they are linked with the code, this will provide library interfaces. If it can determine the compiler or compilers used it may provide useful information in identifying code idioms.


Disassembly

The next logical phase is the
disassembly A disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. A disassembler differs from a decompiler, which targets a high-level language rather than an assembly l ...
of machine code instructions into a machine independent intermediate representation (IR). For example, the
Pentium Pentium is a brand used for a series of x86 architecture-compatible microprocessors produced by Intel. The original Pentium processor from which the brand took its name was first released on March 22, 1993. After that, the Pentium II and P ...
machine instruction mov eax, bx+0x04 might be translated to the IR eax := m bx+4


Idioms

Idiomatic machine code sequences are sequences of code whose combined semantics are not immediately apparent from the instructions' individual semantics. Either as part of the disassembly phase, or as part of later analyses, these idiomatic sequences need to be translated into known equivalent IR. For example, the x86 assembly code: cdq eax ; edx is set to the sign-extension≠edi,edi +(tex)push xor eax, edx sub eax, edx could be translated to eax := abs(eax); Some idiomatic sequences are machine independent; some involve only one instruction. For example, clears the eax register (sets it to zero). This can be implemented with a machine independent simplification rule, such as a = 0. In general, it is best to delay detection of idiomatic sequences if possible, to later stages that are less affected by instruction ordering. For example, the instruction scheduling phase of a compiler may insert other instructions into an idiomatic sequence, or change the ordering of instructions in the sequence. A pattern matching process in the disassembly phase would probably not recognize the altered pattern. Later phases group instruction expressions into more complex expressions, and modify them into a canonical (standardized) form, making it more likely that even the altered idiom will match a higher level pattern later in the decompilation. It is particularly important to recognize the compiler idioms for
subroutine In computer programming, a function or subroutine is a sequence of program instructions that performs a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed. Functions ma ...
calls,
exception handling In computing and computer programming, exception handling is the process of responding to the occurrence of ''exceptions'' – anomalous or exceptional conditions requiring special processing – during the execution of a program. In general, a ...
, and
switch statement In computer programming languages, a switch statement is a type of selection control mechanism used to allow the value of a variable or expression to change the control flow of program execution via search and map. Switch statements function som ...
s. Some languages also have extensive support for
string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
s or
long integer In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers ar ...
s.


Program analysis

Various program analyses can be applied to the IR. In particular, expression propagation combines the semantics of several instructions into more complex expressions. For example, mov eax, bx+0x04 add eax, bx+0x08 sub bx+0x0Ceax could result in the following IR after expression propagation: m bx+12 := m bx+12- (m bx+4+ m bx+8; The resulting expression is more like high level language, and has also eliminated the use of the machine register eax. Later analyses may eliminate the ebx register.


Data flow analysis

The places where register contents are defined and used must be traced using data flow analysis. The same analysis can be applied to locations that are used for temporaries and local data. A different name can then be formed for each such connected set of value definitions and uses. It is possible that the same local variable location was used for more than one variable in different parts of the original program. Even worse it is possible for the data flow analysis to identify a path whereby a value may flow between two such uses even though it would never actually happen or matter in reality. This may in bad cases lead to needing to define a location as a union of types. The decompiler may allow the user to explicitly break such unnatural dependencies which will lead to clearer code. This of course means a variable is potentially used without being initialized and so indicates a problem in the original program.


Type analysis

A good machine code decompiler will perform type analysis. Here, the way registers or memory locations are used result in constraints on the possible type of the location. For example, an and instruction implies that the operand is an integer; programs do not use such an operation on
floating point In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be r ...
values (except in special library code) or on pointers. An add instruction results in three constraints, since the operands may be both integer, or one integer and one pointer (with integer and pointer results respectively; the third constraint comes from the ordering of the two operands when the types are different). Various high level expressions can be recognized which trigger recognition of structures or arrays. However, it is difficult to distinguish many of the possibilities, because of the freedom that machine code or even some high level languages such as C allow with casts and pointer arithmetic. The example from the previous section could result in the following high level code: struct T1 *ebx; struct T1 ; ebx->v000C -= ebx->v0004 + ebx->v0008;


Structuring

The penultimate decompilation phase involves structuring of the IR into higher level constructs such as while loops and if/then/else conditional statements. For example, the machine code xor eax, eax l0002: or ebx, ebx jge l0003 add eax, bx mov ebx, bx+0x4 jmp l0002 l0003: mov
x10040000 X1, X-1 or X-one may refer to: Transportation Aircraft * Bell X-1, the first aircraft to exceed the speed of sound in controlled level flight Automobiles * BMW X1, a 2009–present German subcompact luxury SUV * Geely Yuanjing X1, a 2017–2021 ...
eax
could be translated into: eax = 0; while (ebx < 0) v10040000 = eax; Unstructured code is more difficult to translate into structured code than already structured code. Solutions include replicating some code, or adding boolean variables.


Code generation

The final phase is the generation of the high level code in the back end of the decompiler. Just as a compiler may have several back ends for generating machine code for different architectures, a decompiler may have several back ends for generating high level code in different high level languages. Just before code generation, it may be desirable to allow an interactive editing of the IR, perhaps using some form of
graphical user interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows User (computing), users to Human–computer interaction, interact with electronic devices through graphical icon (comp ...
. This would allow the user to enter comments, and non-generic variable and function names. However, these are almost as easily entered in a post decompilation edit. The user may want to change structural aspects, such as converting a while loop to a for loop. These are less readily modified with a simple text editor, although source
code refactoring In computer programming and software design, code refactoring is the process of restructuring existing computer code—changing the '' factoring''—without changing its external behavior. Refactoring is intended to improve the design, structu ...
tools may assist with this process. The user may need to enter information that failed to be identified during the type analysis phase, e.g. modifying a memory expression to an array or structure expression. Finally, incorrect IR may need to be corrected, or changes made to cause the output code to be more readable.


Legality

The majority of computer programs are covered by
copyright A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, education ...
laws. Although the precise scope of what is covered by copyright differs from region to region, copyright law generally provides the author (the programmer(s) or employer) with a collection of exclusive rights to the program. These rights include the right to make copies, including copies made into the computer’s RAM (unless creating such a copy is essential for using the program). Since the decompilation process involves making multiple such copies, it is generally prohibited without the authorization of the copyright holder. However, because decompilation is often a necessary step in achieving software
interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader def ...
, copyright laws in both the United States and Europe permit decompilation to a limited extent. In the United States, the copyright
fair use Fair use is a doctrine in United States law that permits limited use of copyrighted material without having to first acquire permission from the copyright holder. Fair use is one of the limitations to copyright intended to balance the intere ...
defence has been successfully invoked in decompilation cases. For example, in '' Sega v. Accolade'', the court held that Accolade could lawfully engage in decompilation in order to circumvent the software locking mechanism used by Sega's game consoles. Additionally, the
Digital Millennium Copyright Act The Digital Millennium Copyright Act (DMCA) is a 1998 United States copyright law that implements two 1996 treaties of the World Intellectual Property Organization (WIPO). It criminalizes production and dissemination of technology, devices, or ...
(PUBLIC LAW 105–304) has proper exemptions for both Security Testing and Evaluation in §1201(i), and Reverse Engineering in §1201(f). In Europe, the 1991 Software Directive explicitly provides for a right to decompile in order to achieve interoperability. The result of a heated debate between, on the one side, software protectionists, and, on the other, academics as well as independent software developers, Article 6 permits decompilation only if a number of conditions are met: * First, a person or entity must have a licence to use the program to be decompiled. * Second, decompilation must be necessary to achieve
interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader def ...
with the target program or other programs. Interoperability information should therefore not be readily available, such as through manuals or API documentation. This is an important limitation. The necessity must be proven by the decompiler. The purpose of this important limitation is primarily to provide an incentive for developers to document and disclose their products' interoperability information. * Third, the decompilation process must, if possible, be confined to the parts of the target program relevant to interoperability. Since one of the purposes of decompilation is to gain an understanding of the program structure, this third limitation may be difficult to meet. Again, the burden of proof is on the decompiler. In addition, Article 6 prescribes that the information obtained through decompilation may not be used for other purposes and that it may not be given to others. Overall, the decompilation right provided by Article 6 codifies what is claimed to be common practice in the software industry. Few European lawsuits are known to have emerged from the decompilation right. This could be interpreted as meaning one of three things: #) the decompilation right is not used frequently and the decompilation right may therefore have been unnecessary, #) the decompilation right functions well and provides sufficient legal certainty not to give rise to legal disputes or #) illegal decompilation goes largely undetected. In a report of 2000 regarding implementation of the Software Directive by the European member states, the
European Commission The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body ...
seemed to support the second interpretation.


Tools

Decompilers usually target a specific binary format. Some are native instruction sets (eg Intel x86, ARM, MIPS), others are bytecode for virtual machines (Dalvik, Java class files, WebAssembly, Ethereum). Due to information loss during compilation, decompilation is almost never perfect, and not all decompilers perform equally well for a given binary format. There are studies comparing the performance of different decompilers.


See also

*
Binary recompiler A binary recompiler is a compiler that takes executable binary files as input, analyzes their structure, applies transformations and optimizations, and outputs new optimized executable binaries. The foundation to the concepts of binary recompilat ...
*
Linker (computing) In computing, a linker or link editor is a computer system program that takes one or more object files (generated by a compiler or an assembler) and combines them into a single executable file, library file, or another "object" file. A simp ...
* Abstract interpretation *
Resource editor The resource fork is a fork or section of a file on Apple's classic Mac OS operating system, which was also carried over to the modern macOS for compatibility, used to store structured data along with the unstructured data stored within the data fo ...


Java decompilers

* Mocha decompiler * JD Decompiler * JAD decompiler


Other decompilers

*
.NET Reflector .NET Reflector is a class browser, decompiler and static analyzer for software created with .NET Framework, originally written by Lutz Roeder. MSDN Magazine named it as one of the Ten Must-Have utilities for developers, and Scott Hanselman li ...
* JEB Decompiler (Android Dalvik, Intel x86, ARM, MIPS, WebAssembly, Ethereum) * Ghidra


References


External links

* {{Authority control * Utility software types Reverse engineering