Bootstrapping (compilers)
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, bootstrapping is the technique for producing a self-compiling compiler – that is, a
compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
(or assembler) written in the source
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
that it intends to compile. An initial core version of the compiler (the ''bootstrap compiler'') is generated in a different language (which could be assembly language); successive expanded versions of the compiler are developed using this minimal subset of the language. The problem of compiling a self-compiling compiler has been called the chicken-or-egg problem in compiler design, and bootstrapping is a solution to this problem. Bootstrapping is a fairly common practice when creating a programming language. Many compilers for many programming languages are bootstrapped, including compilers for
ALGOL ALGOL (; short for "Algorithmic Language") is a family of imperative computer programming languages originally developed in 1958. ALGOL heavily influenced many other languages and was the standard method for algorithm description used by the ...
,
BASIC Basic or BASIC may refer to: Science and technology * BASIC, a computer programming language * Basic (chemistry), having the properties of a base * Basic access authentication, in HTTP Entertainment * Basic (film), ''Basic'' (film), a 2003 film ...
, C,
Common Lisp Common Lisp (CL) is a dialect of the Lisp programming language, published in American National Standards Institute (ANSI) standard document ''ANSI INCITS 226-1994 (S2018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperli ...
, D, Eiffel,
Elixir An elixir is a sweet liquid used for medical purposes, to be taken orally and intended to cure one's illness. When used as a dosage form, pharmaceutical preparation, an elixir contains at least one active ingredient designed to be taken orall ...
, Go,
Haskell Haskell () is a general-purpose, statically typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell pioneered several programming language ...
,
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
Modula-2 Modula-2 is a structured, procedural programming language developed between 1977 and 1985/8 by Niklaus Wirth at ETH Zurich. It was created as the language for the operating system and application software of the Lilith personal workstation. It w ...
, Nim,
Oberon Oberon () is a king of the fairy, fairies in Middle Ages, medieval and Renaissance literature. He is best known as a character in William Shakespeare's play ''A Midsummer Night's Dream'', in which he is King of the Fairies and spouse of Titania ...
,
OCaml OCaml ( , formerly Objective Caml) is a General-purpose programming language, general-purpose, High-level programming language, high-level, Comparison of multi-paradigm programming languages, multi-paradigm programming language which extends the ...
, Pascal,
PL/I PL/I (Programming Language One, pronounced and sometimes written PL/1) is a procedural, imperative computer programming language initially developed by IBM. It is designed for scientific, engineering, business and system programming. It has b ...
, Python,
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...
, Scala, Scheme,
TypeScript TypeScript (abbreviated as TS) is a high-level programming language that adds static typing with optional type annotations to JavaScript. It is designed for developing large applications and transpiles to JavaScript. It is developed by Micr ...
, Vala, Zig and more.


Process

A typical bootstrap process works in three or four stages: * Stage 0: preparing an environment for the ''bootstrap compiler'' to work with. This is where the source language and output language of the bootstrap compiler are chosen. In the case of a "
bare machine In information technology, bare machine (or bare-metal computer) is a computer which has no operating system. The software executed by a bare machine, commonly called a "bare metal program" or "bare metal application", is designed to interact dir ...
" (one which has no compiler for any language) the source and output are written as binary
machine code In computer programming, machine code is computer code consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). For conventional binary computers, machine code is the binaryOn nonb ...
, or may be created by cross compiling on some other machine than the target. Otherwise, the bootstrap compiler is to be written in one of the programming languages which does exist on the target machine, and that compiler will generate something which can execute on the target, including a
high-level programming language A high-level programming language is a programming language with strong Abstraction (computer science), abstraction from the details of the computer. In contrast to low-level programming languages, it may use natural language ''elements'', be ea ...
, an
assembly language In computing, assembly language (alternatively assembler language or symbolic machine code), often referred to simply as assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence bet ...
, an
object file An object file is a file that contains machine code or bytecode, as well as other data and metadata, generated by a compiler or assembler from source code during the compilation or assembly process. The machine code that is generated is kno ...
, or even machine code. * Stage 1: the bootstrap compiler is produced. This compiler is enough to translate its own source into a program which can be executed on the target machine. At this point, all further development is done using the language defined by the bootstrap compiler, and stage 2 begins. * Stage 2: a full compiler is produced by the bootstrap compiler. This is typically done in stages as needed, e.g. compiler for version X of the language will be able to compile features from version X+1, but that compiler does not actually use those features. Once this compiler has been tested and can compile itself, now version X+1 features may be used by subsequent releases of the compiler. * Stage 3: a full compiler is produced by the stage 2 full compiler. If more features are to be added, work resumes at stage 2, with the current stage 3 full compiler replacing the bootstrap compiler. The full compiler is built twice in order to compare the outputs of the two stages. If they are different, either the bootstrap or the full compiler contains a bug.


Advantages

Bootstrapping a compiler has the following advantages: * It is a non-trivial test of the language being compiled, and as such is a form of dogfooding. * Compiler developers and bug reporters only need to know the language being compiled. * Compiler development can be performed in the higher-level language being compiled. * Improvements to the compiler's back-end improve not only general-purpose programs but also the compiler itself. * It is a comprehensive consistency check as it should be able to reproduce its own object code. Note that some of these points assume that the language runtime is also written in the same language.


Methods

If one needs to compile a compiler for language X written in language X, there is the issue of how the first compiler can be compiled. The different methods that are used in practice include: * Implementing an
interpreter Interpreting is translation from a spoken or signed language into another language, usually in real time to facilitate live communication. It is distinguished from the translation of a written text, which can be more deliberative and make use o ...
or
compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
for language X in language Y.
Niklaus Wirth Niklaus Emil Wirth ( IPA: ) (15 February 1934 – 1 January 2024) was a Swiss computer scientist. He designed several programming languages, including Pascal, and pioneered several classic topics in software engineering. In 1984, he won the Tu ...
reported that he wrote the first Pascal compiler in Fortran. * Another interpreter or compiler for X has already been written in another language Y; this is how Scheme is often bootstrapped. * Earlier versions of the compiler were written in a subset of X for which there existed some other compiler; this is how some supersets of
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
Haskell Haskell () is a general-purpose, statically typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell pioneered several programming language ...
, and the initial
Free Pascal Free Pascal Compiler (FPC) is a compiler for the closely related programming-language dialects Pascal and Object Pascal. It is free software released under the GNU General Public License, witexception clausesthat allow static linking against it ...
compiler are bootstrapped. * A compiler supporting non-standard language extensions or optional language features can be written without using those extensions and features, to enable it being compiled with another compiler supporting the same base language but a different set of extensions and features. The main parts of the C++ compiler
clang Clang () is a compiler front end for the programming languages C, C++, Objective-C, Objective-C++, and the software frameworks OpenMP, OpenCL, RenderScript, CUDA, SYCL, and HIP. It acts as a drop-in replacement for the GNU Compiler ...
were written in a subset of C++ that can be compiled by both g++ and Microsoft Visual C++. Advanced features are written with some GCC extensions. * The compiler for X is cross compiled from another architecture where there exists a compiler for X; this is how compilers for C are usually ported to other platforms. Also this is the method used for
Free Pascal Free Pascal Compiler (FPC) is a compiler for the closely related programming-language dialects Pascal and Object Pascal. It is free software released under the GNU General Public License, witexception clausesthat allow static linking against it ...
after the initial bootstrap. * Writing the compiler in X; then hand-compiling it from source (most likely in a non-optimized way) and running that on the code to get an optimized compiler.
Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist and mathematician. He is a professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of comp ...
used this for his
WEB Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...
literate programming Literate programming (LP) is a programming paradigm introduced in 1984 by Donald Knuth in which a computer program is given as an explanation of how it works in a natural language, such as English, interspersed (embedded) with snippets of macr ...
system. Methods for distributing compilers in source code include providing a portable
bytecode Bytecode (also called portable code or p-code) is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references (normal ...
version of the compiler, so as to ''bootstrap'' the process of compiling the compiler with itself. The T-diagram is a
notation In linguistics and semiotics, a notation system is a system of graphics or symbols, Character_(symbol), characters and abbreviated Expression (language), expressions, used (for example) in Artistic disciplines, artistic and scientific disciplines ...
used to explain these compiler bootstrap techniques. In some cases, the most convenient way to get a complicated compiler running on a system that has little or no software on it involves a series of ever more sophisticated assemblers and compilers.


History

Assemblers were the first language tools to bootstrap themselves. The first high-level language to provide such a bootstrap was
NELIAC The Navy Electronics Laboratory International ALGOL Compiler (NELIAC) is a dialect and compiler implementation of the programming language ALGOL 58, developed by the Navy Electronics Laboratory (NEL) in 1958. It was designed for numeric and lo ...
in 1958. The first widely used languages to do so were Burroughs B5000 Algol in 1961 and
LISP Lisp (historically LISP, an abbreviation of "list processing") is a family of programming languages with a long history and a distinctive, fully parenthesized Polish notation#Explanation, prefix notation. Originally specified in the late 1950s, ...
in 1962. Hart and Levin wrote a LISP compiler in LISP at MIT in 1962, testing it inside an existing LISP interpreter. Once they had improved the compiler to the point where it could compile its own source code, it was self-hosting. This technique is only possible when an interpreter already exists for the very same language that is to be compiled. It borrows directly from the notion of running a program on itself as input, which is also used in various proofs in
theoretical computer science Theoretical computer science is a subfield of computer science and mathematics that focuses on the Abstraction, abstract and mathematical foundations of computation. It is difficult to circumscribe the theoretical areas precisely. The Associati ...
, such as the variation of the proof that the
halting problem In computability theory (computer science), computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run for ...
is undecidable that uses Rice's Theorem.


Current efforts

Due to security concerns regarding the Trusting Trust Attack (which involves a compiler being maliciously modified to introduce covert backdoors in programs it compiles or even further replicate the malicious modification in future versions of the compiler itself, creating a perpetual cycle of distrust) and various attacks against binary trustworthiness, multiple projects are working to reduce the effort for not only bootstrapping from source but also allowing everyone to verify that source and executable correspond. These include the Bootstrappable builds project and the Reproducible builds project.


See also

* Self-hosting * Self-interpreter * Indirect self-modification * Tombstone diagram *
Metacompiler In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine. The most common type of compiler- ...
*
Falsework Falsework consists of temporary structures used in construction to support a permanent structure until its construction is sufficiently advanced to support itself. For arches, this is specifically called centering. Falsework includes temporary ...


References

{{DEFAULTSORT:Bootstrapping (Compilers) Compilers Compiler construction Compiler theory