llama.cpp is an

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...

software library In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and sub ...

that performs

inference Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinction that ...

on various

large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

s such as

Llama The llama (; ) (''Lama glama'') is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the Pre-Columbian era. Llamas are social animals and live with others as a herd. Their wool is so ...

. It is co-developed alongside the GGML project, a general-purpose

tensor In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tens ...

library. Command-line tools are included with the library, alongside a

server Server may refer to: Computing *Server (computing), a computer program or a device that provides functionality for other programs or devices, called clients Role * Waiting staff, those who work at a restaurant or a bar attending customers and su ...

with a simple

web interface In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine f ...

Background

Towards the end of September 2022, Georgi Gerganov started work on the GGML library, a C library implementing

tensor algebra In mathematics, the tensor algebra of a vector space ''V'', denoted ''T''(''V'') or ''T''(''V''), is the algebra of tensors on ''V'' (of any rank) with multiplication being the tensor product. It is the free algebra on ''V'', in the sense of bein ...

. Gerganov developed the library with the intention of strict memory management and multi-threading. The creation of GGML was inspired by

Fabrice Bellard Fabrice Bellard (; born 1972) is a French computer programmer known for writing FFmpeg, QEMU, and the Tiny C Compiler. He developed Bellard's formula for calculating single digits of pi. In 2012, Bellard co-founded Amarisoft, a telecommunication ...

's work on LibNC. Before llama.cpp, Gerganov worked on a similar library called whisper.cpp which implemented

Whisper Whispering is an unvoiced mode of phonation in which the vocal cords are abducted so that they do not vibrate; air passes between the arytenoid cartilages to create audible turbulence during speech. Supralaryngeal articulation remains t ...

, a speech to text model by

OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...

Development

llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the

inference code in pure C/C++ with no dependencies. This improved performance on computers without

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mob ...

or other dedicated hardware, which was a goal of the project. llama.cpp gained traction with users who lacked specialized hardware as it could run on just a

CPU A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...

including on

Android Android may refer to: Science and technology * Android (robot), a humanoid robot or synthetic organism designed to imitate a human * Android (operating system), Google's mobile operating system ** Bugdroid, a Google mascot sometimes referred to ...

devices. While initially designed for CPUs, GPU inference support was later added. As of November 2024 it has more than 67,000 stars on GitHub. In March 2024

Justine Tunney Justine Alexandra Roberts Tunney (born 1984) is a software developer and a former activist for Occupy Wall Street. Biography Tunney started publishing software in 1998. She built software for other hackers and fiddled with AOL. In 1999, at t ...

introduced new optimized matrix multiplication kernels for x86 and ARM CPUs, improving prompt evaluation performance for

FP16 In computing, half precision (sometimes called FP16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications wh ...

and 8-bit quantized data types. These improvements were committed upstream to llama.cpp. Tunney also created a tool called llamafile that bundles models and llama.cpp into a single file that runs on multiple operating systems via the Cosmopolitan Libc library also created by Tunney which allows C/C++ to be more portable across operating systems.

Architecture

llama.cpp supports multiple hardware targets including x86,

ARM In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between t ...

CUDA CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ...

Metal A metal (from Greek μέταλλον ''métallon'', "mine, quarry, metal") is a material that, when freshly prepared, polished, or fractured, shows a lustrous appearance, and conducts electricity and heat relatively well. Metals are typi ...

Vulkan Vulkan is a low- overhead, cross-platform API, open standard for 3D graphics and computing. Vulkan targets high-performance real-time 3D graphics applications, such as video games and interactive media. Vulkan is intended to offer higher perform ...

(version 1.2 or greater) and

SYCL SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators. It is a single-source embedded domain-specific language (eDSL) based on pure C++17. It is a standard developed by Khronos Group, anno ...

. These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code. llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization. llama.cpp makes use of several CPU extensions for optimization:

AVX AVX may refer to: Technology * Advanced Vector Extensions, an instruction set extension in the x86 microprocessor architecture ** AVX2, an expansion of the AVX instruction set ** AVX-512, 512-bit extensions to the 256-bit AVX * AVX Corporation, ...

AVX2 Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridg ...

and

AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; ...

for

X86-64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging ...

, and

Neon Neon is a chemical element with the symbol Ne and atomic number 10. It is a noble gas. Neon is a colorless, odorless, inert monatomic gas under standard conditions, with about two-thirds the density of air. It was discovered (along with krypt ...

on ARM.

Apple silicon Apple silicon is a series of system on a chip (SoC) and system in a package (SiP) processors designed by Apple Inc., mainly using the ARM architecture. It is the basis of most new Mac computers as well as iPhone, iPad, iPod Touch, Apple ...

is an important target for the project. It supports grammar-based output formatting as

JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other s ...

. It also supports speculative decoding.

GGUF file format

The GGUF (GGML Universal File) file format is a binary format that stores both tensors and metadata in a single file, and is designed for fast saving, and loading of model data. It was introduced in August 2023 by the llama.cpp project to better maintain backwards compatibility as support was added for other model architectures. It superseded previous formats used by the project such as GGML. GGUF files are typically created by converting models developed with a different machine learning library such as

PyTorch PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and op ...

Design

The format focuses on quantization, the act of reducing precision in the model weights. This can lead to reduced memory usage, and increased speed at the expense of lower model accuracy. GGUF supports 2-bit to 8-bit quantized integer types; common floating-point data formats such as

float32 Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floating ...

float16 In computing, half precision (sometimes called FP16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in application ...

, and

bfloat16 The bfloat16 (Brain Floating Point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-b ...

; and 1.56 bit quantization. This file format contains information necessary for running a GPT-like language model such as the tokenizer vocabulary, context length, tensor info and other attributes.

Supported models

LLaMA The llama (; ) (''Lama glama'') is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the Pre-Columbian era. Llamas are social animals and live with others as a herd. Their wool is so ...

Llama 2 Llama (Large Language Model Meta AI, formerly stylized as LLaMA) is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3.3, released in December 2024. Llama models ...

Llama 3 Llama (Large Language Model Meta AI, formerly stylized as LLaMA) is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3.3, released in December 2024. Llama models ...

* Mistral 7B * Mixtral 8x7B * Mixtral 8x22B * DBRX * BERT *

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while sometim ...

BLOOM Bloom or blooming may refer to: Science and technology Biology * Bloom, one or more flowers on a flowering plant * Algal bloom, a rapid increase or accumulation in the population of algae in an aquatic system * Jellyfish bloom, a collective ...

* Gemma * Grok-1 *

Mamba Mambas are fast moving highly venomous snakes of the genus ''Dendroaspis'' (which literally means "tree asp") in the family Elapidae. Four extant species are recognised currently; three of those four species are essentially arboreal and green i ...

* GPT-NeoX * Flan T5 *

DeepSeek DeepSeek () is a Chinese artificial intelligence software company. Its first product is an open-source large language models, large language model (LLM). It is based in Hangzhou, Zhejiang. It is owned and funded by Chinese hedge fund High-Fly ...

References

{{Reflist, refs= {{cite web , title=Initial release · ggerganov/llama.cpp@26c0846 , url=https://github.com/ggerganov/llama.cpp/commit/26c084662903ddaca19bef982831bfb0856e8257 , website=GitHub , access-date=15 May 2024 , language=en {{cite web , title=llama.cpp/LICENSE at master · ggerganov/llama.cpp , url=https://github.com/ggerganov/llama.cpp/blob/master/LICENSE , website=GitHub , language=en {{cite web , last1=Gerganov , first1=Georgi , title=ggerganov/ggml , website=

GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, co ...

, url=https://github.com/ggerganov/ggml , date=17 May 2024 {{cite web , last1=Connatser , first1=Matthew , title=How this open source LLM chatbot runner hit the gas on x86, Arm CPUs , url=https://www.theregister.com/2024/04/03/llamafile_performance_gains/ , website=theregister.com , access-date=15 April 2024 {{cite web , title=ggerganov/llama.cpp , website=

, url=https://github.com/ggerganov/llama.cpp {{cite web , title=ggerganov/whisper.cpp , website=

, url=https://github.com/ggerganov/whisper.cpp {{cite web , last1=Edwards , first1=Benj , title=You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi , url=https://arstechnica.com/information-technology/2023/03/you-can-now-run-a-gpt-3-level-ai-model-on-your-laptop-phone-and-raspberry-pi/ , website=arstechnica.com , date=13 March 2023 , access-date=15 April 2024 {{cite web , last1=Pounder , first1=Les , title=How To Create Your Own AI Chatbot Server With Raspberry Pi 4 , url=https://www.tomshardware.com/how-to/create-ai-chatbot-server-on-raspberry-pi , website=tomshardware.com , date=25 March 2023 , access-date=16 April 2024 {{cite journal , last1=Walkowiak , first1=Bartosz , last2=Walkowiak , first2=Tomasz , journal=International Journal of Electronics and Telecommunications, date=2024 , volume=70 , issue=1 , pages=153–159 , doi=10.24425/ijet.2024.149525 , url=https://journals.pan.pl/Content/130704/18_4466_Walkowiak_L_sk.pdf , access-date=8 May 2024, title=Implementation of language models within an infrastructure designed for Natural Language Processing {{cite web , title=GGUF , url=https://huggingface.co/docs/hub/gguf , website=huggingface.co , access-date=9 May 2024 {{cite web , title=ggml/docs/gguf.md at master · ggerganov/ggml , url=https://github.com/ggerganov/ggml/blob/master/docs/gguf.md , website=GitHub , language=en {{cite web , title=ggerganov/llama.cpp/gguf-py/README.md , url=https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md , website=GitHub , access-date=10 November 2024 {{cite web , last1=Labonne , first1=Maxime , title=Quantize Llama models with GGUF and llama.cpp , url=https://towardsdatascience.com/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172 , website=Medium , publisher=Towards Data Science , access-date=9 May 2024 , language=en , date=29 November 2023 {{cite web , title=GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp , url=https://github.com/ggerganov/llama.cpp/pull/2398 , website=GitHub , language=en {{cite book , last1=Cabezas , first1=Darío , last2=Fonseca-Delgado , first2=Rigoberto , last3=Reyes-Chacón , first3=Iván , last4=Vizcaino-Imacaña , first4=Paulina , last5=Morocho-Cayamcela , first5=Manuel , title=Proceedings of the 19th International Conference on Software Technologies , chapter=Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students , date=2024 , pages=395–402 , doi=10.5220/0012763000003753, isbn=978-989-758-706-1 {{cite journal , last1=Kluska , first1=Piotr , last2=Castell´o , first2=Adri´an , last3=Scheidegger , first3=Florian , last4=I. Malossi , first4=A. Cristiano , last5=Quintana-Ort´ı , first5=Enrique , title=QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers , journal=Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , date=June 2024 , url=https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf {{cite web , last1=Mann , first1=Tobias , title=Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it , url=https://www.theregister.com/2024/07/14/quantization_llm_feature/ , website=theregister , date=14 Jul 2024 {{cite book , last1=Rajput , first1=Saurabhsingh , last2=Sharma , first2=Tushar , chapter=Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency , title=2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C) , date=4 June 2024 , pages=238–242 , doi=10.1109/ICSA-C63560.2024.00049, isbn=979-8-3503-6625-9 {{cite web , last1=Mucci , first1=Tim , title=GGUF versus GGML , url=https://www.ibm.com/think/topics/gguf-versus-ggml , website=www.ibm.com , access-date=26 July 2024 , language=en-us , date=3 July 2024 {{cite journal , last1=Wiest , first1=Isabella Catharina , last2=Ferber , first2=Dyke , last3=Zhu , first3=Jiefu , last4=van Treeck , first4=Marko , last5=Meyer , first5=Meyer, Sonja K. , last6=Juglan , first6=Radhika , last7=Carrero , first7=Zunamys I. , last8=Paech , first8=Daniel , last9=Kleesiek , first9=Jens , last10=Ebert , first10=Matthias P. , last11=Truhn , first11=Daniel , last12=Kather , first12=Jakob Nikolas , title=Privacy-preserving large language models for structured medical information retrieval , journal=npj Digital Medicine , date=2024 , volume=7 , issue=257 , page=257 , doi=10.1038/s41746-024-01233-2, pmid=39304709 , pmc=11415382 {{cite magazine , last1=Dong , first1=Bo , last2=Lin , first2=Jun , last3=Yu , first3=Zhentao , last4=Xu , first4=Zhenzhong , last5=Luo , first5=Yu , last6=Chang , first6=Hanwen , last7=Shen , first7=Haihao , title=Accelerating GGUF Models with Transformers , journal=The Parallel Universe , date=July 2024 , issue=57 , pages=28–33 , url=https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-gguf-models-with-transformers.html , language=en , publisher=Intel {{cite magazine , last1=Jianyu , first1=Zhang , last2=Hengyu , first2=Meng , last3=Ying , first3=Hu , last4=Yu , first4=Luo , last5=Xiaoping , first5=Duan , last6=Corporation , first6=Majumder Abhilash Intel , title=Run LLMs on Intel GPUs Using llama.cpp, journal=The Parallel Universe , date=July 2024 , issue=57 , pages=34–37 , url=https://www.intel.com/content/www/us/en/developer/articles/technical/run-llms-on-gpus-using-llama-cpp.html , publisher=Intel , language=en {{cite web , title=Bringing Whisper and LLaMA to the masses with Georgi Gerganov (Changelog Interviews #532) , url=https://changelog.com/podcast/532 , website=Changelog , access-date=28 July 2024 , language=en , date=22 March 2023 {{cite web , last1=Connatser , first1=Matthew , title=Llamafile LLM driver project boosts performance on CPU cores , url=https://www.theregister.com/2024/04/03/llamafile_performance_gains/ , website=www.theregister.com , access-date=10 May 2024 , language=en {{cite web , last1=Larabel , first1=Michael , title=Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4 , url=https://www.phoronix.com/news/Llamafile-0.7 , website=www.phoronix.com , language=en {{cite web , last1=Alden , first1=Daroc , title=Portable LLMs with llamafile WN.net, url=https://lwn.net/Articles/971195/ , website=lwn.net , access-date=30 July 2024 {{cite web , last1=Gerganov , first1=Georgi , last2=Nguyen , first2=Xuan Son , author3=Slaren , title=Introduction to ggml , url=https://huggingface.co/blog/introduction-to-ggml , website=Huggingface , date=August 13, 2024 {{cite web , last1=Mann , first1=Tobias , title=Intro to speculative decoding: Cheat codes for faster LLMs , url=https://www.theregister.com/2024/12/15/speculative_decoding/ , website=theregister , language=en , date=15 December 2024 {{cite web , last1=Bolz , first1=Jeff , title=Machine Learning in Vulkan with Cooperative Matrix 2 , url=https://vulkan.org/user/pages/09.events/vulkanised-2025/T47-Jeff-Bolz-NVIDIA.pdf , publisher=The Khronos Group/Nvidia , location=Cambridge, UK , language=en , date=February 11–13, 2025 Large language models Open-source artificial intelligence Free and open-source software