Parallel Thread Execution (PTX or NVPTX
) is a low-level
parallel thread execution
Capital punishment, also known as the death penalty and formerly called judicial homicide, is the state-sanctioned killing of a person as punishment for actual or supposed misconduct. The sentence ordering that an offender be punished in ...
virtual machine
In computing, a virtual machine (VM) is the virtualization or emulator, emulation of a computer system. Virtual machines are based on computer architectures and provide the functionality of a physical computer. Their implementations may involve ...
and
instruction set architecture
In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, ...
used in
Nvidia
Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
's Compute Unified Device Architecture (
CUDA
In computing, CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated gene ...
) programming environment. The
Nvidia CUDA Compiler (NVCC) translates code written in CUDA, a
C++-like language, into PTX instructions (an
IL), and the graphics driver contains a
compiler
In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
which translates PTX instructions into executable binary code, which can run on the processing cores of
Nvidia
Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
graphics processing unit
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
s (GPUs). The
GNU Compiler Collection
The GNU Compiler Collection (GCC) is a collection of compilers from the GNU Project that support various programming languages, Computer architecture, hardware architectures, and operating systems. The Free Software Foundation (FSF) distributes ...
and
LLVM
LLVM, also called LLVM Core, is a target-independent optimizer and code generator. It can be used to develop a Compiler#Front end, frontend for any programming language and a Compiler#Back end, backend for any instruction set architecture. LLVM i ...
also have the ability to generate PTX. Inline PTX assembly can be used in CUDA.
Registers
PTX uses an arbitrarily large
processor register
A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-onl ...
set; the output from the compiler is almost pure
static single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form
.reg .u32 %r<335>; // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.:
shr.u64 %rd14, %rd12, 32; // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14
cvt.u64.u32 %rd142, %r112; // convert an unsigned 32-bit integer to 64-bit
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is
@%p14 bra $label; // branch to $label
The
setp.cc.type
instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a
set
instruction, where
set.le.u32.u64 %r101, %rd12, %rd28 sets the 32-bit register
%r101
to
0xffffffff
if the 64-bit register
%rd12
is less than or equal to the 64-bit register
%rd28
. Otherwise
%r101
is set to
0x00000000
.
There are a few predefined identifiers that denote pseudoregisters. Among others,
%tid, %ntid, %ctaid
, and
%nctaid
contain, respectively, thread indices, block dimensions, block indices, and grid dimensions.
State spaces
Load (
ld
) and store (
st
) commands refer to one of several distinct state spaces (memory banks), e.g.
ld.param
.
There are eight state spaces:
;
.reg
: registers
;
.sreg
: special, read-only, platform-specific registers
;
.const
: shared, read-only memory
;
.global
: global memory, shared by all threads
;
.local
: local memory, private to each thread
;
.param
: parameters passed to the kernel
;
.shared
: memory shared between threads in a block
;
.tex
: global texture memory (deprecated)
Shared memory is declared in the PTX file via lines at the start of the form:
.shared .align 8 .b8 pbatch_cache 5744 // define 15,744 bytes, aligned to an 8-byte boundary
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and Nvidia's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained.
See also
*
Standard Portable Intermediate Representation (SPIR)
*
CUDA binary (cubin) – a type of fat binary
References
{{Reflist
External links
PTX ISA page on NVIDIA Developer Zone
Nvidia
Instruction set architectures