Ternary Large Language Model

	Ternary Large Language Model A 1.58-bit Large Language Model (1.58-bit LLM, also ternary LLM) is a version of a Transformer (deep learning architecture), transformer large language model with weights using only three values: -1, 0, and +1. This restriction theoretically allows the model to replace costly multiplications with additions and reduce the storage memory. Since the end-task performance and Perplexity (LLM), perplexity of the 1.58-bit LLMs, at least for smaller model sizes (up to 3-4B parameters), are close to their "full precision" (16-bit FP16 or BF16) counterparts, this design allows reaching the same artificial intelligence goals with much lower hardware requirements, latency, and training effort. The name comes from a fact that a single Ternary numeral system, trit, a ternary arithmetic equivalent of a bit that can take the values, carries log_2 3 \approx 1.58 bits of information. The 1.58-bit LLM models are also called 1-bit LLMs (the true 1-bit models also exist). BitNet In 2024, Ma et al., ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Transformer (deep Learning Architecture) The transformer is a deep learning architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished. Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short-term memory (LSTM). Later variations have been widely adopted for training large language models (LLM) on large (language) datasets. The modern version of the transformer was proposed in the 2017 paper " Attention Is All You Need" by researchers at Google. Transformers were first developed as an improvement ov ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Large Language Model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pretrained transformers (GPTs), which are largely used in generative chatbots such as ChatGPT or Gemini. LLMs can be fine-tuned for specific tasks or guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in. History Before the emergence of transformer-based models in 2017, some language models were considered large relative to the computational and data constraints of their time. In the early 1990s, IBM's statistical models pioneered word alignment techniques for machine translation, laying the groundwork for corpus-based language modeling. A sm ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Perplexity (LLM) In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker. Perplexity of a probability distribution The perplexity ''PP'' of a discrete probability distribution ''p'' is a concept widely used in information theory, machine learning, and statistical modeling. It is defined as :\mathit(p) = \prod_x p(x)^ = b^ where ''x'' ranges over the events, where is defined to be , and where the value of does not affect the result; can be chosen to be 2, 10, , or any other positive value other than . In some contexts, this measure is also referred to as the '' (order-1 true) diversity''. The logarithm is the entropy of the d ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	FP16 In computing, half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks. Almost all modern uses follow the IEEE 754-2008 standard, where the 16-bit base-2 format is referred to as binary16, and the exponent uses 5 bits. This can express values in the range ±65,504, with the minimum value above 1 being 1 + 1/1024. Depending on the computer, half-precision can be over an order of magnitude faster than double precision, e.g. 550 PFLOPS for half-precision vs 37 PFLOPS for double precision on one cloud provider. History Several earlier 16-bit floating point formats have existed including that of Hitachi's HD61810 DSP of 1982 (a 4-bit exponent and a 12-bit mantissa), Thomas J. Scott's WIF of 1991 (5 e ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	BF16 The bfloat16 (brain floating point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms. The bfloat16 format was developed by Google Brain, an artificial intelligence research group at Google. It is ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Artificial Intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to machine perception, perceive their environment and use machine learning, learning and intelligence to take actions that maximize their chances of achieving defined goals. High-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon (company), Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Amazon Alexa, Alexa); autonomous vehicles (e.g., Waymo); Generative artificial intelligence, generative and Computational creativity, creative tools (e.g., ChatGPT and AI art); and Superintelligence, superhuman play and analysis in strategy games (e.g., ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Ternary Numeral System A ternary numeral system (also called base 3 or trinary) has 3 (number), three as its radix, base. Analogous to a bit, a ternary numerical digit, digit is a trit (trinary digit). One trit is equivalent to binary logarithm, log2 3 (about 1.58496) bits of Units of information, information. Although ''ternary'' most often refers to a system in which the three digits are all non–negative numbers; specifically , , and , the adjective also lends its name to the balanced ternary system; comprising the digits −1, 0 and +1, used in comparison logic and ternary computers. Comparison to other bases Representations of integer numbers in ternary do not get uncomfortably lengthy as quickly as in binary numeral system, binary. For example, decimal 365 (number), 365 or senary corresponds to binary (nine bits) and to ternary (six digits). However, they are still far less compact than the corresponding representations in bases such as decimal – see below for a compact way to codi ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Ternary Arithmetic A ternary numeral system (also called base 3 or trinary) has three as its base. Analogous to a bit, a ternary digit is a trit (trinary digit). One trit is equivalent to log2 3 (about 1.58496) bits of information. Although ''ternary'' most often refers to a system in which the three digits are all non–negative numbers; specifically , , and , the adjective also lends its name to the balanced ternary system; comprising the digits −1, 0 and +1, used in comparison logic and ternary computers. Comparison to other bases Representations of integer numbers in ternary do not get uncomfortably lengthy as quickly as in binary. For example, decimal 365 or senary corresponds to binary (nine bits) and to ternary (six digits). However, they are still far less compact than the corresponding representations in bases such as decimal – see below for a compact way to codify ternary using nonary (base 9) and septemvigesimal (base 27). : : : As for rational nu ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]