In computer science, string interning is a method of storing only one copy of each distinct
string
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
value, which must be
immutable
In object-oriented (OO) and functional programming, an immutable object (unchangeable object) is an object whose state cannot be modified after it is created.Goetz et al. ''Java Concurrency in Practice''. Addison Wesley Professional, 2006, Secti ...
. Interning strings makes some string processing tasks more time-efficient or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
The single copy of each string is called its ''intern'' and is typically looked up by a method of the string class, for example String.intern() in
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
. All compile-time constant strings in Java are automatically interned using this method.
String interning is supported by some modern
object-oriented
Object-oriented programming (OOP) is a programming paradigm based on the concept of '' objects''. Objects can contain data (called fields, attributes or properties) and have actions they can perform (called procedures or methods and impleme ...
programming language
A programming language is a system of notation for writing computer programs.
Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
s, including Java,
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (prog ...
,
PHP
PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
(since 5.4),
Lua
and
.NET languages.
Lisp
Lisp (historically LISP, an abbreviation of "list processing") is a family of programming languages with a long history and a distinctive, fully parenthesized Polish notation#Explanation, prefix notation.
Originally specified in the late 1950s, ...
,
Scheme,
Julia,
Ruby
Ruby is a pinkish-red-to-blood-red-colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapph ...
and
Smalltalk
Smalltalk is a purely object oriented programming language (OOP) that was originally created in the 1970s for educational use, specifically for constructionist learning, but later found use in business. It was created at Xerox PARC by Learni ...
are among the languages with a
symbol
A symbol is a mark, Sign (semiotics), sign, or word that indicates, signifies, or is understood as representing an idea, physical object, object, or wikt:relationship, relationship. Symbols allow people to go beyond what is known or seen by cr ...
type that are basically interned strings. The library of the
Standard ML of New Jersey
Standard ML of New Jersey (SML/NJ; Standard Meta-Language of New Jersey) is a compiler and integrated development environment for the programming language Standard ML. It is written in Standard ML, except for the runtime system in C language. It ...
contains an
atom
type that does the same thing.
Objective-C
Objective-C is a high-level general-purpose, object-oriented programming language that adds Smalltalk-style message passing (messaging) to the C programming language. Originally developed by Brad Cox and Tom Love in the early 1980s, it was ...
's selectors, which are mainly used as method names, are interned strings.
Objects other than strings can be interned. For example, in Java, when primitive values are
boxed into a
wrapper object, certain values (any
boolean
, any
byte
, any
char
from 0 to 127, and any
short
or
int
between −128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object.
History
Lisp
Lisp (historically LISP, an abbreviation of "list processing") is a family of programming languages with a long history and a distinctive, fully parenthesized Polish notation#Explanation, prefix notation.
Originally specified in the late 1950s, ...
introduced the notion of interned strings for its
symbols
A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise different concep ...
. Historically, the data structure used as a string intern pool was called an ''oblist'' (when it was implemented as a linked list) or an ''obarray'' (when it was implemented as an array).
Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose ''name'' is that string. Symbols often have additional properties that strings do not such as storage for associated values, or namespacing. The distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.
Motivation
String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as
compiler
In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
s and
dynamic programming language
A dynamic programming language is a type of programming language that allows various operations to be determined and executed at runtime. This is different from the compilation phase. Key decisions about variables, method calls, or data types are ...
runtimes) that rely heavily on
associative array
In computer science, an associative array, key-value store, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In math ...
s with string keys to look up the attributes and methods of an object. Without interning, comparing two distinct strings may involve examining every character of both. This is slow for several reasons: it is inherently
O(n) in the length of the strings; it typically requires reads from several regions of
memory
Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembe ...
, which take time; and the reads fill up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple
object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.
String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a
network
Network, networking and networked may refer to:
Science and technology
* Network theory, the study of graphs as a representation of relations between discrete objects
* Network science, an academic field that studies complex networks
Mathematics
...
or from
storage. Such strings may include
magic numbers or
network protocol
A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any variation of a physical quantity. The protocol defines the rules, syntax, semantics, and synchronization of ...
information. For example, XML parsers may intern names of tags and attributes to save memory. Network transfer of objects over Java RMI serialization object streams can transfer strings that are interned more efficiently, as the String object's handle is used in place of duplicate objects upon serialization.
Issues
Multithreading
If the interned strings are not immutable, one source of drawbacks is that string interning may be problematic when mixed with
multithreading. In many systems, string interns are required to be global across all threads within an address space (or across any contexts which may share pointers), thus the intern pool(s) are global resources that should be synchronized for safe concurrent access. While this only affects string creation (where the intern pool must be checked and modified if necessary), and
double-checked locking
In software engineering, double-checked locking (also known as "double-checked locking optimization") is a software design pattern used to reduce the overhead of acquiring a lock by testing the locking criterion (the "lock hint") before acquiring t ...
may be used on platforms where this is a safe optimization, the need for mutual exclusion when modifying the intern pool can be expensive.
Contention can also be reduced by partitioning the string space into multiple pools, which can be synchronized independently of one another.
Reclaiming unused interned strings
Many implementations of interned strings do not attempt to reclaim (manually or otherwise) strings that are no longer used. For applications where the number of interned strings is small or fixed, or which are short-lived, the loss of system resources may be tolerable. But for long-running systems where large numbers of string interns are created at runtime, the need to reclaim unused interns may arise. This task can be handled by a
garbage collector
A waste collector, also known as a garbage man, garbage collector, trashman (in the U.S), binman or dustman (in the UK), is a person employed by a public or private enterprise to collect and dispose of municipal solid waste (refuse) and recycla ...
, though for this to work correctly
weak reference
In computer programming, a weak reference is a reference that does not protect the referenced object from collection by a garbage collector, unlike a strong reference. An object referenced ''only'' by weak references – meaning "every chain of re ...
s to string interns must be stored in the intern pool.
See also
*
Flyweight pattern
In computer programming, the flyweight software design pattern refers to an Object (computer science), object that minimizes Computer memory, memory usage by sharing some of its data with other similar objects. The flyweight pattern is one of twe ...
References
{{Reflist
External links
Visual J# String class.NET String ClassGuava Java Library - Interner - Non-permgen String.intern and supports other immutable types with weak and strong referenced implementationsUnderstanding Java's intern() Method for Strings
Software optimization
Interning