Etymology, terminology and definition
The term is borrowed from control theory, where the " observability" of a system measures how well its state can be determined from its outputs. Similarly, software observability measures how well a system's state can be understood from the obtained telemetry (metrics, logs, traces, profiling). The definition of observability varies by vendor: The term is frequently referred to as itsObservability vs. monitoring
Observability and monitoring are sometimes used interchangeably. As tooling, commercial offerings and practices evolved in complexity, "monitoring" was re-branded as observability in order to differentiate new tools from the old. The terms are commonly contrasted in that systems are ''monitored'' using predefined sets of ''telemetry'', and monitored systems may be ''observable''. Majors et al. suggest that engineering teams that only have monitoring tools end up relying on expert foreknowledge (seniority), whereas teams that have observability tools rely on exploratory analysis (curiosity).Telemetry types
Observability relies on three main types of telemetry data: metrics, logs and traces. Those are often referred to as "pillars of observability".Metrics
A metric is a point in time measurement ( scalar) that represents some system state. Examples of common metrics include: * number of HTTP requests per second; * total number of query failures; * database size in bytes; * time in seconds since last garbage collection. Monitoring tools are typically configured to emit alerts when certain metric values exceed set thresholds. Thresholds are set based on knowledge about normal operating conditions and experience. Metrics are typically tagged to facilitate grouping and searchability. Application developers choose what kind of metrics to instrument their software with, before it is released. As a result, when a previously unknown issue is encountered, it is impossible to add new metrics without shipping new code. Furthermore, theirLogs
Logs, or log lines, are generally free-form, unstructured text blobs that are intended to be human readable. Modern logging is structured to enable machine parsability. As with metrics, an application developer must instrument the application upfront and ship new code if different logging information is required. Logs typically include a timestamp and severity level. An event (such as a user request) may be fragmented across multiple log lines and interweave with logs from concurrent events.Traces
Distributed traces
A cloud native application is typically made up of distributed services which together fulfill a single request. A distributed trace is an interrelated series of discrete events (also called spans) that track the progression of a single user request. A trace shows the causal and temporal relationships between the services that interoperate to fulfill a request. Instrumenting an application with traces means sending span information to a tracing backend. The tracing backend correlates the received spans to generate presentable traces. To be able to follow a request as it traverses multiple services, spans are labeled withContinuous profiling
Continuous profiling is another telemetry type used to precisely determine how an application consumes resources.Instrumentation
To be able to observe an application, telemetry about the application's behavior needs to be collected or exported. Instrumentation means generating telemetry alongside the normal operation of the application. Telemetry is then collected by an independent backend for later analysis. Instrumentation can be automatic, or custom. Automatic instrumentation offers blanket coverage and immediate value; custom instrumentation brings higher value but requires more intimate involvement with the instrumented application. Instrumentation can be native - done in-code (modifying the code of the instrumented application) - or out-of-code (e.g. sidecar, eBPF). Verifying new features in production by shipping them together with custom instrumentation is a practice called "observability-driven development"."Pillars of observability"
Metrics, logs and traces are most commonly listed as the pillars of observability. Majors et al. suggest that the pillars of observability are high cardinality, high-dimensionality, and explorability, arguing that runbooks and dashboards have little value because "modern systems rarely fail in precisely the same way twice."Self monitoring
Self monitoring is a practice where observability stacks monitor each other, in order to reduce the risk of inconspicuous outages. Self monitoring may be put in place in addition to high availability and redundancy to further avoid correlated failures.See also
* Application performance management (APM) * OpenTelemetry (OTel) * Real user monitoring (RUM) *External links
Bibliography
* * * *References
{{Reflist Distributed computing Software engineering