Logging is not unfamiliar to most developers.
Logs are an important means of recording the operational status of applications. They can help us understand how applications are running, troubleshoot issues, and even monitor application performance. In web backend applications, logs are an indispensable part, as they can record user requests, application errors, warnings, and other information, helping us better understand the application's operational status.
But is it really indispensable? After reading this article, I think my answer to you is: no. The form of logs is very singular, just text, which makes them inherently verbose, and it is difficult for us to extract the information we need. Even if you use AI like ChatGPT, you may not necessarily get a constructive answer. For applications you built yourself, ChatGPT neither understands nor can truly comprehend all your code, which brings about problems.
Why Logs Are Not Indispensable#
The singular form of logs, presented in plain text, often makes the information appear redundant and difficult to extract. Even using AI for analysis may not provide clear insights. The main issues with logs are:
- Redundancy and Large Data Volume: Logs often contain a lot of useless information, making it time-consuming to find key information for specific issues.
- Lack of Contextual Relationships: A single log entry is difficult to present the call relationships and context between multiple services, especially in microservices architecture.
- Insufficient Real-time Insights: Relying solely on logs makes it hard to gain timely insights into performance issues and real-time status.
For example, suppose you encounter the following logs while debugging:
@ahdark-blog/web:dev: GET /category/Backend-Development 500 in 8714ms
@ahdark-blog/web:dev: ⨯ dist/compiled/react-server-dom-webpack/cjs/react-server-dom-webpack-server.edge.development.js (2105:18) @ stringify
@ahdark-blog/web:dev: ⨯ TypeError: Converting circular structure to JSON
@ahdark-blog/web:dev: --> starting at object with constructor '_Document'
@ahdark-blog/web:dev: | property 'tags' -> object with constructor '_Query'
@ahdark-blog/web:dev: | property 'data' -> object with constructor 'Array'
@ahdark-blog/web:dev: | ...
@ahdark-blog/web:dev: | property 'data' -> object with constructor 'Array'
@ahdark-blog/web:dev: --- index 0 closes the circle
@ahdark-blog/web:dev: at stringify (<anonymous>)
@ahdark-blog/web:dev: at stringify (<anonymous>)
@ahdark-blog/web:dev: digest: "14907588"
@ahdark-blog/web:dev: 2103 | function outlineConsoleValue(request, counter, model) {
@ahdark-blog/web:dev: 2104 | "object" === typeof model && null !== model && doNotLimit.add(model);
@ahdark-blog/web:dev: > 2105 | var json = stringify(model, function (parentPropertyName, value) {
@ahdark-blog/web:dev: | ^
@ahdark-blog/web:dev: 2106 | try {
@ahdark-blog/web:dev: 2107 | return renderConsoleValue(
@ahdark-blog/web:dev: 2108 | request,
@ahdark-blog/web:dev: ⨯ dist/compiled/react-server-dom-webpack/cjs/react-server-dom-webpack-server.edge.development.js (2105:18) @ stringify
@ahdark-blog/web:dev: ⨯ TypeError: Converting circular structure to JSON
@ahdark-blog/web:dev: --> starting at object with constructor '_Document'
@ahdark-blog/web:dev: | property 'tags' -> object with constructor '_Query'
@ahdark-blog/web:dev: | property 'data' -> object with constructor 'Array'
@ahdark-blog/web:dev: | ...
@ahdark-blog/web:dev: | property 'data' -> object with constructor 'Array'
@ahdark-blog/web:dev: --- index 0 closes the circle
@ahdark-blog/web:dev: at stringify (<anonymous>)
@ahdark-blog/web:dev: at stringify (<anonymous>)
@ahdark-blog/web:dev: digest: "14907588"
@ahdark-blog/web:dev: 2103 | function outlineConsoleValue(request, counter, model) {
@ahdark-blog/web:dev: 2104 | "object" === typeof model && null !== model && doNotLimit.add(model);
@ahdark-blog/web:dev: > 2105 | var json = stringify(model, function (parentPropertyName, value) {
@ahdark-blog/web:dev: | ^
@ahdark-blog/web:dev: 2106 | try {
@ahdark-blog/web:dev: 2107 | return renderConsoleValue(
@ahdark-blog/web:dev: 2108 | request,
Because of this issue, I abandoned the Next.js refactored frontend and switched to xLog.
From the logs, we can see that JSON.stringify
cannot handle circular references, but the logs do not tell us why this issue occurred, how it was generated, and how to avoid it. Of course, you could analyze the stack, but the stack is complex, and in many scenarios, the place where the error is thrown may not be the actual source of the error. This leads us to need more Observability.
What is Observability#
Observability refers to our ability to monitor and debug applications through metrics, logging, tracing, and other means. Observability is very important as it helps us better understand the operational status of applications, troubleshoot issues, and even assist in monitoring application performance. In previous internet applications, logs were the primary means of improving observability. However, nowadays, large companies tend to use large-scale distributed tracing systems, focusing on tracing and metrics, with logs as a supplement for full-link tracing of applications.
That is, the current improvements in observability are mainly trending towards holistic and full-link approaches, rather than singular, fragmented logging. This transformation is primarily driven by the prevalence of microservices systems and the expansion of software scale. Large companies' software systems often consist of thousands of services. At this scale, singular logging is not continuous enough and cannot accurately pinpoint issues. Therefore, more tracing and metrics are needed to help business systems monitor, troubleshoot, and analyze performance.
Let's first summarize the three main components of observability:
- Logging: Provides event information, mainly used for fine-grained troubleshooting, but generates a large amount of data, making it difficult to capture system behavior comprehensively.
- Metrics: Displays system performance and trends on a macro level, facilitating real-time monitoring and alerts.
- Tracing: Records the complete path of requests in a fine-grained manner, particularly suitable for microservices architecture, enabling cross-service request tracking and precise identification of delays and failures.
Tracing#
Tracing refers to our ability to track the processing of each request in an application, allowing us to understand the application's operational status. In web backend applications, tracing can help us know how much time a request consumes, and better relate the context of function calls and service calls.
The implementation of tracing mainly relies on distributed tracing systems, which serve as a standardized, structured log, providing more comprehensive, continuous, and complete information, and running on a unified platform. This allows developers not to have to look for corresponding text log information in every service (which is not feasible at the scale of large companies), but to view all service log information on one platform, such as Jaeger, Zipkin, etc.
Thanks to the rich data provided by tracing standards, when we become interested in fluctuations in a certain metric, we can directly retrieve the traces associated with that fluctuation and then examine all execution details of these traces across various microservices, discovering that a certain underlying microservice encountered a panic during the request execution, causing this error to propagate upward and leading to a decline in service-level agreements (SLA).
The collection of traces and the passing of context across service processes are usually handled automatically by microservice frameworks and other infrastructures, but to achieve the best results, it still requires the understanding and cooperation of all developers.
Developers should actively maintain the passing of context throughout all code execution processes. For example, in Golang, context.Context
needs to be passed as a parameter in all function calls; in Java, ThreadLocal
is typically used as the storage medium for context, but in multi-threaded or asynchronous scenarios, developers need to manually and explicitly pass the context; otherwise, the context may be interrupted, making effective tracing and monitoring impossible; Rust's OpenTelemetry uses the Tokio Tracing library, passing context using thread-local variables, but in async calls, separate instrument function calls are needed to inject spans for passing.
Metrics#
Metrics refer to our ability to monitor application performance through indicators such as CPU usage, memory usage, disk usage, etc. Metrics can help us monitor performance data, business data, and are useful for understanding the operational status of applications on a macro level. Compared to tracing, which provides detailed records for each request, metrics focus more on monitoring overall performance. Metrics include counters, gauges, histograms, summaries, etc., which are mostly just single values but can quickly detect the operational status of applications under large-scale data. For example, a drop in SLA data can quickly pinpoint a specific service or function call, thus quickly identifying the problem.
The implementation of metrics mainly relies on monitoring systems such as Prometheus, Grafana, etc. Metrics are more like data analysis within applications, possessing more macro characteristics compared to logs, but correspondingly, they require a large amount of data as support, so their role is not significant in small-scale applications.
Common metrics scenarios include CPU usage, memory usage, disk usage, network traffic, etc. These data are usually collected by monitoring systems like node_exporter
, cAdvisor
, and then stored and displayed through monitoring systems like Prometheus. In applications, we can record some custom metrics, such as request processing time, request success rate, etc. These data can help us better formulate optimization strategies.
Specific Implementations for Improving Observability in Code#
In code, we can improve observability by introducing some libraries, with OpenTelemetry being the most commonly used library today1. OpenTelemetry is an open-source distributed tracing system that provides a standardized API to help us better trace the operational status of applications. OpenTelemetry supports multiple languages, such as Golang, Java, Python, Rust, etc., and can help us better trace the operational status of applications.
In this article, we mainly introduce the use of OpenTelemetry in Rust. I won't elaborate too much on the core concepts of OpenTelemetry; you can refer to the official documentation for more information: OpenTelemetry Concepts. I will briefly outline a few key concepts.
- Span: A span is the basic unit in OpenTelemetry, representing the execution time of an operation. A span typically includes a start time, an end time, an operation name, and some attributes. Spans can be nested, forming a tree structure to represent the execution process of an operation.
- Trace: A trace is a collection of spans, representing the entire execution process of an operation. A trace typically includes a trace ID, a span ID, and some attributes. A trace can be used to represent the entire execution process of an operation.
- Context: Context is the context in OpenTelemetry used to pass span and trace information. Context typically includes a trace ID, a span ID, and some attributes. Context can be used to pass span and trace information, and you can serialize and pass context using a propagator. Note that for remote incoming contexts that have already been recorded, you need to use a separate construction method.
Using tracing
and opentelemetry
Together#
tracing
is a tracing library in Rust provided by the creator of Tokio. tracing
offers a way to trace that aligns more with Rust's design philosophy, making it more elegant compared to Go's context.Context
. Additionally, leveraging Rust's procedural macros and zero-cost abstractions provides better performance. To add tracing to a function, you simply need to add #[tracing::instrument]
before the function.
#[tracing::instrument]
fn my_function() {
// do something
}
// Only record on debug level
#[tracing::instrument(level = tracing::Level::DEBUG)]
fn inner_function() {
// do something
}
// Automatically record error and return value
#[tracing::instrument(ret, err)]
async fn function_with_data() -> Result<String, Box<dyn std::error::Error>> {
"Hello, world!".to_string()
}
// Automatically record error
#[tracing::instrument(err)]
async fn function_with_error() -> Result<(), Box<dyn std::error::Error>> {
// do something
}
Combining tracing
and opentelemetry
allows for better tracing of the application's operational status. tracing
provides a better tracing method, while opentelemetry
offers better storage and display of tracing data.
pub fn init_tracer(config: &ObservabilityConfig, resource: Resource) {
let exporter = HttpExporterBuilder::default().with_endpoint(config.endpoint.clone().expect("OTEL_EXPORTER_ENDPOINT not set"));
let provider = opentelemetry_sdk::trace::TracerProvider::builder().with_config(
opentelemetry_sdk::trace::Config::default()
.with_sampler(Sampler::TraceIdRatioBased(config.sample_rate.unwrap_or(1.0)))
.with_resource(resource)
)
.with_batch_exporter(
exporter.build_span_exporter().expect("Failed to build exporter"),
opentelemetry_sdk::runtime::Tokio,
)
.build();
global::set_text_map_propagator(TraceContextPropagator::new());
let telemetry = tracing_opentelemetry::layer().with_tracer(provider.tracer("system-tracing"));
let env_filter = EnvFilter::try_from_default_env().unwrap_or(EnvFilter::new("INFO"));
let subscriber = Registry::default()
.with(telemetry)
.with(env_filter)
.with(tracing_subscriber::fmt::layer());
tracing::subscriber::set_global_default(subscriber)
.expect("Failed to install `tracing` subscriber.");
global::set_tracer_provider(provider);
}
This code initializes the Trace Provider, converting it into a Tokio Tracing Layer and Subscriber through middleware, and sets it as the global default Tracer Provider. It is worth noting that we do not use the thread context stored in opentelemetry::Context::current()
(this data may not even exist), but instead convert the Tokio Tracing Span into an OpenTelemetry Span when submitting data. This is done automatically by the library.
One issue is that since Tracing requires span names to be static (&'static str
), this fundamentally prohibits dynamic span names. This may be unacceptable in certain scenarios, such as in an HTTP service where we might need to dynamically generate span names based on the request URL. In such cases, we can use the otel.name
attribute to specify the span name. There are also other available attributes, such as otel.status_code
, etc.
otel.name
: Span Nameotel.status_code
: Span Status Codeotel.kind
: Span Kind, includingclient
,server
,producer
,consumer
,internal
, etc.
For example, I built the tracing middleware for Volo like this:
#[volo::service]
impl<Cx, ReqBody, ResBody, ResErr, S> Service<Cx, Request<ReqBody>> for InnerTracingExtractLayer<S> {
async fn call(&self, cx: &mut Cx, req: Request<ReqBody>) -> Result<S::Response, S::Error> {
let method = cx.rpc_info().method().as_str();
let span = tracing::span!(
tracing::Level::INFO,
"rpc_call",
rpc.method = method,
otel.name = format!("RPC {}", method),
otel.kind = "server",
);
// ...
}
}
Using in Frameworks like Actix#
Web servers are naturally suited for using tracing because they have linear contexts. In Actix, we can use tracing-actix-web
2 to integrate tracing.
use actix_web::{App, web, HttpServer};
use tracing_actix_web::TracingLogger;
fn main() {
// Init your `tracing` subscriber here!
let server = HttpServer::new(|| {
App::new()
// Mount `TracingLogger` as a middleware
.wrap(TracingLogger::default())
.service( /* */ )
});
}
Other frameworks also have similar middleware that you can find. Since OpenTelemetry is the mainstream tracing system for most enterprises, frameworks often provide related support, such as: https://www.cloudwego.io/docs/hertz/tutorials/observability/open-telemetry/
Passing Remote Spans#
In distributed systems, we often need to pass spans to remote services. OpenTelemetry provides a standardized context passing mechanism. However, in Rust, due to the complexity of passing contexts combined with tracing
, we need additional handling.
if let (Some(trace_id), Some(span_id)) = (
req.metadata().get("trace_id").and_then(|s| s.to_str().ok()),
req.metadata().get("span_id").and_then(|s| s.to_str().ok()),
) {
let cx = opentelemetry::Context::new().with_remote_span_context(SpanContext::new(
TraceId::from_hex(trace_id).unwrap(),
SpanId::from_hex(span_id).unwrap(),
Default::default(),
true,
Default::default(),
));
span.set_parent(cx);
}
There are some details here:
- We create a remote span context using the
with_remote_span_context
method. Unlike creating a normal span context, a remote span context will not be submitted to the local tracer but will be converted into a remote span when finally submitted. If you try to pass the Tokio Tracing span ID, the local system will throw an error because it cannot find this span since it comes from an external source. - We call the
set_parent
method provided byopentelemetry::trace::TraceContextExt
fortracing
spans. This method associates thetracing
span with theopentelemetry
span, thus enabling the passing of remote spans. You can see that the context we initialized is theopentelemetry
context, not thetracing
context. This is because thetracing
context cannot be serialized based on remote data, and the data we are passing is also based on theopentelemetry
span context.
You can see a problem: although OpenTelemetry's adaptation for each language may not be perfect, it emphasizes standardized handling across languages and services. On the other hand, Rust's tracing
focuses more on Rust's design philosophy, providing better performance and user experience, but it cannot be used across services. This is why we need to combine OpenTelemetry and Tracing, even if the microservices system I write is entirely in Rust.
Using instrument
in Async Passing#
In Rust, we typically use tokio::spawn
to start an asynchronous task. However, in asynchronous tasks, we cannot directly add tracing based on thread context. This is because the lifecycle of asynchronous tasks is uncertain, and they are likely not executed in the current thread. Therefore, we need to manually add tracing information to functions in asynchronous tasks.
For example, when building Actix Middleware, we can use the instrument
function provided by the tracing::Instrument
trait to add tracing information to async tasks:
impl<S, B> Service<ServiceRequest> for UserMiddlewareService<S>
where
S: Service<ServiceRequest, Response=ServiceResponse<B>, Error=actix_web::Error> + 'static,
S::Future: 'static,
B: 'static,
{
fn call(&self, req: ServiceRequest) -> Self::Future {
let parent_span: Span = req.extensions().get::<RootSpan>().map(|s| s.clone().into()).unwrap_or(Span::none());
let span = tracing::info_span!(parent: parent_span, "user_middleware.call");
let _enter = span.enter(); // Enter the span
// some actions...
drop(_enter); // Exit the span
async move {
// some actions...
service.call(req).await
}
.instrument(span)
.boxed_local()
}
}
Here, we create a Span
using tracing::info_span!
, and then use the instrument
function to inject the current span as the parent for further tracing in the asynchronous task.
Conclusion#
Observability is a very important concept that can help us better understand the operational status of applications, troubleshoot issues, and even monitor application performance. Since 2022, I have been using distributed tracing in Go programs. I personally place great importance on distributed tracing, not only because its detailed content improves work efficiency, but also because it enhances the completeness and stability of systems. In my view, distributed tracing is an important hallmark of modern applications. No scaled application can do without distributed tracing, especially in large companies, where often an entire disaster recovery plan relies on a single metric from distributed tracing. We should pay more attention to observability and introduce some methods to improve applications, thereby enhancing the accuracy and efficiency of problem localization.
The methods and example code provided in this article can help you achieve more efficient monitoring and troubleshooting in actual projects.
Footnotes#
-
tracing-actix-web is a community-maintained Actix Web Tracing middleware. ↩