Monitoring and Security

Traces, Logs, and Metrics


Learning Objectives

  • You understand the importance of observability and know how traces, logs, and metrics contribute to it.
  • You know why traces, logs, and metrics are all needed when monitoring and diagnosing issues in web applications.
  • You understand the role of traces, logs, and metrics in alerting systems.

There is a need to understand the internal state of a system based on the data it produces — that is, systems need to be observable. Observability is the ability to understand the internal state of a system based on the data it produces. This is crucial for diagnosing issues, monitoring performance, and ensuring system reliability. In Web Software Development, observability is achieved through the collection and analysis of data, including traces, logs, and metrics.

Loading Exercise...

Traces

Trace data captures the path of requests in an application, as the requests travel through different components and services. Traces are needed for understanding interactions between components and services. Traces consist of segments, where each segment is known as a span. A span records the start and end times of an operation, with relevant information such as error messages or status codes.

To capture traces, each request is given an unique identifier, known as a trace id. This id is passed along with the request as it moves through different services. When collecting trace data, the correlation id is added to each span, allowing the spans to be connected to form a complete trace of the request’s path. This enables developers to pinpoint performance bottlenecks or errors with precision.

As an example, imagine an e-commerce web application with multiple microservices. When a user adds an item to their cart, the request goes from the frontend service to the authentication service, then to the inventory service, and finally to the payment service. Each step generates a span that correspond to a unique trace id. These spans are then merged together and visualized.

On a high level, when omitting the user request to the frontend and the final response to the user, such a trace might consist of the spans outlined in Figure 1.

Figure 1 — A trace consisting of multiple spans. The request would have a unique trace id, which could be used to combine the spans together.

In the above example, the trace consists of three spans: authentication, inventory check, and payment processing. When studying the trace, we notice that the inventory check took the longest time, which could be a potential bottleneck in the system.

Often, trace visualization software would concretely show the time spent in each span, making it easier to identify bottlenecks and performance issues.

While traces are typically collected per request, they can also be captured at a session level. This provides a comprehensive view of the user’s interactions with the application, allowing you to look into prior actions that may have led to an error.

Loading Exercise...

Logs

Logs capture time-stamped record of events that occur within an application or service. They offer a detailed account of everything from routine operations to critical errors. Logs come typically in two forms: unstructured and structured. Unstructured logs are typically plain text, which may require extra effort to parse. Structured logs, however, are formatted in a consistent way, e.g. using JSON or JSONL, making them easier to search, filter, and analyze log data.

In addition to the event messages, logs often include additional information such as log levels (debug, info, warning, error), process identifiers, and sometimes user or session identifiers.

Logs can also serve as audit trails for tracking user actions and system events. They provide information on what happened before, during, and after an event within individual systems. During incidents or outages, logs provide information that helps responding to the problem quickly and effectively.

In scalable web applications, log data is often collected and aggregated across multiple services to a centralized web service. This allows for centralized monitoring and analysis of logs, making it easier to identify patterns, anomalies, and potential issues. In addition, depending on the type of the log data, logs can also include the trace id, which allows connecting logs with traces.

As an example, consider a user trying to log in to an application. If the login fails due to invalid credentials, a log entry is generated with a timestamp, log level, message, user id, and trace id. This log entry can be used to trace the user’s login attempt and identify the cause of the failure.

{
  "timestamp": "2025-03-04T12:00:00Z",
  "level": "error",
  "message": "Login failed due to invalid credentials",
  "userId": "9876",
  "traceId": "abc123def456"
}
Loading Exercise...

Metrics

Metrics are numerical data points that capture specific performance characteristics of a system, such as latency, throughput, error rates, CPU usage, and memory consumption. These data points are often aggregated into time series, allowing viewing system performance time.

As metrics are quantitative, they can be easily graphed and analyzed, making it simpler to track trends, identify anomalies, and assess whether the system is meeting performance expectations. Metrics are key for continuous performance monitoring, capacity planning, and automated alerting based on predefined metric thresholds.

When working with metrics, it is important to select appropriate aggregation and sampling methods. As an example, if metrics are collected only once per hour, important details about the system’s performance may be lost. On the other hand, collecting metrics too frequently can even lead to data overload, making it difficult to identify meaningful patterns.

As an example, collecting the CPU usage of a service tens of times per second would likely provide too much data. Instead, the CPU usage could be sampled every minute to provide a higher-level view that still captures the necessary performance characteristics. At the same time, if there would be a need to identify short-lived spikes in CPU usage, more frequent sampling might be necessary.

The key in working with metrics is instrumenting the services to collect the right data points, and then aggregating and visualizing the data in a way that provides actionable insights.

Loading Exercise...

Alerting

Alerting systems analyze traces, logs, and metrics to detect anomalies or failures. Alerts are notifications that are sent to responsible parties whenever a predefined condition is met. For example, an alert can be triggered when the error rate exceeds a certain threshold, when the latency of a service increases beyond a specified limit, or when the hard drive space is running low.

Alerting rules can also be based on composite metrics. For example, if a service’s CPU usage exceeds 80% and the memory usage is above 75% for a sustained period, an alert is triggered.

Bigger companies have dedicated teams that are responsible for monitoring and responding to alerts. These teams are often on-call and ready to respond to incidents 24/7. In smaller organizations, the responsibility for monitoring and responding to alerts may fall on the developers or the DevOps team.

For more on dedicated teams responsible for monitoring and responding to alerts, see the Site Reliability Engineering site by Google.

Loading Exercise...