❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Learning Notes #65 – Application Logs, Metrics, MDC

21 January 2025 at 05:45

I am big fan of logs. Would like to log everything. All the request, response of an API. But is it correct ? Though logs helped our team greatly during this new year, i want to know, is there a better approach to log things. That search made this blog. In this blog i jot down notes on logging. Lets log it.

Throughout this blog, i try to generalize things. Not biased to a particular language. But here and there you can see me biased towards Python. Also this is my opinion. Not a hard rule.

Which is a best logger ?

I’m not here to argue about which logger is the best, they all have their problems. But the worst one is usually the one you build yourself. Sure, existing loggers aren’t perfect, but trying to create your own is often a much bigger mistake.

1. Why Logging Matters

Logging provides visibility into your application’s behavior, helping to,

  • Diagnose and troubleshoot issues (This is most common usecase)
  • Monitor application health and performance (Metrics)
  • Meet compliance and auditing requirements (Audit Logs)
  • Enable debugging in production environments (we all do this.)

However, poorly designed logging strategies can lead to excessive log volumes, higher costs, and difficulty in pinpointing actionable insights.

2. Logging Best Practices

a. Use Structured Logs

Long story short, instead of unstructured plain text, use JSON or other structured formats. This makes parsing and querying easier, especially in log aggregation tools.


{
  "timestamp": "2025-01-20T12:34:56Z",
  "level": "INFO",
  "message": "User login successful",
  "userId": 12345,
  "sessionId": "abcde12345"
}

b. Leverage Logging Levels

Define and adhere to appropriate logging levels to avoid log bloat:

  • DEBUG: Detailed information for debugging.
  • INFO: General operational messages.
  • WARNING: Indications of potential issues.
  • ERROR: Application errors that require immediate attention.
  • CRITICAL: Severe errors leading to application failure.

c. Avoid Sensitive Data

Sanitize your logs to exclude sensitive information like passwords, PII, or API keys. Instead, mask or hash such data. Don’t add token even for testing.


d. Include Contextual Information

Incorporate metadata like request IDs, user IDs, or transaction IDs to trace specific events effectively.


3. Log Ingestion at Scale

As applications scale, log ingestion can become a bottleneck. Here’s how to manage it,

a. Centralized Logging

Stream logs to centralized systems like Elasticsearch, Logstash, Kibana (ELK), or cloud-native services like AWS CloudWatch, Azure Monitor, or Google Cloud Logging.

b. Optimize Log Volume

  • Log only necessary information.
  • Use log sampling to reduce verbosity in high-throughput systems.
  • Rotate logs to limit disk usage.

c. Use Asynchronous Logging

Asynchronous loggers improve application performance by delegating logging tasks to separate threads or processes. (Not Suitable all time. It has its own problems)

d. Method return values are usually important

If you have a log in the method and don’t include the return value of the method, you’re missing important information. Make an effort to include that at the expense of slightly less elegant looking code.

e. Include filename in error messages

Mention the path/to/file:line-number to pinpoint the location of the issue.

3. Logging Don’ts

a. Don’t Log Everything at the Same Level

Logging all messages at the INFO or DEBUG level creates noise and makes it difficult to identify critical issues.

b. Don’t Hardcode Log Messages

Avoid static, vague, or generic log messages. Use dynamic and descriptive messages that include relevant context.

# Bad Example
Error occurred.

# Good Example
Error occurred while processing payment for user_id=12345, transaction_id=abc-6789.

c. Don’t Log Sensitive or Regulated Data

Exposing personally identifiable information (PII), passwords, or other sensitive data in logs can lead to compliance violations (e.g., GDPR, HIPAA).

d. Don’t Ignore Log Rotation

Failing to implement log rotation can result in disk space exhaustion, especially in high traffic systems (Log Retention).

e. Don’t Overlook Log Correlation

Logs without request IDs, session IDs, or contextual metadata make it difficult to correlate related events.

f. Don’t Forget to Monitor Log Costs

Logging everything without considering storage and processing costs can lead to financial inefficiency in large-scale systems.

g. Keep the log message short

Long and verbose messages are a cost. The cost is in reading time and ingestion time.

h. Never use log message in loop

This might seem obvious, but just to be clear -> logging inside a loop, even if the log level isn’t visible by default, can still hurt performance. It’s best to avoid this whenever possible.

If you absolutely need to log something at a hidden level and decide to break this guideline, keep it short and straightforward.

i. Log item you already β€œhave”

We should avoid this,


logger.info("Reached X and value of method is {}", method());

Here, just for the logging purpose, we are calling the method() again. Even if the method is cheap. You’re effectively running the method regardless of the respective logging levels!

j. Dont log iterables

Even if it’s a small list. The concern is that the list might grow and β€œovercrowd” the log. Writing the content of the list to the log can balloon it up and slow processing noticeably. Also kills time in debugging.

k. Don’t Log What the Framework Logs for You

There are great things to log. E.g. the name of the current thread, the time, etc. But those are already written into the log by default almost everywhere. Don’t duplicate these efforts.

l.Don’t log Method Entry/Exit

Log only important events in the system. Entering or exiting a method isn’t an important event. E.g. if I have a method that enables feature X the log should be β€œFeature X enabled” and not β€œenable_feature_X entered”. I have done this a lot.

m. Dont fill the method

A complex method might include multiple points of failure, so it makes sense that we’d place logs in multiple points in the method so we can detect the failure along the way. Unfortunately, this leads to duplicate logging and verbosity.

Errors will typically map to error handling code which should be logged in generically. So all error conditions should already be covered.

This creates situations where we sometimes need to change the flow/behavior of the code, so logging will be more elegant.

n. Don’t use AOP logging

AOP (Aspect-Oriented Programming) logging allows you to automatically add logs at specific points in your application, such as when methods are entered or exited.

In Python, AOP-style logging can be implemented using decorators or middleware that inject logs into specific points, such as method entry and exit. While it might seem appealing for detailed tracing, the same problems apply as in other languages like Java.


import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def log_method_entry_exit(func):
    def wrapper(*args, **kwargs):
        logger.info(f"Entering: {func.__name__} with args={args} kwargs={kwargs}")
        result = func(*args, **kwargs)
        logger.info(f"Exiting: {func.__name__} with result={result}")
        return result
    return wrapper

# Example usage
@log_method_entry_exit
def example_function(x, y):
    return x + y

example_function(5, 3)

Why Avoid AOP Logging in Python

  1. Performance Impact:
    • Injecting logs into every method increases runtime overhead, especially if used extensively in large-scale systems.
    • In Python, where function calls already add some overhead, this can significantly affect performance.
  2. Log Verbosity:
    • If this decorator is applied to every function or method in a system, it produces an enormous amount of log data.
    • Debugging becomes harder because the meaningful logs are lost in the noise of entry/exit logs.
  3. Limited Usefulness:
    • During local development, tools like Python debuggers (pdb), profilers (cProfile, line_profiler), or tracing libraries like trace are far more effective for inspecting function behavior and performance.
  4. CI Issues:
    • Enabling such verbose logging during CI test runs can make tracking test failures more difficult because the logs are flooded with entry/exit messages, obscuring the root cause of failures.

Use Python-specific tools like pdb, ipdb, or IDE-integrated debuggers to inspect code locally.

o. Dont Double log

It’s pretty common to log an error when we’re about to throw an error. However, since most error code is generic, it’s likely there’s a log in the generic error handling code.

4. Ensuring Scalability

To keep your logging system robust and scalable,

  • Monitor Log Storage: Set alerts for log storage thresholds.
  • Implement Compression: Compress log files to reduce storage costs.
  • Automate Archival and Deletion: Regularly archive old logs and purge obsolete data.
  • Benchmark Logging Overhead: Measure the performance impact of logging on your application.

5. Logging for Metrics

Below, is the list of items that i wish can be logged for metrics.

General API Metrics

  1. General API Metrics on HTTP methods, status codes, latency/duration, request size.
  2. Total requests per endpoint over time. Requests per minute/hour.
  3. Frequency and breakdown of 4XX and 5XX errors.
  4. User ID or API client making the request.

{
  "timestamp": "2025-01-20T12:34:56Z",
  "endpoint": "/projects",
  "method": "POST",
  "status_code": 201,
  "user_id": 12345,
  "request_size_bytes": 512,
  "response_size_bytes": 256,
  "duration_ms": 120
}

Business Specific Metrics

  1. Objects (session) creations: No. of projects created (daily/weekly)
  2. Average success/failure rate.
  3. Average time to create a session.
  4. Frequency of each action on top of session.

{
  "timestamp": "2025-01-20T12:35:00Z",
  "endpoint": "/projects/12345/actions",
  "action": "edit",
  "status_code": 200,
  "user_id": 12345,
  "duration_ms": 98
}

Performance Metrics

  1. Database query metrics on execution time, no. of queries per request.
  2. Third party service metrics on time spent, success/failure rates of external calls.

{
  "timestamp": "2025-01-20T12:37:15Z",
  "endpoint": "/projects/12345",
  "db_query_time_ms": 45,
  "external_api_time_ms": 80,
  "status_code": 200,
  "duration_ms": 130
}

Scalability Metrics

  1. Concurrency metrics on max request handled.
  2. Request queue times during load.
  3. System Metrics on CPU and Memory usage during request processing (this will be auto captured).

Usage Metrics

  1. Traffic analysis on peak usage times.
  2. Most/Least used endpoints.

6. Mapped Diagnostic Context (MDC)

MDC is the one, i longed for most. Also went into trouble by implementing without a middleware.

Mapped Diagnostic Context (MDC) is a feature provided by many logging frameworks, such as Logback, Log4j, and SLF4J. It allows developers to attach contextual information (key-value pairs) to the logging events, which can then be automatically included in log messages.

This context helps in differentiating and correlating log messages, especially in multi-threaded applications.

Why Use MDC?

  1. Enhanced Log Clarity: By adding contextual information like user IDs, session IDs, or transaction IDs, MDC enables logs to provide more meaningful insights.
  2. Easier Debugging: When logs contain thread-specific context, tracing the execution path of a specific transaction or user request becomes straightforward.
  3. Reduced Log Ambiguity: MDC ensures that logs from different threads or components do not get mixed up, avoiding confusion.

Common Use Cases

  1. Web Applications: Logging user sessions, request IDs, or IP addresses to trace the lifecycle of a request.
  2. Microservices: Propagating correlation IDs across services for distributed tracing.
  3. Background Tasks: Tracking specific jobs or tasks in asynchronous operations.

Limitations (Curated from other blogs. I havent tried yet )

  1. Thread Boundaries: MDC is thread-local, so its context does not automatically propagate across threads (e.g., in asynchronous executions). For such scenarios, you may need to manually propagate the MDC context.
  2. Overhead: Adding and managing MDC context introduces a small runtime overhead, especially in high-throughput systems.
  3. Configuration Dependency: Proper MDC usage often depends on correctly configuring the logging framework.


2025-01-21 14:22:15.123 INFO  [thread-1] [userId=12345, transactionId=abc123] Starting transaction
2025-01-21 14:22:16.456 DEBUG [thread-1] [userId=12345, transactionId=abc123] Processing request
2025-01-21 14:22:17.789 ERROR [thread-1] [userId=12345, transactionId=abc123] Error processing request: Invalid input
2025-01-21 14:22:18.012 INFO  [thread-1] [userId=12345, transactionId=abc123] Transaction completed

In Fastapi, we can implement this via a middleware,


import logging
import uuid
from fastapi import FastAPI, Request
from starlette.middleware.base import BaseHTTPMiddleware

# Configure the logger
logger = logging.getLogger("uvicorn")
logger.setLevel(logging.INFO)

# Create a custom formatter with MDC placeholders
class CustomFormatter(logging.Formatter):
    def format(self, record):
        record.user_id = getattr(record, "user_id", "unknown")
        record.transaction_id = getattr(record, "transaction_id", str(uuid.uuid4()))
        return super().format(record)

# Set the logging format with MDC keys
formatter = CustomFormatter(
    "%(asctime)s %(levelname)s [%(threadName)s] [userId=%(user_id)s, transactionId=%(transaction_id)s] %(message)s"
)

# Apply the formatter to the handler
console_handler = logging.StreamHandler()
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)

# FastAPI application
app = FastAPI()

# Custom Middleware to add MDC context
class RequestContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        # Add MDC info before handling the request
        user_id = request.headers.get("X-User-ID", "default-user")
        transaction_id = str(uuid.uuid4())
        logging.getLogger().info(f"Request started: {user_id}, {transaction_id}")

        # Add MDC info to log
        logging.getLogger().user_id = user_id
        logging.getLogger().transaction_id = transaction_id

        response = await call_next(request)

        # Optionally, log additional information when the response is done
        logging.getLogger().info(f"Request finished: {user_id}, {transaction_id}")

        return response

# Add custom middleware to the FastAPI app
app.add_middleware(RequestContextMiddleware)

@app.get("/")
async def read_root():
    logger.info("Handling the root endpoint.")
    return {"message": "Hello, World!"}

@app.get("/items/{item_id}")
async def read_item(item_id: int):
    logger.info(f"Fetching item with ID {item_id}")
    return {"item_id": item_id}

Hope, you might have got a better idea on logging.

❌
❌