Learning Notes #65 β Application Logs, Metrics, MDC
I am big fan of logs. Would like to log everything. All the request, response of an API. But is it correct ? Though logs helped our team greatly during this new year, i want to know, is there a better approach to log things. That search made this blog. In this blog i jot down notes on logging. Lets log it.
Throughout this blog, i try to generalize things. Not biased to a particular language. But here and there you can see me biased towards Python. Also this is my opinion. Not a hard rule.
Which is a best logger ?
Iβm not here to argue about which logger is the best, they all have their problems. But the worst one is usually the one you build yourself. Sure, existing loggers arenβt perfect, but trying to create your own is often a much bigger mistake.
1. Why Logging Matters
Logging provides visibility into your applicationβs behavior, helping to,
- Diagnose and troubleshoot issues (This is most common usecase)
- Monitor application health and performance (Metrics)
- Meet compliance and auditing requirements (Audit Logs)
- Enable debugging in production environments (we all do this.)
However, poorly designed logging strategies can lead to excessive log volumes, higher costs, and difficulty in pinpointing actionable insights.
2. Logging Best Practices
a. Use Structured Logs
Long story short, instead of unstructured plain text, use JSON or other structured formats. This makes parsing and querying easier, especially in log aggregation tools.
{ "timestamp": "2025-01-20T12:34:56Z", "level": "INFO", "message": "User login successful", "userId": 12345, "sessionId": "abcde12345" }
b. Leverage Logging Levels
Define and adhere to appropriate logging levels to avoid log bloat:
DEBUG
: Detailed information for debugging.INFO
: General operational messages.WARNING
: Indications of potential issues.ERROR
: Application errors that require immediate attention.CRITICAL
: Severe errors leading to application failure.
c. Avoid Sensitive Data
Sanitize your logs to exclude sensitive information like passwords, PII, or API keys. Instead, mask or hash such data. Donβt add token even for testing.
d. Include Contextual Information
Incorporate metadata like request IDs, user IDs, or transaction IDs to trace specific events effectively.
3. Log Ingestion at Scale
As applications scale, log ingestion can become a bottleneck. Hereβs how to manage it,
a. Centralized Logging
Stream logs to centralized systems like Elasticsearch, Logstash, Kibana (ELK), or cloud-native services like AWS CloudWatch, Azure Monitor, or Google Cloud Logging.
b. Optimize Log Volume
- Log only necessary information.
- Use log sampling to reduce verbosity in high-throughput systems.
- Rotate logs to limit disk usage.
c. Use Asynchronous Logging
Asynchronous loggers improve application performance by delegating logging tasks to separate threads or processes. (Not Suitable all time. It has its own problems)
d. Method return values are usually important
If you have a log in the method and donβt include the return value of the method, youβre missing important information. Make an effort to include that at the expense of slightly less elegant looking code.
e. Include filename in error messages
Mention the path/to/file:line-number
to pinpoint the location of the issue.
3. Logging Donβts
a. Donβt Log Everything at the Same Level
Logging all messages at the INFO
or DEBUG
level creates noise and makes it difficult to identify critical issues.
b. Donβt Hardcode Log Messages
Avoid static, vague, or generic log messages. Use dynamic and descriptive messages that include relevant context.
# Bad Example Error occurred. # Good Example Error occurred while processing payment for user_id=12345, transaction_id=abc-6789.
c. Donβt Log Sensitive or Regulated Data
Exposing personally identifiable information (PII), passwords, or other sensitive data in logs can lead to compliance violations (e.g., GDPR, HIPAA).
d. Donβt Ignore Log Rotation
Failing to implement log rotation can result in disk space exhaustion, especially in high traffic systems (Log Retention).
e. Donβt Overlook Log Correlation
Logs without request IDs, session IDs, or contextual metadata make it difficult to correlate related events.
f. Donβt Forget to Monitor Log Costs
Logging everything without considering storage and processing costs can lead to financial inefficiency in large-scale systems.
g. Keep the log message short
Long and verbose messages are a cost. The cost is in reading time and ingestion time.
h. Never use log message in loop
This might seem obvious, but just to be clear -> logging inside a loop, even if the log level isnβt visible by default, can still hurt performance. Itβs best to avoid this whenever possible.
If you absolutely need to log something at a hidden level and decide to break this guideline, keep it short and straightforward.
i. Log item you already βhaveβ
We should avoid this,
logger.info("Reached X and value of method is {}", method());
Here, just for the logging purpose, we are calling the method() again. Even if the method is cheap. Youβre effectively running the method regardless of the respective logging levels!
j. Dont log iterables
Even if itβs a small list. The concern is that the list might grow and βovercrowdβ the log. Writing the content of the list to the log can balloon it up and slow processing noticeably. Also kills time in debugging.
k. Donβt Log What the Framework Logs for You
There are great things to log. E.g. the name of the current thread, the time, etc. But those are already written into the log by default almost everywhere. Donβt duplicate these efforts.
l.Donβt log Method Entry/Exit
Log only important events in the system. Entering or exiting a method isnβt an important event. E.g. if I have a method that enables feature X the log should be βFeature X enabledβ and not βenable_feature_X enteredβ. I have done this a lot.
m. Dont fill the method
A complex method might include multiple points of failure, so it makes sense that weβd place logs in multiple points in the method so we can detect the failure along the way. Unfortunately, this leads to duplicate logging and verbosity.
Errors will typically map to error handling code which should be logged in generically. So all error conditions should already be covered.
This creates situations where we sometimes need to change the flow/behavior of the code, so logging will be more elegant.
n. Donβt use AOP logging
AOP (Aspect-Oriented Programming) logging allows you to automatically add logs at specific points in your application, such as when methods are entered or exited.
In Python, AOP-style logging can be implemented using decorators or middleware that inject logs into specific points, such as method entry and exit. While it might seem appealing for detailed tracing, the same problems apply as in other languages like Java.
import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) def log_method_entry_exit(func): def wrapper(*args, **kwargs): logger.info(f"Entering: {func.__name__} with args={args} kwargs={kwargs}") result = func(*args, **kwargs) logger.info(f"Exiting: {func.__name__} with result={result}") return result return wrapper # Example usage @log_method_entry_exit def example_function(x, y): return x + y example_function(5, 3)
Why Avoid AOP Logging in Python
- Performance Impact:
- Injecting logs into every method increases runtime overhead, especially if used extensively in large-scale systems.
- In Python, where function calls already add some overhead, this can significantly affect performance.
- Log Verbosity:
- If this decorator is applied to every function or method in a system, it produces an enormous amount of log data.
- Debugging becomes harder because the meaningful logs are lost in the noise of entry/exit logs.
- Limited Usefulness:
- During local development, tools like Python debuggers (
pdb
), profilers (cProfile
,line_profiler
), or tracing libraries liketrace
are far more effective for inspecting function behavior and performance.
- During local development, tools like Python debuggers (
- CI Issues:
- Enabling such verbose logging during CI test runs can make tracking test failures more difficult because the logs are flooded with entry/exit messages, obscuring the root cause of failures.
Use Python-specific tools like pdb
, ipdb
, or IDE-integrated debuggers to inspect code locally.
o. Dont Double log
Itβs pretty common to log an error when weβre about to throw an error. However, since most error code is generic, itβs likely thereβs a log in the generic error handling code.
4. Ensuring Scalability
To keep your logging system robust and scalable,
- Monitor Log Storage: Set alerts for log storage thresholds.
- Implement Compression: Compress log files to reduce storage costs.
- Automate Archival and Deletion: Regularly archive old logs and purge obsolete data.
- Benchmark Logging Overhead: Measure the performance impact of logging on your application.
5. Logging for Metrics
Below, is the list of items that i wish can be logged for metrics.
General API Metrics
- General API Metrics on HTTP methods, status codes, latency/duration, request size.
- Total requests per endpoint over time. Requests per minute/hour.
- Frequency and breakdown of 4XX and 5XX errors.
- User ID or API client making the request.
{ "timestamp": "2025-01-20T12:34:56Z", "endpoint": "/projects", "method": "POST", "status_code": 201, "user_id": 12345, "request_size_bytes": 512, "response_size_bytes": 256, "duration_ms": 120 }
Business Specific Metrics
- Objects (session) creations: No. of projects created (daily/weekly)
- Average success/failure rate.
- Average time to create a session.
- Frequency of each action on top of session.
{ "timestamp": "2025-01-20T12:35:00Z", "endpoint": "/projects/12345/actions", "action": "edit", "status_code": 200, "user_id": 12345, "duration_ms": 98 }
Performance Metrics
- Database query metrics on execution time, no. of queries per request.
- Third party service metrics on time spent, success/failure rates of external calls.
{ "timestamp": "2025-01-20T12:37:15Z", "endpoint": "/projects/12345", "db_query_time_ms": 45, "external_api_time_ms": 80, "status_code": 200, "duration_ms": 130 }
Scalability Metrics
- Concurrency metrics on max request handled.
- Request queue times during load.
- System Metrics on CPU and Memory usage during request processing (this will be auto captured).
Usage Metrics
- Traffic analysis on peak usage times.
- Most/Least used endpoints.
6. Mapped Diagnostic Context (MDC)
MDC is the one, i longed for most. Also went into trouble by implementing without a middleware.
Mapped Diagnostic Context (MDC) is a feature provided by many logging frameworks, such as Logback, Log4j, and SLF4J. It allows developers to attach contextual information (key-value pairs) to the logging events, which can then be automatically included in log messages.
This context helps in differentiating and correlating log messages, especially in multi-threaded applications.
Why Use MDC?
- Enhanced Log Clarity: By adding contextual information like user IDs, session IDs, or transaction IDs, MDC enables logs to provide more meaningful insights.
- Easier Debugging: When logs contain thread-specific context, tracing the execution path of a specific transaction or user request becomes straightforward.
- Reduced Log Ambiguity: MDC ensures that logs from different threads or components do not get mixed up, avoiding confusion.
Common Use Cases
- Web Applications: Logging user sessions, request IDs, or IP addresses to trace the lifecycle of a request.
- Microservices: Propagating correlation IDs across services for distributed tracing.
- Background Tasks: Tracking specific jobs or tasks in asynchronous operations.
Limitations (Curated from other blogs. I havent tried yet )
- Thread Boundaries: MDC is thread-local, so its context does not automatically propagate across threads (e.g., in asynchronous executions). For such scenarios, you may need to manually propagate the MDC context.
- Overhead: Adding and managing MDC context introduces a small runtime overhead, especially in high-throughput systems.
- Configuration Dependency: Proper MDC usage often depends on correctly configuring the logging framework.
2025-01-21 14:22:15.123 INFO [thread-1] [userId=12345, transactionId=abc123] Starting transaction 2025-01-21 14:22:16.456 DEBUG [thread-1] [userId=12345, transactionId=abc123] Processing request 2025-01-21 14:22:17.789 ERROR [thread-1] [userId=12345, transactionId=abc123] Error processing request: Invalid input 2025-01-21 14:22:18.012 INFO [thread-1] [userId=12345, transactionId=abc123] Transaction completed
In Fastapi, we can implement this via a middleware,
import logging import uuid from fastapi import FastAPI, Request from starlette.middleware.base import BaseHTTPMiddleware # Configure the logger logger = logging.getLogger("uvicorn") logger.setLevel(logging.INFO) # Create a custom formatter with MDC placeholders class CustomFormatter(logging.Formatter): def format(self, record): record.user_id = getattr(record, "user_id", "unknown") record.transaction_id = getattr(record, "transaction_id", str(uuid.uuid4())) return super().format(record) # Set the logging format with MDC keys formatter = CustomFormatter( "%(asctime)s %(levelname)s [%(threadName)s] [userId=%(user_id)s, transactionId=%(transaction_id)s] %(message)s" ) # Apply the formatter to the handler console_handler = logging.StreamHandler() console_handler.setFormatter(formatter) logger.addHandler(console_handler) # FastAPI application app = FastAPI() # Custom Middleware to add MDC context class RequestContextMiddleware(BaseHTTPMiddleware): async def dispatch(self, request: Request, call_next): # Add MDC info before handling the request user_id = request.headers.get("X-User-ID", "default-user") transaction_id = str(uuid.uuid4()) logging.getLogger().info(f"Request started: {user_id}, {transaction_id}") # Add MDC info to log logging.getLogger().user_id = user_id logging.getLogger().transaction_id = transaction_id response = await call_next(request) # Optionally, log additional information when the response is done logging.getLogger().info(f"Request finished: {user_id}, {transaction_id}") return response # Add custom middleware to the FastAPI app app.add_middleware(RequestContextMiddleware) @app.get("/") async def read_root(): logger.info("Handling the root endpoint.") return {"message": "Hello, World!"} @app.get("/items/{item_id}") async def read_item(item_id: int): logger.info(f"Fetching item with ID {item_id}") return {"item_id": item_id}
Hope, you might have got a better idea on logging.