Logging Strategies for Observability and Troubleshooting

Logging Strategies for Observability and Troubleshooting

As I've been between jobs, I've taken some time to start collecting various personal projects, clean them up, and get them packaged and made available to the general public, if I feel they could be of use to anyone out there. My first effort is a little Python package that provides a decorator/context-manager class that makes capturing and logging process execution times easier and more consistent. I picked this one to start with for a variety of reasons:

  • The code involved was relatively minimal, so I could write it in a couple of afternoons, including testing, giving me time to focus on learning the newest standard processes involved in creating Python packages.
  • That also allowed me to work through the CI/CD processes that I wanted to build for any future Python packaging projects, specifically Bitbucket's Pipelines.
  • It integrates nicely, I think, with standard AWS CloudWatch log ? metrics-filters ? dashboards/alarms processes and resources.
  • It helps keep logs that are used to capture metrics consistent and more efficient (though in fairness, that efficiency gain is, perhaps, trivial).

The back-story behind this package is fairly straightforward, though I've abstracted the circumstances and details for various reasons. Assume that you have a system that runs in AWS. It could be an API Gateway backed by one to many Lambda Functions, a Fargate task, or anything, really, that has inner functions or methods that you need visibility into the successes and errors, and elapsed execution time of for every request. For the sake of illustration, let's assume a Fargate task that:

  • Accepts some identifiers for data that needs to be retrieved from APIs that you have no control over. For purposes of illustration, I'll call them Person, Place and Thing APIs.
  • Retrieves the actual data for the identified items from their relevant APIs.
  • Does some manipulation of the data to assemble it into a payload for a call to another API.
  • Makes that API call to analyze the collected data.
  • Writes that analysis to a data store.

Putting all of that together looks something like this:

There are, I think, two main categories of logging-related messages to be concerned with. The first is messages for troubleshooting purposes: information that would help an engineer track down which parts of a process are succeeding and which are failing. The approach that I have advocated for over the last several years is a fairly simple rule-set:

  • Entry-point functions or methods should info-log their initial call: my_entry_point called, for example.
  • Entry-point functions or methods should debug-log their inputs. In Python, this can be as simple as logging f'{vars()}' at the top of the code, but other, more sophisticated approaches are possible (or even desirable) too.
  • All functions or methods should log their exit state. That takes two basic forms:

  1. For successful completion without errors, info-logging that is sufficient: my_child_function completed successfully, for example. In addition, debug-logging what will be returned is advised.
  2. Any errors raised should be captured and logged, with the inputs that led to the error. Logging the stack-trace at a debug level is also advised.

When everything is running according to plan, that leads to one log-message per function or method called, plus one at the start of the entry-point, each of which is intentionally small to reduce the amount of logged data being paid for. For example (these examples are all asyncio-based code, unless noted otherwise):

[INFO]  12:05:57,008: main called
[INFO]  12:06:01,857: get_place_data completed without error
[INFO]  12:06:02,241: get_person_data completed without error
[INFO]  12:06:02,811: get_thing_data completed without error
[INFO]  12:06:05,387: analyze_data completed without error
[INFO]  12:06:05,748: write_data completed without error
[INFO]  12:06:05,748: async_main completed
[INFO]  12:06:05,750: main completed        

This captures the entire process wrapped by the main function, and shows that all of the relevant sub-processes ran to completion without error. This is, if not the optimum level of logging information for troubleshooting purposes, at least very close to it. In situations where there are no explicit errors being raised, but the results of a process are suspect or just wrong, this level of logging can show at least things over and above the "there were no errors" scenario:

  • Whether the process as a whole was executing as expected.
  • Which child processes executed (if there are any), and which did not.

It also affords the ability to simply turn debug-level logging on to see more details:

[INFO]  14:08:08,408: main called
[DEBUG] 14:08:08,408: main: {}
[DEBUG] 14:08:08,408: async_main called: {}
[DEBUG] 14:08:11,206: get_person_data called: {}
[DEBUG] 14:08:11,206: get_place_data called: {}
[DEBUG] 14:08:11,206: get_thing_data called: {}
[INFO]  14:08:13,873: get_thing_data completed without error
[DEBUG] 14:08:13,874: {'thing_data': ('the Revolver', {})}
[INFO]  14:08:13,977: get_place_data completed without error
[DEBUG] 14:08:13,978: {'place_data': ('The Library', {})}
[INFO]  14:08:14,001: get_person_data completed without error
[DEBUG] 14:08:14,001: {'person_data': ('Professor Plum', {})}
[DEBUG] 14:08:14,001: analyze_data called: {'payload': {'person_data': {'Professor Plum': {}}, 'place_data': {'The Library': {}}, 'thing_data': {'the Revolver': {}}}}
[INFO]  14:08:16,537: analyze_data completed without error
[DEBUG] 14:08:16,537: {'person_data': {'Professor Plum': {}}, 'place_data': {'The Library': {}}, 'thing_data': {'the Revolver': {}}}
[DEBUG] 14:08:16,537: write_data called: {'payload': {'person_data': {'Professor Plum': {}}, 'place_data': {'The Library': {}}, 'thing_data': {'the Revolver': {}}}}
[INFO]  14:08:17,145: write_data completed without error
[DEBUG] 14:08:17,145: True
[INFO]  14:08:17,145: async_main completed
[INFO]  14:08:17,146: main completed        

In situations where errors arise, this approach will usually log the source error, any "intermediate" errors that happen as a result, and a "final" error, making the process of tracking down the source of the error, and the inputs that led to it, relatively simple:

[INFO]  14:02:13,003: main called
[ERROR] 14:02:15,672: RuntimeError in get_thing_data: with {}: Thing API failure
[INFO]  14:02:18,108: get_person_data completed without error
[INFO]  14:02:18,294: get_place_data completed without error
[ERROR] 14:02:18,295: RuntimeError in async_main: 1 child processes failed: 
                      [RuntimeError('Thing API failure')]
[INFO]  14:02:18,296: main completed        

This still provides a reasonable view of a process' execution — usually enough for an engineer to be able to replicate the error locally, since the inputs that caused the error are captured. But, if more detail is needed, the same switch over to debug-level logging can provide more detail.

The second main category of logging messages is concerned with providing data-points that can be picked up by metrics filters, and used to provide observability into the performance of a system or its process. In an AWS CloudWatch context, these data points can be collected with a Metric Filter, and those collected metrics can drive displays in CloudWatch Dashboards and be used to set up Alarms that can be used to notify someone when performance is outside an expected range or passing some value threshold. The metric filter setup uses one of several filter syntaxes to identify names in the associated log group, and defines how the value for that key is determined.

If storage of logs is an ongoing cost concern, note that the metrics extracted from logs persist after the logs they were extracted from have been deleted. That allows the log retention policy to be set for a much shorter time-frame — just long enough that the passage of a long weekend won't completely eliminate them, for example — without affecting the metrics extracted in any meaningful fashion.

The metrics themselves will be captured in one-minute buckets by default, and over time it appears that those buckets are merged into hourly data-sets. I've observed this, but cannot confirm whether the higher resolution time scale is really not available, or determine what the age is when that happens. I'm reasonably certain, from memory, that it's over a month, and it seems likely that it's closer to three months.

One of the more efficient logging output formats, from a metric filter standpoint, is to log data as a JSON object, with fields and values that filter definitions can leverage. By way of example, capturing the execution times of the functions in the hypothetical Fargate task above might issue a log-message with the following values (pretty-printed for ease of reading; there's no need for it to be pretty in the log message itself):

{
	"async_main": 2.4285507202148438,
	"get_place_data": 2.770256280899048,
	"get_person_data": 2.859272003173828,
	"get_thing_data": 2.9412269592285156,
	"async_results": 2.9431209564208984,
	"analyze_data": 2.139224052429199,
	"write_data": 0.7172739505767822,
	"main": 8.23118805885315
}        

Functionally, there's no reason that these metrics values would have to be logged in a single line: Logging each value independently is (trivially) less efficient, though, and in my experience, troubleshooting efforts that need access to the raw metrics data are easier to deal with if they are all in one place.

With all of those factors in mind, I wrote a small Python package to capture elapsed time values with associated names, and write the entire collection of them out to a single log-line: the goblinfish-metrics-trackers package. It also provides a mechanism for adding arbitrary metrics to that output, allowing a single log-message to capture both "latencies" (elapsed times for any processes of interest) and other points of interest (like error-counts). To reduce the amount of data storage in the logs and allow for better consistency in the metrics filters later on, the elapsed times are output as milliseconds, with no more than three decimal places of precision (e.g. 1234.567). An example of the final output for the Fargate structure above, pretty-printed, is:

{
   "latencies":{
      "async_main":2869.3,
      "get_place_data":2324.621,
      "get_thing_data":2427.277,
      "get_person_data":2871.321,
      "async_results":2872.099,
      "analyze_data":2199.275,
      "main":7942.849
   },
   "metrics":{
      "write_data_errors":1
   }
}        

The corresponding general logging output for the test-run that generated this log message was:

[INFO]  09:38:18,861: main called
[INFO]  09:38:23,246: get_thing_data completed without error
[INFO]  09:38:23,545: get_place_data completed without error
[INFO]  09:38:23,859: get_person_data completed without error
[INFO]  09:38:26,426: analyze_data completed without error
[ERROR] 09:38:26,426: RuntimeError in write_data: Write data failure
[ERROR] 09:38:26,426: RuntimeError in async_main: Write data failure
[INFO]  09:38:26,427: main completed        

The function that raised the error (an asynchronous function in this case), incorporating the logging policy described earlier, using a WRITE_FAIL_RATE value of 0.99 to force a fake error, and capturing that error metric was:

# Module "Constants" and Other Attributes
tracker = ProcessTracker()
# ...
async def write_data(payload):
    inputs = vars()
    logging.debug(f'write_data called: {inputs}')
    try:
        if random.random() < WRITE_FAIL_RATE:
            raise RuntimeError('Write data failure')
        sleep_for = random.randrange(250, 1_000) / 1000
        with tracker.timer('write_data'):
            await asyncio.sleep(sleep_for)
    except Exception as error:
        logging.error(
            f'{error.__class__.__name__} in write_data: {error}'
        )
        logging.debug(f'write_data inputs: {inputs}')
        tracker.set_metric('write_data_errors', 1)
        raise
    else:
        result = True
        logging.info('write_data completed without error')
        logging.debug(f'{result}')
        return result        

The synchronous equivalent of this function can use the @tracker.track decorator on the function. When the decorator is applied to an async function, it still runs without error, but it only captures the elapsed time needed for an asynchronous promise or future to be created, which is not terrifically useful in general. Apart from that, it is nearly identical:

# Module "Constants" and Other Attributes
tracker = ProcessTracker()
# ...
@tracker.track
def write_data(payload):
    inputs = vars()
    logging.debug(f'write_data called: {inputs}')
    try:
        if random.random() < WRITE_FAIL_RATE:
            raise RuntimeError('Write data failure')
        sleep_for = random.randrange(250, 1_000) / 1000
        time.sleep(sleep_for)
    except Exception as error:
        logging.error(
            f'{error.__class__.__name__} in write_data: {error}'
        )
        logging.debug(f'write_data inputs: {inputs}')
        # Here is the error-count metric
        tracker.set_metric('write_data_errors', 1)
        raise
    else:
        result = True
        logging.info('write_data completed without error')
        logging.debug(f'{result}')
        return result        

The general logging output also looks pretty much the same as the output for the asynchronous function version, though as I'm writing this, it also generated a stack trace.

The CloudFormation resource definitions for the related metric filters don't need to be terribly complicated — There are only seven properties for either that must be provided, with one more (FilterName) that is more a convenience item. The CloudFormation YAML required for the latency and errors metrics capture for the data-writer function ends up looking like this:

# ...
Resources:
  # ...
  DataWriterLatencyMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      FilterName: data-writer-latency-metric-filter
      # This is what triggers picking up the value
      FilterPattern: {$.latencies.write_data>0}
      # This is the source of the log-messages to be evaluated
      LogGroupName: !Ref DataWriterLogGroup
      MetricTransformations: 
        - MetricName: data-writer-latency
          # This is a grouping name, so that all the metrics for the
          # application will be kept in the same namespace
          MetricNamespace: !Ref ApplicationName
          # This is the value that will be picked up
          MetricValue: {$.latencies.write_data}
          # This is the unit-type for the metric
          Unit: Milliseconds
  # ...
  DataWriterErrorCountMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      FilterName: data-writer-error-count-metric-filter
      FilterPattern: {$.metrics.write_data_errors>0}
      LogGroupName: !Ref DataWriterLogGroup
      MetricTransformations: 
        - MetricName: data-writer-latency
          MetricNamespace: !Ref ApplicationName
          MetricValue: {$.metrics.write_data_errors}
          Unit: Count        
Correction: This YAML did not originally include the .latencies and .metrics segments in the FilterPattern and MetricValue entries. Those have been corrected.

Full examples of applying the decorators and context managers are available in the package repository, both modeled to simulate the Fargate example above:

Bridger Jones

Software Engineer

3 周

Fantastic logging and observation pattern!

要查看或添加评论,请登录

Brian Allbee的更多文章

  • How (and why) to "tag" email addresses used for sign-ins

    How (and why) to "tag" email addresses used for sign-ins

    It's a fairly common practice these days for online registrations to use email addresses as user identifiers in their…

  • #OpenToWork Scammers(?)

    #OpenToWork Scammers(?)

    Having recently been laid off, one of the first things that I did was to come here to LinkedIn, clean up my profile…

    9 条评论
  • Some observations about workplace heroics.

    Some observations about workplace heroics.

    Being a hero is hard. You don’t even have to be one to see that.

    6 条评论

社区洞察

其他会员也浏览了