登录查看更多内容

The APEROS Code Review Elements

Carlos Silvestre

Software Development Manager at Amazon

发布日期: 2022年2月9日

This article intends to present to Engineers and Engineering Managers the?APEROS (Availability,?Performance,?Extensibility,?Resiliency,?Observability,?Scalability) code review elements. The usage of these elements presents a different, more efficient approach on how to do Code Reviews in code changes for distributed systems.

A look at Code Reviews?

All developers out there know what?Code Reviews ?are about. It’s a widely used mechanism to: drive better code quality and maintainability, transfer knowledge between engineers, share ownership of the changes, find better solutions for the problems the change is trying to solve, and find potential gaps in the solution.

At tech companies, engineers review code multiple times a week. It’s a necessary time consuming task that every engineer (should) take part on. Currently there’s a few debates in the industry on the value added by Code Reviews. A few years ago, Google wrote a thorough?paper ?on the subject. Here’s an interesting finding from it:?“Despite years of refinement, code review at Google still faces breakdowns. These are mostly linked to the complexity of the interactions that occur around the reviews. Yet, code review is strongly considered a valuable process by developers, who spend around 3 hours a week reviewing.”.

Code reviews are not something engineers are thought at University or that engineers are coached early in their careers. There’s a mix bag of styles when it comes to how to review a change, and the approach is usually not consistent in a team, organisation or company.?

That’s why as an Engineering Manager, I’ve always spoke to my teams about how to add more value to code reviews as a way to make the most of the time engineers dedicate to it weekly.

How can we make the most of Code Reviews?

Code reviews help us to validate we’re doing the right thing as a team, that we have considered different opinions before moving ahead with a deployment. But it can become something automatic that engineers do without putting a lot of thinking behind it, without thinking about the value that it will bring and what really should be considered case by case.

A common mistake made by engineers (usually more often with less tenured ones) is to focus on a line by line review, forgetting about the big picture. In my personal opinion (and I have no data to support that), I believe folks tend to forget that grand scheme of things (like which compute the software is running, what are the upstreams and downstreams, what’s the purpose of that application) and over-indexing on semantics, naming conventions, code style, error messages. AKA nitpicks -?https://blog.danlew.net/2021/02/23/stop-nitpicking-in-code-reviews/ .

I’m not saying that those are not important, they are and if not taken care of we may end up with spaghetti, unpleasant or unmaintainable code (or all of the above). High performant teams must find ways of automating most of these validations instead of spending efforts on it. Well thought code style checks and static analysers are great ways to do that in a non-opinionated way. Quoting the aforementioned Google’s case study: “A qualitative study of 88 Mozilla developers found that static analysis integration was the most commonly-requested feature for code review. Automated analyses?allow reviewers to focus on the understandability and maintainability of changes, instead of getting distracted by trivial comments (e.g., about formatting)”

In an environment where?distributed systems?are the majority of the components (like most of the applications running across the globe nowadays, with the exception of pure UX or embedded/hardware driven software), the focus must be somewhere else. Code reviews of software that runs on distributed applications need a lot more focus on the critical components, on what can go wrong and what can be very expensive to deal with later on instead on semantics and formatting. A quick read on the?Fallacies of Distributed Computing ?is a great way to understand the difference between the analysis you need to do in a non-distributed software as opposed to a distributed one.

As a target, we want our distributed applications to be reliable, performant, scalable, observable and easy to maintain. It’s a major challenge to get all those elements right and at a high bar in a single application. But I believe we can all agree that these are important pillars for modern distributed systems. So the challenge is how to embed all those elements that are usually thought through during the design into implementation and code reviews.

Working backwards from the fallacies and all that can go wrong on distributed systems, I encourage my teams to use the APEROS?(Availability,?Performance,?Extensibility,?Resiliency,?Observability,?Scalability) elements as the handbook to review code for distributed systems.?

Using the APEROS Elements

During your code reviews, write down the APEROS elements in front of you. Based on the description of the review and on the nature of the change, pick which of these elements are relevant to the change - most of the times you will be assessing against 2 or 3 elements instead of the 6. Sometimes, all of the 6. Sometimes, none of them.

When writing your comments, add the element at the beginning, e.g. “[Scalability] This looks like a stateful check that will break if this service is running multiple instances”. It will help the owner of the change to be in the right mindset when addressing it.

The list below contain a non-exhaustive list of things to look out for in each of the elements. The examples are absurd, but it’s just to give the reader a taste of what to look for.

Availability

Clients and applications have availability SLAs they agree upon. Most of the applications will aim for the highest possible uptime (happier clients, happier engineers, lower operational load). Engineers should aim to never leave the client hanging for longer than it expected, bubble up downstream failures and avoid too many retries (as the fallacies remind us, networks are not perfect and you should always be ready for your downstream failures). When assessing Availability, validate that:

Downstream service URLs, timeouts and retries are configurable. They will change and you have to react quicker than a release.
Timeouts and retries have been justified. Default values are not always the way to go.
Your service timeout is greater than MAX(list_of_my_downstream_timeouts). If you are timing out sooner than your downstream SLA, you are either using the wrong downstream or you’re not thinking about your timeouts.
You’re not retrying more and more frequently than you should. Excessive retries will wear out your CPU, starve your threads and may cause even more trouble to your downstream. Retry enough to go around network or application hiccups, and bubble up to your client that your dependencies look unhealthy and therefore you’re not able to process that request.
Know your limits and throttle accordingly if you’re a heavy load application in terms of transactions per second (TPS). Asking clients to try again later is more polite than keep them waiting. Throttling helps availability by avoiding your application to run over its limits.

Examples

1) Horrendous retry policy - multiple instances querying a problematic downstream until starvation

function readFromDownstream() {
   var response = null;
   do {
     try {
       response = callServiceA();
     } catch (E) {
       log.err ("Service call failled, retry");
     }
   } while (response != null)
}

2) Resource exhaustion threat if upstream load is not capped

function processRequest() {
   var future = launchThreadAndAsynchProcessRequest();
    if(future.complete) {
      return 200;
    } else if(future.err) {
      return 500;
    }
}

Performance

Reviewing performance during code reviews is most of the times like putting the cart in front of the horses. You won’t know how your service perform until you probe it, until you run performance tests, until you exhaust components and measure throughput, latency, CPU and memory consumption, downstream latency and all that long list of important performance metrics. But there’s still value on checking for obvious opportunities for improvements or obvious signs of early optimisations (AKA the root of all evils - good articles?here ?and?here ). When assessing performance, validate that:

You’re being smart with your resources. CPU, Threads, and Memory are finite and the less you need them, the better:
Do not use more threads that you need
Large files must be read in streams
Mechanical Sympathy (concept by Martin Thompson - more?here ?and?here )
There’s no unnecessary time complexity. If you can iterate once, don’t iterate twice
Multithreading opportunities are being used smartly, e.g. processing a large set of data to react to each entry independently has been done with multiple threads while a short iteration was kept in a single one
There’s no avoidable multiple IO trips to a DB or a downstream service?
Operations are being batched if they can

Examples

1) Early optimisation - unless well documented and justified, the engineer at this point would not have a clue whether 10 threads is enough, too much or even if this requires multithreading at all

function interpolateData() {
  try {
     var executionService = new (10); // 10 threads!
     var runnable = goThroughData();
   executionService.execute(runnable);
   } catch (E) {
     log.err ("not good");
   } finally {
     executionService.close()
   }
}

2) Missed multithreading opportunity - large data sets will most probably benefit from parallel streams if the operation is atomic

function processList() {
   var veryLargeList.stream().map(Bla::doSomethingAtomic).close();
}

3) Immutability candidate - avoid reads and writes if possible, consider whether the data model is right for the use case and whether immutability is not a better option

function updateItem(item) {
   var existingItem = readFromDB(item.key);
   
   if(item.this > existingItem.that) {
     updateInDB(item);
   } else {
     // do nothing
   }
   
}

4) Missed batching opportunity - most of the DBs offer batch operations for a reason

function persistList() {
   var mediumSizedList.forEach(item -> { myDBClient.persist(item); } );
}

Extensibility

When we build, we want to ensure the life of the next person using this piece of code is easier, and not harder than yours when you started. You should aim to make your code and the overall application code easy to maintain - which in a lot of times means easier to extend. There’s a lot of the old and gold?SOLID principles ?when it comes to extensibility, it’s always good to revisit them from time to time. When assessing Extensibility, validate that:

领英推荐

Refactoring: The Art of Polishing Code

KWAN 1 个月前

Code reviews and how they can increase your code…

Société Générale - Africa Technologies & Services 1 年前

Coding in the Negative Space

Chris Cooney 1 个月前

You’re not duplicating code or roles of a component
You’re not creating components that have too many responsibilities
Existing domains in the code are respected when creating new components
No unnecessary coupling between components is created
No components are created unless they’re needed

Examples

1) Hard to extend DB class - if a new table is added in the future, the class won’t be reusable and will have to be rewritten

class DBHandler {
  constant DB_URI = "db://";
  constant TABLE_NAME = "table";
  
  function connect(){..}
  function close(){..}
  function queryItem(key) {
    // execute select * from TABLE_NAME where index = 'key'
  }
}

2) Components with too many responsibilities - if tomorrow there’s a new parser, new file format, new output destination or a new audit demand, this class will have to be decomposed

class JackOfAllTrades {
  ...
  JackOfAllTrades(fileName, outputQueueTopic);
  
  function readFromFile() {..}
  function parseFile() {..}
  function connectToQueue() {..}
  function persistToQueue() {..}
  function logResultsToAudit() {..}
}

Resiliency

Although resiliency feeds into Availability, it has enough breadth to be a separate topic. A resilient software avoids failure at all times, and when it fails, it fails gracefully. The rule is simple - no fatals and there should always be enough resources to react properly. When assessing Resiliency, validate that:

Bad inputs can be handled. Logs don’t necessarily need to be detailed when the inputs are bad (like failed because of this or that), but your application should never ever fail because you received unexpected data.
Implicit exceptions? are caught (aka Unchecked exceptions). It’s easy for your colleague to miss an implicit exception being caught so you should help to point it out (and add a unit test to ensure nobody else will miss it)
Resources that are not needed anymore are properly closed
Multithreading are used on thread-safe code
Failures are graceful if you’re asked to shutdown by your auto-scaling friend (see?Shutdown Hooks ) - a lot of the existing frameworks have them out of the box, so make sure it’s present.

Examples

1) Careless usage of input data

function myAPIX(request, response)  {
   try {
     var response = processInput(request.mylist.get(FIRST).foo)
     metrics.emit(APPLICATIONA_APIX_SUCCESS)
   } catch (IOException) {
     log.err ("Failure");
     metrics.emit(APPLICATIONA_PROCESSINPUT_FAILURE); 
   }
}

2) Missed implicit exception - NullPointerException is implicit and would crash the application if not caught upstream and/or cause a fatal

function foo(fileName)  {
   try {
     var file = readFile(fileName);
      // returns null
      file.read();
   } catch (IOException e) {
     log.err ("Not found");
   }
   // Uncaught NPE
}

3) Closing resources - memory leak threat

function interpolateData() {
  try {
     var executionService = new (10); // 10 threads!
     var runnable = goThroughData();
   executionService.execute(runnable);
   } catch (IOException E) {
     log.err ("not good");
     executionService.close()
   }
   // Missing finally block
}

Observability

Distributed systems pushed the observability capabilities to a very high level in modern applications. We rely on emitted metrics to take decisions - automated or not. We use them to analyse healthiness, to derive actions from trends and to improve our software. They are a key piece of our operations and having the right metrics allow you to react quickly. Distributed application must be observable from the get go. When assessing Observability, validate that:

You have enough fault metrics to diagnose if your application is not behaving as expected. Logs are the last resource you should rely on to have the first level of analysis done when an issue occurs
Critical fault metrics are accompanied by a detailed log message that will allow you to dive deep
The success metrics are enough to assess that your application is healthy
Your are able to provide your business the information that they have requested through metrics
You have built enough performance metrics to allow you to measure your server side latency. Ideally your framework will do this on your behalf, so make sure it’s configured
You don’t have too many unnecessary metrics. More metrics is better than less metrics, but find a balance - metrics are not free
You’re frugal under info logs. Logging is expensive and you want to be mindful of what you really need to know when something went wrong

Examples

1) Misleading failure metric - not enough?granularity by not differentiating between the two issues (having to rely on logs to do so)

function myAPIX(request, response) {
   try {
     this.processRequest(request, response)
     metrics.emit(APPLICATIONA_APIX_SUCCESS)
   } catch (E1) {
     log.err ("Failure 1");
     metrics.emit(APPLICATIONA_APIX_FAILURE); 
   } catch (E2) {
   log.err ("Failure 2");
     metrics.emit(APPLICATIONA_APIX_FAILURE); // Same than the prev exception
   }
}

2) Duplicated metrics - it may confuse your dashboard set up or just be completely ignored

function processInput(input) {
   try {
     var data = getDataFromDB(input)
   } catch (E) {
     log.err ("Failure");
     metrics.emit(APPLICATIONA_PROCESSINPUT_FAILURE); 
   }
}

funtion getDataFromDB(input) {
   try {
     var data = dbClient.read(input)
   } catch (E) {
     log.err ("Failure");
     metrics.emit(APPLICATIONA_GETDATADB_FAILURE); 
   }
   
   return data;
}

3) Bad input metrics missing - input metrics allow for you to understand who is calling you with bad inputs and is very helpful when you need to migrate versions or do breaking changes

function myAPIX(request, response)  {
   try {
     validateInput(request)
     var response = processInput(request)
     metrics.emit(APPLICATIONA_APIX_SUCCESS)
   } catch (ValidateEx) {
     log.err ("Bad input");
     // Missing wrong input metric
   } catch (RuntimeE) {
     log.err ("Failure");
     metrics.emit(APPLICATIONA_PROCESSINPUT_FAILURE); 
   }
}

Scalability

When we create a new service, we need to start from the basic idea that this service will be running in multiple instances that are not aware of each other. A service will not scale well if you have to know about or communicate with another instance of it (very rare are the cases where this is not true). When assessing scalability, validate that:

The service is stateless. Stateful services are hard to scale
If the application is idempotent, ensure the number of instances will not break the idempotency
There are no race conditions when consuming events, generating unique IDs or dealing with data from a data source. Those operations must be atomic and not influenced by the number of instances

Examples

1) Stateful - multiple instances will not behave as expected

function iDontScale(var input) {
   if(input.a.equals(DO_THIS)) {
     setNextOperation(OPERATION_A);
   } else if (input.a.equals(DO_THAT)) {
     setNextOperation(OPERATION_B);
   }
   
   switch(getNextOperation()){
     case OPERATION_A:
       // do this
       break;
     case OPERATION_B:
       // do that
       break;
     default:
       // log I can't scale like that
}

2) Non-atomic event consumption - multiple instances may read the same event before the event is flagged as read

function readEventDB(var data) {
  // read event from DB
  var event = readEvent()
  // mark event as read from DB
  markEventAsRead(event.id)
}

The “so what”?

Although there is no complex paper written based on a vast amount of data about the results of using the APEROS elements for code reviews, my empirical observations from these over the past many years that I have built and delivered software are:

A lot of the less experienced engineers started considering one or many of these elements during core reviews for the first time
Engineers came to the realisation that it was very easy to forget those aspects when reviewing code unless they were educated to do so
The impact of the comments and the adjustments made due to them became more relevant to the overall quality/success of the release
Observability turned into something baked into the very early stages of development instead of a follow up tech debt to add metrics
Discussions on code reviews were linked to and fed back into the design
Future designs took into account existing code review discussions and became more robust from very early stages

I have no idea if this technique brings value to every single software development team out there, but all I’ve seen so far convinces me that it’s at least worth trying.

Carlos Silvestre is a Software Development Manager at Amazon. The opinions of this article are his and don’t represent his Company’s.

Emanuel Bulic

Senior Software Development Engineer

2 年

Great advice. I do disagree with the runtime-exception bits and some of your code examples absorb exceptions - this will now get copy-paste into production code. My advice to junior engineers is to stay away from difficult patterns they are unfamiliar with (such as multi-threading or runtime-exception handling). Stick to simple constructs and add complexity as needed. Learn to walk before you run. Here's my own 2 bits: * Please do not ever absorb exceptions unless you know what you're doing! Do not mess with InterruptedException, NPEs, and definitely do not touch Error. Become and expert before you go into that mess. * Please don't copy-paste code from examples or from someone else's code base with properly understanding what it does. Examples are never complete - you should understand the context of what you are trying to achieve and the code you are about to drop in.

Thiago Ferreira Leite Piva

Senior Product and Pricing Manager, Europe

2 年

Super good article!!

1 次回应

Cledson Santos

2 年

Great article, Carlos Silvestre! Highly recommended reading!

Alex Ayers

Security @ Wise

2 年

Nice Carlos, I can see a number of the APEROS (I read it as APEROL first, wishful thinking!) elements applying in other context, such as controls too,

1 次回应

查看更多评论

要查看或添加评论，请登录

The APEROS Code Review Elements

Carlos Silvestre

Software Development Manager at Amazon

A look at Code Reviews?

How can we make the most of Code Reviews?

Using the APEROS Elements

Availability

Examples

Performance

Examples

Extensibility

领英推荐

Examples

Resiliency

Examples

Observability

Examples

Scalability

Examples

The “so what”?

社区洞察

其他会员也浏览了

Code review mastery: 8 practical examples that will help you elevate your code review game

Crafting Better Code Reviews

The Best Practices and Impact of Code Review on Productivity and Timelines

Testability - a developers armour

Freshen Up Your Codebase: The Essential Guide to Refactoring!

Life After MVP: Complexity Battle - Ensuring Code Maintainability

Refactoring vs documentation