The APEROS Code Review Elements
This article intends to present to Engineers and Engineering Managers the?APEROS (Availability,?Performance,?Extensibility,?Resiliency,?Observability,?Scalability) code review elements. The usage of these elements presents a different, more efficient approach on how to do Code Reviews in code changes for distributed systems.
A look at Code Reviews?
All developers out there know what?Code Reviews ?are about. It’s a widely used mechanism to: drive better code quality and maintainability, transfer knowledge between engineers, share ownership of the changes, find better solutions for the problems the change is trying to solve, and find potential gaps in the solution.
At tech companies, engineers review code multiple times a week. It’s a necessary time consuming task that every engineer (should) take part on. Currently there’s a few debates in the industry on the value added by Code Reviews. A few years ago, Google wrote a thorough?paper ?on the subject. Here’s an interesting finding from it:?“Despite years of refinement, code review at Google still faces breakdowns. These are mostly linked to the complexity of the interactions that occur around the reviews. Yet, code review is strongly considered a valuable process by developers, who spend around 3 hours a week reviewing.”.
Code reviews are not something engineers are thought at University or that engineers are coached early in their careers. There’s a mix bag of styles when it comes to how to review a change, and the approach is usually not consistent in a team, organisation or company.?
That’s why as an Engineering Manager, I’ve always spoke to my teams about how to add more value to code reviews as a way to make the most of the time engineers dedicate to it weekly.
How can we make the most of Code Reviews?
Code reviews help us to validate we’re doing the right thing as a team, that we have considered different opinions before moving ahead with a deployment. But it can become something automatic that engineers do without putting a lot of thinking behind it, without thinking about the value that it will bring and what really should be considered case by case.
A common mistake made by engineers (usually more often with less tenured ones) is to focus on a line by line review, forgetting about the big picture. In my personal opinion (and I have no data to support that), I believe folks tend to forget that grand scheme of things (like which compute the software is running, what are the upstreams and downstreams, what’s the purpose of that application) and over-indexing on semantics, naming conventions, code style, error messages. AKA nitpicks -?https://blog.danlew.net/2021/02/23/stop-nitpicking-in-code-reviews/ .
I’m not saying that those are not important, they are and if not taken care of we may end up with spaghetti, unpleasant or unmaintainable code (or all of the above). High performant teams must find ways of automating most of these validations instead of spending efforts on it. Well thought code style checks and static analysers are great ways to do that in a non-opinionated way. Quoting the aforementioned Google’s case study: “A qualitative study of 88 Mozilla developers found that static analysis integration was the most commonly-requested feature for code review. Automated analyses?allow reviewers to focus on the understandability and maintainability of changes, instead of getting distracted by trivial comments (e.g., about formatting)”
In an environment where?distributed systems?are the majority of the components (like most of the applications running across the globe nowadays, with the exception of pure UX or embedded/hardware driven software), the focus must be somewhere else. Code reviews of software that runs on distributed applications need a lot more focus on the critical components, on what can go wrong and what can be very expensive to deal with later on instead on semantics and formatting. A quick read on the?Fallacies of Distributed Computing ?is a great way to understand the difference between the analysis you need to do in a non-distributed software as opposed to a distributed one.
As a target, we want our distributed applications to be reliable, performant, scalable, observable and easy to maintain. It’s a major challenge to get all those elements right and at a high bar in a single application. But I believe we can all agree that these are important pillars for modern distributed systems. So the challenge is how to embed all those elements that are usually thought through during the design into implementation and code reviews.
Working backwards from the fallacies and all that can go wrong on distributed systems, I encourage my teams to use the APEROS?(Availability,?Performance,?Extensibility,?Resiliency,?Observability,?Scalability) elements as the handbook to review code for distributed systems.?
Using the APEROS Elements
During your code reviews, write down the APEROS elements in front of you. Based on the description of the review and on the nature of the change, pick which of these elements are relevant to the change - most of the times you will be assessing against 2 or 3 elements instead of the 6. Sometimes, all of the 6. Sometimes, none of them.
When writing your comments, add the element at the beginning, e.g. “[Scalability] This looks like a stateful check that will break if this service is running multiple instances”. It will help the owner of the change to be in the right mindset when addressing it.
The list below contain a non-exhaustive list of things to look out for in each of the elements. The examples are absurd, but it’s just to give the reader a taste of what to look for.
Availability
Clients and applications have availability SLAs they agree upon. Most of the applications will aim for the highest possible uptime (happier clients, happier engineers, lower operational load). Engineers should aim to never leave the client hanging for longer than it expected, bubble up downstream failures and avoid too many retries (as the fallacies remind us, networks are not perfect and you should always be ready for your downstream failures). When assessing Availability, validate that:
Examples
1) Horrendous retry policy - multiple instances querying a problematic downstream until starvation
function readFromDownstream() {
var response = null;
do {
try {
response = callServiceA();
} catch (E) {
log.err ("Service call failled, retry");
}
} while (response != null)
}
2) Resource exhaustion threat if upstream load is not capped
function processRequest() {
var future = launchThreadAndAsynchProcessRequest();
if(future.complete) {
return 200;
} else if(future.err) {
return 500;
}
}
Performance
Reviewing performance during code reviews is most of the times like putting the cart in front of the horses. You won’t know how your service perform until you probe it, until you run performance tests, until you exhaust components and measure throughput, latency, CPU and memory consumption, downstream latency and all that long list of important performance metrics. But there’s still value on checking for obvious opportunities for improvements or obvious signs of early optimisations (AKA the root of all evils - good articles?here ?and?here ). When assessing performance, validate that:
Examples
1) Early optimisation - unless well documented and justified, the engineer at this point would not have a clue whether 10 threads is enough, too much or even if this requires multithreading at all
function interpolateData() {
try {
var executionService = new (10); // 10 threads!
var runnable = goThroughData();
executionService.execute(runnable);
} catch (E) {
log.err ("not good");
} finally {
executionService.close()
}
}
2) Missed multithreading opportunity - large data sets will most probably benefit from parallel streams if the operation is atomic
function processList() {
var veryLargeList.stream().map(Bla::doSomethingAtomic).close();
}
3) Immutability candidate - avoid reads and writes if possible, consider whether the data model is right for the use case and whether immutability is not a better option
function updateItem(item) {
var existingItem = readFromDB(item.key);
if(item.this > existingItem.that) {
updateInDB(item);
} else {
// do nothing
}
}
4) Missed batching opportunity - most of the DBs offer batch operations for a reason
function persistList() {
var mediumSizedList.forEach(item -> { myDBClient.persist(item); } );
}
Extensibility
When we build, we want to ensure the life of the next person using this piece of code is easier, and not harder than yours when you started. You should aim to make your code and the overall application code easy to maintain - which in a lot of times means easier to extend. There’s a lot of the old and gold?SOLID principles ?when it comes to extensibility, it’s always good to revisit them from time to time. When assessing Extensibility, validate that:
领英推荐
Examples
1) Hard to extend DB class - if a new table is added in the future, the class won’t be reusable and will have to be rewritten
class DBHandler {
constant DB_URI = "db://";
constant TABLE_NAME = "table";
function connect(){..}
function close(){..}
function queryItem(key) {
// execute select * from TABLE_NAME where index = 'key'
}
}
2) Components with too many responsibilities - if tomorrow there’s a new parser, new file format, new output destination or a new audit demand, this class will have to be decomposed
class JackOfAllTrades {
...
JackOfAllTrades(fileName, outputQueueTopic);
function readFromFile() {..}
function parseFile() {..}
function connectToQueue() {..}
function persistToQueue() {..}
function logResultsToAudit() {..}
}
Resiliency
Although resiliency feeds into Availability, it has enough breadth to be a separate topic. A resilient software avoids failure at all times, and when it fails, it fails gracefully. The rule is simple - no fatals and there should always be enough resources to react properly. When assessing Resiliency, validate that:
Examples
1) Careless usage of input data
function myAPIX(request, response) {
try {
var response = processInput(request.mylist.get(FIRST).foo)
metrics.emit(APPLICATIONA_APIX_SUCCESS)
} catch (IOException) {
log.err ("Failure");
metrics.emit(APPLICATIONA_PROCESSINPUT_FAILURE);
}
}
2) Missed implicit exception - NullPointerException is implicit and would crash the application if not caught upstream and/or cause a fatal
function foo(fileName) {
try {
var file = readFile(fileName);
// returns null
file.read();
} catch (IOException e) {
log.err ("Not found");
}
// Uncaught NPE
}
3) Closing resources - memory leak threat
function interpolateData() {
try {
var executionService = new (10); // 10 threads!
var runnable = goThroughData();
executionService.execute(runnable);
} catch (IOException E) {
log.err ("not good");
executionService.close()
}
// Missing finally block
}
Observability
Distributed systems pushed the observability capabilities to a very high level in modern applications. We rely on emitted metrics to take decisions - automated or not. We use them to analyse healthiness, to derive actions from trends and to improve our software. They are a key piece of our operations and having the right metrics allow you to react quickly. Distributed application must be observable from the get go. When assessing Observability, validate that:
Examples
1) Misleading failure metric - not enough?granularity by not differentiating between the two issues (having to rely on logs to do so)
function myAPIX(request, response) {
try {
this.processRequest(request, response)
metrics.emit(APPLICATIONA_APIX_SUCCESS)
} catch (E1) {
log.err ("Failure 1");
metrics.emit(APPLICATIONA_APIX_FAILURE);
} catch (E2) {
log.err ("Failure 2");
metrics.emit(APPLICATIONA_APIX_FAILURE); // Same than the prev exception
}
}
2) Duplicated metrics - it may confuse your dashboard set up or just be completely ignored
function processInput(input) {
try {
var data = getDataFromDB(input)
} catch (E) {
log.err ("Failure");
metrics.emit(APPLICATIONA_PROCESSINPUT_FAILURE);
}
}
funtion getDataFromDB(input) {
try {
var data = dbClient.read(input)
} catch (E) {
log.err ("Failure");
metrics.emit(APPLICATIONA_GETDATADB_FAILURE);
}
return data;
}
3) Bad input metrics missing - input metrics allow for you to understand who is calling you with bad inputs and is very helpful when you need to migrate versions or do breaking changes
function myAPIX(request, response) {
try {
validateInput(request)
var response = processInput(request)
metrics.emit(APPLICATIONA_APIX_SUCCESS)
} catch (ValidateEx) {
log.err ("Bad input");
// Missing wrong input metric
} catch (RuntimeE) {
log.err ("Failure");
metrics.emit(APPLICATIONA_PROCESSINPUT_FAILURE);
}
}
Scalability
When we create a new service, we need to start from the basic idea that this service will be running in multiple instances that are not aware of each other. A service will not scale well if you have to know about or communicate with another instance of it (very rare are the cases where this is not true). When assessing scalability, validate that:
Examples
1) Stateful - multiple instances will not behave as expected
function iDontScale(var input) {
if(input.a.equals(DO_THIS)) {
setNextOperation(OPERATION_A);
} else if (input.a.equals(DO_THAT)) {
setNextOperation(OPERATION_B);
}
switch(getNextOperation()){
case OPERATION_A:
// do this
break;
case OPERATION_B:
// do that
break;
default:
// log I can't scale like that
}
2) Non-atomic event consumption - multiple instances may read the same event before the event is flagged as read
function readEventDB(var data) {
// read event from DB
var event = readEvent()
// mark event as read from DB
markEventAsRead(event.id)
}
The “so what”?
Although there is no complex paper written based on a vast amount of data about the results of using the APEROS elements for code reviews, my empirical observations from these over the past many years that I have built and delivered software are:
I have no idea if this technique brings value to every single software development team out there, but all I’ve seen so far convinces me that it’s at least worth trying.
Carlos Silvestre is a Software Development Manager at Amazon. The opinions of this article are his and don’t represent his Company’s.
Senior Software Development Engineer
2 年Great advice. I do disagree with the runtime-exception bits and some of your code examples absorb exceptions - this will now get copy-paste into production code. My advice to junior engineers is to stay away from difficult patterns they are unfamiliar with (such as multi-threading or runtime-exception handling). Stick to simple constructs and add complexity as needed. Learn to walk before you run. Here's my own 2 bits: * Please do not ever absorb exceptions unless you know what you're doing! Do not mess with InterruptedException, NPEs, and definitely do not touch Error. Become and expert before you go into that mess. * Please don't copy-paste code from examples or from someone else's code base with properly understanding what it does. Examples are never complete - you should understand the context of what you are trying to achieve and the code you are about to drop in.
Senior Product and Pricing Manager, Europe
2 年Super good article!!
Great article, Carlos Silvestre! Highly recommended reading!
Security @ Wise
2 年Nice Carlos, I can see a number of the APEROS (I read it as APEROL first, wishful thinking!) elements applying in other context, such as controls too,