The Path to Creating an Order Broker Service
Potter Rafed
Tech Lead | Architect | Engineering Manager | Contract | Wizard | Open for Contracting Opportunities
One of our recent projects was to create service that can receive, translate and forward requests for sales order transactions between our systems.
The service is part of the plan to introduce an Enterprise Service Bus architecture across the Dunelm systems, aiming to standardise our integration platform.
Make way for the Order Broker Service.
In this article we are going to go through some of the challenges we faced while creating this service and how we overcame them.
- Designing the Infrastructure to cater for the need of the application.
- Adhering to our Service Level Agreements (SLAs) to a working solution that meets them.
- Utilising Job Queues to make the need for a database redundant and to increase app response time.
- Managing persistent job queues and persistent logs within an ephemeral container environment.
- Exploring methods to optimise our app response time.
- Implementing static analysis tools to help write clean and standardised code.
The Requirements
The aim was simple to start with – receive an order through an API call, transform it and send it to another API endpoint.
I am sure all of you can create a system like that in a few days.
However, there were a few more details in our SLAs…
- The Order Broker must be able to receive orders in bulk. An API call could contain 1 or 100+ orders.
- For each order it needs to have the ability to transform the order to match the specification of its destination.
- A single order must be able to be sent to 1 or more destinations based on who has sent the order.
- A basic validation of the order has to be made (e.g. required fields being present)
- In addition to orders, the Order Broker must be able to handle notifications, cancellations and credit memos (various order updates) which had their own different formats, destinations and validation.
And those are just the ones coming from business. We knew that if we were going to create a reliable and stable service we needed a few other things:
- There must be a strong logging mechanism in play in order to allow for order traces and debugging.
- We should be able to resend orders, if for one reason or another, the order had failed to send.
- All message (order, notification etc.) destinations should be easily configurable and changeable.
- We needed to have a resilient autoscaling infrastructure to allow us to deploy updates and fixes with zero downtime.
- Our app response time for a request containing 1 order should be no more than 20ms.
Now this no longer seems like a three-day task, does it?
So lets dive in and see how a team of 2.5 developers and 1 QA tackled this in 3 months. (The 0.5 comes from the fact that we had one person only partly involved, not because they were half-decent developers!)
Infrastructure
Since we already had a well established infrastructure in AWS, this part wasn’t that much of a problem. We created a Dockerised environment and deployed it in an EC2 Container Service (ECS) in AWS.
We are running at least 2 t2.medium instances in front of a load balancer at all times.
The load balancer is multi-availability-zone and we can scale to n identical instances in a matter of minutes – simple but effective!
Database?
We could have had a database… but we didn’t want one.
Why bother with additional overhead of entities, configuration, entity manager and potential migrations when after all Order Broker is just a message passing system.
All we needed is a good Job Queue and a well thought out configuration to control it.
Job Queue
We’ve had lots of good experience with the in-memory queueing service Beanstalkd so that was the obvious candidate for us here.
We created a local beanstalkd server, running in a separate container as part of our service. Each running instance had it’s own Beanstalkd server and it was queuing its jobs there.
After the first few iterations however, we came across a problem – any jobs that hadn’t been cleared were being wiped out when we did a release.
When you think about it, it makes sense – with every new release we did, all of our containers were destroyed (along with the jobs information) and new ones were being spawned instead. This is one of the challenges that containerisation presents.
We decided we needed a better solution for this – one that will allow us to continue developing in isolation without any dependent services, but also one that will allow us to preserve our jobs when we are running a Staging or a Production environment.
To achieve this we’ve created a Centralised Beanstalkd Service.
It has it’s own load balancer and a service running on EC2 instances with a mounted EFS Mount. All we had to do then was mount the EFS volume into the beanstalkd container. This way we could have all the jobs data being saved externally to the container and destroying containers and EC2 instances will not delete our data. Perfect!
Since this is a centralized system, other teams can use this Beanstalkd service to submit jobs and run workers on, as well.
Queue Management
We’ve been using Beanstalkd for the last 3-4 years now and we’ve had multiple implementations and wrappers around the Pheanstalkd library but not a single abstract implementation of a queue system.
We had to create all the classes to deal with reading jobs from the queue, adding jobs to the queue, deleting jobs etc etc… AGAIN! It was time to end this nonsense and so we created a separate abstract queue library which has our queue namespacing conventions built in.
It comes with an adapter for Beanstalkd and it is easily extendable if we ever wanted to use another queue service like RabbitMQ or SQS.
It has already made some of our colleagues’ lives easier by not having to write their own (yet another) queue management solution.
The Routing Queue
We needed to distinguish between different message types (order, cancellation, notification) and determine the sender in order to find out where the message should be distributed to.
If we were to wait for a response from another system before replying to the sender our response time would depend on that other system. And we definitely did not want that.
An example of this flow would look like this:
- Sender sends a message
- Order Broker (OB) validates it
- OB finds its destination based on the sender and type of message
- OB transforms the message based on the destination and type of message
- OB sends the message to each destination
- OB returns the response to the sender
You can see how we would have never been able to respond under 20ms in such a scenario.
We needed to do the bare minimum while we were processing the request – the validation; and “outsource” the more time consuming work to another process.
This is where the Routing Queue comes into play.
Once we had validated the message we would put it in the Routing Queue and if that was successful we would return success message to the sender.
Within the Routing Queue we had all the time in the world (within reason!) to do all of the other stuff: to find the destination, to transform the message, to send it to another (or more) queues, which would then send it to various destinations.
Responding under 20ms
Our average response time was around 190ms. Which is not that bad, however we needed to achive <20ms!
We wanted to have a way of checking our progress while tackling this, so we started with the setup of a local Webgrind with xDebug which would help us profile our application. We also setup Siege in order to be able to test with a lot of requests per second.
One of the first things we did in order to optimise our app was to update our Nginx version to 1.11.13 which significantly reduced our response time to around ~150ms.
We then played around with “composer install” by adding a “–no-dev” on our production and staging builds, but saw no improvements. Furthermore we optimised the “autoload dump” by adding “–classmap-authoritative”. See more details here.
Obviously running on PHP 7.1 we also enabled OpCache and surprisingly there wasn’t much of an improvement either.
We went into deeper analysis in order to find what exactly is taking so long. Turns out most of the time the application spent including various libraries and dependencies. 90% of the time spent was on autoload and Zend Expressive loading their classes.
We found our app’s pure execution time was around 2ms and we had no bottlenecks.
So we needed somehow to get those classes to load quicker.
This lead into further investigations of autoload dump optimisations, upgrading to Zend Expressive 2 and even downgrading to Zend Expressive 1.0 (a different project of ours adhering to the SLAs was ZE 1.0) but still we couldn’t find a solution.
Lastly we decided to look into PHP-FPM optimisations to somehow try and cache those files.
It was then, that we realized we had the following setting in one of our .ini files.
opcache.enable=0
WHAT?
Yes… it had turned out we never actually had OpCache running.
It was a misconfiguration on our part. The switch we had in place to control OpCache had a bug and it never worked.
After enabling OpCache our app response time dropped to ~18ms!
Here is a recent snapshot of our average response times from New Relic.
Centralised Logging
One morning a colleague came to me and asked me to check if we’ve received a request the previous day, because they had not received the message at the other end. So I logged in to the server and found out that the oldest log entry we had was an hour ago!
Being a system with no database, the logs were the only place where can trace messages and check for errors.
Remember when we were losing our jobs data whenever we did releases? Well the same thing was happening here – we were saving the logs locally (in the container) and when a new release was being pushed the container (along with the logs) was being destroyed.
Darn it! We put our heads down and came up with a something called FluentD.
Fluentd is an open source data collector for unified logging layer
We installed it in a separate container running in our build and had our logs folder mounted to it. It then asynchronously streamed all of our logs to AWS CloudWatch.
We could then make all the releases we wanted without our logs being lost. That was really cool!
Wrap Up with GrumPHP
One of the last things we did to the project (perhaps it should have been the first?) was to investigate and introduce GrumPHP.
If you haven’t heard of GrumPHP here is what they say about themselves:
This composer plugin will register some git hooks in your package repository. When somebody commits changes, GrumPHP will run some tests on the committed code. If the tests fail, you won’t be able to commit your changes.
Essentially it is a tool that runs other static analysis tools against the code you are trying to commit.
We configured it so it checks for basic PSR-1 and PSR-2 coding standards along with syntax checks, cyclomatic and Npath complexity cheks, other types of code smells and commit message standards.
We will eventually spend some more time to write our on checkers that look and enforce some of our own standards, but for now GrumPHP off the shelf works pretty amazingly too!
Overall the Order Broker Service has been our 4th (micro)service and we showed that we can produce high quality, maintainable and reliable service in a reasonable amount of time with the tools and knowledge we’ve been building, with which we can all be proud of!
Many thanks to Vasil Dakov, Oras Al-Kubaisi, Valentine Necsescu and Fabian Soto for their efforts in achieving this.
— Potter Rafed, Lead Software Developer
PS: Don’t forget your unit tests!