Using bots technology for enterprise legacy.

Using bots technology for enterprise legacy.

General

Today in the IT world, we as technology managers need to be able to provide simple and inexpensive solutions to complex and costly problems. Fortunately, in today's world there are many technological tools that can help us in this task. Here,   I will sahre a real show case in which, using a smart process, we implemented a simple solution to a complicated problem.

The business process

We had a capacity problem In one of our legacy platforms that provided various services to the users, although it is not defined as a production-critical system, it is an important system for an organization that provides daily services  to ~45,000 different customers a day.

Due to legacy application and planning, the platform was planned to support around 10,000 users, but growing rapidly it now supports over 45,000 users (more than 300% growth).

The capacity problem caused for overload on the system. The solution and best practices for this kind of problems is to rewrite the application and adjust the application for the new capacity planning. This path, however, consumes lots of time, money and other resources.

Keeping in mind the need for a complete and thorough solution, to handle the new capacity, we aimed at finding a "simple" solution (a solution which is simple to implement and low on resources even if it isn't the "best" solution or doesn't follow best practices to the letter). We started working on a simple and innovative solution. At the same time, we had the ability to work on a long-term solution that would improve the platform performance and reduce the need for a short-term solution.


The bot solution 

We started to monitor the platform closely and identified the following key things

  1. The set of events that occurred before the crash.
  2. The set of actions need to be done by the supporting team for handling those crashes.
  3. The set of tests we need to do to the service after the service is back online.

By consolidating all this data we wrote an automated bot mechanism that will run long list of actions and tests that eventually will  gracefully restart the problematic service, without no influence to end-users.

This mechanism resulted in the platform being up 100% of the time, and users receiving higher level of service with no interruptions.


要查看或添加评论,请登录

Shay Weiss的更多文章

社区洞察

其他会员也浏览了