What We Do in the Shadows: QIC's Experience in Monitoring and Error Analytics
Hi! My name is Daria Izmodenova and I am a Systems Analyst in the Motor team at QIC digital hub. I am responsible for the online service for purchasing car insurance policies in Oman — a new country for us where the app has recently launched. At this stage, our team is focused on fixing technical issues, improving conversion rates, and driving traffic. In this article, I will share the tools we used to track user behavior and identify errors in the service, as well as discuss the advantages and disadvantages of each.
Product Development Features
Before joining QIC, I worked in custom development for 3 years, and transitioning to a product team has been a new and exciting experience for me. When you work on a single product, you gain a better understanding of how releases impact customer behavior and revenue, which underscores the importance of collecting and analyzing user data. Even for seemingly small applications, a lot of work goes into it, as every feature is carefully tagged and logged. Additionally, teams are not afraid to revisit previously implemented solutions if it helps improve business metrics.
Simplified Process of Working on Functionality:
As it’s shown above, our team works iteratively. When we release an improvement, we try to analyze it, which leads to new hypotheses that are added to the backlog for further development. For a quality analysis of the service’s performance, it is essential to track both business and product metrics as well as technical indicators and errors. This is the responsibility of the System and Product Analysts. Let’s start with product analytics.
Google Analytics
For analyzing user behavior, we use Google Analytics. Product Analysts are responsible for creating specifications for tagging the service, maintaining technical documentation, and collecting and analyzing data before and after releases.
Google Analytics is an affordable and relatively easy-to-set-up and maintain tool. One of its advantages is the universal integration with additional services such as Google Ads, BigQuery, and Looker Studio. However, there are some drawbacks, including limits on tracked events that require us to balance quality and quantity, as well as inaccuracies in event reporting that we need to account for during analysis. Furthermore, Google displays sampled rather than complete data, which means analysts often have to spend extra time querying BigQuery or building dashboards. Google Analytics allows us to systematically analyze the app's performance and highlight areas where users encounter technical errors (for which other tools are used for analysis).
Log Tables
Our application integrates with a master system that processes most of the business logic. Once we have an export for the desired period, we can view and analyze the requests and responses, filter them by sessions, and track user actions chronologically. Unfortunately, data from production is available with a one-day lag; otherwise, this is an excellent way to troubleshoot issues.
Initially, instead of tables, we only had a Telegram chat where request logs were exported in real time. This allows for quick monitoring of the situation, but without alerting, the responsibility falls on whoever happens to check the chat. For more systematic work, a process is needed, which will come later.
Error Codes
Our team was tasked with "transitioning to error codes," specifically moving all application logic from the frontend to the backend. Here's an example:
A client can only have one active policy for a VIN number (the unique vehicle number). If such a policy is found, the purchase becomes impossible, and the user sees an error screen. To solve this task, the frontend sends the VIN number to the backend, which queries data from a third party and returns it to the frontend. The frontend checks a specific parameter to see if the client has an active policy, and if so, displays an error screen. We rewrote the logic so that the backend independently checks for an active policy and sends the corresponding error_code to the frontend.
Similarly, this work was done throughout the application, while it was important not to miss anything or break anything. During this process, our team encountered challenges:
Despite all these challenges, the exercise proved very beneficial. Clear markers were established for all edge scenarios, laying the groundwork for connecting additional monitoring systems.
Sentry
Sentry is a program for tracking errors in applications. Its capabilities also include performance checking, user feedback collection, analysis and recommendations for code improvements, session recording, and incident management. For more about Sentry, read another article by Dastan Abdiraiym, "How to Work with Sentry: A Developer's Guide." I will only describe our experience of implementing Sentry in our team.
At the first launch of traffic, it was important to respond quickly to errors to prevent "budget drain." For this, our team tried to use Sentry. The idea was that upon receiving an error code from the backend, the frontend should send data to Sentry, which tracks errors and generates alerts as necessary. We were inspired by a nearby team's example, which also set up notifications in Slack, and we hurried to implement it.
领英推荐
However, we did not achieve the desired outcome. Due to the large number of errors with HTTP status 4xx, we had to disable notifications to avoid cluttering the information flow; there were too many. As a result, truly problematic requests did not make it into the statistics. Here’s how our developer, Maxim Dolgikh, commented on the situation:
"Sentry serves for developers to monitor unforeseen errors in the application. The idea that analysts would track something through these errors was initially flawed—it was an attempt to 'hammer nails with a saw.'"
Currently, Sentry is used for frontend errors and requests with status 5xx. There are plans to move some requests to status code 422 — this code indicates that there has been a violation of business logic.
We learned an important lesson—to take our time and study the context of how various tools are used by other teams before implementing them ourselves. In hindsight, it turned out that the nearby team used the business logic monitoring scheme through Sentry as an exception for one feature that needed to be closely monitored after launch. Once everyone was sure that there were no critical errors, Sentry was used as intended—to track frontend errors in the application.
Elastic Search
Elastic Search, also known as ELK, has long been the standard for working with logs. Implementing this tool requires time and significant effort, so we could not use it right away. The following steps were taken to implement Elastic:
In our product, ELK became a real lifesaver. To mark all methods and prepare them for logging took one day for the analyst; we only needed to handle data masking. Development also quickly added the necessary methods to ELK, and soon we were able to view the first logs on the test stand.
Among the advantages of using Elasticsearch, we can highlight the very fast search across a large set of data, especially in contrast to the existing data access limitations in production. ELK has a convenient data aggregation mechanism for log analysis: reports and dashboards can be built, data visualized, filtered by application, session, and so on—this is incredibly convenient.
Unfortunately, at the moment there are also downsides: there is a significant gap between method calls and their appearance in logs in production. Our developers are doing everything possible to fix this. Part of the team is also experiencing access difficulties due to the nuances of remote work and complicated combinations of several VPNs. These issues are already being addressed by the DevOps department.
Nevertheless, with ELK we have potentially rapid access to logs and the ability to quickly respond to incidents and analyze problematic cases. The efforts invested in implementing Elasticsearch will definitely not be in vain.
Roles and Responsibilities
Tools are great. Working and documented processes are even better. When implementing monitoring, it is important to discuss at the outset who will be responsible for what. In our team, we made this mistake: we had to hold a large meeting and decide on the fly how we would monitor the situation in production. And here’s how we divided up responsibilities:
Conclusions
Our team continues to refine our approaches to driving traffic. Recently, we released a major update and switched organic traffic to our new portal. In a short time, we tested various tools for traffic analysis and selected those suitable for our product. We managed to eliminate critical errors, and now our focus is on improving business metrics.
The good news is that a support department recently emerged within the company, and its head was tasked with building a system for monitoring errors and handling incidents. Soon, we will have a clear work process, a board in JIRA, and even our email for support requests! Through the collaborative efforts of the teams, a foundation has been laid for the work of support staff, and now we look forward to a future with stable services and clear processes, making it much easier to launch new products.
I would like to thank Maxim Dolgikh, Danil Malich, Maria Polyakova, Alexander Gordienko, and Vladislav Shcherbakov for their assistance in writing this article.