What We Do in the Shadows: QIC's Experience in Monitoring and Error Analytics

What We Do in the Shadows: QIC's Experience in Monitoring and Error Analytics

Hi! My name is Daria Izmodenova and I am a Systems Analyst in the Motor team at QIC digital hub. I am responsible for the online service for purchasing car insurance policies in Oman — a new country for us where the app has recently launched. At this stage, our team is focused on fixing technical issues, improving conversion rates, and driving traffic. In this article, I will share the tools we used to track user behavior and identify errors in the service, as well as discuss the advantages and disadvantages of each.

Product Development Features

Before joining QIC, I worked in custom development for 3 years, and transitioning to a product team has been a new and exciting experience for me. When you work on a single product, you gain a better understanding of how releases impact customer behavior and revenue, which underscores the importance of collecting and analyzing user data. Even for seemingly small applications, a lot of work goes into it, as every feature is carefully tagged and logged. Additionally, teams are not afraid to revisit previously implemented solutions if it helps improve business metrics.

Simplified Process of Working on Functionality:

As it’s shown above, our team works iteratively. When we release an improvement, we try to analyze it, which leads to new hypotheses that are added to the backlog for further development. For a quality analysis of the service’s performance, it is essential to track both business and product metrics as well as technical indicators and errors. This is the responsibility of the System and Product Analysts. Let’s start with product analytics.

Google Analytics

For analyzing user behavior, we use Google Analytics. Product Analysts are responsible for creating specifications for tagging the service, maintaining technical documentation, and collecting and analyzing data before and after releases.

Google Analytics is an affordable and relatively easy-to-set-up and maintain tool. One of its advantages is the universal integration with additional services such as Google Ads, BigQuery, and Looker Studio. However, there are some drawbacks, including limits on tracked events that require us to balance quality and quantity, as well as inaccuracies in event reporting that we need to account for during analysis. Furthermore, Google displays sampled rather than complete data, which means analysts often have to spend extra time querying BigQuery or building dashboards. Google Analytics allows us to systematically analyze the app's performance and highlight areas where users encounter technical errors (for which other tools are used for analysis).

Log Tables

Our application integrates with a master system that processes most of the business logic. Once we have an export for the desired period, we can view and analyze the requests and responses, filter them by sessions, and track user actions chronologically. Unfortunately, data from production is available with a one-day lag; otherwise, this is an excellent way to troubleshoot issues.

Initially, instead of tables, we only had a Telegram chat where request logs were exported in real time. This allows for quick monitoring of the situation, but without alerting, the responsibility falls on whoever happens to check the chat. For more systematic work, a process is needed, which will come later.

Error Codes

Our team was tasked with "transitioning to error codes," specifically moving all application logic from the frontend to the backend. Here's an example:

A client can only have one active policy for a VIN number (the unique vehicle number). If such a policy is found, the purchase becomes impossible, and the user sees an error screen. To solve this task, the frontend sends the VIN number to the backend, which queries data from a third party and returns it to the frontend. The frontend checks a specific parameter to see if the client has an active policy, and if so, displays an error screen. We rewrote the logic so that the backend independently checks for an active policy and sends the corresponding error_code to the frontend.

Similarly, this work was done throughout the application, while it was important not to miss anything or break anything. During this process, our team encountered challenges:

  • We struggled for a long time to decide which HTTP code to send — 200 or 4xx — and how the frontend should react to this.
  • Each method returned a boolean parameter called success, indicating the success of the method's execution. It turned out that this logic in the frontend was not taken into account. There were situations where the response had an HTTP status of 200 + success false and vice versa, which broke this logic.
  • We did not always consider that error codes should be mutually exclusive—multiple error codes should not apply. Some of the logic would have been better moved to statuses or other response parameters, rather than within error codes. Unfortunately, this realization did not come immediately.

Despite all these challenges, the exercise proved very beneficial. Clear markers were established for all edge scenarios, laying the groundwork for connecting additional monitoring systems.

Sentry

Sentry is a program for tracking errors in applications. Its capabilities also include performance checking, user feedback collection, analysis and recommendations for code improvements, session recording, and incident management. For more about Sentry, read another article by Dastan Abdiraiym, "How to Work with Sentry: A Developer's Guide." I will only describe our experience of implementing Sentry in our team.

At the first launch of traffic, it was important to respond quickly to errors to prevent "budget drain." For this, our team tried to use Sentry. The idea was that upon receiving an error code from the backend, the frontend should send data to Sentry, which tracks errors and generates alerts as necessary. We were inspired by a nearby team's example, which also set up notifications in Slack, and we hurried to implement it.


However, we did not achieve the desired outcome. Due to the large number of errors with HTTP status 4xx, we had to disable notifications to avoid cluttering the information flow; there were too many. As a result, truly problematic requests did not make it into the statistics. Here’s how our developer, Maxim Dolgikh, commented on the situation:

"Sentry serves for developers to monitor unforeseen errors in the application. The idea that analysts would track something through these errors was initially flawed—it was an attempt to 'hammer nails with a saw.'"

Currently, Sentry is used for frontend errors and requests with status 5xx. There are plans to move some requests to status code 422 — this code indicates that there has been a violation of business logic.

We learned an important lesson—to take our time and study the context of how various tools are used by other teams before implementing them ourselves. In hindsight, it turned out that the nearby team used the business logic monitoring scheme through Sentry as an exception for one feature that needed to be closely monitored after launch. Once everyone was sure that there were no critical errors, Sentry was used as intended—to track frontend errors in the application.

Elastic Search

Elastic Search, also known as ELK, has long been the standard for working with logs. Implementing this tool requires time and significant effort, so we could not use it right away. The following steps were taken to implement Elastic:

  • Coordination of ELK usage with the leads of departments.
  • Infrastructure setup.
  • Agreement on the necessary log structure and calculation of potential data volume across all products with a several-month retention depth.
  • Study of the ELK API and description of log requirements in the pilot product.
  • Coordination with the security department on data masking requirements, such as *** instead of client phone numbers.
  • Development, testing, and debugging of the service.
  • Scaling to other products.

In our product, ELK became a real lifesaver. To mark all methods and prepare them for logging took one day for the analyst; we only needed to handle data masking. Development also quickly added the necessary methods to ELK, and soon we were able to view the first logs on the test stand.

Among the advantages of using Elasticsearch, we can highlight the very fast search across a large set of data, especially in contrast to the existing data access limitations in production. ELK has a convenient data aggregation mechanism for log analysis: reports and dashboards can be built, data visualized, filtered by application, session, and so on—this is incredibly convenient.

Unfortunately, at the moment there are also downsides: there is a significant gap between method calls and their appearance in logs in production. Our developers are doing everything possible to fix this. Part of the team is also experiencing access difficulties due to the nuances of remote work and complicated combinations of several VPNs. These issues are already being addressed by the DevOps department.

Nevertheless, with ELK we have potentially rapid access to logs and the ability to quickly respond to incidents and analyze problematic cases. The efforts invested in implementing Elasticsearch will definitely not be in vain.

Roles and Responsibilities

Tools are great. Working and documented processes are even better. When implementing monitoring, it is important to discuss at the outset who will be responsible for what. In our team, we made this mistake: we had to hold a large meeting and decide on the fly how we would monitor the situation in production. And here’s how we divided up responsibilities:

  • The Product Owner, Project Manager, and Performance Marketing Specialist are responsible for launching and stopping traffic and notifying the team about these events.
  • The Business Analyst prepares functionalities for monitoring before the release (error codes, Elastic Search, logging in the database) and sets tasks for the Product Analyst for marking the service in Google Analytics. After the release, the BA assists with data analysis, error investigation, and task assignment for changes.
  • The Product Analyst writes the markup for events in Google Analytics, collects and analyzes data before and after the release.
  • The Product Owner is responsible for decision-making based on data and requests further markings of the service if clarifications are needed.
  • The Project Manager oversees processes and helps with error investigations.
  • Developers and QA monitor errors in production and also assist with error investigation

Conclusions

Our team continues to refine our approaches to driving traffic. Recently, we released a major update and switched organic traffic to our new portal. In a short time, we tested various tools for traffic analysis and selected those suitable for our product. We managed to eliminate critical errors, and now our focus is on improving business metrics.

The good news is that a support department recently emerged within the company, and its head was tasked with building a system for monitoring errors and handling incidents. Soon, we will have a clear work process, a board in JIRA, and even our email for support requests! Through the collaborative efforts of the teams, a foundation has been laid for the work of support staff, and now we look forward to a future with stable services and clear processes, making it much easier to launch new products.


I would like to thank Maxim Dolgikh, Danil Malich, Maria Polyakova, Alexander Gordienko, and Vladislav Shcherbakov for their assistance in writing this article.

要查看或添加评论,请登录

QIC digital hub的更多文章

社区洞察

其他会员也浏览了