5 tips to help you learn and verify data set accuracy faster
By Mor Markovich, CPA - Data Analysts

5 tips to help you learn and verify data set accuracy faster

There are 2 reasons why this article may be useful for you:

1. You're required to learn data sets and tables as you go.

2. You're responsible for building queries and tables with complex logic. 

As an analyst, I found myself mostly learning the data as I go, understanding the important data keys a little too late and writing a complex query for a complex table, yet still finding myself struggling with the QA testing later. To solve this problem, I have adopted work methods that made those tasks much easier for me.

1.Counting rows - It might sound obvious, but I bet most of you don't do it. Recently, I found myself in a very uncomfortable situation, explaining to the analysis requester why the metric I calculated for him should be fixed. I was very familiar with the relevant table which contained data about dates and customers from a period of 4 years, and I couldn't guess that suddenly this table would contain data from only 2 years. In order to validate any table integrity, even if you are sure that you know this table thoroughly, make sure to count its rows (in any familiar tool). This will help you to learn about the table's size, the run time of your query, and if anything has been changed in the table since the last time you dealt with it (even if it was yesterday). To make this validation even more efficient, try to add relevant conditions for comprehensive testing. This simple action can prevent so many careless mistakes.

select Count(*) from table where year(date)=2019

2. Check the Primary Keys- This is especially important when you're writing a query with too many connected tables. In this case, I want to be sure that I won't sabotage the results by joining 2 tables that are from different key levels. I'm sure it has happened to all of us more than once, exposing a bug in multiple rows. Most of the time, the reason for this problem lies in the primary keys. In order to prevent this annoying situation or detect it faster, you should check the integrity of the primary keys after any sub-query / sub table / temporary table you write. The best way to do that is to compare between the results of "count all rows" to "count the distinct primary key rows". Both numbers should be identical. If they're not, you need to check the primary key. If you will implement this habit during your query writing process, it will help you be much more confident about your results later. 

select count(*), count(distinct id) from table
No alt text provided for this image

3. Check the distinct values, NULLS and their frequent- A few weeks ago I worked on a project which required me to deal with a lot of data bases and columns. Some of the columns were new to me and provided by a third party. I had to build an algorithm that will provide results for a question we were curious about, but the results looked unreasonable to me and to my colleagues who were familiar with the data from the 'other' side (UI). After a little QA testing, we found out that the data was badly transmitted through the ETL. The time we wasted on analyzing the data, just to learn later that the data was corrupted from the beginning, has let me to realize that I need to check the data myself and not trust it blindly. Since then, I made myself a habit of checking any column I'm going to deal with for the frequency of nulls, the possible values in it and how many times it repeats. In addition, it is important to toe the line with the person in charge of the data from the UI / business side. 

Select column, count(*)
from table
group by 1

4. Use min/max functions - These functions shouldn't be used only for calculations, but also to check the data's integrity and its feasibility. I work with so many tables that should be updated daily or that are based on dates. It could also be a revenue table or a table with administrative data. Let's say that it is a currency rate table which I should join or left join with a revenue table. How can I trust my query to find all the updated revenues and present it in the correct currency? a little check of minimum and maximum dates in the tables will give us an idea about our expected results. We could also use LEAD analytical function to check that we have a one day gap between dates and also check the period available in the table (max (-) min). Also, we can use max and min functions to QA values column. For example, should it be possible to get minus in the revenue column? These checks will help you to understand the data and trust it. 

select max(date)-min(date) 
from table

5. Use sum/average functions - Once, when I joined a new project with new DB and terms, I dealt with a revenue table. My mission was to find the daily revenue per fund with some complex calculations. When I presented the results to my manager, he said that something seems to be incorrect- The revenue was too high. No matter how many times I checked my query and asked my colleague to debug it, I couldn't find the issue. Eventually, I came to the basic check which was to summarize the total revenue in the table. This led me to find that every revenue unit appeared twice in the table. In other words, the issue was in the table's logic, and not my query. Since then, I always ask a few questions in order to validate the table, its logic and its accuracy: What is the average daily revenue? how much money was earned in 1 week? how many products were sold in 1 week? what is the average age of the employees? Do the results make sense to me? This is a simple and quick way to help you focus on the goal and not waste precious time. 

select avg(revenue)
from
(
select month(date), sum(revenue)
from table
where year(date)=2020
group by 1
) x

The bottom line is that all these queries are a piece of cake for analysts, it will take you no more than 10 minutes to check everything! And trust me, it will save you so much time later.

Rita Fainshtein

Microsoft Data Platform MVP | Power BI data visualization expert | Power Bi - Da 100 certificated

4 年

????? ????? ???????? ????? :)

Itay Talmor

General Manager at PayMe

4 年

Great share! I

要查看或添加评论,请登录

Mor Nitzan的更多文章

  • ??? ???? ????? ????? ????????

    ??? ???? ????? ????? ????????

    ????? ??????? ??? ????? ?? ??? ???/?????????/???? ????? ??? ????/???? ?????? ?????? ??????? ????????. ????? ??? ???…

    8 条评论
  • Differences Between Event Tracking and Database Modeling

    Differences Between Event Tracking and Database Modeling

    Event Tracking Definition: Event tracking involves capturing specific actions or occurrences within a system or…

    13 条评论
  • Leads data strategy

    Leads data strategy

    Lead is a word that eventually represents a person or a company that could be your next customer/client, but what does…

    5 条评论
  • How should I prepare this retrospect analysis?

    How should I prepare this retrospect analysis?

    Sometimes businesses decide about new processes or features before any A/B testing or due diligence are done. They do…

    1 条评论

社区洞察

其他会员也浏览了