Shift Left - How I carried out ETL & data migration testing activities (10 mins read)
Husny Jeffrey
Senior Test Developer @ Grab | Ensuring Quality of Million-Dollar Features
Recently I got an opportunity to work in an ETL & Data Migration project. I would like to share the way I have carried out data migration testing activities and a checklist to verify data migration through this article.
First, let’s understand what is data migration and why we need data migration.
What is data migration?
Simply, transferring data from one database to another.
Why?
Data migration requirement can come due to several reasons. Below I have mentioned two of them.
- Decommissioning legacy applications. (Re-build the application using a latest architecture)
- Change of the database vendor (e.g. from Oracle to MySQL) etc.
In a microservice architecture environment, there should be a separate micro database for each microservice. So that changes to one service’s database does not impact to any other services.
In our case, we are moving from monolithic architecture to microservice architecture. Therefore, we should migrate service related data from a large legacy database to a new micro database.
To explain the topic in a practical way, I am taking an example.
Assume our source database is DB2 and destination database in PostgreSQL.
There is a table in source database (DB2) which contains customer related data which customer microservice needs to consume in performing dedicated functions.
Table description in source database (DB2)
STEP ONE – We have source database (DB2) table, we should finalize destination database model
For that team came up with below database table model.
Team decided to concatenate "registered_date" and "registered_time" as "registered_date_time". Also add an extra column "is_deleted" to provide soft deletion.
Developers are going to create this table in the new micro database by referring to this document. At this point as testers we must review this document and make sure the proposed architecture (database model) is correct.
I would highly recommend for testers to involve from the design phase itself so that you can provide your inputs in early stage and that will reduce the review time as well.
To review the document with the proposed database model, I have created my review checklist.
Database model document review - Checklist
Number of columns - Verify whether proposed database model includes all the source table (customer) columns in DB2 which we have planned to migrate to the micro database.
- In above example - We have discussed and agreed to concatenate date and time columns and agreed to add a new column (is_deleted)
- Result - Pass
Column names - Verify whether column names mentioned in proposed database model describes the columns in DB2 correctly.
- In above example - All the column names in proposed architecture describes the columns in DB2
- Result - Pass
Note: It’s not necessary to mention the name of the customer as “customer_name” in customer table. If you defined column as “name”, it means "customer_name" since the column is anyway in customer table.
Column data types – If you are NOT applying any transformation logic, data types should be matched in the source and destination.
- In above example - We have decided to transform the date time format to epoch timestamp. Other than that, columns data types are matched with DB2 column’s data types.
- Result - Pass
Column sizes – Column size mentioned in proposed architecture should be greater than or equal to DB2 column size.
- I have given an input for this point. Since we have defined “name” as varchar [30], it’s better to increase the size of "email” column from varchar [30] to varchar [40]. Because there is considerable likelihood that having an email address with first name and last name. Since email include a domain name, (e.g. @gmail.com) size of "email" field should be greater than the "name" field.
- Result – Pass
Primary Key / Foreign Key – Primary key and foreign keys (if available) should be mentioned in proposed architecture document.
- Result - Pass
Not Null – Not null columns defined in source (DB2) should be mentioned in proposed architecture document.
- Result - Pass
Database best practices –
Table names and column names should be lower case and should be separated using an underscore (do not use spaces for table names)
Singular words for the table names (customer instead of customers)
No unnecessary prefixes or suffixes (customer instead of tbl_customer) etc.
- Result – Pass
Consistency - Columns referring to the same data should be identical everywhere in the database. (e.g. Data type and size of the customer id in customer table should be same as the data type and size of the customer id (foreign key) in the order table)
Result – N/A
Default value – Discussed and agreed default values should be mentioned in the proposed database model document.
Result – Pass
Since I have spent some quality time in reviewing the proposed database model document, I have identified some mistakes earlier which could have been a cause for an issue in future.
Okay, now we have finalized the database model.
STEP TWO – Need to create the finalized table in PostgreSQL database
Developers are creating these table in the new micro database in accordance with the proposed database model document. Once they commit the SQL queries, they are adding testers as reviewers for their pull requests for the remote repository. As testers, we should be aware on SQL queries and need to review their pull request against the proposed database model. (A defect prevention activity)
According to above example, SQL query should be,
CREATE TABLE customer (
id VARCHAR(10),
name VARCHAR(30),
credit_limit INT NOT NULL,
registered_date_time BIGINT,
email VARCHAR(40),
is_deleted BOOLEAN DEFAULT FALSE,
PRIMARY KEY (id)
);
The column names, data types, sizes and data validations applied in this query can be verified with the document and can approve the pull request from the tester’s end.
In this case you can approve the pull request containing the above query since it will create the exact table which is described in proposed database model.
Okay, step two is also done. Once everything is fine with the DB scripts, these scripts can be executed in the QA dedicated PostgreSQL DB instance.
STEP THREE – Review ETL transformation logic
Before moving into step 3, let’s explain ETL briefly.
ETL stands for Extract, Transformation and Load.
Extract – Extracting data from source database (in our case, it is DB2)
Transformation – Apply transformation logic to change source data if needed. (in our case, we have write a logic to concatenate "registered_date" and "registered_time" as "registered_date_time").
Load – Load the data to the destination database (in our case, it is PostgreSQL)
We can create ETL jobs using some tools (e.g. talend) and can migrate data from source to destination by running ETL jobs.
As testers, we should review transformation logic in ETL rather than waiting to verify migrated data. So that we can prevent data mismatches or incorrect data in destination database.
As an example we can go ahead with a discussion with developers and verify that all the transformation logic identified in the design phase are implemented in the ETL
Okay, we have reviewed transformation logic as well.
STEP FOUR (final step) – Run the ETL job, migrate data from source to destination and verify data accuracy to make sure data migration is successful
Once ETL job is completed, we need to make sure all the data in source database (DB2) are in destination database (PostgreSQL) as well. We call this as data accuracy testing. Carried out to make sure data migration is successful.
For that, I have a checklist to verify destination data against source data.
Data accuracy testing - Checklist
- Verify record counts
- Verify record counts for a specific duration (e.g. total customers registered in 2017)
- Latest date in the table
- Oldest date in the table.
- Sum of an integer column
- Max of an integer column
- Min of an integer column
- Number of customers who are using Gmail (%@gmail.com) etc.
- more...
There are more to discuss. Test the performance of ETL, fault tolerance testing, challenges faced in data migration etc.. Let's discuss soon.
Enjoy!
Husny Jiffry
Senior Quality Engineer - Sysco LABS, Sri Lanka
Staff Performance Engineer in Walmart Global Tech
6 年It’s really a great article!! whoever want to know about the ETL testing. Awesome Husny Jiffry
Artificial Intelligence and Product Engineering Leader
7 年Hi Husny, it is mind blowing how well you have explained ETL. Explanation with simple examples are exceptional. Well thought and structured article. Your article is a great example for people who are trying to write fancy articles but struggling to explain the practical application of a concept. Please keep the next versions coming. If you agree, is it okey if I add your article to my company’s testing knowledge base ? I will provide the necessary credentials to you.
Software Quality Engineer at N26
7 年Thanks for making valuable articles.worth to share with all.learnt the ETL manual testing process step by steps