Handling Data Format Changes in Databases: A Practical Example
Pradosh Jena
GenAI | LangChain | AWS | Python | Big Data | Snowflake | DBT Cloud Data Architect | Post Graduate in Data Science | 5x Certified | Writer @Medium
In the world of software development, data formats evolve as applications grow and requirements change. One of the challenges developers face is ensuring that their applications can handle different data formats, especially when switching from one schema design to another. This situation is common in document databases, which are often praised for their schema flexibility. However, this flexibility can sometimes lead to complications when reading and writing data. In this blog, we will specifically focus on a practical example of handling a data format change in a document database.
The Scenario: Changing User Data Format
Imagine you have a document database storing user information. Initially, the user data includes a single field for the full name:
{
"user_id": "123",
"name": "John Doe"
}
Later, you decide to separate the full name into two fields: first_name and last_name. This change could be due to various reasons, such as the need for better data manipulation, improved search capabilities, or more personalized user interactions. The new format looks like this:
{
"user_id": "123",
"first_name": "John",
"last_name": "Doe"
}
Implementing the Change: Schema-on-Read vs. Schema-on-Write
Document databases are often described as "schemaless," which is a bit of a misnomer. In reality, there is an implicit schema defined by how the application expects the data to be structured. This concept is known as schema-on-read, where the structure is interpreted when the data is read, as opposed to schema-on-write, where the schema is enforced when data is written, typical in relational databases.
To implement this change in a document database, you don't need to alter an existing schema since the database does not enforce one. Instead, you write code to handle both the old and new data formats. Here's an example in JavaScript:
if (user && user.name && !user.first_name) {
// Documents written before the change do not have first_name and last_name
let nameParts = user.name.split(" ");
user.first_name = nameParts[0];
user.last_name = nameParts.slice(1).join(" ");
}
Let's break down this code:
1. user : This checks if the user object exists. It's a basic null check to avoid runtime errors.
2. user.name: This checks if the name field exists, indicating the old data format.
3. !user.first_name : This checks if the first_name field does not exist, confirming that the document is indeed in the old format and needs to be updated.
This conditional block ensures that the application can handle data regardless of whether it is in the old or new format. If the data is in the old format (full name in a single field), it splits the name into first_name and last_name. This approach is dynamic and requires careful handling to ensure data integrity.
Handling the Migration in Relational Databases
In contrast, relational databases use schema-on-write, where schema changes are explicitly defined and enforced. To migrate the data, you would typically add new columns to the table and update existing records. For example, in SQL:
ALTER TABLE users ADD COLUMN first_name VARCHAR(255);
ALTER TABLE users ADD COLUMN last_name VARCHAR(255);
UPDATE users SET
first_name = split_part(name, ' ', 1),
last_name = substring(name from position(' ' in name)+1);
Here, the schema change is enforced at the database level, ensuring that all new and existing data conforms to the new structure. This approach provides strict data consistency but requires a coordinated update to both the database schema and the application logic.
The Trade-Offs
The main advantage of schema-on-read is flexibility. Applications can evolve without needing immediate changes to the database schema. This flexibility, however, comes with the trade-off of having to manage multiple data formats within the application, which can increase complexity and the risk of errors.
Schema-on-write, while more rigid, provides consistency and integrity. The database ensures that all data adheres to the defined schema, simplifying the application's data handling logic. However, changes require more coordination and can lead to downtime or complex migration procedures.
Conclusion
Choosing between schema-on-read and schema-on-write depends on the specific needs of your application, including factors like data consistency requirements, development agility, and maintenance overhead. In our example, handling a change from storing a full name to separate first and last names illustrates the practical implications of these two approaches. By understanding the strengths and weaknesses of each, you can make informed decisions about how to manage your data and schema evolution effectively.
Food For Thought
What kind of database would you choose , if you want to store someone's LinkedIn profile information ?
What kind of database would you choose to give a seamless browsing experience to an user on an ecommerce site.
Is cache and queues are not one type of storage mechanism?
Please share with your comments.
It's short and concise article to highlight this key point what needs to be taken into account while choosing the right kind of database for your application. For more in-depth reading you can refer https://learning.oreilly.com/library/view/designing-data-intensive-applications/
I will publish many more such article on such design principle in such simple language.