Data Platform: The Successful Paths
Miguel Garcia Lorenzo
Hello! I am a seasoned technology expert with a wealth of experience and a passion for building successful teams and creating business value by orchestrating innovative technologies.
Introduction
I've been working as Solution Architect for many years, and I've seen the same mistakes very often. Usually, companies want to evolve their data platform because the current solution doesn't cover their needs, which is a good reason.?But many times they start from the wrong starting point:?
I think the reasons are simple, and at the same time are very difficult to change:
Making mistakes and identifying them is all part of the learning process. It is a good indicator that we are working to improve and evolve our solutions. But we have to analyze deeply and understand the reasons (what, why, etc.) to avoid making the same mistakes every time.
There are no magical data products, or at least, I don't know them. A global Data Platform has many use cases, and not all of them can be resolved by the same data product. I remember when RDBMS Databases were the solution for all use cases. So many companies would invest a lot of money and effort in very big RDMBS (OLTP/OLAP) Databases as Teradata, Oracle Exadata, IBM, or DB2.?After those, came the Big Data ecosystem and NoSQL, and again everybody began to design Big Data Lake even as Data Warehouse using solutions such as Impala and Hive. Many of these solutions failed because they were only technological changes following the trends of the moment, but almost nobody had analyzed the causes:
Today, we are in a more complex scenario. There are more data products (and marketing) joined with a new cloud approach. Cloud requires, more than ever, to change our culture, vision, and methodology.
Having several products in our Data Platform increases the solution complexity in many terms such as operations, coexistence, or integrations, but at the same time gives us the flexibility to provide different qualities and also avoid locking vendor scenarios. We have to take advantage of the benefits of the cloud. One of them is to reduce operational costs and effort that allows providing a better solution in terms of the variety of services.?
Logical Data Fabric and Data Virtualization
This architecture provides a single layer that enables the users and applications access to the data uncouple the location of the data and the specific technology repository product. This approach improves the data integration experience for users by allowing us to evolve the data repositories layer without impacting the operation of the systems.?This capacity also is the foundation for making our BI layer more agile. Logic Data Fabric helps us adapt quickly to changing business needs, allowing us to add new technology capabilities, reducing integration effort and time to market.
Today, Hybrid Multi-Cloud solutions are generating a lot of interest. Companies like?Denodo?have been working for many years to provide solutions focused on data fabric and data virtualization.
Joining Logic Data Fabric with?Data Virtualization?is a powerful tool, but can become so complex and concentrate all logic in one place. It is something to take into account.?
Currently, one of the most common scenarios is to migrate On-Premise solutions to Cloud or Multi-Cloud solutions. In these cases, data migration and data integration are some of the challenges.
These are not easy tasks to perform, and we have to do it every time that we change one of our data repositories. One of the keys to success is to provide a logic data layer that provides the data from different sources based on some criteria.
These capabilities provide us a lot of flexibility to design our migration and coexistence strategies.?
领英推荐
For example, we can start replicating the same dataset on both platforms (On-Premise and Cloud). The logic layer will retrieve the dataset from one of the data repositories based on timestamps.
Avoid Coupling All Logic in the Data Repository
In the last few years, there has been a lot of effort to uncouple the data and the logic into different layers. In my opinion, this is the right way to go.?
Having the logic and the data in the same layer involve problems such as silos, monoliths, locking vendor, or performance problems.?Often the data repository is the most expensive layer. I remember employing a lot of effort to get out miles of PL/SQL for other more cost-efficient distributed solutions.
I think that many times?cloud data solutions provide unrealistic scalability.?Of course, we can scale up our solution but at what cost? Currently, some development teams are less focused on process optimization because, unlike On-Premise, in the cloud, it can scale without limits. The problems are the costs and many times when coming out requires a lot of effort and time to solve it.??
We should not forget that distributed databases existed before the cloud. Cloud solutions provide us better performance in many cases, pay as you go, fewer operation tasks, and scale resources only when we need them. But in many cases is the same technology that we have on our On-Premise.
I see again how many teams are getting back to applying the same approach massively than years ago. It doesn't mean that we can not use streams/tasks in Snowflake or PLSQL in Oracle Autonomous Databases, but also not use them as a global approach and for all the uses cases. We run the risk that we will build a monolith again.
Don't get me wrong: Snowflake, Big Query, Oracle Autonomous Database, and other new data products are great products, with great functionalities but the key is to use them correctly inside of our data strategy.
Avoid Building a Sandcastle
A poor performance in terms of latency or data processing concurrency is one of the reasons to design and build a new data platform. We have to avoid creating dependencies between the current solution and the new one. Because as we know,?"A Chain is As Strong As The Weakest Link".?
A common scenario is to design the new solution over the current solution that implies a strong dependency between them.?We will propagate the problems from one platform to another and increase the risk of impact on the current solution. For instance, many times the data source of the new data platform is the current one:
Another common scenario is to change the data repository but keep the same ingestion approach and data model. In many companies, the ETL?is the main replication process and often involves performance or data quality issues.?It doesn't matter that we change the data repository, because replication based on ETL always involves the same problems.
Conclusion
Designing a successfully new Data Platform usually comes from understanding why we are here at this point.?Knowing the past is important for understanding the present, design the future and avoid making the same mistakes again.
New technological improvements and architectures require a change in the way we work and think. As a team, we must evolve to avoid applying the same methodologies from On-Premise environments in Cloud. The team culture and architecture patterns are more important than the products.
In the beginning, it could seem like a great strategy to build a new solution on top of something that is currently not working from of point of view on timing requirements. Often, it will generate more issues, consuming team effort and generating team and users dissatisfaction.?
Senior Data Infrastructure Engineer
3 年I really liked this article. It's a great list of anti-patterns in cloud adoption and how to avoid them