Data Architecture
Uli Bethke
Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion
This article on #dataarchitecture is an experiment. It will be crowdwritten over the next few days. Each day I post a new part and people can leave their comments. Where applicable I will work the comments into the article and summarise any discussions.
What is data architecture?
Data architecture is about making sure that an organisation meets its data analytics objectives taking into account the preferences and culture of the organisation. Data architecture needs to make sure that the objectives and agreements are met in the most efficient way. You can read more about this aspect of data architecture in Martijn ten Napel 's excellent post Architecting is driving toward coherency
In other words: data architecture is the gate keeper and guardian that prevents data anarchy breaking out in an organisation.
Data architecture is needed whenever a requirement for information and data analysis comes up, e.g. the need to implement a dashboard with KPIs or an operational reporting requirement. In a first step data architecture needs to understand the exact nature of the requirement before looking at the toolbox of some common design patterns. I call these design patterns the logical architecture.
Logical architecture
I frequently come across data architecture diagrams that are riddled with vendor names, tools, and technologies. Tools and technologies have a place in data architecture but it is not a primary role. It is more of a support act. Not the main act.
First and foremost data architecture is abstracted from tools and technologies. This is what I call logical data architecture. It is purely conceptual and not tied to a particular tool, design pattern (e.g. ETL vs ELT), or technology. It is universal. You can translate the logical architecture into any combination of tools and technologies. A good architecture should be applicable to the cloud and on-premise. How and where you implement the architecture is important but a secondary concern.
The secondary level of architecture is the physical data architecture. It brings the logical architecture to life and deals with tools and technologies. The logical architecture can be translated to a physical architecture.
When we map the logical architecture to tools and vendors we need to take the context and requirements of the organisation into account.
The types of tools you select are highly dependent on your requirements, your organisation, the skills you have, preferences (e.g. build vs. buy), budget, existing vendor relations, software license model, cloud strategy, skills and much more. One size does not fit all.
Let’s go through an example. Separating the conceptual architecture from the implementation details will prevent you from making silly statements such as “Hadoop will replace the data warehouse”. The data warehouse is a concept whereas Hadoop is a technology. You can implement a data warehouse on Hadoop but it does not make sense to say that a technology will replace a concept. I see this mistake being made frequently. Another example. When people hear data warehouse they instinctively think relational database. There is no hard connection between the two.
Having said that some tools are unquestionably a better fit for certain use cases than others, e.g. while Hadoop was a good fit for processing unstructured data in batch (it was built for this purpose) it was a poor fit for BI style queries (even though it can be shoehorned to do so).
领英推荐
Separation of concerns
One core principle in logical architecture is the separation of concerns. It is a term coined by Dijkstra to separate a computer program into different sections. Each section addresses a separate concern. In #dataarchitecture these are typically the layers and design patterns when building a data warehouse. Each layer has a specific concern or purpose.
There is not a single way for the logical architecture design. The image I have posted here represents the logical architecture that we use at Sonra.
It has the following concerns
Let's take a closer look at the data source layer??.
Relational Databases
Relational databases are the most common data source type. Databases remain the heart of most business processes and transactions. Data extraction methods include:
Text files
Files, such as CSV or XML, are typically provided by third parties or DBAs who restrict direct database access. JSON documents often come from querying APIs or NoSQL databases like MongoDB. XML and JSON are semi-structured data types, best loaded to the Landing layer and converted to structured tables en route to Staging. For complex JSON/XML documents, Flexter, an Enterprise XML Conversion tool by Sonra, converts any XML into a readable, relational format in seconds. https://lnkd.in/eXMbEEKS
Excel files
Excel files, while occasionally valid as a data source, may indicate a larger problem. They're often used as a cheap, temporary replacement for operational systems or to upload reference data or target KPIs. Strategically, consider migrating these shadow IT systems to a proper web application using low-code/no-code environment. Using Excel to analyse data is a legitimate scenario in my opinion. However, Excel as a data source indicates a shadow IT type problem.
Head of Analytics at MarketingLens | BigQuery guy
1 年I'll be curious to see how this unfolds. I have to admit I'm also guilty of putting vendors on such diagrams - although sometimes more as examples or to make it prettier than just having boxes. Or, when I do it retrospectively and we know what tool/method we used in the end. But would be very happy to harvest some best practices from this!