登录查看更多内容

Data Architecture

Uli Bethke

Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion

发布日期: 2023年6月14日

This article on #dataarchitecture is an experiment. It will be crowdwritten over the next few days. Each day I post a new part and people can leave their comments. Where applicable I will work the comments into the article and summarise any discussions.

What is data architecture?

Data architecture is about making sure that an organisation meets its data analytics objectives taking into account the preferences and culture of the organisation. Data architecture needs to make sure that the objectives and agreements are met in the most efficient way. You can read more about this aspect of data architecture in Martijn ten Napel 's excellent post Architecting is driving toward coherency

In other words: data architecture is the gate keeper and guardian that prevents data anarchy breaking out in an organisation.

Data architecture is needed whenever a requirement for information and data analysis comes up, e.g. the need to implement a dashboard with KPIs or an operational reporting requirement. In a first step data architecture needs to understand the exact nature of the requirement before looking at the toolbox of some common design patterns. I call these design patterns the logical architecture.

Logical architecture

I frequently come across data architecture diagrams that are riddled with vendor names, tools, and technologies. Tools and technologies have a place in data architecture but it is not a primary role. It is more of a support act. Not the main act.

First and foremost data architecture is abstracted from tools and technologies. This is what I call logical data architecture. It is purely conceptual and not tied to a particular tool, design pattern (e.g. ETL vs ELT), or technology. It is universal. You can translate the logical architecture into any combination of tools and technologies. A good architecture should be applicable to the cloud and on-premise. How and where you implement the architecture is important but a secondary concern.

The secondary level of architecture is the physical data architecture. It brings the logical architecture to life and deals with tools and technologies. The logical architecture can be translated to a physical architecture.

When we map the logical architecture to tools and vendors we need to take the context and requirements of the organisation into account.

The types of tools you select are highly dependent on your requirements, your organisation, the skills you have, preferences (e.g. build vs. buy), budget, existing vendor relations, software license model, cloud strategy, skills and much more. One size does not fit all.

Let’s go through an example. Separating the conceptual architecture from the implementation details will prevent you from making silly statements such as “Hadoop will replace the data warehouse”. The data warehouse is a concept whereas Hadoop is a technology. You can implement a data warehouse on Hadoop but it does not make sense to say that a technology will replace a concept. I see this mistake being made frequently. Another example. When people hear data warehouse they instinctively think relational database. There is no hard connection between the two.

Having said that some tools are unquestionably a better fit for certain use cases than others, e.g. while Hadoop was a good fit for processing unstructured data in batch (it was built for this purpose) it was a poor fit for BI style queries (even though it can be shoehorned to do so).

Andrew C. Madson 4 个月前

What goes into bronze, silver, and gold layers of a…

Valliappa Lakshmanan 2 个月前

ELEMENTS OF DATA ARCHITECTURE

Bill Inmon 8 个月前

Separation of concerns

One core principle in logical architecture is the separation of concerns. It is a term coined by Dijkstra to separate a computer program into different sections. Each section addresses a separate concern. In #dataarchitecture these are typically the layers and design patterns when building a data warehouse. Each layer has a specific concern or purpose.

There is not a single way for the logical architecture design. The image I have posted here represents the logical architecture that we use at Sonra.

It has the following concerns

Data sources
Landing
Persisted Staging aka Data Lake
Integration
Access
Streaming
Search
Metadata

Let's take a closer look at the data source layer??.

Relational Databases

Relational databases are the most common data source type. Databases remain the heart of most business processes and transactions. Data extraction methods include:

Data exports via bulk load/unload utilities
Direct connections over JDBC or ODBC
Connections to the transaction log for real-time extraction

Text files

Files, such as CSV or XML, are typically provided by third parties or DBAs who restrict direct database access. JSON documents often come from querying APIs or NoSQL databases like MongoDB. XML and JSON are semi-structured data types, best loaded to the Landing layer and converted to structured tables en route to Staging. For complex JSON/XML documents, Flexter, an Enterprise XML Conversion tool by Sonra, converts any XML into a readable, relational format in seconds. https://lnkd.in/eXMbEEKS

Excel files

Excel files, while occasionally valid as a data source, may indicate a larger problem. They're often used as a cheap, temporary replacement for operational systems or to upload reference data or target KPIs. Strategically, consider migrating these shadow IT systems to a proper web application using low-code/no-code environment. Using Excel to analyse data is a legitimate scenario in my opinion. However, Excel as a data source indicates a shadow IT type problem.

Balazs Vajna

Head of Analytics at MarketingLens | BigQuery guy

1 年

I'll be curious to see how this unfolds. I have to admit I'm also guilty of putting vendors on such diagrams - although sometimes more as examples or to make it prettier than just having boxes. Or, when I do it retrospectively and we know what tool/method we used in the end. But would be very happy to harvest some best practices from this!

查看更多评论

要查看或添加评论，请登录

Uli Bethke的更多文章

Data Warehouse 3.0. A Reference Architecture for the Modern Data Warehouse.

2022年1月6日

Data Warehouse 3.0. A Reference Architecture for the Modern Data Warehouse.

I frequently come across data architecture diagrams that are riddled with vendor names, tools, and technologies. Tools…

4 条评论
Using Virtual Data Marts the right way

2021年5月25日

Using Virtual Data Marts the right way

Virtual data marts can be a useful design pattern but there are a few things you should know before you use them…

1 条评论
Business and data strategy aligned

2021年5月13日

Business and data strategy aligned

People talk or should I say waffle a lot about aligning business strategy with data strategy. 90%+ are clueless what…

1 条评论
Reverse Geocoding on Snowflake

2021年1月15日

Reverse Geocoding on Snowflake

In this blog post we will show how you can use OpenStreetMap data from the Snowflake data marketplace and the…
Location analytics and geospatial data on Snowflake

2021年1月15日

Location analytics and geospatial data on Snowflake

What is Geospatial / GIS data? Geospatial data and location analytics Geospatial Data can be defined as data with a…
ecobee chooses Flexter to make valuable IoT data in XML accessible to BigQuery users for analytics

2020年10月13日

ecobee chooses Flexter to make valuable IoT data in XML accessible to BigQuery users for analytics

The challenge A world leader in smart home devices that introduced the world’s first smart thermostat, ecobee has…
The Data Marketplace. A missing piece in modern data architecture

2020年10月7日

The Data Marketplace. A missing piece in modern data architecture

What is a Data Marketplace? Data Marketplaces are a relatively recent phenomenon in data management. They bring…

3 条评论
Converting XML documents from an Oracle CLOB to a relational format

2020年9月23日

Converting XML documents from an Oracle CLOB to a relational format

In this post we will show you how easy it is to convert XML data from Oracle CLOB to a relational format. We will be…
Converting Covid XML and JSON to Yellowbrick

2020年9月18日

Converting Covid XML and JSON to Yellowbrick

Sonra has recently certified Flexter against Yellowbrick. In this blog post we show you how Flexter and Yellowbrick…
Snowflake Snowsight: 7 refreshing features

2020年8月17日

Snowflake Snowsight: 7 refreshing features

Overview: With the acquisition of Numeracy in March 2019, Snowflake took a step towards enhancing the Snowflake Cloud…

See all articles

Data Architecture

Uli Bethke

Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion

What is data architecture?

Logical architecture

领英推荐

Separation of concerns

Relational Databases

Text files

Excel files

Uli Bethke的更多文章

社区洞察

其他会员也浏览了

Data Vault – A Modern Architecture for Your Enterprises

Data Vault – A Modern Architecture Enterprises