登录查看更多内容

Data warehouse automation explained. Benefits and use cases

Uli Bethke

Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion

发布日期: 2019年5月20日

What is data warehouse automation?

I have been struggling to find a good article or post that explains data warehouse automation. To address this gap I have decide to write up my own post. I hope it helps. As always, feedback is welcome.

Data warehouse automation is not some sort of wizardry. It does not automagically build, model, and load a data warehouse for you. This level of automation is (for the time being at least) wishful thinking. Like with any other type of automation we identify patterns and instead of repeating them many times we use automation.

Let's look at ETL as an example. In data integration we come across the same patterns over and over again. Using data warehouse automation we can convert these design patterns into reusable templates of code. The template itself contains placeholders. These are populated from metadata when we run (instantiate) the code template.

Let me give you a simple example to illustrate the point. A very common (and simple) data integration pattern is to truncate-load a table. The truncate-load operation requires two steps.

Truncate the target table

Insert the data from one or more source tables to the target table

These steps are always the same. What changes are the instances of objects in the template: What is the name of our target table? What are the names of our source tables? What are the mappings between source and target tables? All of this information can be stored in a metadata catalog. From there it can be retrieved at runtime to populate the placeholders in our code template. With this information our engine can generate the code that implements and executes the data integration pattern.

If you are looking for a practical example of data warehouse automation and code templates in action have a look at our blog post that shows how to create a code template for loading data from S3 to Redshift.

Similar to data integration, you can also automate certain rules to auto-generate a dimensional model from a normalised model. The approach works more or less well. It will not generate some production ready model for you. It does give you a starting point though. It gives you a starting point for your model and adds agility to the process.

One crucial pillar for any data warehouse automation effort is the availability of a rich set of metadata.

What are the benefits of data warehouse automation?

By encapsulating recurring patterns in a code template and automating common tasks we gain a wide variety of benefits:

Increased productivity. The ability to reuse the same code template over and over again increases productivity. Our engineers have to write a lot less boilerplate code.
Using a code template will result in less bugs, a higher level of consistency, and higher quality of code.
Changes can be rolled out at the speed of light. Let’s assume our data integration pattern needs to be modified. We need to add a third step to the truncate-load pattern, e.g. to collect table statistics after the INSERT has completed. All we need to do is to add this step to the code template and by magic it is rolled out to all of the instances where it is used.
Using an automated approach to code generation aligns well with automated approaches to testing.

Data warehouse automation. A sample use case

I have been working in data warehousing for the last 20 years. I have seen a lot of things come and go. Things have been hyped and then disappear or become legacy. There are however some fundamental truths. Projects that involve complex XML as a data source either fail or run over time and budget. There are many reasons for this. I have listed the main ones here:

Many XML files are based on industry data standards. A data standard typically covers and standardises on many different business processes. The business processes themselves contain many different entities. We have seen standards with hundreds or thousands of entities. Probably without knowing you are working with one of these standards on a daily basis: Office Open XML. It is the standard that underpins Microsoft office documents. The documentation for this standard has more than 8,000 pages. We have written in detail about this standard elsewhere on this blog: Liberating data from spreadmarts and Excel. As you can imagine it requires a lot of time to understand the standard and mapping back the concepts to XML is not straightforward. As a result, data analysts spend a lot of time trying to make sense of the standard.
Data engineers are good at working with databases, SQL, and scripting languages. They typically lack the niche skills such as XSLT or XQuery to work with XML files. They don’t have any interest or incentive to acquire these esoteric skills. Rightly so in my opinion.
Standard data integration tools have limited support for working with XML. They typically just provide a GUI on top of XPath. The whole process is still very manual and time consuming. The approach works reasonably well for simple XML. Not so for complex data standards. Apart from the lack of automation these tools also show terrible performance from our experience. We have seen ETL running for more than 24 hours for a relatively small number of 50,000 XML documents.
All of these issues lead to long analysis and development lifecycles, poor quality of code, badly performing data pipelines, and a lack of agility. In summary, there are significant risks to these projects.

The case for data warehouse automation for industry data standards and XML

Data warehouse automation for XML addresses all of these issues. As a result, data warehouse analysts and engineers can focus on adding value to the enterprise rather than converting XML from one format to another. Let’s look at some of the typical tasks than can be automated:

The analysis phase can be automated. We can collect information and intelligence such as data types, relationships between elements and types, data profiles, and data distribution from the source XML files.
The generation of the target data model to a database can be automated.
Using a metadata layer, the documentation can be auto-generated. Think of data lineage or source to target maps for the data analysts or ER diagrams of the target data models for the data engineers.
Logging of errors and issues can be automated.
Rogue XML files with unexpected structures can be auto-detected and parked for inspection.
Relationships and globally unique keys can be automatically defined.
Last but not least, the data warehouse target schema can be auto-populated.

With automation of these steps we get the usual benefits:

Code is consistent and of high quality
Performance has been optimized
Testing can be automated
Increased productivity and less bugs. Data engineers and analysts can focus on adding real value to your company.

Our own product Flexter is a data warehouse solution for complex data standards, XML, JSON, and APIs.

You can try it out online and find answers to FAQs on our website.

We have recently released a SaaS version of the product and are looking for beta testers.

Our enterprise edition can be installed on a single node or for very large volumes of XML on a cluster of servers.

Ad Stam

5 年

The discussion should not be whether or not you should use DWA, but only which DWA environment is the best fit for your organization. And the reasons mentioned above are certainly true. But distrust and not invented here is a common problem within a lot of IT departments in general. And do'nt forget the concept of information logistics is still alien to most.

1 次回应

Gertjan Vlug

...

5 年

Thanks Uli!

Steve Remington

Analytics, business intelligence, data warehousing, data literacy and decision support consultant, architect, panel moderator, keynote speaker and trainer.

5 年

Data Warehouse Automation (DWA) is a grossly underused technology in analytics. About five years ago I saw a slight surge in the use of DWA but for some reason, it never really caught on in the way it logically should. Here are my thoughts on why DW has not caught on: 1) System Integrator Firm Self-Interest. The business model of many system integrators is based on maximising the number of billable hours for the largest number of developers. DWA is basically kryptonite to this business model and system integrators have shied away from DWA and when it is suggested they will find reasons why DWA is not suitable for a given project. 2) Developer Self-Interest: When the average ETL developer sees DWA they quickly realise that DWA will lead to fewer ETL developers being required. Moreover, many ETL developers define their role by the tool they use (e.g. SSIS, Informatica, etc.) and not the value they provide to the business. The use of DWA tends to push the DW skill set more toward the requirements gathering, business analysis and data modelling. If you are an ETL developer and have a narrow skill set you will feel threatened by DWA and you will find ways to prevent or hinder its implement. 3) The Rise of Python Wielding Data Engineer: The recent rise to the almost god-like status of the data scientists and data engineers has led to what is effectively 20+ year regression of ETL development methods. In the early days of ETL development, there were very few tools like SSIS or Informatica so all ETL code was lovingly hand-crafted in the preferred language of the developer. With the rise of various ETL platforms progressively less hand-coding was done. I now regularly see the scenario where the data science team is charged with building a data warehouse and the response is usually "There's no need to buy an expensive ETL platform. We can easily write all of the ETL code using Python, the swiss army knife of software languages". Now don't get me wrong. Python is a very useful language but it is not designed to be an efficient ETL development platform. I suppose all we can do is continue to highlight the cost, speed and quality benefits of DWA to managers and project owners and hope they change their mind.

4 次回应

查看更多评论

要查看或添加评论，请登录

Uli Bethke的更多文章

Data Architecture

2023年6月14日

Data Architecture

This article on #dataarchitecture is an experiment. It will be crowdwritten over the next few days.

2 条评论
Data Warehouse 3.0. A Reference Architecture for the Modern Data Warehouse.

2022年1月6日

Data Warehouse 3.0. A Reference Architecture for the Modern Data Warehouse.

I frequently come across data architecture diagrams that are riddled with vendor names, tools, and technologies. Tools…

4 条评论
Using Virtual Data Marts the right way

2021年5月25日

Using Virtual Data Marts the right way

Virtual data marts can be a useful design pattern but there are a few things you should know before you use them…
Business and data strategy aligned

2021年5月13日

Business and data strategy aligned

People talk or should I say waffle a lot about aligning business strategy with data strategy. 90%+ are clueless what…

1 条评论
Reverse Geocoding on Snowflake

2021年1月15日

Reverse Geocoding on Snowflake

In this blog post we will show how you can use OpenStreetMap data from the Snowflake data marketplace and the…
Location analytics and geospatial data on Snowflake

2021年1月15日

Location analytics and geospatial data on Snowflake

What is Geospatial / GIS data? Geospatial data and location analytics Geospatial Data can be defined as data with a…
ecobee chooses Flexter to make valuable IoT data in XML accessible to BigQuery users for analytics

2020年10月13日

ecobee chooses Flexter to make valuable IoT data in XML accessible to BigQuery users for analytics

The challenge A world leader in smart home devices that introduced the world’s first smart thermostat, ecobee has…
The Data Marketplace. A missing piece in modern data architecture

2020年10月7日

The Data Marketplace. A missing piece in modern data architecture

What is a Data Marketplace? Data Marketplaces are a relatively recent phenomenon in data management. They bring…

3 条评论
Converting XML documents from an Oracle CLOB to a relational format

2020年9月23日

Converting XML documents from an Oracle CLOB to a relational format

In this post we will show you how easy it is to convert XML data from Oracle CLOB to a relational format. We will be…
Converting Covid XML and JSON to Yellowbrick

2020年9月18日

Converting Covid XML and JSON to Yellowbrick

Sonra has recently certified Flexter against Yellowbrick. In this blog post we show you how Flexter and Yellowbrick…

See all articles

Data warehouse automation explained. Benefits and use cases

Uli Bethke

Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion

What is data warehouse automation?

What are the benefits of data warehouse automation?

Data warehouse automation. A sample use case

The case for data warehouse automation for industry data standards and XML

Uli Bethke的更多文章

社区洞察

其他会员也浏览了

Data Warehousing & Data Analytics

Data warehouse

{Classical} Data Warehouse Architecture [3 out of 10]

Understanding Kimball and Inmon Data Warehouse Architectures

Overview of Data Architectures

What is a Data Warehouse?

Choosing Data Management IT Tools: Data Lineage Solutions

Understanding the Differences Between Snowflake and Star Schema in the Data Engineering Universe

What is data warehouse automation?

What are the benefits of data warehouse automation?

Data warehouse automation. A sample use case

The case for data warehouse automation for industry data standards and XML

Uli Bethke的更多文章

Data Architecture

Data Warehouse 3.0. A Reference Architecture for the Modern Data Warehouse.

Using Virtual Data Marts the right way

Business and data strategy aligned

Reverse Geocoding on Snowflake

Location analytics and geospatial data on Snowflake

ecobee chooses Flexter to make valuable IoT data in XML accessible to BigQuery users for analytics

The Data Marketplace. A missing piece in modern data architecture

Converting XML documents from an Oracle CLOB to a relational format

Converting Covid XML and JSON to Yellowbrick

社区洞察

其他会员也浏览了

Data Warehousing & Data Analytics

Data warehouse

{Classical} Data Warehouse Architecture [3 out of 10]

Understanding Kimball and Inmon Data Warehouse Architectures

Overview of Data Architectures

What is a Data Warehouse?

Choosing Data Management IT Tools: Data Lineage Solutions

Understanding the Differences Between Snowflake and Star Schema in the Data Engineering Universe