Data warehouse automation explained. Benefits and use cases
What is data warehouse automation?
I have been struggling to find a good article or post that explains data warehouse automation. To address this gap I have decide to write up my own post. I hope it helps. As always, feedback is welcome.
Data warehouse automation is not some sort of wizardry. It does not automagically build, model, and load a data warehouse for you. This level of automation is (for the time being at least) wishful thinking. Like with any other type of automation we identify patterns and instead of repeating them many times we use automation.
Let's look at ETL as an example. In data integration we come across the same patterns over and over again. Using data warehouse automation we can convert these design patterns into reusable templates of code. The template itself contains placeholders. These are populated from metadata when we run (instantiate) the code template.
Let me give you a simple example to illustrate the point. A very common (and simple) data integration pattern is to truncate-load a table. The truncate-load operation requires two steps.
Truncate the target table
Insert the data from one or more source tables to the target table
These steps are always the same. What changes are the instances of objects in the template: What is the name of our target table? What are the names of our source tables? What are the mappings between source and target tables? All of this information can be stored in a metadata catalog. From there it can be retrieved at runtime to populate the placeholders in our code template. With this information our engine can generate the code that implements and executes the data integration pattern.
If you are looking for a practical example of data warehouse automation and code templates in action have a look at our blog post that shows how to create a code template for loading data from S3 to Redshift.
Similar to data integration, you can also automate certain rules to auto-generate a dimensional model from a normalised model. The approach works more or less well. It will not generate some production ready model for you. It does give you a starting point though. It gives you a starting point for your model and adds agility to the process.
One crucial pillar for any data warehouse automation effort is the availability of a rich set of metadata.
What are the benefits of data warehouse automation?
By encapsulating recurring patterns in a code template and automating common tasks we gain a wide variety of benefits:
- Increased productivity. The ability to reuse the same code template over and over again increases productivity. Our engineers have to write a lot less boilerplate code.
- Using a code template will result in less bugs, a higher level of consistency, and higher quality of code.
- Changes can be rolled out at the speed of light. Let’s assume our data integration pattern needs to be modified. We need to add a third step to the truncate-load pattern, e.g. to collect table statistics after the INSERT has completed. All we need to do is to add this step to the code template and by magic it is rolled out to all of the instances where it is used.
- Using an automated approach to code generation aligns well with automated approaches to testing.
Data warehouse automation. A sample use case
I have been working in data warehousing for the last 20 years. I have seen a lot of things come and go. Things have been hyped and then disappear or become legacy. There are however some fundamental truths. Projects that involve complex XML as a data source either fail or run over time and budget. There are many reasons for this. I have listed the main ones here:
- Many XML files are based on industry data standards. A data standard typically covers and standardises on many different business processes. The business processes themselves contain many different entities. We have seen standards with hundreds or thousands of entities. Probably without knowing you are working with one of these standards on a daily basis: Office Open XML. It is the standard that underpins Microsoft office documents. The documentation for this standard has more than 8,000 pages. We have written in detail about this standard elsewhere on this blog: Liberating data from spreadmarts and Excel. As you can imagine it requires a lot of time to understand the standard and mapping back the concepts to XML is not straightforward. As a result, data analysts spend a lot of time trying to make sense of the standard.
- Data engineers are good at working with databases, SQL, and scripting languages. They typically lack the niche skills such as XSLT or XQuery to work with XML files. They don’t have any interest or incentive to acquire these esoteric skills. Rightly so in my opinion.
- Standard data integration tools have limited support for working with XML. They typically just provide a GUI on top of XPath. The whole process is still very manual and time consuming. The approach works reasonably well for simple XML. Not so for complex data standards. Apart from the lack of automation these tools also show terrible performance from our experience. We have seen ETL running for more than 24 hours for a relatively small number of 50,000 XML documents.
- All of these issues lead to long analysis and development lifecycles, poor quality of code, badly performing data pipelines, and a lack of agility. In summary, there are significant risks to these projects.
The case for data warehouse automation for industry data standards and XML
Data warehouse automation for XML addresses all of these issues. As a result, data warehouse analysts and engineers can focus on adding value to the enterprise rather than converting XML from one format to another. Let’s look at some of the typical tasks than can be automated:
- The analysis phase can be automated. We can collect information and intelligence such as data types, relationships between elements and types, data profiles, and data distribution from the source XML files.
- The generation of the target data model to a database can be automated.
- Using a metadata layer, the documentation can be auto-generated. Think of data lineage or source to target maps for the data analysts or ER diagrams of the target data models for the data engineers.
- Logging of errors and issues can be automated.
- Rogue XML files with unexpected structures can be auto-detected and parked for inspection.
- Relationships and globally unique keys can be automatically defined.
- Last but not least, the data warehouse target schema can be auto-populated.
With automation of these steps we get the usual benefits:
- Code is consistent and of high quality
- Performance has been optimized
- Testing can be automated
- Increased productivity and less bugs. Data engineers and analysts can focus on adding real value to your company.
Our own product Flexter is a data warehouse solution for complex data standards, XML, JSON, and APIs.
You can try it out online and find answers to FAQs on our website.
We have recently released a SaaS version of the product and are looking for beta testers.
Our enterprise edition can be installed on a single node or for very large volumes of XML on a cluster of servers.
The discussion should not be whether or not you should use DWA, but only which DWA environment is the best fit for your organization. And the reasons mentioned above are certainly true. But distrust and not invented here is a common problem within a lot of IT departments in general. And do'nt forget the concept of information logistics is still alien to most.
...
5 年Thanks Uli!
Analytics, business intelligence, data warehousing, data literacy and decision support consultant, architect, panel moderator, keynote speaker and trainer.
5 年Data Warehouse Automation (DWA) is a grossly underused technology in analytics. About five years ago I saw a slight surge in the use of DWA but for some reason, it never really caught on in the way it logically should. Here are my thoughts on why DW has not caught on: 1) System Integrator Firm Self-Interest. The business model of many system integrators is based on maximising the number of billable hours for the largest number of developers. DWA is basically kryptonite to this business model and system integrators have shied away from DWA and when it is suggested they will find reasons why DWA is not suitable for a given project. 2) Developer Self-Interest: When the average ETL developer sees DWA they quickly realise that DWA will lead to fewer ETL developers being required. Moreover, many ETL developers define their role by the tool they use (e.g. SSIS, Informatica, etc.) and not the value they provide to the business. The use of DWA tends to push the DW skill set more toward the requirements gathering, business analysis and data modelling. If you are an ETL developer and have a narrow skill set you will feel threatened by DWA and you will find ways to prevent or hinder its implement. 3) The Rise of Python Wielding Data Engineer: The recent rise to the almost god-like status of the data scientists and data engineers has led to what is effectively 20+ year regression of ETL development methods. In the early days of ETL development, there were very few tools like SSIS or Informatica so all ETL code was lovingly hand-crafted in the preferred language of the developer. With the rise of various ETL platforms progressively less hand-coding was done. I now regularly see the scenario where the data science team is charged with building a data warehouse and the response is usually "There's no need to buy an expensive ETL platform. We can easily write all of the ETL code using Python, the swiss army knife of software languages". Now don't get me wrong. Python is a very useful language but it is not designed to be an efficient ETL development platform. I suppose all we can do is continue to highlight the cost, speed and quality benefits of DWA to managers and project owners and hope they change their mind.