Boost BI & Development with Metadata-Driven ETL Framework:Data First, Insight-Lead, Organisations with Metadata-Driven Data ETL Architecture.
Ryan Julyan
Senior Data Engineer @ BSG | Data and Analytics Assets | Senior Software Developer
Abstract:
All businesses must quickly, effectively, and efficiently source and report the data that drives daily operations.
In today’s economy, a key trend for Insurance (Health, Property and Casualty, Life), Healthcare (Hospitals, Clinics, Doctors Groups), and Other Financial Services (Banking and Investment Management) organisations are to continue to grow and adapt, where success is measured in speed. Analysts and decision-makers can make faster actionable insights, the greater the chance that information can be translated into value for the organisation.
To implement an agile approach, organisations must reconsider the divided handoffs from one division to another, preventing open communication between business users, developers, and architects that create traditional Enterprise Data Warehouse (EDW) development — instead, a suitable method of incorporating metadata to shorten and create iterative development cycles.
Why Metadata-Driven Architecture:
Organisations must provide standard, accessible, and generic services and data analytics, enabling each team level to report on relevant data in a near real-time manner. However, traditional methods can limit organisations from growing and adapting because they are limited to technology or some predefined structures that are too difficult or expensive to change, not to mention the challenges of integrating business systems and data from similar organisations or divisions.
It becomes easier and more efficient to achieve the goal of providing standard, accessible, near real-time services and reports on relevant data by implementing a flexible Enterprise Data Warehouse (EDW)/Analytics Architecture that implements a metadata-driven Extract, Transform and Load (ETL) framework and a simplified Service-oriented architecture (SOA) services approach, providing visibility of the procedures and processes in the Enterprise Data Warehouse (EDW)/Analytics Architecture by allowing all users in an organisation to quickly manage and control their data without going into the code itself, reducing the dependency on strong technical knowledge to deliver business solutions.
Definition of Metadata-Driven Data ETL Architecture:
Extracting the data required, traditionally, meant using a variety of tools. However, using those tools effectively requires strong technical knowledge and experience with each Software Vendor’s toolset. The dependency on solid technical understanding means integrating new data sources becomes challenging, often requiring complicated, time-consuming and error-prone customisation.
Metadata-driven ETL frameworks simplify the technology by abstracting layers, establishing an easier-to-use and more flexible method to implement new data sources. This means that the learning curve is simpler to understand and easier to adopt while reducing implementation time.
Metadata-driven ETL frameworks create templates for data structures (including structural artefacts with entity relationships and data formats that define the architecture of the EDW), data migration controls, exception/error handling, and rules management. In addition, transformation and integration rules can be created via templates made available in accessible, ubiquitous tools such as Excel spreadsheets by non-technical, domain specialist users.
Similarly, data source locations, schemas, error handling logic, and job control parameters can be stored in physical configuration files that the Framework can easily maintain and process to generate executable ETL jobs. Data schemas should be created using data models to build the schema dynamically, allowing different data storage engines to be used without breaking functionality. A configuration file can generate the code for the specific storage engine.
A metadata repository stores and manages the data structures, data migration controls, exception/error handling, and business rules. There are several challenges with sharing and administering the metadata. As such, different architectures have emerged to combat these challenges. The three most common metadata architectural approaches are:
How to implement:
The simple configuration and rule-based implementation streamline loading data into traditional EDWs and make data available sooner for analytics, reporting, and use by other applications. This is achieved by removing bottlenecks of highly niche technical staff and allowing domain specialists to create and maintain their pipelines. This allows for the technical team to support standardised and is easy to review code for their platform while providing the non-technical staff with the ability to create flexible and streamlined applications that can be quickly assembled to provide end users access to the data through Portals, Dashboards, and other applications, without the technical overhead, by providing the capability to replicate and add new data sources and business logic, quickly and effectively.
A?metadata-driven ETL framework?provides an abstraction layer making it trivial to define and reuse mappings, define multiple sources and targets of data, and easy to define and reuse transformation rules. As a result, a metadata-driven ETL framework makes it faster and easier to process, load, and transform data by managing the EDW environments and quickly replicating standard reporting services and applications.
Regardless of the application’s architecture, code can be generated for each layer by using suitable metadata attributes. This means an “n-tier architecture” is achievable and can be augmented by a metadata-driven ETL framework. Each layer from the UI layer, service layer, persistence layer, data access layer and storage layer can be augmented using layer-wise patterns and practices.
Advantages:
A metadata-driven ETL framework provides an approach for standardising incoming and outgoing data by simplifying complicated processes. The uniform generic way of data ingestion makes it easy to review existing configurations or add new configurations. A metadata-driven ETL framework provides unique agility in developing or changing configurations. Changes typically would not require any code providing scalability since new sources, configurations, and environments are created by creating meta-data and configuration rules. Maintenance time and effort are reduced by the ease and accessibility of the meta-data and configurations, from business logic to data flow, through ubiquitous tools such as Excel spreadsheets.
领英推荐
Many of the advantages listed assume a no-compilation architecture, meaning the elements can be loaded at runtime. Any functionality designed can be instantly previewed and published, making it available to the end users and testers sooner and without delays. This means users can benefit from the systems and data immediately, are empowered and can take ownership of delivering insight-led decisions.
Metadata-driven ETL frameworks do not need to replace one’s existing ETL platforms. Instead, a metadata-driven ETL framework can be an accelerator or code generator for rapid development in the native ETL platform. Furthermore, since the metadata-driven ETL framework provides the configuration and instructions, translating the configurations into functional code allows existing platforms to be used and extended without much impact.
Using a metadata-driven ETL framework over traditional development is estimated to reduce time to deployment by an estimated 30% when integrating new data sources into a data warehousing and business analytics environment.
Performance Concerns:
Generating code from a configuration does not mean that the code is inefficient.
Iterations and reviews of changes can be given more focus, allowing similar tasks to be grouped. This can be achieved because users can access the information they require freeing the technical staff to review and monitor the system instead of delivering on business functionality and reports. This process improvement should be worked into the life-cycle and approval process (which can be systemised and, in some instances, automated). These processes can be put in place to protect sensitive data (considering GDPR and POPIA) by ensuring that only people who require the data are getting access to this data and ensuring that business users do not put forward the exact requests for the same data. In addition, these processes can vastly improve the performance times of business logic, bringing the system much closer to real-time communication and reports. Thus allowing all users in an organisation to make far more effective, insight-led business decisions.
Implementation Concerns:
With every user in an organisation now having the ability to update and access the information from the EDW, managing user access to the specific information they should and should not be able to access is critical. This access should not be limited to the design of the reports but rather to the ETL system at runtime since the rules and data are decoupled. Because each metadata point can now be accessed, more granular attribute-based user access rights should be implemented. These access rights could be initially controlled and set up at a role level and inherited by users, but allowing a more granular control will become paramount to the success of the metadata-driven ETL framework. These permissions should extend from the standard Create/Read/Update/Delete (CRUD) permissions to include a data entity access level (potentially even an attribute in a data entity) as well as elements on a form, as such, an attribute-based authentication lends itself to work well with the metadata-driven ETL framework ecosystem.
Other concerns of users not creating complete/accurate or optimised rules as the templated structures tools such as Excel spreadsheets, which do not enforce data integrity. The metadata-driven ETL framework should allow for a pre-process validation of the configuration and rule-based information to prevent long run times and potentially break queries from running inside the EDW.
Metadata-driven ETL frameworks are not an all-or-nothing approach. A hybrid system will provide the ability to reverse engineer and generate dynamic code where applicable while allowing for customisations when needed. This means the dependency on the Metadata-driven ETL framework should not be a vendor lock nor prevent growth and development in IT or business areas.
A good metadata-driven ETL framework will implement version control, meaning every change to the metadata files is archived (historical versions of the metadata files are archived), enabling rollbacks when necessary.
Conclusion:
Metadata-driven ETL frameworks provide visibility of the procedures and processes in the Enterprise Data Warehouse/Analytics Architecture by allowing all users in an organisation to quickly manage and control their data without going into the code itself, reducing the dependency on strong technical knowledge to deliver business solutions.
Metadata-driven ETL frameworks create templates for data structures, data migration controls, exception/error handling, and rules management. The simple configuration and rule-based implementation streamline loading data into traditional EDWs and make data available sooner for analytics, reporting, and use by other applications. In addition, data source locations, schemas, error handling logic, and job control parameters can be stored in physical configuration files that the Framework can easily maintain and process to generate executable ETL jobs.
A metadata-driven ETL framework provides the following:
Using a metadata-driven ETL framework over traditional development is estimated to reduce time to deployment by an estimated 30% when integrating new data sources into a data warehousing and business analytics environment. In addition, using metadata-driven ETL frameworks allows all users in an organisation to quickly manage and control their data without going into the code itself, reducing the dependency on strong technical knowledge to deliver business solutions.
References: