Data-Driven Compliance: Data Architecture
Oussama KIASSI
Data & AI Leader @ IBM | 12x Microsoft | 11x Databricks | 1x Snowflake
Over time, governments have played a central role in shaping industries by formulating and enforcing regulations to promote ethics and sustainability. This critical function finds expression through the practice of regulatory reporting, encompassing both financial and non-financial data (ESG). However, the explosive data growth made it more challenging for organizations to report consistent and integral data. Data is extensive (high storage), fast (generated quickly), complex (difficult to process), and sometimes lacks integrity. Therefore, organizations need pillars to handle this exponential data volume surge while correctly reporting compliance.
To help understand and analyze the breadth of government actions through quantifying policy texts, the Mercatus Center at George Mason University offers a dataset for All U.S. Industries and Federal Regulations, RegData [2]. This same dataset was used in 2014 to research and identify highly regulated industries [3], from which we can cite Basic Chemical Manufacturing, Pharmaceutical and Medicine Manufacturing [4], Utilities, etc. A critical question arises for these industries regarding designing a reliable enterprise data architecture for regulatory reporting. This choice should fill regulatory reporting needs such as accuracy, veracity, consistency, and auditability. We can identify some potential architectures from a regulatory reporting perspective: Two-tier architecture, Data Lakehouse, and Data Mesh.
Two-tier Architecture (Modern Data Warehouse)
As its name implies, Two-Tier architecture, also called Modern Data Warehouse, is the coexistence of a data lake and data warehouse. The data lake stores data entirely in its native format at scale. Then, it is processed for structured storage and querying in the data warehouse. This architecture, supported by a solid data governance culture and a reliable usage audit system (granular logging), fulfills its purpose for mid-size companies and strategic business units’ regulatory reporting. However, it comes with certain drawbacks of data duplication and consistency. Data may be duplicated and fail to be consistent across the two systems. The underlying figure illustrates regulatory reporting using Azure Modern Data Warehouse, which uses Azure Analytics stack: Azure Data Factory (raw ingestion), Azure Data Lake (native storage), Azure Databricks (data processing and preparation), Azure SQL Datawarehouse (data standardization and serving for regulators), Azure Analysis Services (data modeling), Power BI (reporting); to extract value from regulation data and share a view database with regulators.
Data Lakehouse
Data Lakehouse combines the best of the two worlds, Data Lakes and Data Warehouses while addressing their shortcomings. This architecture boasts several advantages, including storage flexibility (structure and format), compatibility with multiple programming languages, cost-efficiency for computing and storage, scalability through distributed computing and native storage, and compatibility with ACID transactions. ?Undoubtedly, this is a valuable asset for mid-size companies and strategic business units’ regulatory reporting. The undermentioned figure depicts financial regulatory reporting using Databricks Lakehouse capabilities of medallion architecture (Lakehouse layers organizing data based on quality and granularity), delta live tables (robust and reliable data pipelines), dashboards (reporting), and delta sharing (an open protocol for enterprise data exchange) in junction with FIRE (Financial Regulatory data standard) data model that unifies data specifications between regulatory systems in finance.
Data Mesh
Large organizations often need help with siloed data repositories at scale, leading to inconsistencies and duplications across strategic business units and/or subsidiaries. This loophole can cause discrepancies in regulatory reporting, tarnishing the company’s reputation and lowering its market value. Hence, there is a need for a decentralized data architecture, Data Mesh, that organizes data by separate domains and federates governance to enhance data assets' consistency, adoption, and interoperability. Per its creator’s definition [7], this architecture is built on four principles: Domain Ownership (domain teams taking responsibility for their data), Data as a Product (projecting product development philosophy onto data), Self-serve Data Infrastructure platform (adopting platform thinking to data infrastructure and coexistence of data platform team and data products teams), and Federated Governance (interoperability of all data products through adherence to the organizational rules and industry regulations). The figure below shows that Data Mesh organizes data by domain while allowing data virtualization and interoperability. Likewise, Regulators can still access view databases exposed by business nodes for audit purposes.
In today's landscape, regulators have become increasingly demanding. Therefore, companies must make informed decisions when choosing their data architectures for regulatory reporting. Two-tier data architecture can be good for mid-size public companies already possessing one of the two systems (Data Lake or Data Warehouse) with less investment cost and high operational expenses. Data Lakehouse is more efficient and optimal than the latter but demands more investment. Data Mesh is more suitable for larger organizations with data interoperability and consistency across the nodes. More importantly, data governance culture, our next topic, is more critical in the face of regulations because it allows for the standardization and integrity of data and processes.
?
References
领英推荐
[2] “RegHub” –QuantGov
[3] “RegData 2.2: a panel dataset on US federal regulations” –Patrick A. McLaughlin & Oliver Sherouse
[4] “Chemical Manufacturing | North American Industry Classification System (NAICS)” –U.S. Census Bureau
[5] “High-Performance Modern Data Warehousing with Azure Databricks and Azure SQL Data Warehouse” –Databricks
[7] “Data Mesh?: Part I - What Is Data Mesh?” –Zhamak Dehghani
?
Further readings
“Data Lakehouse Architecture and AI Company” –Databricks
“Data Mesh” –Zhamak Dehghani
“AI for better health” –Oussama KIASSI
“Artificial Intelligence for Enterprise” –Oussama KIASSI
“Why Data is the New Superpower?” –Oussama KIASSI