Data Nugget August 2023

Data Nugget August 2023

31 August, 2023

As summer fades away, we hope you enjoyed a splendid vacation. The end of August is?an opportune moment to resume our tradition of curating captivating updates from the realm of data management.?I extend a warm invitation for you to explore our latest highlights.

First, we have an insightful review of?Inga Strümke's book 'Maskiner som tenker.' Second, we have a nugget on data management strategy. Third, we have a guest contribution about data platform urbanism. Fourth, we present a nugget that sheds light on figuring out the perfect data strategy. And last but not least,?we bring the next episode of the podcast series on data governance in a mesh.?

Enjoy reading!

Let's grow Data Nugget together. Forward it to a friend. They can sign up here to get a fresh version of Data Nugget on the last day of every month.



Book review: "Maskiner som tenker" - Putting?AI in context

Nugget by Winfried Adalbert Etzel

“Maskiner som tenker” is a book for anyone interested in the state of #AI, influenced by AI FOMO, or even concerned about the power of AI.

Inga Strümke really hit a nerve with this book, making the field of AI accessible for a broad audience and cutting through the AI hype with a clear message: AI is already here, and we need to manage and develop AI in a responsible manner.

I found it particularly powerful, how Inga embedded AI as a discipline of field of study into the broader picture of many other disciplines, from physics to cybernetics, from history to philosophy. It becomes eminent that AI must be viewed as multidisciplinary, and the influence of AI on our society needs to be understood and contextualized better.

So here are the reasons for reading this book:

  • Putting AI in a broad context of society, history, philosophy, ethics and politics.
  • Broad perspective on AI ethics and intelligence.
  • Making AI accessible.
  • Introducing the big thinkers in AI.
  • Explaining common AI terminology and concepts.


Structure

The book is structured in three main sections:

  1. Past
  2. Present
  3. Future

Already by looking at the table of contents, you understand that the book is about context and process, by setting AI in a historical flow of time.

Inga made the book personal by sprinkling it with autobiographical stories and where she was at different times of development in the field. This made it so much more of a personalized read. So when talking about putting the field of AI in context, here are my takeaways:

AI in historical context

Inga put AI development into the context of historical industrial development over the last centuries. But also in the more recent times, Inga portrayed the AI hype cycles in a great way, putting the recent AI hype in context of previous times, like the 1960s or the 1980s. She also argued for the recurring AI winters, periods when AI was not part of popular discussions, maybe seeing a pattern predicting future AI development.

Without diving too deep into a processual interpretation of history or attributing it to inherent logic, I think it is important to have that historical perspective as a basis for any discussion, also when it comes to technological development.

AI in social context

In the intersection between social and historical context, focusing on certain advancements in the Industrial Revolution, Inga portrayed the impact that these technological and industrial changes had on society, work security, financial independence, etc. We can see that we are in similar debates today when discussing the impact of AI on our work and life.

There are three aspects Inga described in more detail that are important impacts of AI on our society today.

  • Mental health

AI has already taken a central spot in our society, and though the possibilities of contextualizing and structuring data to find patterns, AI can, through otherwise seemingly unconnected datapoints, gain insight into e.g., our mental health. Inga talked about the example from showing Facebook status messages in context in 2018. A project that gained insight on how we are feeling mentally and showed us that knowledge we share about our life can be understood by an ML model better than us.

  • Privacy

How can we as individuals safeguard our #privacy in a society that is built on sharing where data collection is vital for important services in the society? I was happy to see that Inga added Privacy, but also the debate around GDPR to the book. What is interesting to note is that the Norwegian Data Protection Authority ( Datatilsynet ) not only supervises #GDPR compliance, but is de facto the authority for everything related to data, algorithms, and AI.

  • Explainable AI

One of the vital parts of our western democracy is the right to receive explanations for actions by the state authorities. This is how we can contribute to our society and ensure that public administration is by the people, for the people. With a vast progression of use of AI in public administration, this right is threatened. We cannot trace decisions through a model, understanding which criteria lead to which decision. Explainable AI is an important field in creating democracy, transparency, and trust in AI.

AI in philosophical context

When we talk about AI, we must talk about philosophy, not only in the ethical AI debate, but for all the general concepts of AI, starting with the definition of Artificial Intelligence. At the core of many philosophical debates and writings in the last 300 years, but also dating back to Aristotle, we find that intelligence? is not an easy concept to define.

So, if we struggle to define intelligence for humans, how can we tell if we have achieved Artificial Intelligence? Maybe this is the reason for the rather circular definition of Artificial Intelligence we meet at the end of Chapter 1: (…) #kunstigIntelligens er et fagfelt innen #datavitenskap med form?l ? utvikle maskiner som evner ? oppf?re seg intelligent. ((...) artificial intelligence is a field within data science with the aim of developing machines that are capable of behaving intelligently.)

So, what is Intelligence? Inga gave some hints throughout the book, pointing towards creativity, understanding context or empathy. At the same time, she also named some of the great philosophical works around this topic, like Gottlob Freges, Uber Sinn und Bedeutung, or Immanuel Kants notion of transzendale Selbstbestimmung, or autonomy.

Inga made it clear that philosophy is vital for our understanding of AI, and the link between the disciplines needs to be strengthened, whilst not going into any deep philosophical debate in the book itself.

AI in ethical context

Ethical dilemmas and constructs around moral and reason are fundamental for the discussion on AI. Inga didn’t just dedicate a chapter to the ethical debate but added thoughts on ethical topics within AI throughout the book. In Chapter 8, she added the five principles for ethical AI by Luciano Floridi and Josh Cowls :

  • Beneficence
  • non-maleficence
  • autonomy
  • justice
  • explicability

Morality and ethics are described in two perspectives. The perspective on how we can ensure that we make the right considerations when creating, implementing, and acting on AI generated information. But also looking at AI as artificial moral actors that have a sense of morality and can make ethical considerations autonomously.

This brings an entirely new set of challenges, problems of ethical colonialism, moral superiority, etc., that needs to be considered, also when we talk about AI as ‘intelligent.’ There is some certainty in the notion that AI ethics as a subfield of #DataEthics will become extremely important going forward, also challenging our established ethical mesh.

AI in political context

The intention of the book to spark political debate and regulatory action is not a secret, especially after Inga handed out her book to Members of the Norwegian Parliament (?Stortinget?).

In the book, she also advocated for more and clearer regulations within AI, building further on the coming AI act proposed by the European Union. To ensure a development that is ethical and contributes to societal advancement, Inga is clear about the need for guardrails through regulations. This is distinctly important to ensure that we can minimize ethical dilemmas in the development progress of AI in public administration, but also for private companies navigating in this field.

AI cannot be understood as something that is apolitical because the effects on society are so vast that this needs to be on the political agenda. Even if we are going towards a new AI winter, this debate is important to have.

My recommendation

Inga Strümkes “Maskiner som tenker” came out at the right time, filling a void between public opinion and the discipline of #ArtificialIntelligence. I think this is how you can understand this book: not as a scientific work, but as an introduction to make AI broadly accessible.

In this field of AI, one can easily see signs of a technological hubris to think that everything can be solved via statistical uncertainty.

I would love to see this book as a first step towards a public debate about AI, but also an academic debate where other disciplines, as described above, take more ownership in this debate. The greatest advantage of this book is that Inga challenges other disciplines, and at the same time, conveys the message and knowledge in a way that is understandable for a broad audience.

With my background in history, political science and law, this book is a step to manifesting the importance of these disciplines into the previously STEM-dominated field of AI.


Data management strategy

Nugget by Gaurav Sood . Source of the nugget.

Data management is a process that requires both administrative and governance capabilities to acquire, validate, store and process data. With the growth of?#BigData, companies of all sizes are generating and consuming vast quantities of data to create business insights into trends, customer behavior, and new opportunities. And this gives rise to the importance of an effective #DataManagementStrategy.

Data management strategies help companies avoid many of the pitfalls associated with data handling, including duplicate or missing data, poorly documented data sources, and low business value, resource-intensive processes. An enterprise data management strategy can help organizations perform better within the markets they serve.

Data gets transformed in information, information into knowledge and knowledge into key decisions by and for the organization. It is important to remember these stages as they are responsible for transformation of data into insight. Data management is an organizational level capability, which the entire company has to be responsible for and not only the IT/data team. Data management strategy and master data management strategy should follow data science best practices. These practices can be seen in and underpin data engineering, data analytics, machine learning, deep learning, and artificial intelligence disciplines.

Data management strategy is enabled by people, processes, technology, and partners, and is supported by the business and IT strategy goals. The strategy itself is a plan or roadmap with an overall budget that supports the organization’s other strategic projects. ?

Data management strategy approach?

Following are some of the steps for defining a good strategy approach:

  • Identify business objectives by performing a business assessment to understand business strategy, business technological direction, operating model, and policies and procedures.
  • Decide how IT supports business objectives with data strategy by doing a strategic assessment: Understand the demand for data management. Understand strengths, weaknesses, opportunities, and strengths. Define the market spaces a data management strategy is needed for. Identify all organizational gaps in capability and resources for success.
  • Generate the strategy. Establish priorities. Establish objectives. Draft a strategic plan. Define expected return on investment (ROI) and total cost of ownership (TCO)

  • Execute strategy. Create strategic plan Create risk assessment Get business buy-in, alignments, and integrations to the plan Deploy IT assets – people, technology, process, partners. Decide and define governance and compliance capability. Hire chief data officer to be accountable for strategy execution. Create standards, policies, and procedures. Support execution of plan across organizational functional units.
  • Obtain feedback from #DataManagement architecture and design. Find the right technology that supports the budget and objectives. Establish data management practice. Establish data governance.
  • Prepare and execute organizational change management (OCM) for data management. Train and educate employees and other stakeholders. Get feedback.
  • Deliver and operate data management strategy with feedback.

The strategic plan is followed by plans, projects, and then operational implementation. The strategy supports operations, and operations should support the strategy of the organization’s data management initiatives. ?

Benefits of data management strategy

Here are some benefits of having a data strategy:

  • Overall better decision-making across the organization.
  • A better understanding of organizational strengths, weaknesses, opportunities, and strengths in all business areas.
  • Reduce bad data, which is inconsistent, incompatible, duplicate or missing for decision support across the organization.
  • Cost management and efficiency. Enablement of the business to spend its budget better on needed capabilities and resources to support customer outcomes.
  • Better data governance, compliance, and master data management strategy. Critical data is managed better.
  • Improvement in running the business, innovation of the business, customer fixes, and wishes.

These benefits can overall increase business value for the business and its customers. All data management strategies and projects should focus on benefits and value to the organization. ?

Challenges of data management strategy

Some of the biggest challenges are:

  • Alignment with business needs is not understood.
  • Lack of resources to implement the strategy successfully.
  • Ineffective communication between the leaders and the teammates about the vision and execution of the data management strategy.
  • Lack of training and resistance from people to adapt to the change.

Data management strategies, when effectively executed, usually outweigh the challenges.


Data platform urbanism: Sustainable plans for your data work

Guest contribution by P?l de Vibe , Head of AI and Data Engineering at Knowit Objectnet.

This article will propose some suggestions for structuring your data work on the data platform service Databricks , including naming and organising data products in Unity Catalog and git. Databricks is available as a meta-platform on Amazon Web Services (AWS) , Microsoft Azure and Google Cloud (GCP). This article is technical, intended for data platform engineers and architects, ML engineers, data engineers, and full stack data scientists. Even if Databricks is the focus here, a lot of the ideas in this article apply to other data platforms as well and should be possible to understand and adapt.

It is becoming increasingly easy to get rolling with Databricks as with other data platform services like Snowflake , Google Dataplex and Microsoft Fabric Services . However, how can you avoid building a data favela without plumbing, proper transport infrastructure and security? How can you avoid building a huge urban sprawl, clogging up infrastructure and killing spaces for shared culture and learning?

Even if decentralization is core to Democratizing Data, Data Drivenness, #DataMesh, enabling AI, etc., some structural regulations and guidelines are useful, the same way a pre-emptive urbanism and shared building norms are powerful tools to build functional and soul soothing cities.?

1. Desirable Qualities of Data Work

Based on the growing body of experience with data platforms, we can point to some core desirable qualities of data work, which should be reflected in data platform urbanism, that is, the design principles implemented in the structure plans and other aspects of the data platform engineering. Done right, these principles will prevent chaos, prescribe conventions, and educate users by the power of example.

Usability

It should be easy to use on the data platform, even for less technical users. Onboarding should also be self-service and pedagogical. For an organization to become data-driven, usability is crucial. Easy comprehension of the structure and ways of working is a part of it.

Sustainability

Even if it should be an easy-to-use platform, data work ought to be sustainable, both in terms of respecting the space of other data workers and the resources of the company. However, even more important is that the data work can be maintained and understood over time. With sustainable, I here refer to technologically and organizationally sustainable, not ecologically, which is a much greater question.

Cooperation

Cooperation and knowledge sharing across teams can be enabled by following common examples and conventions for organizing code, data, and pipelines. In addition, low threshold arenas for knowledge sharing must be enabled, like Slack and QA fora. Without cooperation, building a much-needed data culture is hard to achieve.

Autonomy

Each team and data worker should have space, resources, and freedom to go about solving their data needs.

Security

Responsible access management should be enabled by design. By having clarity and comprehension in the structure plan, security becomes much easier to implement and maintain.

Privacy

Solid examples of privacy-respecting data handling should be provided, at least as soon as privacy-sensitive data will be processed. Audit logging is a great tool to capture who has queried which data and implement security practices.

Data worker user experience

The data worker should be easily able to start doing data work. It should be easy to:

  • get access to and obtain data.
  • know where exploratory code and data assets will be stored, and how to name them.
  • follow branch name conventions and commit code.
  • share code across projects and departments.
  • deploy code to automated execution (production)
  • apply data quality checks.
  • perform access management.
  • investigate failing pipelines.
  • test run data pipelines.
  • have access to production data where it is permitted, to do analysis and build ML on real data.
  • if the use case proscribes it, easily handle data consumption and production in a privacy-respectful manner.

2. Environments in the data platform

This section will explain why App Engineering and Data Engineering should treat environment separation differently.

App Engineering vs. Data engineering

App engineering normally deals with precise data. A typical data interaction is to query the application's database for precise data to display/change/transaction data on behalf of a specific user. For example, find the start dates of courses for student: Find the start dates of courses of student with student_id=1

On the contrary, data engineering normally focuses on patterns in data. A typical data interaction is to query different databases/data sources for all the data in some tables, also across databases, to study patterns. For example, find the popularity of all course types across all course choices for all students across databases from many different universities for the last 10 years.

Given the difference in the need for real data, there are also differences in the environmental separation for the cloud infrastructure when building a data platform as opposed to an application platform.

In an application platform, it is often useful to distinguish between environments for development, testing, staging and production, and the data used in these usually have clear distinctions. Development and testing will usually use synthetic data, while staging and production use real data. It can, therefore, be useful to have a totally separate environment for development and testing. In a data platform, however, there is a need to have access to real data already in the development phase.

Testing app functionality vs. data functionality

Given the differences in data interaction, testing app code often fulfils a different role than testing data code.

Typical app test

Given a student with three courses starting on August 15, 17, and 18, return a list of exactly those courses and start dates. If the test fails, cancel deployment of the app. Typical test tool: Pytest, JUnit

Typical data test

For data engineering, data quality tests are often more useful. Check that the number of course categories across the databases is a minimum of 90% or a maximum of 120% of the number of categories from the previous year. Check that all courses have at least one course category. Courses which lack categories should be rejected or quarantined. Typical test tools: Great Expectations, DBT, Delta Live Tables

Environment proposal

Here is a proposal for defining environments in a data platform. Cloud environments (AWS accounts / Azure Subscriptions)

Infra dev: Use for cloud engineering, i.e., building the data platform itself, adding new orchestration tools. Will not be accessed by data workers, only by data platform engineers.

Infra prod: Used for developing data pipelines and running these pipelines in automated production. Will be used by all data workers.

Infra Prod can contain several Databricks workspaces, or there can be one infra-Prod account per workspace. I would suggest keeping one workspace per account, to reduce blast radius on cluttering.

Data environments

Dev: Exploratory data work (ML, ETL, Analysis, data pipeline development). Staging: staging environments are used to run a new version of a data pipeline or task, to test them, before they go into production.

Prod: these pipelines should run in an automated, stable, and secure manner, without errors.

All of these should reside in Infra Prod.

Note that the Dev needs access to real data, i.e., production data). In some cases, data should be filtered/masked production data. There might be cases where completely synthetic data must be used. However, this challenge must be solved specifically to build data pipelines producing useful data, even if data visibility is reduced during development. This challenge will not be solved by a simple cloud environment separation, unlike for app platforms.

3. Planning for quality: Structure plans and proposals

Aiming at enabling the core qualities of data work and considering the specific characteristics of data platforms, concrete suggestions will now follow for organizing the data platform engineering.

3.1 Structure plans for decentralized data work

Data mesh has become a common concept in data platform engineering. It is a comprehensive topic with disagreements or variations on recommended architectural patterns. These will not be covered here, but we will focus on decentralization of ownership of data, which is a core feature that platform engineers agree upon. Each department should own the data products coming out of it. In this article, we will mainly focus on the decentralization of data work, enabling autonomy, but also cooperation and sustainability.

A data product structure plan is necessary to enable cooperation, autonomy, sustainability, and data UX across a large organization.

Structural dimensions

A data mesh data product hierarchy could be broken down into these levels:

  • Organization (Acme Inc)
  • Domain (Sales)
  • Project (Customer Analysis)
  • Data Product (Customer Classification)
  • Tables (customer_classification, classification_codes)
  • Data product version

This enables us to not mix up each projects' data assets and code. This data product hierarchy must be reflected in components of the data platform:

  • Unity Catalog, schema, and table naming
  • Git-repository structure
  • Workspace distribution
  • CICD environment handling, e.g., naming of development and staging pipelines
  • Data Pipeline naming

Thus, the structure plans must cater to a number of dimensions:

  • Data product hierarchies
  • Distinct types of data work: ML, Data Engineering, Analysis. Each of these might require different file types and pipeline ops, sometimes warranting structural separation.
  • Data access management
  • Data environment separation (dev, staging, prod)
  • Data development experience (where do I put my experimental tables and code? Where do I test run my new data pipeline?)

Let's now look at a concrete structure proposal to accommodate these dimensions.

3.2 Unity catalog structure plan

In the data asset hierarchy, it will be enough to have the six levels of data product hierarchy, plus one level of environment separation. Unfortunately, Unity Catalog only provides three levels of hierarchy: catalog, schema and table. It is a strong suggestion from Databricks to only use one metastore if data is to be shared, so four levels are not available, only three. To accommodate all levels, we will need to use prefixes or postfixes.

A number of choices have been made which might be disputed, but here is the reasoning:

A data product needs two levels of hierarchy. Since a data product is often made up of several data assets or tables, e.g., a star schema, we need to preserve the schema level of the unity catalog for the data product. We need the tables on one level, and their grouping, i.e., the data product, on the schema level. This already consumes two of our three levels.

Area as catalog ?

The catalog should correspond to an area to simplify access management, which then can be granted, with Unity Catalog grant statements, on either area level or data product level. One area can contain multiple data products.

Environment prefixing

The environment prefixing at catalog level gives a crystal-clear separation between environments. However, one could argue for postfixing instead, since sorting on org+domain could be more relevant, so putting env as postfix is not a bad alternative.

In any case, by keeping env on the top level together with area, a clear overview of official data products can be seen by filtering on prod_ prefix. Access management during CICD and jobs runs also become clearer, and therefore, more secure.

Another advantage of keeping env in catalog name, and not pulling environment into the schema name (level below), is that queries can be developed and then deployed to production without having to change the schema name.

Explicit versioning of data products

Versioning should be an explicit part of the data product name, to avoid version lookups in meta data, when building and running pipelines. It also permits producing separate versions of a data product in parallel, which might be needed by downstream consumers.

When developing new tables, normally a data worker would call it something like testpaul2_classifications. Chaos spreads quicker than street vendors on an unregulated street. So having and facilitating a convention for naming development tables makes everyone's life easier. A suitable prefix could be dev[issue id][shortened title].

It could also be made an enforced practice that all data work should happen in the context of a git branch.

Staging environments

Instead of putting staging in the env name we use pr to signal that it is a pull request running, but stg could also work. The env prefix format would be: pr[ticket number][shortened title][short commit hash], e.g. pr745enrich1a6f396

This enables parallel execution of multiple pipelines consuming the same data product. By beginning with pr, we clearly state that it is a staging environment. It is a good idea to include the commit hash also to avoid conflicts with parallel runs of the same branch.

Environment-agnostic code

A pyspark utility function or sql constant should be used to automatically get the catalog name, including the environment based on the execution context. Databricks has great utility functions to deduct which run time context a notebook is running in (workspace name, cluster name and type, interactive notebook, etc.) Note that upstream data assets should be read from production or masked production, while the data assets being created by the pipeline being developed should run in dev or staging before the branch is merged to main and deployed.

Streams and ML models Beyond tables, data assets can also be ML models or streams. They might not always fully fit into the implementation suggestions outlined here, but the higher-level principles can still be applied.

3.3 Git-versioned data work

A git repo should be used as a workbench instead of using an unversioned folder under "Workspace". The simplest is for each user to checkout a monorepo, which the platform team facilitates, so that all data work takes place there with support and examples for the desired structure and ways of working out of the box.

An example of such a repo can be seen here. In Knowit Objectnet, we use a monorepo for the internal data platform, where all the data domains' code is located unless something else is explicitly necessary.

In the repo structure, projects correspond to the area level of data mesh, in this case, customer_analysis. The flows folder has separate folders for ML and prep (#ETL) flows (pipelines). Shared code can either be put within the project's libs folder, domain libs folder or in the root libs folder, depending on how universal the python functionality in the module is.

More details can be found in the monorepo example. There is also code in the repo example for unit testing pyspark code, which can be done as part of a CICD pipeline, for solidifying pipelines. Note that spark just released more functionality for unit testing, which has not been incorporated in the example yet. Using a shared repo becomes a catalyst for cooperation and knowledge sharing and enables a very easy and direct way for people to gradually build their tools with git versioning. It is also a great place to document data pipelines.

Documentation close to the code

The documentation for the data work should be close to the code, e.g., in README.md files inside the flows folder producing the data assets. Easy to find, easy to purge when the code dies.

Git-versioned orchestration directly in the Databricks UI

All data work, including orchestration should be possible to do directly in Databricks' UI. A good way to achieve this is to use yaml files to define Databricks Jobs, for orchestration, and let CICD automatically deploy this as Databricks Jobs using Databricks' good APIs, on git events. For example, you can trigger a test run when a new branch commit happens, staging run on a new commit to a Pull Request, and production deploy on merge to master. Advanced users are free to pull the code down on their own laptops if they want more advanced editor functionality, e.g., for refactoring, and use Databricks Connect to interact with clusters and data.

When beginning with Databricks, it is often tempting to use the Users/username folder or shared folder under Workspace as a workbench for data work. Instead, you should checkout a monorepo and begin working within a project, under the explore folder, e.g., projects/nyctaxi/explore/diff_taxi_aggregations). This way, the code being created already has a hierarchical position corresponding to their organizational position. The naming of the data asset naming (schema and table name) should follow a corresponding naming pattern as the code. Everything becomes much clearer. Fewer choices need to be made, and less chaos is created. It also enables git version controlling your experiments and sharing them with others.

Your personal Repos folder is a lot like a virtual laptop. In your personal repo-folder you checkout repos. Only you work in it (but you can share access to it) Collaboration happens through branches, like you would do with code on your physical laptop. Share code with your physical laptop, e.g., for refactoring code or other team members.

In general, it is very similar to checking out a repo and working on it on your physical laptop. The same repo can be checked out multiple times. The repo-integration (git UI) in Databricks is an easy way to use git for non-developers.

Pull requests as proposals to put code into production

Pull requests are great for representing a proposed addition or change to production pipelines. The code diff is easy to study, comments can be made, CICD validations implemented, four-eyes approval regimes enforced. My experience is that non-developers understand Pull Requests fairly easily.

3.4 Democratize the data work and hide cloud engineering

To become a data-driven organization, it is key to empower data workers on many levels or roles to do their job as much as possible on their own in an efficient, but responsible, way. Ways of working must be cultivated to make it easy for less technical workers to use the data platform. The git repo described above is a fitting tool to achieve this. But here are a couple of additional proposals to enable.

To simplify data work, especially for non-developers, the data platform should isolate the data workers from direct cloud engineering. They should not have to relate to the entire apparatus of cloud engineering, e.g., the entire terraform setup and terraform state. Therefore, all data work should be possible to do right in the Databricks' UI. More technical users can still download the Databricks repo onto their own machine and work via Databricks Connect and Databricks' APIs.

Leverage technologies empowering less technical data workers

Tools such as Notebooks, Delta Live Tables (DLT), Databricks SQL Dashboards and DBT empower less technically experienced employees. DBT is a very good tool, but it can also lead to splitting the user interface into several platforms, both Databricks and DBT Cloud, and it is, therefore, often best not to start with DBT. DLT is partly inspired by DBT and provides similar data quality checks, although DBT's opinionated file structure by default enables cleaner implementations of pipelines.

3.5 Let Unity Catalog manage cloud resources and tables

Unity Catalog is the new brain of Databricks and is designed partly to simplify access control to cloud resources, which can be a complex task. In addition, there are several other efficiency-enhancing functions that are gradually being added to the Unity Catalog, e.g., column-based lineage.

It is complex to keep track of cloud resources only with the help of IAM roles, and this also means that data work is mixed with cloud engineering, for example, manage access to a data.

In general, one should avoid trying to build functions, which Databricks itself solves or is about to solve, e.g., support for DBT and Airflow orchestration, which are now supported in Databricks Jobs. Another example is Databricks cost dashboards to monitor what data work spending.

Please note that the #UnityCatalog, as of today, does not necessarily solve all the needs that one wants to be solved by a data catalog. An easy way to get started with a data product catalog is to simply use one or more delta lake tables as a data product catalog. This can provide fast speed, easy automation, large learning space and integration opportunities. Eventually, you will get a clearer idea of what extra catalog functionality is needed.

3.6 Getting started examples for core tasks

In general, providing example code and pipelines for core tasks and use cases, including good naming practices, is extremely useful to avoid unnecessary diversity in ways of working, naming, and coding practices. Implementing enforced Black linting is also very useful. In the CICD linting checks can be added.

3.7 Automated data quality checks

The shared repo should have some examples of data quality jobs and pipelines, with Delta Live Tables or other frameworks like Great Expectations. Automated data quality is essential for good data work, and good examples, working out of the box, will help to get automated data quality checks into use. DBDemos can be a place to start looking for examples to incorporate.

3.8 Workspace distribution

How many domains should a single database workspace share? My general recommendation is to limit each domain to a single workspace. This way the blast radius for data access is naturally limited, and the number of artefacts (notebooks, folders, models, repos, jobs) created in a workspace becomes more manageable, without losing overview.

3.9 CICD

Here are some suggestions for managing automatic deployment to production with CICD. A new PR run is created per Pull Request, per commit hash, with a unique catalog to write the data. All data work code, which is merged to the main branch, is automatically deployed to production. Production takes place on job clusters, but in the same Workspace as development, but with tighter access control of the data.

Self-service orchestration. Pipelines/jobs/flows are defined in e.g. yaml in the repo, and is deployed automatically with CICD. PR pipelines run on job clusters on each new commit to a PR. This way, the yaml definitions of jobs can be tested iteratively before they are put into production.

Since you can end up pushing a commit before a PR test is finished, it is best to use the commit hash as part of the schema name. Automatic validations should be implemented in CICD: incoming and outgoing table schema and data quality verifications, black linting, automatic privacy alerts, etc. Non-compliant code can be blocked from deployment.

DataOps and MLOps might have different needs, but CICD and PRs are great places to handle a lot of the Ops .

3.10 Central data platform enabling team

As suggested by Data Mesh theory, the organization must commit to staffing a central data platform enabling team to cultivate the tools, structures, and ways of working to democratize and the data work. Internal training and knowledge sharing must also be prioritized.

4. Conclusion: A sustainable development plan

A number of proposals for structuring your Data Work have been presented. Much like a city, the consequences of bad or missing planning will manifest over time and become increasingly hard to change. On the flip side, by creating livable and even beautiful cities, we enable people and communities to thrive and flourish, and the same goes for data platforms. Few challenges are more important for a company than this, as we enter the Age of AI. Even the ancients apparently agree with this focus, and the crucial impact it has on culture and people.

5. Vocabulary

For clarity, here are some definitions of relevant terms:

Data platform: A collection of tools and services, preferably in the public cloud, that make it easy to transform datasets and build data products.

Data product: One or more data assets which it makes sense to deliver and access-manage jointly, e.g., the tables in a star model, a Kafka stream, or a machine learning model.

Data asset: A table, machine learning model, Kafka stream, dashboard or similar.

Data work: All work with data assets on a data platform, e.g., data engineering/ETL, Machine learning, MLOps, orchestration of data pipelines, SQL analysis, dashboard building, dashboard analysis.

Data worker: Those who use the data platform, e.g., data engineers, analysts, and data scientists.

Data mesh: A methodology for structuring data work, systems, and data products in an organization, with decentralized data product ownership.

Data area: A category for grouping related data products, for example, the data products studying private customer sales.

Data domain: A team, a department or other part of the organization that produces data products which the domain will take ownership of.

Data ownership: To take responsibility for delivery, versioning, design, and documentation of a data product.

Cloud engineering: Building and maintaining cloud infrastructure.

Application development: Development of applications with user interfaces which are not focused on data work.

Application platform: A collection of tools and services, preferably in the public cloud, that make it easier to build applications.

Data development: Data work on a data platform to produce data products or insights.

Synthetic data: Fake data created to represent real data.

Real/production data: Real data used in production systems.

Ways of working: An opinionated way of going about a common data work task.

Data Pipeline: A sequence of data assets produced together with dependencies between them.

Structure plan: A plan prescribing where different teams and data activities should take place, within git-repositories and the lakehouse structure (Unity Catalog), includes naming conventions and examples.


Luckily, the perfect data strategy has yet to be found!

Nugget?by Isa Oxenaar

During the 2023 summer, I participated in the Data Summer School organized by Carruthers & Jackson. The weekly sessions focus?on how to be or become a good chief data officer.?One of the latest sessions highlighted the fact that good data management is still relatively new to a lot of larger companies. To hear that, to date, no one has completely figured out what the perfect data strategy is,?creates headspace for one's own creativity and thoughts on a successful data strategy.?

Looking at some of the recurring problems while developing a new data strategy can function as a skeleton to tailor the strategy to a specific company. It is, for example, important for a #CDO to relate to ‘business as usual’?in some productive way, so the changes thought out will be implemented. Dealing with ‘business as usual’ mostly means dealing with everything legacy: legacy data environments, IT departments, systems, business processes and legacy transformation processes.?

The Carruther & Jackson way to divide the new strategy in three ways is listed below:

1. Urgent Data Strategy (UDS): the one that extinguishes present fires, tackles high-profile problems, and starts laying a foundation.

2. Immediate Data Strategy (IDS): This is the tactical approach to deliver support for business as usual, gain quick wins and temporary fixes, and prepare for the second part of the data strategy. This strategy focuses on stability, existing data initiatives and data governance, security, exploitation and performance.

3. Target Data Strategy (TDS): The strategic approach. Once the immediate data strategy is in play, the CDO needs to be preparing the organization for the changes that are coming with the implementation of the target data strategy. This strategy entails goals and a vision. The focus lies on adding value.


MetaDAMA?2#11: Data Governance in a Mesh

Nugget?by Winfried Adalbert Etzel

Data Mesh promises so much, so of course everyone is talking about it.

I had the pleasure to chat with Karin H?kansson about Data Governance in a Mesh. Karin has worked with Data Governance and Data Mesh and is really active in the Data Mesh Community, speaking at podcasts and moderating the LinkedIn Group 'Data Governance - Data Mesh'

Here are my key takeaways from our conversation:

Data in retail

  • The culture in retail is about innovation, experimentation, new products. So governance has to adapt to this environment in order to be successful.?
  • If retail would do, what we do in data a fashion retailer would sell yarn instead of t-shirts.
  • Retail knows what the customer wants?before the customer wants it. What would happen if we in data think like retailers?
  • It is more about understanding the business better?than making the business data literate.

Data Governance

  • Data governance best practices in the DMBOK is still relevant, also in a Data Mesh setting.
  • Data governance has been on a journey from compliance driven to business value driven.
  • Centralized data governance creates a bottleneck. Decentralized governance creates silos. So federated data governance is the middle ground.
  • Create incentives to create trust.
  • If you utilize your platform correctly, you can have high expectations towards computational governance.

Data Mesh

  • Data Mesh comes with a cost - you need to invest in Data Mesh.
  • But more than anything, Data Mesh implementation is an enormous change effort.
  • If you do not know why Data Mesh, you will implement something else
  • Implement Data Mesh in an agile way: start small, fail fast and iterate.
  • To start with Data Mesh, work with a business team that is eager to get started and sees the benefits. You have to have business onboard, otherwise it is not going to work.
  • Always check if you get the value that you expected.
  • When you do it, make sure you get governance, business and tech teams to work together and are aligned on the why.
  • Make sure to upskill for Data Mesh - it?is fundamentally different: talk about it, have debates, book clubs.
  • The four?elements of data mesh: can you implement those in a sequence or should you look at them as a unite to implement within a limited scope??
  • Start finding ways for people to work together, e.g., common goal?and environment where it is fine to share.
  • A good first step is to find an example of data with a certain issue or limitation and talk with the business user about exactly this.
  • Data Governance, as much as Data Mesh, is about change management: You need to get close to the business and collaborate actively.
  • Your first two steps should be:

  1. start with one business unite, an early adopter.
  2. find their most critical data and talk about actual data.

MDM and Data Mesh

  • Are we still hunting for that golden record? How do we work with MDM in a mesh? This is not solved yet.
  • You can refer to data?instead of collecting data in an MDM system.
  • Maybe the best approach so far are global IDs to track data cross domains, but how you link your data might become the new MDM.
  • You still need to connect the data, but you do not need to collect the data.?MDM in a Mesh.

Domains, federation and responsibility

  • If you federate responsibility to the domains, they also need the resources and competency to fulfill these responsibilities.
  • If the domain data teams are successful in abstracting the complexity, it will become easy to create data products.
  • If you scale too fast (faster than your data platform), you might end up having to duplicate teams.

You can listen to the podcast?? here ? ?or on any of the common streaming services (Apple Podcast, Google Podcast, Spotify,? etc.)?


Thank you for reading this edition of Data Nugget. We hope you liked it.

Data Nugget was delivered with a vision, zeal and courage from the editors and the collaborators.?

You can visit our website here , or write us at [email protected] .

I would love to hear your feedback and ideas.

Nazia Qureshi

Data Nugget Head Editor

要查看或添加评论,请登录

Data Management Association Norway (DAMA)的更多文章

社区洞察

其他会员也浏览了