Top 10 Data Governance Anti-Patterns for Analytics
At the Microsoft Technology Center (MTC) we talk to a lot of data leaders that are struggling to leverage their traditional data governance approaches in the Age of Advanced Analytics. Formerly tried-and-true governance approaches are not working anymore and they can't exactly put their finger on the reasons. Analytics has changed and data governance has not kept pace. Microsoft and the MTC has some patterns and processes that can help you modernize your governance to achieve your Digital Transformation goals.
What is Data Governance?
There are entire books on?data governance?that you should refer to if you need a refresher. Data governance is the overall management of the availability, usability, integrity, and security of an enterprise's data assets. Most companies will have a data governance body that sets the rules, defines the procedures, and ensures the various data teams execute on the plan.
The Big Problem with Data Governance
Data has changed radically the past few years and governance best practices have not kept up. Years ago we only had operational systems where the notions of governance were easy and not too controversial. For example: few users had the ability to query the data directly for fear of bringing the system down, therefore the default posture was to DENY access. When we built the first generation of analytical systems (namely, data warehouses) the governance was only slightly more complex. Users now wanted the ability to query the data themselves, which usually involved exporting the data to Excel because we still feared that the user might issue a rogue query and bring down the analytics system. Every company governed these access requests differently based on its culture ... some?trusted?their users and allowed access, others did not.
Today, data sources have exploded and business users are demanding?Self-Service Analytics. They want to be able to look at data on their terms and do?Exploratory Data Analytics?without the need for IT to broker access. Data lakes and lakehouses have exploded which further encourage users to be curious about their data assets. With a data lake the users merely wire up their own compute instance (which might be a data warehouse, Spark, Power BI, or Excel) to the data asset and they can do whatever they want. Dashboarding and visual query tools like Power BI are easy to use and more business users are comfortable writing a modicum of SQL, the?lingua franca?of?data.
Data governance teams have not kept pace. They are not accustomed to granting access to data for experimentation. They want to know what the data will be used for. Their default posture is to limit access to data. Certainly it makes sense to limit business users from having direct query access to operational systems that likely contains PHI and PII, but why hinder a business user from valuable data that can be monetized?
Too many business people equate "data governance" with being told NO. If that's your company, you need to start changing that culture NOW. Companies that CANNOT leverage their data assets to make data-driven decisions will soon find themselves marginalized as their competition embraces the promise of Digital Transformation. Data governance should enable innovation, not be its roadblock.
Data Governance Anti-Patterns
Here are some common Data Governance Anti-Patterns we see at the MTC, and our recommendations for solutions.
Anti-Pattern 01: Data governance as an IT function
Is your data governance team comprised solely (or even mostly) of IT personnel? If so, I'll wager your data governance is not aligned with your business goals. A company's data is not OWNED by IT, it is owned by the business. IT is merely the steward. Business units should have representation on all governance committees. Data governance is multi-disciplinary and should be an enabler of business value.
Data governance is not a function of IT. It should be driven by the business in support of company objectives.
Anti-Pattern 02: Relying on a Tools-based Governance Approach
Too many times I hear "we want to use Azure Purview to do data governance". Azure Purview is great for that. The problem is you told me a TECHNOLOGY problem. When I probe deeper around what the company's strategic goals are for its data assets I don't get clear answers. And I never hear: "Well, we talked to our business users and what they would really like to see is...".
The data governance goals should drive the governance tool choices, the tool should not drive the governance goals.
Data governance is a journey, not a destination
Anti-Pattern 03: Putting PII/PHI/sensitive data in the lake (therefore necessitating governance)
Data lakes are the best and most-often used tools for analytics. But I see technologists copying EVERY piece of operational data to the lake. Why? I can see no good reason to put credit card numbers or SSNs in a data lake. They serve no analytical purpose. The more copies you make of sensitive data the more your risk for a breach increases.
I've heard data leaders say they need some sensitive data in the lake so the lake can be "the single source of truth" and support operational reporting that requires this data. If that's the case then you have 2 choices:
Anti-Pattern 04: Using the same governance rules for all lake zones and analytics datasets
Data lakes (or whatever you use for your analytics sandbox) have "zones". Everyone calls the zones by different names -- "landing", "raw", "curated", "bronze", "silver", "gold". There are 2 key reasons why we have zones. Based on the zone name we should implicitly know:
Different areas of the data lake should have different governance approaches.
Self-service analytics REQUIRES a more "open" approach to data access. There is no other way.
Anti-Pattern 05: Insisting on "Early" Governance
Sometimes when I help customers with a difficult analytics problem I'll have someone interrupt me and want to discuss the data governance implications of the outputs of our work. Said differently, they want to control the data we are trying to produce BEFORE WE'VE EVEN PRODUCED IT.
This is the wrong approach.
The better approach is to allow the analytics team to find "the nuggets of gold" first, and solve the business problems. Then, as part of a review process, let's have a thoughtful conversation about how we should govern our new insights.
I call this model "late governance". We are deferring all governance decisions until we are sure we have generated a valuable business insight. I'll say it again, data governance should not stifle innovation.
领英推荐
"Early" data governance is diametrically opposed to self-service analytics, which is a business goal for every company I talk to.
Similarly, most analytics projects I work on will require the team to ingest new datasets. Proponents of "early governance" will insist that this new data be controlled, before we've even experimented to see if it provides?lift. This stifles the experimentation process.
Anti-Pattern 06: Following a "One-Size-Fits-All Personas" Policy
Some analytics personas will want access to the most raw, dirty data, so they can spot trends and anomalies. Think: data scientists. They will want access to the "raw" or "bronze" area of the data lake. But this data would be overwhelming and misinterpreted by other personas (like business analysts). Let these other folks see that the data is available (via the catalog), but make them request access and explain why you are asking them to request access to it (because you want to understand?why?they want access to raw, possibly confusing, data). Other personas need "self-service" analytics but don't really have SQL skills. These users should have access to summarized and "certified" data that they can use in Power BI or Excel. These personas would have access to the "gold" and "platinum" zones which would map closely to the semantic tier and data warehouse facts and dims. "Business analyst" personas that understand SQL might need access to more granular, yet "clean", data. This maps closely to the "silver" or "curated" zones for most data lakes.
This structure allows far more flexibility. We can give everyone access to what we think they need based on their persona and they always have visibility into ALL data and can request access if needed.
Never force your data scientists to use data warehouse data (facts and dims). It almost never works for their needs. They'll get frustrated and you will stifle their innovation. Data lakes were originally built to store data that these personas needed to do their jobs. Since then everyone else has seen the efficacy in using the lake for analytics where the data is easier to manipulate than the traditional facts and dims.
Anti-Pattern 07a: Denying write access to users
One of the goals of a data lake is to be "an analytics sandbox". This means users will need to make copies of data for experimentation...they'll need to semantically-enrich that data (usually via SQL) and they'll need to save copies of it somewhere where they can reference it and continue to build upon it as they search for business value. To do that they need to have write access to the sandbox area of the lake. The sandbox is very much like a temp table in SQL Server.
Human users, whether business analysts or data engineers, should never have write access to ANY area of the lake except the sandbox. The remainder of the lake should only be writable by the scheduler/job account. Not even the Ops Team should be allowed to write to the lake. This is a core lake governance concept.
In the past the DBA?never?gave write access to the warehouse to common users. This made analytics very difficult since most analysts are not able to do all of the data enrichment they need in a single SQL statement. This is why it was always easier for a user to export the data to Excel and do the analytics there.
Anti-Pattern 07b: Allowing Shared Sandboxes
While we want users to be able to have a writable sandbox, we do NOT want them to have a?shareable?sandbox. Why?
Imagine your analyst, Annie, found a valuable business insight and has it persisted in her sandbox. She tells the Operations team she would like to move it to the data warehouse so it can eventually be added to a Power BI dashboard. The governance team finds out and wants to hold various meetings to understand the data better. The data warehouse team says they'll need 3 months to integrate the data into the fact table. The dashboarding team wants to meet with Annie to understand...ugh...isn't this exhausting? Annie found something valuable and she's being punished. What does she do? She shares the data in her sandbox with her team...and likely later she shares it with everyone via Power BI. Now the data is "published" in various dashboards and has zero governance. That's not good.
Make people follow the governance process by not allowing ad hoc sharing. If users complain about?stifled innovation?then that means your governance process needs an overhaul. FIX IT!
You've probably heard the old aphorism:?Without governance the data lake soon turns into a data swamp. Not true. If you follow all of the above advice there is NO WAY you'll ever have a swamp. But make sure your processes aren't so onerous that you are a drag on innovation.
How can you spot an organization that struggles with data governance? They have obviously inefficient business processes.
Anti-Pattern 08: Not Managing By Exception
Business users will find all kinds of valid reasons to do things that violate governance rules. It's OK to make exceptions to your governance plan.
Here's an example I've seen many times: the governance team mandates that only Spark, Synapse, and other "approved" tools can be used to query the data lake. But eventually a department will purchase a query tool they want to use?with THEIR data. Too many organizations will disallow this. Why? This really doesn't make any sense. The compute engine (ie, the "tool") is not important, the data is.
With a data lake, the model should be "Bring Your Own Compute" (BYOC). Use the tool you want to use.
This is just one example. Governance teams need to be more open to making exceptions.
In so many companies I work with the users equate "data governance team" with?NO. Don't let this be your company.
Anti-Pattern 09: Putting the Permissions at the Compute Layer
Always do the ACL'ing in the data lake. This allows users to BYOC (see above). The tool no longer matters. The user simply logs in and the tool passes the credential to the lake to determine access.
Never apply the permissions at the compute tier, do all permissioning strictly at the data persistence tier...the lake.
Anti-Pattern 10: Mandating All Data Be in the Official Corporate Data Lake
This never works. Why? For the same reasons departmental data marts sprung up 20 years ago in spite of the corporate data warehouse: The business can't wait for IT, so it builds its own solutions. We would call these "skunkworks" projects or "Shadow IT". These projects solve the immediate business problem, but at the long-term expense of data governance and building more and more data siloes. (This is why you need to allow business units to have their own sandbox in YOUR data lake versus those teams building their own data lakes out of frustration.)
Every customer I talk to has multiple data lakes ("ponds?"). Most are structured around business units. My advice is not to stifle this innovation simply to conform to a corporate governance standard. Instead, assist these business units with their governance efforts by providing the tools and templates you are using for the corporate lake. Help them, don't hinder them.
Data governance is not a "project". Data is constantly changing, and so is the data management field. Data governance should be viewed as an on-going corporate "program".
How the MTC can help
Is your data governance team embracing modern analytics notions like:
Data governance is a bigger problem today than ever. Companies are using data for more novel use cases and regulators are taking notice. But your governance strategy still has to foster innovation. Your company's view of its data estate and risk tolerance has likely evolved in the last few years.?Self-Service Analytics?is impossible to achieve with outmoded governance anti-patterns that I've outlined above. This is a delicate balancing act that the MTC understands very well. If you want to achieve?Digital Transformation?then you must realize these are issues of?culture.
The Microsoft Technology Center is a service that helps customers on their Digital Transformation journey. We know that?successful data governance efforts are less about the technology and more about modern processes...and people. Data governance is changing to support dual mandates of heightened regulatory burdens and self-service initiatives. At the MTC, we've been doing this for years. We are thought leaders, conference speakers, and former consultants and executives. We've learned the patterns that will help you transform your governance programs. And with the Azure cloud and our governance technologies like Azure Purview, we can execute in hours-to-days instead of months.
Does this sound compelling? SUCCESS for the MTC is solving challenging problems for respected companies and their talented staff. Does that sound like folks you can trust with your data? The Digital Transformation is here, and we know how to help. Would you like to engage?
Senior Director @ Microsoft Business Applications; Specializing in digital transformation leveraging CRM, ERP, and low-code technology.
3 年Excellent article! I learned so much about data governance from it - thanks!