The Importance of Data Governance for Artificial Intelligence
Photo credit: Bing Image Creator

The Importance of Data Governance for Artificial Intelligence

by Faye Murray (EMBA) , Chief Data Officer, Emrys Group

In today's data-driven landscape, artificial intelligence (AI) has emerged as a transformative force, revolutionising industries, creating efficiencies, and driving innovation across sectors. However, as a recent OpenAI communication makes clear (full credit to James Betker), the success of any AI model is contingent on the quality and integrity of a dataset:

“It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else.”

An online search for data governance and AI generates a succession of articles focussing on supporting organisational data governance efforts through deploying AI tools and solutions. Yet there is little on the central role data governance plays regarding enterprise AI, despite it being the cornerstone for the effective implementation and management of AI systems. In this article I delve into why data governance is indispensable for AI and how organisations can utilise it to unlock the full potential of their AI investments.

To find out more about how AI and IoT Solutions offered by Emrys can add value to your business, visit: https://emrys.group/ai-iot-solutions/

What is Data Governance?

A holistic definition of data governance is all activities, processes and technology implemented to ensure data is secure, private, clean, accurate, available, up-to-date, and usable. The full data lifecycle at an organisation must be understood through promoting transparency and documentation related to data collection and processing methods; recording and monitoring the provenance, quality and integrity of data; and understanding and documenting the uses, biases and limitations of data organisation wide. Data governance initiatives and frameworks are, at their core, about organisational and cultural transformation, which is why concepts like data ownership and stewardship are key as is executive-level support.

Data governance can be boiled down to four key questions about the data:

  1. Do we know what is happening?
  2. Is it safe?
  3. Is it trustworthy?
  4. Can it be leveraged?

As data becomes increasingly central to business activities, whether because of its applications or the insights it can provide, organisations are recognising the importance of investing in data governance and good data practices. Data governance is important for all data operations and analytics, strategy, and decision-making. However, AI is especially sensitive to poorly governed data, and without addressing data governance, the potential offered by AI will remain mostly untapped.

To find out more about Data Management at Emrys, visit: https://emrys.group/big-data/

Why Data Governance is Essential for Effective AI

1.???? Data Quality Assurance

The value of an AI system is directly proportional to the quality and volume of the data/datasets it is trained on and learns from. Generally, AI systems are built using the 80:10:10 rule, where:

  • 80% of the dataset will be used to train the AI model – the model learns to recognise patterns, correlations, and relationships within the data, enabling it to learn from examples and make predictions or classifications accurately.
  • 10% will be used for data validation (fine tuning model hyperparameters and evaluating performance).
  • The final 10% will be used to test the trained model. This allows the model's generalisation ability to be tested, along with its effectiveness in making predictions on new, unseen data.

As a model is completely reliant on the data on which it’s been trained, low quality, low integrity datasets lead to inaccurate data models. Through maintaining data quality and integrity as a part of data governance programmes, organisations can enhance the performance and predictive capabilities of their AI systems, driving better outcomes and insights.

2.???? Mitigating Model Bias

The dataset an AI model is trained on must be representative, otherwise the AI model may exacerbate and perpetuate any biases which exist within the dataset. The negative impact of this can be seen with ‘overfitting’ and ‘underfitting’.

Overfitting

Overfitting is when the model learns from irrelevant, spurious patterns or ‘noise’ in the training data, rather than the underlying patterns that generalise well to new, unseen data. As a result, the model performs poorly when introduced to unseen data. All aspects of data governance, including robust data governance practices, policies, and procedures to ensure data quality (minimising errors and inconsistences in the data), detailing the provenance of data, encouraging collaborative data sharing, and profiling datasets can help detect and mitigate biases within the dataset before it is used to train the AI model. Mitigation of bias pre-model training entails augmenting the dataset (adding more data and data points) so it is more realistic and representative, but other methods which can also be applied ‘post-training’ include the implementation of appropriate sampling and regularisation techniques, alongside re-weighting strategies and debiasing algorithms.

Underfitting

Underfitting occurs when the model is too simple to capture the underlying patterns or correlations in the data. This means the model can’t make accurate classifications and predictions, and therefore performs poorly on both training data and new, unseen data, leading to limited utility in real-world applications. Biases in the underlying dataset can exacerbate underfitting as the model is unable to capture the complexity of underlying patterns due to unrepresentative or insufficient data. The solutions to underfitting can be similar those detailed for overfitting, reinforcing the centrality of data governance in creating representative, accurate and high performing AI models.

3.???? Assisting the Scaling and Adoption of AI Organisation-wide

As noted previously, properly implemented data governance frameworks enable organisations to efficiently manage the large volumes of data required for successful AI model development and deployment. Thus, through organising, categorising and cataloguing data it also becomes easier to scale AI initiatives across the organisation. However, there is another aspect of data governance which makes the introduction of enterprise AI much more likely to be successful: central to data governance is promoting strong data culture which treats data as a strategic asset whilst breaking down data silos and fostering collaboration between various stakeholders including IT, data engineers, data scientists, data analysts, legal and compliance teams, domain experts and more. It’s exactly this kind of collaborative working and culture which can ensure that AI initiatives are successful, aligning with business goals whilst complying with regulations and meeting the requirements of various stakeholders.

4.???? Supporting Compliance, Regulation and Trust

Data governance plays a multifaceted role in supporting compliance, regulation, and trust in AI systems, while concurrently protecting privacy and minimising reidentification risk. Properly instituted data governance programmes promote transparency and the provision of documentation around organisational data collection, processing and usage, including how data is handled throughout the AI lifecycle. Additionally, data governance frameworks include mechanisms for accountability, access and permissions control, and auditing, which help ensure adherence to regulatory standards like GDPR whilst mitigating the risk of non-compliance. Data governance also advocates for data minimisation, anonymisation and pseudonymisation. Accordingly, the risk of privacy breaches and unauthorised access is minimised, as the exposure of sensitive data is limited. This is particularly relevant with respect to minimising reidentification risk for individuals or organisations whose data have been used in the development or deployment of AI applications, as data governance involves the continuous monitoring of AI systems and data usage to detect and mitigate reidentification risks in near real-time, thereby safeguarding individual privacy.

5.???? Ethical AI Model Development and Deployment

Overseeing any data governance programme is a Data Governance Council or Data Governance Board, which plays a critical role in promoting and ensuring ethical AI model development and deployment. This is achieved through defining ethical guidelines and principles which encompass principles for fairness, transparency, accountability and more during all stages of the AI lifecycle. The Data Governance Council/Data Governance Board should also institute an ethical review process to assess the potential ethical implications of AI models before their development and deployment – this includes stakeholder consultation and conducting ethical impact assessments – before a series of decisions are made, and recommendations provided.

Black Box vs White Box Models

White box models are ‘transparent’ and allow you to understand exactly how decisions and predictions are made; a common example is a decision tree. As the name suggests, black box AI models are opaque and often more complex than white box AI models; neural networks are one example. Black box AI models can generate accurate predictions; however, it is not possible to unpick exactly how those predictions have been arrived at meaning that data governance is especially important. Why? Because it provides insights into the model’s inner workings and provides assurance that data quality and bias have been detected and addressed, and potential ethical issues have been carefully considered. As such, data governance helps stakeholders understand and trust the model’s outputs. Beyond this, black box models can inadvertently reveal sensitive information contained in the training data, posing privacy risks at the individual and organisational levels; this is a particular risk for LLMs. Including data governance as part of the black box AI model development and deployment process affords the opportunity to implement privacy-enhancing technologies like homomorphic encryption, differential privacy or federated learning.

If you would like to find out more about the consultancy services Emrys offers around Data Strategy, AI and Data Governance, then please email [email protected]

要查看或添加评论,请登录

Emrys Group的更多文章

社区洞察

其他会员也浏览了