登录查看更多内容

Data Governance

Ernesto Kuruma

Data & AI | Synergies, Business Opportunities, Innovation

发布日期: 2024年6月11日

Introduction

Technological advancements have ushered in an era where pivotal decisions are increasingly automated, driven by sophisticated software algorithms that leverage machine learning, data analytics, and artificial intelligence (AI). This paradigm shift towards data-driven systems has exposed gaps between conventional governance processes and the realities of software-mediated decision-making. As these intelligent systems continue to pervade various aspects of our lives, from approving credit applications to assessing criminal risk profiles, there is a pressing societal interest in ensuring their ethical governance and fostering accountable algorithms.

The Transparency Conundrum

The dominant discourse within legal and policy circles has moved beyond the notion that mere transparency can resolve the challenges posed by automated decision systems. Disclosing source code is neither a prerequisite for effective oversight nor a guarantee of public comprehension and participation in governance. Moreover, transparency is often objectionable to entities that profit from proprietary methods not protected by patents or copyrights. Paradoxically, detailed system knowledge can facilitate adversarial activities like gaming or exploiting vulnerabilities.

Crucially, the pivotal role of data in machine learning, data analytics, and AI systems implies that source code disclosure alone is insufficient to reveal their inner workings fully. The processes of data collection, normalization, exploration, and cleaning significantly influence system functionality, underscoring the need for a more holistic approach to transparency.

Data Governance: A Multifaceted Responsibility

Businesses deploying data-driven systems, from rudimentary descriptive analytics to cutting-edge deep learning models, must navigate a complex landscape of data governance requirements. Key considerations include:

The origin and methods of data collection (directly from customers or purchased from third parties)
Restrictions imposed by privacy policies or contractual obligations
Applicable data protection laws and jurisdictional variations
Permissible data combinations and analyses

Failure to address these concerns can result in legal non-compliance, reputational damage, and erosion of consumer trust – risks no organization can afford to overlook.

Robust Data Governance Practices

To confidently navigate the treacherous waters of data governance, organizations must adopt smart policies and practices. This article outlines essential best practices for responsible data stewardship.

Prioritize Information Security and Privacy

A robust data governance strategy must prioritize strong information security and privacy measures. Any collected and retained data pose a risk of breach, necessitating a minimalist approach – limiting data collection to only what is essential and avoiding unnecessary retention.

Retained data should be securely encrypted, with access control mechanisms restricting sensitive customer information to authorized personnel with verified needs. Regular auditing and monitoring of sensitive data stores are crucial to ensure compliance and identify potential misuse.

Minimize Data Collection and Retention

When data retention is unavoidable, organizations should explore scrubbing or aggregating retained data to reduce sensitivity levels. For instance, instead of raw visitor logs, a website hosting company could retain visitor counts stratified by geographic region, device type, or network operator – information sufficient for most operational needs while minimizing privacy risks.

Additionally, understanding data retention rationales can inform when data can be profitably discarded, anonymized, pseudonymized, or otherwise transformed to mitigate risks.

Consider the Risk of Re-identification

Classifying data as personally identifiable information (PII) or non-sensitive is an oversimplification that fails to account for the risk of re-identification. Research has shown that seemingly innocuous data can become identifying when combined with other datasets or contextual information.

Responsible data governance must consider the potential for retained data to be re-identified, intentionally or inadvertently, and evaluate whether formal privacy-preserving techniques like differential privacy are necessary to safeguard sensitive information securely.

Establish a Data Use Review Board

Data scientists must continually scrutinize the ethics of their methods and findings, prepared to forgo analyses that violate laws, privacy norms, contractual requirements, or consumer trust. To facilitate this, organizations should designate review boards empowered to approve or deny:

The collection of new data
Investigations into sensitive questions using company data
The deployment of insights from such analyses

These cross-functional boards should comprise stakeholders from data science, information security, legal, compliance, marketing, and other relevant domains. Their diverse perspectives can uncover potential issues and provide valuable insights into responsible data use.

Produce Data-Focused Social Impact Statements

Impact statements offer a structured process to investigate potential issues in data practices and provide a digestible view of the risks associated with data processing. Organizations concerned about the equities of data analysis should include such concerns and mitigations in their privacy impact statements and consider producing similar social impact statements for data-driven systems and processes.

These statements enhance transparency about organizational risk acknowledgment and mitigation techniques without preemptively foreclosing specific activities. They can be used internally to convince leadership of an analysis's value and risk profile or published externally to engage consumer trust and solicit feedback from civil society groups.

Strive for Explainability and Ongoing Validation

To foster trust in data-driven processes, both data scientists and decision subjects must understand them. This understanding is supported by the ability to explain:

How specific decisions were made (local explanations)
The general rules governing all decisions of a certain type or system (global explanations)

Many explainable data analysis methods exist, and the question of what constitutes a useful explanation is an active research area. However, explainability is not a panacea; explanations must be supported by sufficient evidence, target the intended audience, and adequately engage with the task at hand to avoid lending credence to incorrect models.

In addition to providing explanations, data-driven systems can be made more transparent through the disclosure of analysis methods and underlying datasets, when feasible. However, data release must be carefully considered, as datasets are often sensitive to their collection context and methodology, risking misuse when repurposed without proper context.

Facilitate Ongoing Auditing and Assumption Challenging

Interrogating a data-driven system's fidelity is an ongoing effort, driven by the twin risks of modeling error (unwarranted assumptions baked into data or normalization methods) and concept drift (changes in the world that invalidate assumptions).

Data scientists must continually validate their predictions and monitor system performance post-deployment. Auditing, especially by groups potentially affected by bias, is critically important for investigating unfairness. External audits by trusted academics, journalists, civil society groups, or the public can also be facilitated when appropriate.

Systems could be modified to support querying on synthetic data, demonstrating how outputs would change under hypothetical input variations. Stronger forms of testing, including white-box methods that consider a model's structure, should also be considered.

Identify and Mitigate Systematic Biases

Systematic bias can enter datasets and analysis methods at various levels:

Data collection from non-representative samples
Subsampling or coalescing data in ways that disadvantage particular groups
Human interaction during labeling, handling missing values, pruning outliers, defining groupings, or encoding categorical variables

Data scientists define the problem, choose methods, measure success criteria, optimize for specific values, and select parameters – all context-specific decisions that can introduce bias.

领英推荐

AI governance starts with data governance

Cloudflare 5 个月前

How do you foster a data security and privacy culture…

Anil Patil ??"PrivacY ProdigY"?? 3 个月前

Data Residency: Fostering Trust and Unleashing…

Ronald van Loon 11 个月前

Ruling out such behavior from black-box models like random forests and deep neural networks, which may infer protected attributes as new classification features, is particularly important.

The way data are considered also matters; patterns that exist in aggregated groups may disappear or reverse when subgroups are analyzed separately (Simpson's paradox).

Mitigating systematic biases requires understanding the nature and source of the bias and the best way to respond. Fortunately, techniques exist to provide guarantees against certain types of bias, although their practical implementation remains an active research area.

Examine Error Distributions and Feedback Loops

Responsible data governance considers not only the fairness of correct predictions but also of errors. Disproportionately harmful errors for individuals or protected groups can undermine equitable decision-making, even if the process considered appropriate factors.

False positives are particularly important in contexts like criminal risk assessment, where misclassifications as high-risk can lead to harsher treatment, potentially creating a negative feedback loop that increases the likelihood of future adverse outcomes.

Evaluating unfairness must consider how the spoils of data analysis will translate into real-world actions and their consequences.

Enable Human Challenges and Corrections

While the goal of automated decision-making is efficiency and scalability, responsible data governance must define mechanisms for humans to challenge and correct erroneous outcomes.

Externally visible processes should allow engagement with automated results, and an internal role should be designated to own the outcomes and human-mediated escalation process. This role must be prepared to address individual-level and broader societal claims of unfairness.

Systems should produce sufficient operational evidence to allow decision subjects to determine if decisions were correct and facilitate a review process to pinpoint what happened and why. Careful system design can ensure the reproducibility and oversight of machine actions.

Manage Human Oversight While Capturing Efficiencies

Introducing human review into automated processes raises concerns about compromising speed and scale benefits. However, this trade-off can be managed by carefully defining escalation triggers, akin to customer support workflows.

The cost of human escalations incentivizes developing high-fidelity decision processes, closing the feedback loop for investigating model accuracy. Tracking situations requiring review can identify areas for process improvement and examine potential abuse favoring or disfavoring particular individuals or groups.

Emerging Standards and Principles

To advance responsible data governance and foster fairness, accountability, transparency, and ethics in data-driven systems, many organizations have proposed statements of principles or evaluation standards. While highly contextual and dependent on the system, data, and populations involved, these documents serve as guideposts for evaluation and measurement.

Dagstuhl Principles for Accountable Algorithms

This framework provides light-touch guidance for building "accountable algorithms," centered around five pillars: responsibility, explainability, accuracy, auditability, and fairness. Each pillar includes a statement of advice and guiding questions to enable data scientists to implement it meaningfully.

ACM US Policy Council Principles

The ACM US Policy Council statement defines algorithms, analytics, automated decision-making, and associated risks, presenting seven principles for "algorithmic transparency and accountability": awareness, access and redress, accountability, explanation, data provenance, auditability, and validation and testing.

Center for Democracy and Technology Digital Decision Project

This project's report identifies four major principles for responsible automated decision systems: fairness, explainability, auditability, and reliability (the ability to trust a system's behavior and monitor deviations). An interactive tool guides system designers through key questions to evaluate consistency with these principles.

IEEE Global Initiative for Ethical AI and Autonomous Systems

The IEEE initiative aims to build consensus around ethical issues in AI and autonomous systems, supporting human values in system design. Its flagship report defines a vision for "Ethically Aligned Design" centered on three principles: embodying human rights ideals, prioritizing maximum benefit to humanity and the environment, and mitigating risks as AI/autonomous systems evolve.

IEEE Standard P7003 – Algorithmic Bias Considerations

This ongoing standards-setting process defines certification for communicating the use of best practices in algorithm design, testing, and evaluation to avoid unjustified differential impact on users. It provides benchmarking procedures for data quality, guidelines for minimizing concept drift, and approaches for managing model output interpretation.

Data Ethics and the GDPR: A Regulatory Lens

The EU's General Data Protection Regulation (GDPR) provides a timely policy lens for examining responsible data governance practices. While not prescribing specific compliance regimes, the GDPR encourages ideal behavior and improves data stewardship.

Articles 13 and 14 establish strong notice rights for personal data processing, including information about automated decision-making logic and repurposing data for new uses. Articles 15-17 provide rights to access, correct, and erase personal data.

Article 22 grants EU citizens the right to demand human intervention in important automated decisions, raising questions about what constitutes personal information and meaningful explanations in data-driven systems.

The GDPR's data protection impact assessment requirement aligns with recommendations for producing social impact statements, engaging stakeholders, and maintaining decision records.

Article 21 allows objections to personal data processing, while Article 22 permits demands for human decision-making over solely automated processing – supporting practices of enabling human challenges, review, and override mechanisms.

Recital 69 suggests data controllers bear the burden of justifying continued data use against a subject's erasure request, presenting interesting implications for managing models trained on such data.

Overall, the GDPR incentivizes organizations to develop robust data governance regimes, aligning with many best practices outlined in this article.

Conclusion

As data-driven systems become increasingly pervasive, realizing their full potential hinges on building and operating them responsibly to earn consumer and regulatory trust. The practices and principles discussed here provide a solid foundation for organizations to navigate the complex landscape of data governance confidently.

By prioritizing information security, minimizing data collection, considering re-identification risks, empowering data review boards, producing impact statements, striving for explainability, facilitating auditing, mitigating biases, examining error distributions, enabling human oversight, and aligning with emerging standards, businesses can confidently field data-driven solutions that uphold ethical norms and human values.

Responsible data governance is not just a compliance exercise but an opportunity for organizations to demonstrate their commitment to fairness, accountability, transparency, and ethics – cornerstones of trustworthy AI and the key to unlocking its transformative potential.

要查看或添加评论，请登录

Ernesto Kuruma的更多文章

Best Practices for Data Governance in Machine Learning Explained

2024年10月3日

Best Practices for Data Governance in Machine Learning Explained

In the world of machine learning, we're seeing a growing need for effective data governance. As we dive deeper into…
Unlocking Profitability: A Comprehensive Guide to Data Monetization

2024年6月14日

Unlocking Profitability: A Comprehensive Guide to Data Monetization

The Emergence of Data Marketplaces In recent years, a groundbreaking business model has emerged, disrupting traditional…

1 条评论
Safeguarding Privacy and Bolstering Security in Decentralized Federated Learning Systems

2024年6月13日

Safeguarding Privacy and Bolstering Security in Decentralized Federated Learning Systems

Federated Learning: A Privacy-Preserving Paradigm In recent years, the widespread adoption of machine learning has…
Despedida | Farewell - Accenture

2024年6月3日

Despedida | Farewell - Accenture

Uma Jornada de Inova??o e Colabora??o Minha carreira em tecnologia foi inspirada pelo filme "Jogos de Guerra", que…

5 条评论
De Engenheiro Elétrico a Engenheiro de Dados: Um Salto de Profiss?o, Mas N?o de Essência

2023年9月27日

De Engenheiro Elétrico a Engenheiro de Dados: Um Salto de Profiss?o, Mas N?o de Essência

No dia da minha formatura em Engenharia Elétrica, jamais imaginei que minha trajetória profissional me conduziria ao…

3 条评论

See all articles

Data Governance

Ernesto Kuruma

Data & AI | Synergies, Business Opportunities, Innovation

Introduction

The Transparency Conundrum

Data Governance: A Multifaceted Responsibility

Robust Data Governance Practices

Prioritize Information Security and Privacy

Minimize Data Collection and Retention

Consider the Risk of Re-identification

Establish a Data Use Review Board

Produce Data-Focused Social Impact Statements

Strive for Explainability and Ongoing Validation

Facilitate Ongoing Auditing and Assumption Challenging

Identify and Mitigate Systematic Biases

领英推荐

Examine Error Distributions and Feedback Loops

Enable Human Challenges and Corrections

Manage Human Oversight While Capturing Efficiencies

Emerging Standards and Principles

Dagstuhl Principles for Accountable Algorithms

ACM US Policy Council Principles

Center for Democracy and Technology Digital Decision Project

IEEE Global Initiative for Ethical AI and Autonomous Systems

IEEE Standard P7003 – Algorithmic Bias Considerations

Data Ethics and the GDPR: A Regulatory Lens

Conclusion

Ernesto Kuruma的更多文章

社区洞察

其他会员也浏览了

The Importance of Data Governance in the Public Sector

Unstructured Data Under Control: A Guide for Enterprises

Why Common Approaches for Mitigating Sensitive Data Sprawl Fail

Data Governance in the Enterprise: Why It's Important and How to Implement It

Why is Data Governance an Essential for the Manufacturing Industry

Optimising your data analytics strategy through external partnerships

The 2025 Agenda for Chief Data Officers: Key Strategic Priorities

Data Governance Trends: Shaping the Future of Information Management

Data governance as a concept

Data Governance is Non-Negotiable: A Board of Directors’ Perspective on Establishing a Strong Foundation for AI-Driven Initiatives

Introduction

The Transparency Conundrum

Data Governance: A Multifaceted Responsibility

Robust Data Governance Practices

Prioritize Information Security and Privacy

Minimize Data Collection and Retention

Consider the Risk of Re-identification

Establish a Data Use Review Board

Produce Data-Focused Social Impact Statements

Strive for Explainability and Ongoing Validation

Facilitate Ongoing Auditing and Assumption Challenging

Identify and Mitigate Systematic Biases

领英推荐

Examine Error Distributions and Feedback Loops

Enable Human Challenges and Corrections

Manage Human Oversight While Capturing Efficiencies

Emerging Standards and Principles

Dagstuhl Principles for Accountable Algorithms

ACM US Policy Council Principles

Center for Democracy and Technology Digital Decision Project

IEEE Global Initiative for Ethical AI and Autonomous Systems

IEEE Standard P7003 – Algorithmic Bias Considerations

Data Ethics and the GDPR: A Regulatory Lens

Conclusion

Ernesto Kuruma的更多文章

Best Practices for Data Governance in Machine Learning Explained

Unlocking Profitability: A Comprehensive Guide to Data Monetization

Safeguarding Privacy and Bolstering Security in Decentralized Federated Learning Systems

Despedida | Farewell - Accenture

De Engenheiro Elétrico a Engenheiro de Dados: Um Salto de Profiss?o, Mas N?o de Essência

社区洞察

其他会员也浏览了

The Importance of Data Governance in the Public Sector

Unstructured Data Under Control: A Guide for Enterprises

Why Common Approaches for Mitigating Sensitive Data Sprawl Fail

Data Governance in the Enterprise: Why It's Important and How to Implement It

Why is Data Governance an Essential for the Manufacturing Industry

Optimising your data analytics strategy through external partnerships

The 2025 Agenda for Chief Data Officers: Key Strategic Priorities

Data Governance Trends: Shaping the Future of Information Management

Data governance as a concept

Data Governance is Non-Negotiable: A Board of Directors’ Perspective on Establishing a Strong Foundation for AI-Driven Initiatives