Data Governance
Introduction
Technological advancements have ushered in an era where pivotal decisions are increasingly automated, driven by sophisticated software algorithms that leverage machine learning, data analytics, and artificial intelligence (AI). This paradigm shift towards data-driven systems has exposed gaps between conventional governance processes and the realities of software-mediated decision-making. As these intelligent systems continue to pervade various aspects of our lives, from approving credit applications to assessing criminal risk profiles, there is a pressing societal interest in ensuring their ethical governance and fostering accountable algorithms.
The Transparency Conundrum
The dominant discourse within legal and policy circles has moved beyond the notion that mere transparency can resolve the challenges posed by automated decision systems. Disclosing source code is neither a prerequisite for effective oversight nor a guarantee of public comprehension and participation in governance. Moreover, transparency is often objectionable to entities that profit from proprietary methods not protected by patents or copyrights. Paradoxically, detailed system knowledge can facilitate adversarial activities like gaming or exploiting vulnerabilities.
Crucially, the pivotal role of data in machine learning, data analytics, and AI systems implies that source code disclosure alone is insufficient to reveal their inner workings fully. The processes of data collection, normalization, exploration, and cleaning significantly influence system functionality, underscoring the need for a more holistic approach to transparency.
Data Governance: A Multifaceted Responsibility
Businesses deploying data-driven systems, from rudimentary descriptive analytics to cutting-edge deep learning models, must navigate a complex landscape of data governance requirements. Key considerations include:
Failure to address these concerns can result in legal non-compliance, reputational damage, and erosion of consumer trust – risks no organization can afford to overlook.
Robust Data Governance Practices
To confidently navigate the treacherous waters of data governance, organizations must adopt smart policies and practices. This article outlines essential best practices for responsible data stewardship.
Prioritize Information Security and Privacy
A robust data governance strategy must prioritize strong information security and privacy measures. Any collected and retained data pose a risk of breach, necessitating a minimalist approach – limiting data collection to only what is essential and avoiding unnecessary retention.
Retained data should be securely encrypted, with access control mechanisms restricting sensitive customer information to authorized personnel with verified needs. Regular auditing and monitoring of sensitive data stores are crucial to ensure compliance and identify potential misuse.
Minimize Data Collection and Retention
When data retention is unavoidable, organizations should explore scrubbing or aggregating retained data to reduce sensitivity levels. For instance, instead of raw visitor logs, a website hosting company could retain visitor counts stratified by geographic region, device type, or network operator – information sufficient for most operational needs while minimizing privacy risks.
Additionally, understanding data retention rationales can inform when data can be profitably discarded, anonymized, pseudonymized, or otherwise transformed to mitigate risks.
Consider the Risk of Re-identification
Classifying data as personally identifiable information (PII) or non-sensitive is an oversimplification that fails to account for the risk of re-identification. Research has shown that seemingly innocuous data can become identifying when combined with other datasets or contextual information.
Responsible data governance must consider the potential for retained data to be re-identified, intentionally or inadvertently, and evaluate whether formal privacy-preserving techniques like differential privacy are necessary to safeguard sensitive information securely.
Establish a Data Use Review Board
Data scientists must continually scrutinize the ethics of their methods and findings, prepared to forgo analyses that violate laws, privacy norms, contractual requirements, or consumer trust. To facilitate this, organizations should designate review boards empowered to approve or deny:
These cross-functional boards should comprise stakeholders from data science, information security, legal, compliance, marketing, and other relevant domains. Their diverse perspectives can uncover potential issues and provide valuable insights into responsible data use.
Produce Data-Focused Social Impact Statements
Impact statements offer a structured process to investigate potential issues in data practices and provide a digestible view of the risks associated with data processing. Organizations concerned about the equities of data analysis should include such concerns and mitigations in their privacy impact statements and consider producing similar social impact statements for data-driven systems and processes.
These statements enhance transparency about organizational risk acknowledgment and mitigation techniques without preemptively foreclosing specific activities. They can be used internally to convince leadership of an analysis's value and risk profile or published externally to engage consumer trust and solicit feedback from civil society groups.
Strive for Explainability and Ongoing Validation
To foster trust in data-driven processes, both data scientists and decision subjects must understand them. This understanding is supported by the ability to explain:
Many explainable data analysis methods exist, and the question of what constitutes a useful explanation is an active research area. However, explainability is not a panacea; explanations must be supported by sufficient evidence, target the intended audience, and adequately engage with the task at hand to avoid lending credence to incorrect models.
In addition to providing explanations, data-driven systems can be made more transparent through the disclosure of analysis methods and underlying datasets, when feasible. However, data release must be carefully considered, as datasets are often sensitive to their collection context and methodology, risking misuse when repurposed without proper context.
Facilitate Ongoing Auditing and Assumption Challenging
Interrogating a data-driven system's fidelity is an ongoing effort, driven by the twin risks of modeling error (unwarranted assumptions baked into data or normalization methods) and concept drift (changes in the world that invalidate assumptions).
Data scientists must continually validate their predictions and monitor system performance post-deployment. Auditing, especially by groups potentially affected by bias, is critically important for investigating unfairness. External audits by trusted academics, journalists, civil society groups, or the public can also be facilitated when appropriate.
Systems could be modified to support querying on synthetic data, demonstrating how outputs would change under hypothetical input variations. Stronger forms of testing, including white-box methods that consider a model's structure, should also be considered.
Identify and Mitigate Systematic Biases
Systematic bias can enter datasets and analysis methods at various levels:
Data scientists define the problem, choose methods, measure success criteria, optimize for specific values, and select parameters – all context-specific decisions that can introduce bias.
领英推荐
Ruling out such behavior from black-box models like random forests and deep neural networks, which may infer protected attributes as new classification features, is particularly important.
The way data are considered also matters; patterns that exist in aggregated groups may disappear or reverse when subgroups are analyzed separately (Simpson's paradox).
Mitigating systematic biases requires understanding the nature and source of the bias and the best way to respond. Fortunately, techniques exist to provide guarantees against certain types of bias, although their practical implementation remains an active research area.
Examine Error Distributions and Feedback Loops
Responsible data governance considers not only the fairness of correct predictions but also of errors. Disproportionately harmful errors for individuals or protected groups can undermine equitable decision-making, even if the process considered appropriate factors.
False positives are particularly important in contexts like criminal risk assessment, where misclassifications as high-risk can lead to harsher treatment, potentially creating a negative feedback loop that increases the likelihood of future adverse outcomes.
Evaluating unfairness must consider how the spoils of data analysis will translate into real-world actions and their consequences.
Enable Human Challenges and Corrections
While the goal of automated decision-making is efficiency and scalability, responsible data governance must define mechanisms for humans to challenge and correct erroneous outcomes.
Externally visible processes should allow engagement with automated results, and an internal role should be designated to own the outcomes and human-mediated escalation process. This role must be prepared to address individual-level and broader societal claims of unfairness.
Systems should produce sufficient operational evidence to allow decision subjects to determine if decisions were correct and facilitate a review process to pinpoint what happened and why. Careful system design can ensure the reproducibility and oversight of machine actions.
Manage Human Oversight While Capturing Efficiencies
Introducing human review into automated processes raises concerns about compromising speed and scale benefits. However, this trade-off can be managed by carefully defining escalation triggers, akin to customer support workflows.
The cost of human escalations incentivizes developing high-fidelity decision processes, closing the feedback loop for investigating model accuracy. Tracking situations requiring review can identify areas for process improvement and examine potential abuse favoring or disfavoring particular individuals or groups.
Emerging Standards and Principles
To advance responsible data governance and foster fairness, accountability, transparency, and ethics in data-driven systems, many organizations have proposed statements of principles or evaluation standards. While highly contextual and dependent on the system, data, and populations involved, these documents serve as guideposts for evaluation and measurement.
Dagstuhl Principles for Accountable Algorithms
This framework provides light-touch guidance for building "accountable algorithms," centered around five pillars: responsibility, explainability, accuracy, auditability, and fairness. Each pillar includes a statement of advice and guiding questions to enable data scientists to implement it meaningfully.
ACM US Policy Council Principles
The ACM US Policy Council statement defines algorithms, analytics, automated decision-making, and associated risks, presenting seven principles for "algorithmic transparency and accountability": awareness, access and redress, accountability, explanation, data provenance, auditability, and validation and testing.
Center for Democracy and Technology Digital Decision Project
This project's report identifies four major principles for responsible automated decision systems: fairness, explainability, auditability, and reliability (the ability to trust a system's behavior and monitor deviations). An interactive tool guides system designers through key questions to evaluate consistency with these principles.
IEEE Global Initiative for Ethical AI and Autonomous Systems
The IEEE initiative aims to build consensus around ethical issues in AI and autonomous systems, supporting human values in system design. Its flagship report defines a vision for "Ethically Aligned Design" centered on three principles: embodying human rights ideals, prioritizing maximum benefit to humanity and the environment, and mitigating risks as AI/autonomous systems evolve.
IEEE Standard P7003 – Algorithmic Bias Considerations
This ongoing standards-setting process defines certification for communicating the use of best practices in algorithm design, testing, and evaluation to avoid unjustified differential impact on users. It provides benchmarking procedures for data quality, guidelines for minimizing concept drift, and approaches for managing model output interpretation.
Data Ethics and the GDPR: A Regulatory Lens
The EU's General Data Protection Regulation (GDPR) provides a timely policy lens for examining responsible data governance practices. While not prescribing specific compliance regimes, the GDPR encourages ideal behavior and improves data stewardship.
Articles 13 and 14 establish strong notice rights for personal data processing, including information about automated decision-making logic and repurposing data for new uses. Articles 15-17 provide rights to access, correct, and erase personal data.
Article 22 grants EU citizens the right to demand human intervention in important automated decisions, raising questions about what constitutes personal information and meaningful explanations in data-driven systems.
The GDPR's data protection impact assessment requirement aligns with recommendations for producing social impact statements, engaging stakeholders, and maintaining decision records.
Article 21 allows objections to personal data processing, while Article 22 permits demands for human decision-making over solely automated processing – supporting practices of enabling human challenges, review, and override mechanisms.
Recital 69 suggests data controllers bear the burden of justifying continued data use against a subject's erasure request, presenting interesting implications for managing models trained on such data.
Overall, the GDPR incentivizes organizations to develop robust data governance regimes, aligning with many best practices outlined in this article.
Conclusion
As data-driven systems become increasingly pervasive, realizing their full potential hinges on building and operating them responsibly to earn consumer and regulatory trust. The practices and principles discussed here provide a solid foundation for organizations to navigate the complex landscape of data governance confidently.
By prioritizing information security, minimizing data collection, considering re-identification risks, empowering data review boards, producing impact statements, striving for explainability, facilitating auditing, mitigating biases, examining error distributions, enabling human oversight, and aligning with emerging standards, businesses can confidently field data-driven solutions that uphold ethical norms and human values.
Responsible data governance is not just a compliance exercise but an opportunity for organizations to demonstrate their commitment to fairness, accountability, transparency, and ethics – cornerstones of trustworthy AI and the key to unlocking its transformative potential.