Best Practices in AI Documentation: The Imperative of Evidence from Practice

Best Practices in AI Documentation: The Imperative of Evidence from Practice

By: CDT's Amy Winecoff & Miranda Bogen

Recent AI incidents have underscored the urgent need for robust governance to mitigate risks and ensure responsible development of AI-powered systems. For example, 4chan users leveraged AI tools to create violent and explicit images of female celebrities , and Google’s Gemini generated offensive images of “historical” figures . These incidents are part of a long history of AI failures, from chatbots spewing hate speech to algorithms exacerbating racial disparities in healthcare . Ongoing AI incidents raise a crucial question: Why do AI failures persist? The answer, while complex, centers on the inadequacy of current AI governance procedures.?

Effective risk management and oversight of AI hinge on a critical, yet underappreciated tool: comprehensive documentation. Often, documentation is conceptualized as a tool for achieving transparency into AI systems, enabling accountability to external oversight bodies and the public. However, third-party visibility and accountability are only two of the many goals that documentation can facilitate. Documentation also serves as the backbone for effective AI risk management and governance, and helps practitioners assess potential failure modes and proactively address these issues throughout the development and deployment lifecycle. Well-maintained documentation offers organizations ongoing insights into their systems’ strengths and weaknesses, fostering iterative improvements. And documentation informs decisions about whether to launch systems at all, given the potential benefits and risks they stand to pose. In essence, documentation is a tool that has the potential to —?and is indeed necessary to — facilitate both external accountability and internal risk management practices.

Notwithstanding that potential, approaches that seem beneficial in theory are not always successful in practice. To ensure documentation can fully support robust AI governance, researchers, policymakers, and advocacy groups should consider insights from public and private-sector practitioners experienced in creating and using documentation, as well as evidence of its efficacy in real-world AI contexts.

The Theory of Documentation

In ideal forms, AI documentation records fundamental details about AI systems, including the sources of training data, the hardware and software used to train the component AI models, and the evaluation methodologies used to assess the systems for efficacy and errors. Documentation can also describe the procedures a company has followed in designing, developing, and deploying these systems, such as the original motivation for developing the system, whether the system underwent an impact assessment or ethics review, and how training or evaluation data were labeled.?

Good documentation can provide insight into an AI system’s risks and improve AI development more generally. For example, Wikipedia developers participating in a research study were asked to use a documentation framework to guide their development of a machine learning system for predicting quality in community content moderation applications. Through their engagement in the documentation process, practitioners identified accuracy metrics that were more closely aligned with the priorities of the system’s target users. Navigating this process also provided them with a deeper understanding of the system they were working on, which could facilitate more efficient development in the future.

On the other hand, when AI systems or components are not sufficiently documented before deployment, errors or problems with these components may go undetected, potentially deteriorating system performance and posing ethical and legal risks. For instance, when one group of AI researchers documented BookCorpus —a previously undocumented dataset used to train popular large language models (LLMs)—they revealed numerous duplications of books and an overrepresentation of specific genres, like romance, which could potentially skew outputs of models trained on that dataset. Researchers also found that BookCorpus may have violated copyright restrictions, highlighting significant legal and ethical implications of using the dataset. This case illustrates the importance of thorough documentation in identifying hidden problems and promoting responsible AI development and use.

Numerous AI researchers and governance groups have proposed frameworks for AI documentation. As a part of an ongoing effort to understand themes that emerge from documentation research, we identified and reviewed 37 different approaches to documenting AI data, models, systems, and processes that have been proposed in the academic and gray literature. These proposals have significantly influenced academic researchers and policymakers seeking to define best practices for responsible AI development. For instance, the concept of model cards , a method adopted by a number of AI developers for documenting AI models, has been cited nearly 1,800 times in academic publications in the last five years. The National Institute for Standards and Technology (NIST) references datasheets for datasets , a method for documenting AI training data, 26 times in its guide for companies on implementing effective AI risk management . The technical documentation requirements for high-risk AI systems in the European Union’s AI Act draw on a number of AI documentation proposals, including both datasheets and model cards.

Improving governance outcomes can be achieved both through the documentation artifacts practitioners create for data, models, and systems, and through their active participation in the documentation process. Specific outputs like datasheets, model cards, and system cards can help downstream stakeholders understand the intended and unintended uses of these components. These artifacts enable stakeholders to assess whether their planned uses comply with organizational or legal requirements, preventing non-compliant development efforts. These artifacts can also alert downstream stakeholders to instances where risk mitigation techniques may be necessary before systems can be safely deployed.

Beyond the benefits of the artifacts themselves, the documentation process can foster a healthy risk management culture within an organization. Regularly documenting risks helps practitioners better understand responsible AI principles and practices, influencing their behavior beyond the documentation process. Documentation can potentially serve as a forcing function, encouraging practitioners to follow best practices in software development and adopt more rigorous scientific approaches since documentation allows for increased scrutiny and accountability from other internal stakeholders. Moreover, documentation can facilitate collaboration among different stakeholders by establishing a common knowledge base about systems. This shared understanding helps cross-functional teams collectively examine the strengths, weaknesses, and risks of their approaches from multiple perspectives.

From Theory to Practice

While proposed documentation frameworks have laid a foundation for AI development norms, the emerging policy attention to defining more specific documentation standards necessitates a careful evaluation of which practices have or will most effectively support governance goals in real-world settings. Documentation is a tool, and the success of that tool depends on the human and social context in which it is used . Without adequate evidence to support understanding of social, organizational, and institutional dynamics that could influence the utility documentation provides, unproven and potentially less effective methods could end up becoming a norm, while more robust approaches that better serve governance and accountability goals go unadopted.

To contribute to an evidence-based understanding of effective documentation strategies, the AI Governance Lab reviewed 21 research papers that present empirical findings related to documentation and convened stakeholders from technology companies, AI governance and compliance consulting firms, nonprofit organizations, and government agencies to identify insights. We also consulted individually with stakeholders who use documentation in their work. In each effort, we sought to identify challenges and opportunities for translating the theoretical approaches to documentation that researchers have proposed into real-world practice.?

Empirical studies identify a variety of challenges for implementing documentation effectively. These include:

Our convening and consultations with documentation stakeholders likewise underscored that implementing documentation in real-world development environments is more complex than it may appear, and the assumptions of proposed documentation frameworks often diverge from the conditions that shape real-world practice. Considerations that our workshop participants raised included:

  • Dynamic development lifecycles: Proposed documentation frameworks often assume AI development is linear, with clear decision points and distinct system components. However, AI development in practice can be much more complicated, and distinctions among system components and decision processes are often blurred. This makes determining what to document and when to document it a non-trivial question.
  • System and organizational complexity: Larger AI developers are often dealing with hundreds or even thousands of datasets, models, and systems, while the kinds of documentation approaches proposed in the academic literature tend to focus on how a single dataset, model, or system should be documented. Both those producing documentation and those who rely on it must navigate high volumes of information, particularly as this information may be distributed across multiple technical artifacts and teams.?
  • General-purpose systems: General-purpose models, such as the one behind ChatGPT, can be adapted for a wide range of downstream tasks. Consequently, documentation on any aspect of these models, including evaluations of their capabilities and risks, may not be applicable to every specific use case. Deployers of relatively closed general-purpose models, which are accessible only via APIs, argue that the lack of detailed documentation on the models’ safety guardrails prevents them from effectively mitigating context-specific risks in their own systems. This issue is further complicated by insufficient documentation on the data used to train these models since information about training data can enhance downstream system performance and identify areas where risk mitigation may be necessary.
  • Relationship with existing documentation requirements: AI practitioners in government already face substantial documentation requirements for privacy, cybersecurity, data governance, and other factors, and while documentation requirements have increased with new legislation, government entities have not always received commensurate increases in resources to meet them. To fulfill documentation and transparency requirements, government practitioners need to find ways to streamline and harmonize diverse documentation requirements — but how to do this effectively and efficiently is an open question.
  • Navigating legal and reputational considerations: Documentation frameworks typically assume that detailed information contained within documentation will help downstream practitioners make sound choices, but some actors may be reluctant to record information that could create liability or be damaging if it publicly surfaced. In light of these concerns, lack of clarity around what must be documented could dissuade developers or their organizations from producing documentation that is detailed enough to be useful.

To be sure, none of these concerns should be an excuse for poor governance practices, but they present factors for organizations to consider when designing and implementing documentation practices to maximize their likelihood of success, and for policymakers to be aware of when considering what guidance or requirements would be effective to incentivize useful documentation practices.

The Path Forward

Given the urgency of preventing AI-driven harms, stakeholders will need to balance some uncertainty around which practices will best support risk management with the consequences of continued inconsistency and insufficiency in many current documentation approaches. In the meantime, successes and failures of past AI documentation research and implementations offer valuable lessons. One important lesson is that proposed approaches to documentation are more valuable when they are accompanied by empirical evidence about how these approaches can and can’t respond to applied development contexts. While extensive real-world investigations in multiple contexts may not be practical in the short term, smaller qualitative studies and focus groups can still provide helpful insights. For example, researcher Karen Boyd conducted a study to assess the effectiveness of data documentation in raising ethical awareness among AI practitioners . In her study, 23 AI practitioners participated, but only 11 were given data documentation artifacts to consult. Results indicated that those with access to data documentation were more likely to recognize ethical issues than those without it. Although the findings are not definitive, Boyd’s study represents some of the best available evidence on the impact of data documentation on practitioners’ ethical deliberation.?

In some instances, empirical evidence on documentation exists but remains unpublished, which is regrettable. Take, for example, documentation frameworks developed through an approach known as “co-design.” In co-design studies, researchers iterate between proposing a framework, gathering feedback from relevant stakeholders or gathering data from pilot implementations, and updating the framework design until they converge on a final product. Co-design is effective for developing empirically-informed frameworks that are responsive to organizational context. However, the data collected during these studies has not generally been shared in publications. This is most likely because the authors believe the primary contribution of their work is the framework rather than the findings, but findings can provide crucial context as to why the authors adapted their initial designs or where the framework might best succeed. Without this context, practitioners and policymakers might not understand the rationale behind specific design choices that can be necessary for effective implementation.?

Moving forward, researchers should prioritize evaluating proposed approaches in practice and publicly sharing findings to provide insight into where there may be subtle tradeoffs with significant implications for governance and accountability efforts. Yet, even if researchers embrace empirical evaluations as a necessary step when proposing documentation approaches, reaching a consensus on evidence-based best practices will likely take time — and in the meantime, lack of or ambiguity in evidence for such proposals risks being weaponized by actors looking to water down rules they are ultimately expected to follow.?

Nevertheless, policymakers and government agencies should more carefully review AI documentation proposals that have a weaker empirical basis than those informed by more robust evidence. They should also build in processes to review and revisit the effectiveness of recommendations or guidelines over time to ensure gaps are spotted and filled. Doing so will help documentation as a risk mitigation strategy live up to its promise. Policymakers and government entities like NIST and the EU AI office also should prioritize enhancing the evidence base for documentation practices. Guidance to companies, for instance, could recommend mechanisms for evaluating and disclosing the success of particular documentation frameworks. As companies adopt documentation frameworks to manage AI risks, they could then assess both the usability and effectiveness of these practices — both of which are instrumental to the success of any responsible AI tool .?

Grantmaking organizations like the National Science Foundation (NSF) and AI safety research initiatives could promote evidence-based research on documentation by requiring grant applicants to detail how they will empirically evaluate their proposed approaches. Just as academic conferences and journals have shifted researchers’ embrace of open science and focus on the potential ethical implications of their work, publication venues could further encourage empirical evaluation by adjusting review guidelines for research on documentation. Reviewers could then place significant emphasis on the presence or absence of empirical evaluations when making acceptance decisions.

As AI systems become increasingly integral to technology products, establishing effective governance practices is critical. Robust documentation practices are essential to these efforts. While the need to document AI systems is evident, it is equally important for AI companies to adopt empirically validated methods that improve outcomes in real-world development and risk management settings. Researchers, policymakers, and other stakeholders defining best practices for AI governance can learn from the successes and failures of past efforts, empirical studies on documentation, and insights from practitioners. By doing so, they can develop and refine documentation approaches that fulfill their theoretical potential in practice. Moving forward, collaborative efforts that emphasize evidence-based practices will be crucial in harnessing the benefits of AI while minimizing its risks. At the AI Governance Lab, we will continue to investigate this important area and highlight promising practices within policy and practitioner communities.

Felicity Slater

Privacy, Data Protection, and Digital Health Associate at Hintze Law

4 个月
回复

要查看或添加评论,请登录

社区洞察