Data Catalogue - the Excalibur Sword for Scaling AI
It is a pleasure to present the latest article from Oscar and Guillermo who both work at Capgemini in Madrid. Thank you for sharing your thoughts with the IE Big Data Club.
The hard truth of scaling Artificial Intelligence (AI)
The competitive advantages of AI are generated by establishing valuable solutions and scaling across the organization. The prototypes need to be developed in an environment that enables free and flexible handling of data and AI concepts, and tested with the end users to prove value on concrete business decisions under a “fail fast and fail often” approach.
However, the companies have difficulties in implementing AI use cases throughout their organization. Several challenges block the scaling of AI and thus the efficient value generation from data like very limited overview about available systems and their changes, inconsistencies in the information about available data its meaning and source systems, as well as information on data access and usage.
Specifically, necessary roles within an organization to implement AI use cases are hindered by this situation: the use case owner, the data scientist and the data steward. The use case owner is the functional source of any AI use case to assess the feasibility of use cases. The data scientist is responsible to extract knowledge and insights from structured and unstructured data. The data steward despite not being associated with any specific AI use case is a vital support function that utilizes an organization’s data governance processes to ensure fitness of data elements.
The solution? The Data Catalogue
To combat the lack of transparency and usability in AI use cases, a Data Catalogue initiative should be considered to store all relevant metadata and provide an overview over available systems, their stored data and existing data flows within an organization’s system landscape.
The main features of a Data Catalogue support the actors in finding the right data for their work and accessing it. Particularly, the use case owner and data scientist are concerned with this aspect of a Data Catalogue to find answers to their needs.
A key feature of a Data Catalogue is the business glossary. In contrast to any data dictionary which stores a system’s technical metadata it is rather a framework to create, nurture, and promote a common vocabulary for an organization. For data to be meaningful, people across the organization need to share a common understanding of its definition, lineage, and validity. In addition, the relationships between technical and functional metadata as well as the data lineage from source systems to target systems can be visualized within a Data Catalogue.
A Data Catalogue also supports collaboration to allow different roles to work together and support the maintenance and consumption of data assets. Such features are relevant for all roles engaged in-depth in the implementation of AI use cases i.e. the data scientist and data steward. Also, a Data Catalogue offers an in-tool data access management where requests, approvals and documentation are handled in a compliance-proof manner.
Lastly, a Data Catalogue simplifies the import and maintenance of metadata and supports the relevant roles in cataloguing, enriching and quality-assuring data. This feature is especially relevant for the data steward who wants to actively manage available data through a Data Catalogue.
The maintenance of data assets is crucial for organizations to become data-driven and one of its biggest success factors are empowered data stewards. A Data Catalogue enables the efficient use of data and thus the scaling of AI, the data stewards will proactively manage data rules, reactively monitor the quality of the data and establish organization-specific workflows with other stakeholders that guarantee the maintenance as well as the enhancement of metadata.
Written by Oscar Alonso Llombart and Guillermo Blanco Mu?oz from Capgemini