Second commandment of data-driven companies: You shall not walk the road or leave the data adrift
Jéimar Arias Vélez
Engineering Manager | Tech Lead | Data Architect | Data Engineer | CDO | Director Data y Analítica | Director Tecnología | Desarrollo e Innovación | Cloud | Big data | IoT | Blockchain | AI | Ciberseguridad
Before walking the path to adopt the #datadriven culture, the company must first define where it wants to go, define a management and governance strategy for its data assets, and could be framed and prioritized within the 11 knowledge areas defined by DAMA International on the wheel model below:
?Data Governance: The data governance process is at the center and is the one that defines the rules, policies, and controls how data assets are managed in the organization for another 10 areas of knowledge defined in this model.
Data Architecture: Architecture is the art and technique of projecting, designing, and building [https://es.wikipedia.org/wiki/Arquitectura].?In this context, data architecture comprises defining a model of how the data should be organized in the enterprise, defining the physical and logical model for the collection, storage, processing, analysis, and use of data and business information.???Data architecture is of vital importance for good data management and must be framed within the enterprise architecture.
Data Quality: This area of knowledge handles the good practices that a company must have for the continuous improvement of its data quality.?We can say that the data quality process should be managed as a program framed by the Deming cycle: Plan-Do-Check-Act in the search to deliver highly reliable data and information for different audiences.?We summarize the data quality cycle as follows: measure the quality of the data, make an improvement plan, execute the plan, validate the improvement, take the corrective measures, and start the cycle again.
Data Modeling & Design: Data design and modeling refer to how data entities and their attributes of an information system are physically and logically stored and how they relate to each other to provide the data model that best fits to the business requirement.
Data Storage & Operations: Refer to the management of the storage and operation of data throughout its lifecycle to maximize its value.
Data Security: Data security should be a top priority for every company.?An event such as information hacking can bankrupt the company. This area of knowledge refers to implementing policies, norms, procedures, standards, automation,?etc., to ensure the storage, processing, and use of the information in an integral, and reliable way by the audiences that can access such information and, under their level of authorization.
Data Integration & Interoperability: It refers to the different processes that must be implemented (interfaces) to ensure the correct storage and flow of data within and between different applications.?This concept even covers the exchange of data services between organizations.
Documents & Content: Document and content management involves controlling the capture, storage, access, and use of data and information stored outside of relational databases [DMBoK2 2017].
Reference & Master Data: Master data is data that is shared between different business areas, processes, or information systems.?Examples: customers, suppliers, users, products, and locations. Reference data is a code that is shared throughout the organization, including international codes that apply to a particular industry.?E.g. country codes, economic activity codes, political division codes within each country, etc. Master and reference data management comprises implementing different processes to ensure a single source of this information at all levels of the organization.
Metadata: Metadata is data that describes other data. It comprises building and maintaining a semantic layer for consulting at all organizational levels. Metadata management must contain a dictionary that defines each piece of data from the business point of view, as well as allows the query of the lineage of the data.?Metadata is of three types: business, technical and operational. Examples by type:
●??????Business metadata: Business rules, quality rules, security levels, data lineage, etc.
●??????Technical metadata: Database models, column description, network design documentation, ETLs, etc.
●??????Operational metadata: Error logs, resource metrics, data retention rules, SLAs, roles, etc.
Data warehousing & Business Intelligence: ?Since the emergence of the concept of data lakes, this question has arisen: ?Does a data lake replace the data warehouse???My answer is: It does not replace it and they really are complementary. A data warehouse is the storage of highly structured data designed to provide business intelligence for decision-making.?These databases have a columnar storage technology contrary to relational databases that are designed by records.
Big data & Data Science: Although this process is not in the wheel model, DAMA has it as a separate chapter in its book DMBoK2 and therefore it is no less important than the previous ones.??Big data refers to the collection, storage, processing, analysis, and use of large volumes of data, taken from different sources, at high speeds, with high reliability to generate value in an organization.?The information is classified as structured, semi-structured, and non-structured and can be processed in batches or streaming ([near] real-time).?On the other hand, Data Science is a scientific discipline focused on the analysis of large data sources to extract information, understand reality, and discover basic patterns for taking correct decision-making [https://www.masterdatascienceucm.com/que-es-data-science/]. This means that Data Science relies on Big Data to generate valuable information for decision-making as a complement to business intelligence.
The following are the steps that I consider for starting a good data governance and management strategy:
1. Make an assessment of data processes?
To design a data management strategy, it is necessary to make an objective assessment of the state of the processes, mainly those defined in the wheel model.?It is recommended that this assessment be made by an external entity of the organization to avoid bias that may produce internal interests. An assessment must be made by the business unit (domain) and by its subdomains of the 12 processes, including Data Governance. For this assessment, you could use the DMMA methodology (Data Management Maturity Assessment) defined by CMMI (Capability Maturity Model Integration) which consists of five levels: Level 0: Does not exist;?Level 1: Beginner; Level 2: Repeatable;?Level 3: Defined; Level 4: Managed;?Level 5: Optimized.?This model allows an assessment of the level of maturity of a company's data processes.?For more information on how to perform the assessment and the clear definition of maturity levels, you can consult the DMBok2 of DAMA Chapter 15 Data Management Maturity Assessment.
From the ratings, you can build a spider chart, like the one below, by business unit, sub-domain, and/or merged for the organization:
2. Create the strategic data plan
?The next step comprises the elaboration of the strategic data plan, which must be aligned with the strategic objectives of the company and the business units and consider that it is sustainable over time.?In this plan, besides the processes, it is necessary to consider the people and technology that will be used for its implementation. To build the plan, I suggest developing a prioritization matrix for each of the above processes using the RICE model.?This model is not the only one, but it is the one I recommend since its formula defines a Benefit/Cost ratio.?The RICE model [Source: https://www.productplan.com/glossary/rice-scoring-model/], which stands for Reach, Impact, Confidence, Effort, consists of four variables that must be scored for each process by reviewing its level current valuation and the level to which you want to take it.?For each process, the following formula is applied:
Score = R * I * C / E
Where:
Reach = Scope.?This first factor can be defined as a rating from 1 to 100 or from 1 to 1000, but it represents in a proportional, objective, and standard way for all processes in terms of the number of people, transactions, etc., that can be achieved in a given period of time.?Example: If it is defined that the maximum score is 1000?and this represents 1 million transactions, then for 300000 transactions your score will be 300.
Impact = It can be a quantitative measure or a factor that objectively represents the number of new users, the number of transactions, etc., because of applying the improvement in the process.?If it cannot be calculated quantitatively, it is suggested to apply the following rating scale:
3 = Massive impact
2 = High impact
1 = Medium impact
0.5 = Low impact
0.25 = Vero low impact
Confidence = Trust level.?It corresponds to a percentage that the team has on the achievement of the objective.?I suggest the following percentages:
100% = High Confidence
75% = Medium confidence
50% = Low confidence
25% = Minimum confidence
0% = Total distrust and its effect would be zero on the rating.
Effort = It corresponds to the number of resources estimated in terms of hours or amount invested, etc., that will be allocated to implementing the process in the established period. For better understanding, it can be defined in another unit defined for the scope (Reach) but for better compression, it is recommended to use a minor scale.??Example: If the scope was defined from 1 to 1000, the scale for the effort could be defined from 1 to 10.?So,?if the deployment consumes 2000 hours out of a total of 10000 in the set time period, the value of effort would be 20%;?equivalent E = 2.
The RICE framework can be seen as a Cost-Efficient ratio.?This means that if the numerator is the maximum value and the denominator is the minimum value we would obtain the highest priority, and these processes would be the first to implement (Quick wins).
The table below is an example of how it can be objectively prioritized using the?RICE framework:
From the previous box you can generate a bubble chart like the following:?????????
On the X axis is the numerator (Profit or value) and the Y axis is the effort.?The size of the bubble represents the Score.?In quadrant 1, the processes that generate the most value and demand less effort (Quick wins) are located.?The quadrant number would mark the order of implementation.
Note: As I reported, it is an illustrative example and does not reflect reality, because for this exercise the Big Data & Data Science process would have a low priority, but it may happen that it applies only to a business unit that generates a lot of effort for the benefit provided by its implementation.?All data management processes should be implemented, what we want to show here are the implementation priorities.
Governance is a process that can be implemented in parallel with the other processes, as its management is implemented.?In this illustrative example, I intentionally gave the three processes in quadrant 1 the highest rating: Governance, Architecture, and Reference&Master Data, because, in my opinion, this is where you should start.?However, each company establishes its plan depending on the result of the assessment of the level of maturity of its processes and the resources that can be invested financially.
The plan must be aligned with the company's strategic plan, and I recommend using the OKRs (Objective and Key Results) method for this.??Likewise, this plan must be accompanied by a Roadmap of its defined implementation in the short, medium, and long term. This Roadmap is dynamic especially in the medium and long term since they must be reviewed periodically and priorities may change as they are implemented.
?
3. Data Governance
?The second commandment says, "Do not leave data adrift," meaning that it must be well governed.?There are several definitions of Data Governance, but the following definition is the one that covers all aspects: "Data governance consists of the formulation and compliance of policies to optimize, secure and empower data as organizational assets by aligning the objectives of different organizational functions; Data governance requires interdepartmental cooperation to timely and faithfully deliver data with maximum value for organizational decision-making. Soares 2014".?This definition covers alignment with the company's strategy, data security, group work between different areas, and the use of data for decision-making, among other aspects.
Previously, the data was governed and managed by the technology area headed by the CTO or Technology Manager. Today, there are many companies that still have this model.?However, when a company realizes the potential of its data and sees it as an asset, the next question arises, ?should data continue to be governed by technology or by an independent area? Well, the answer depends on the resources the company is willing to invest. It doesn't matter who governs or manages the data, but it is important to define the governance and data management roles within it and formally assign those roles to the right people with their respective responsibilities.
In my opinion and given that data is already being seen as a strategic issue, the Data area should be independent, and report directly to the CEO of the company, under the following model:
This model shows the three independent areas, but there is always joint work (intersections between the areas) and the intersection of the three areas is teamwork, which generates value for the company.???The Data area, in my perspective, should have a single head under the name of CDO (Chief Data Officer) or CDAO (Chief Data?& Analytics Officer) or whatever name you want to give it. What I mean by this is that the Data leader should have the entire Data strategy of the company including the issue of Data Science, since, if these functions are separate, they may not be aligned pointing to different objectives.
To implement good data governance, companies must establish a data operation model. The following are models defined in DAMA International's DMBok2: Decentralized, Network, Centralized, Hybrid, and Federated. The company defines the operating model that best suits the strategic plan and the resources it is willing to invest.?However, I quite like the federated model since there are centralized and other decentralized functions and it can be applied to any size of an organization.??This Federated model involves the distribution of data ownership and data stewardship from the data to the business areas but is governed by a data governance office headed by the CDO. The following image is the federated model taken from DAMA International's DMBok2 book:
The model entails defining a steering committee for Data Management, which delegates the functions of data governance and administration to a COE (Center of Excellence) headed by the CDO, and from there the guidelines are given for the proper management of data in the business and technology units called Data Management Groups. "A federated model provides a centralized strategy with decentralized execution. DMBok2 2017".
The following are the main roles that should exist in a data operation model, no matter the type of model.?A role does not imply a position.?Depending on the size of the company, a role could be done by only one person, if the company is large, but in a small company, a single person could fulfill the functions and responsibilities of several roles.
Data Sponsors: They are generally people who have high power to take decisions in the organization. They are committed to supporting data governance and management initiatives.
Data Management Steering Committee (DMSC): The DMSC is the highest data governance and management body. It provides direction and defines overall priorities for different governance and data management initiatives.?It makes high-level decisions and is the body that approves the strategic data plan. This committee delegates the functions and responsibilities to the COE headed by the CDO, who in turn should have a seat in the DMSC.?This committee should be made up of high-level managers in the company: CEO, CDO, COO, CIO, CTO, CFO, and some managers of the business units.
Chief Data Officer (CDO): From my perspective, the CDO is responsible for all processes related to data and business information including the analytical part. He is responsible for the management of all the processes previously defined in the wheel model which I summarize in the following: data governance, Data management, analytics, Operation, Innovation, and usability.
Data Governance Office (DGO): The data governance office is an area under the CDO's umbrella and is the one that defines the governance policies for the correct management of data in the company.??It works together with the business and IT units to define the policies that best suit their requirements. It can be made up of Data Stewards.
Data Owners: They are the proprietors of the data. This role is assumed by the director of a business area, called a domain.?Data Owners delegate to Data Stewards the responsibility for managing the data of the modules they manage, called subdomains.
Data Stewards:?Data Stewards are the administrators of data. In a federated model, they can be two types: Data Stewards in the DGO and Data Stewards in Business Units, the latter called Business Data Stewards.?The Data Stewards in the DGO, define the transversal data governance policies for the entire company. While Business Data Stewards define business rules, manage metadata, and define quality rules for the subdomain in charge.
Data Custodians: They are the technical counterpart of Data Stewards.?They are responsible for the implementation of business rules, quality, etc., defined by data stewards. In that order of ideas, they are the custodians of the transport, storage, security, and technical processing of data.
Data Architect: Role responsible for the architecture and integration of data. It must be aligned with Enterprise Architecture. This role should be under the umbrella of the CDO.?Define the best components that are part of a data analysis solution, seeking to provide the solution that best fits the business requirements.?It is a transversal role for the company.
Data Modeler: While the Data Architect defines the best components, the Data Modeler, performs the physical and logical design of the data model that best suits the business requirements. He must be an expert in knowing the structured, semi-structured, and unstructured data models.?This role should be integrated into either part-time or FTE Development Squads (SDLCs), depending on the progress of the development project.
Data Engineer: The data engineer handles the development and implementation of “automated” data flows (Data pipelines) that perform collection, storage, and processing, leaving them ready for consumption either by Data Analysts or by Data Scientists, and in general, by the different audiences that consume data for decision making.??These tasks include the entire “Data Cleansing” process.
Data Analyst: The Data Analyst extracts, processes, and analyzes large amounts of data that will mark the strategic management of the company. It identifies patterns in users, translates data into understandable language, and adds value to the company [https://www.edix.com/es/instituto/data-analyst/].?They are people who must have enough knowledge of data visualization tools such as Power BI, Amazon QuickSight, Tableau, Looker, Qlik view, among others.
Data Scientist: A data scientist is a person who knows statistical methods very well and additionally knows technology very well to take advantage of them to get value from the data. While?Data Analysts concentrate on analytical processes related to Business Intelligence, Data Scientists are usually dedicated to building forecasting and/or machine learning models for automated decision-making.
Other roles: Analytics/Report Developer, DBA, Data Security Administrator.
Finally, related to this commandment, for good data governance, the responsibilities of the roles over the different data management processes should be defined. For this, it is recommended to build a RACI matrix (Responsible, Accountable, Consulted, Informed) by Business Units (Domains) and by their subdomains, as the following example:
Parent article:?https://www.dhirubhai.net/pulse/10-commandments-data-driven-companies-jeimar-de-jesus-arias-velez/
Link 1st Commandment: https://www.dhirubhai.net/pulse/1st-commandment-data-driven-companies-you-love-data-arias-velez/
?
References
DMBok2 of DAMA International 2017 Second Edition
?