Data Nugget November 2023
Data Management Association Norway (DAMA)
Accelerating Data Management in Norway
30 November 2023
Welcome back to the?new episode of our Data Nugget.?
First, we have an excellent book review on?the book "Data Teams" Jesse Anderson . Second, we have an interesting nugget about the importance of data management platforms.?Last but not least, we have the next podcast focusing on a fundamental conflict between?the fashion world and Machine Learning.?
Enjoy the?reading!
Let's grow Data Nugget together. Forward it to a friend. They can sign up here to get a fresh version of Data Nugget on the last day of every month.
Book review: Data Teams? ?
Nugget by? Winfried Adalbert Etzel
Jesse Anderson ’s book "Data Teams: A Unified Management Model for Successful Data-Focused Teams"?is a book that definitely will make you feel self-conscious - you realize through the examples that you have been there, and it is useful that someone puts it into words. ?
The book is not the result of an academic study, but experience based. And that is the value of this book. Jesse's goal is to establish and document best practices for creating data teams. And best practices are experience-based: What have we learned that can lead to success or failure in organizing data teams. ?
So who is this book for??Anyone working in a data organization. It can put the work you are doing in perspective and maybe broaden your view on the work of Data Science, Data Engineering and the importance of Data Operations. However, I would say that management will get the most benefit from reading this book, and especially from lessons learnt and experiences shared. The book is also written in a way that addresses management and tries to explain and convey some of the ?truths? in data, that have been and are not easy to comprehend outside the data world.? ?
Big Data and distributed systems? ?
It might be tempting to reorganize according to the best practice described in this book, but this book will not give you a silver bullet for all your data problems: You need to adapt this to your situation. You cannot reorganize yourself away from your problems. Jesse is narrowing down the focus of the book right at the start, and in a natural way to focus on a pragmatic way of structuring work with Big Data and in distributed systems.? ?
I enjoyed that the book is cleaning up the misconceptions around what Big Data is, through the 3 Vs to nVs, which has certainly confused the definition of Big Data, more than it has helped. But at the same time, I found the negative (can’t) definition that Jesse offered a bit unsatisfying: "When asked to do a task with data, the person or team says they can’t do it, usually due to a technical limitation." ?
But, once Jesse put this in the context of distributed systems, it made more sense: "A distributed system is a task broken up and run on several computers at once. This could also mean data broken up and stored on multiple computers. Big data frameworks and technologies are examples of distributed systems." ?
I would still like to see a more sufficient explanation of the complexity of e.g., distributed systems that is mentioned many times throughout the book.? ?
Two definitions in the book that I found helpful and brought context to the book?were: ?
"A data pipeline is a way of making data available: bringing it into an organization, transferring it to another team, and so on—but usually transforming the data along the way to make it more useful."
"A data product takes in a dataset, organizes the data in a way that is consumable by others, and exposes in a form that’s usable by others." ?
The Data Teams ?
Now this is the heart of this book, and the magic lies in the simplicity of it. There are only three teams that are essential: Data Science, Data Engineering, and Operations.? ?
The Data Science team ?
Data Science Teams are the consumers of data products through data pipelines. Often, this is the first hire for a data team, ending with Data Scientists spending their time on the entire data lifecycle, mainly tough data engineering tasks. ?
A good Data Scientist often has a math degree, preferably a PhD?and can work exploratory with data. But these 'good'?Data Scientists are rare and many other roles are rotating towards this 'sexiest role in data'. ?
Here are some of the focus points you need to ensure as a manager:?
The Data Engineering team ?
Data Engineering Teams create the data pipeline ?(…)that feeds data to the rest of the organization, including the data scientists.? This includes also the creation of data products. Often Data Engineers come from Software Engineering and bring a certain mindset with them, that brings specific, functionality-driven structure to data (e.g., DevOps, Feature design, etc.). With this comes an engineering mindset, that very much searches for solving problems with concrete solutions. (Opposite to the exploratory Data Science mindset). ?
Some focus points for managers:?
The Operations team ?
Operation Teams are often neglected but are at the core of keeping things running. There is a general notion that you can rationalize away these teams in the age of cloud, automation, managed services, etc. But these are the professionals that keep the cluster software and other big data technologies you have operationalized running smoothly. A Data Operations Engineer often comes from an operations or software engineering background, specialised in Big Data with an understanding for data and programming. ?
What to look for as a manager? ?
In production, you find out how good or bad the work from Data Scientists and Data Engineers is.
Troubleshooting and reverse engineering can be a timely endeavour, and shouldn’t be underestimated.
Models like DevOps or DataOps can optimize the E2E lifecycle of data products and ensure operational focus. ?
These three teams are described comprehensively for a Manager and Leader to understand differences, complexity, tasks, skills, profiles, etc.?This is where the 'simplicity'?pays off, by structuring the best practice into these three teams and their interaction internally and externally.? ?
Jesse provides an understanding of e.g., differences between a Data Scientist and Data Engineer, not just based on the outcome, but also skills, ways of working, recruitment, task deliverance, and most interestingly; mindset. This has enormous value for management and can lead to better application of people and resources.? ?
What I was a bit surprised to see is that the role of Data Governance and Data Architecture was only mentioned marginally. In my opinion, there is value in Data Governance to particularly build bridges and create a better understanding between these teams, the business and management.? ?
The real world? ?
This is one of the last parts of the book, but should not be overlooked. Jesse interviewed data practitioners and management to build a set of case studies that showcase the importance of understanding and utilization of data professionals according to their strengths. ?
I appreciate the real world taking on many of the issues within data. Case studies are an important communication tool, and Jesse used them with excellence.? ?
This gave the book a great frame that enforces the lessons learnt based on the best practice character of the book.? ?
My recommendation ?
This book is a great read for anyone working with data who wants to understand the differences and mindsets within data better. Hurray for showing the complexity of working with data, especially with fixing data problems!? ?
领英推荐
Reading this book feels like getting 'behind the scenes' access to many relatable challenges?and getting to know how a very experienced consultant would solve them. A consultant who has seen and heard much?and has been in many 'I told you so'?situations.? ?
Read this book if you are looking to broaden your horizon on realistic challenges faced when putting together a data-focused team.
Importance of Data Management Platform
Nugget by Gaurav Sood
Data is omnipresent in today’s digital world. It gets transformed into information, information into knowledge and knowledge into key decisions by and for the organization. Data Management is an organizational-level capability that the whole company must be responsible for and not only the IT/Data team. Data Management, in turn, depends on various methods, strategies, and tools to derive the information we seek from raw data.
Data Management tools are used to develop and monitor practices as well as organize, process, and analyze an organization’s data. These tools are designed to arrange and harmonize data and should provide a high degree of efficiency and effectiveness. Data Management tools also support privacy, security, and the elimination of data redundancy. Effective?Data Management?uses a combination of software tools and best practices to control and organize data resources effectively.?
Businesses today rely on a depth of Data Management tools to effectively manage and utilize their data. These tools often become a part of the Data Management Platform which can be both on-prem or in the cloud. Some of the latest DM tools are open source as well. DM platform should be able to do data cleansing, ETL, data consolidation, and more. An intelligent Data Management strategy can protect an organization from becoming an environment of chaos and confusion. Knowing which tools are needed to operate a specific business is necessary when selecting a platform. For example, the Data Management tools used by an online retail business are different from those used by an educational website. It cannot be a one-size-fits-all approach when choosing the right DM platform.?
Categories of data management?tools?
Below is a list of basic Data Management tools and their descriptions.?
Data Cleansing Tools – The main purpose of cleansing tools is to find inaccurate, corrupt, and irrelevant data and to eventually correct it. It is also referred to as ‘data scrubbing’ or ‘data cleansing’. It should be one of the first steps after Data ingestion and is often one of the most important steps towards a successful data platform setup. Imagine the damage that wrong or bad data can have on an organization’s image and business. While cleansed data boosts the reliability and value of the organization’s data.?
Data Integration tools – These tools perform the mapping and transformation of data. Data integration tools support analytics by aligning and merging data. They consolidate data from a variety of sources into a single storage area. They help turn raw data into useful information that promotes faster and better decision-making. These tools help to understand and retain customers and support collaboration between departments. The process typically uses four layers of technology: an ETL data pipeline, data sources, business intelligence (BI) tools, and a?data warehouse?destination.?
ETL (Extract, Transform and Load) Tools – These tools speed up data consolidation and automate the extract, transform, and load process. They “extract” structured and unstructured data and consolidate it into a repository. The transformation process includes cleansing, standardization, and reduplication. The last step of the?ETL process?is downloading the transformed data. It can be downloaded all at once, called a “full load” or it can be downloaded at scheduled intervals, called “incremental load.?
Guiding a business through?turbulent times requires good and reliable information, which data integration tools can help provide. Having all the pertinent information available supports new opportunities and makes decision-making much easier. Some of the most common features of all good Data Management tools are:Scalability:?A good system should allow you to increase or decrease its performance in response to the constantly changing needs of applications and system processing demands. For example, a system with a growing number of users needs a database that can increase its processing power to keep up with the increased demands.
Data Backup and Disaster Recovery:?The?purpose of a backup is to store a copy of the data so it can be recovered after a system failure. Data backup and disaster recovery tools/features are necessary for easy access and retrieval of data after a system goes down.
Choosing the right data management tool should depend on your business needs and the data complexity and not on the latest trends and news in the market. Getting your data to talk is your responsibility and selecting the correct data management tool will enable you to do that.
You can read the full article here.?
MetaDAMA?2#14: When ML meets Fashion ?
Nugget by Winfried Adalbert Etzel
There is a fundamental conflict between the essence of fashion and Machine Learning.
Fashion is always about change, innovation and identity. Whilst ML is good at making predictions based on historical patterns on those things, not change. How do those go together?
I had a fantastic chat with Celine Xu , who is Head Data Scientist at H&M Group, with a mixed background in Applied Mathematics and Business.
Here are my key takeaways:
Three things data can achieve:
Focus on ML in Fashion
ML in Product Design (some examples)
The Challenges of ML on behavioural and cultural data
You can listen to the podcast??here??or on any of the common streaming services (Apple Podcast, Google Podcast, Spotify,? etc.)?Note: The podcasts in our monthly newsletters are behind the actual airtime of the MetaDAMA podcast series.
Thank you for reading this edition of Data Nugget. We hope you liked it.
Data Nugget was delivered with a vision, zeal and courage from the editors and the collaborators.
You can visit our website here, or write us at [email protected].
I would love to hear your feedback and ideas.
Data Nugget Head Editor