What is a Data Fabric?
Richard Schreiber, MPA
Operations Research Analyst @ GSA | Data Science, Federal Contracting
Data Fabric is a data strategy and management concept. It is less tangible than a Data Lake or Data Warehouse. It is a system that makes data storage, extraction, access, and analysis more efficient. A Data Fabric integrates an organization's ingestion and storage processes, adds machine learning to improve its performance and insight gathering, and delivers everything for easy access and consumption.
Background
First, you have to understand what a Data Lake and Data Warehouse are. These two are ways to store big data. (Industry characterizes Big Data by large volumes, frequencies, and sources.) However, they have significant differences. The most notable difference is their data structure. Lakes are for raw data, and Warehouses are for processed data (Learn more about Lakes and Warehouses here ).
Then you have to understand the concept of metadata . Meta is greek for "about the thing itself." Therefore, metadata describes data. Metadata is the names, sources, types, and sizes of the data files. The title of a book is another example. A title describes what is within.
The Problem?
The underlying need for a Data Fabric strategy derives from the innate characteristic of data to be "dirty."
Dirty Data is hard to manage. Many organizations aren't prepared for how dirty data can get when they expand, add more data sources, and adopt more products. They end up filling their data lakes and warehouses with data that has errors, varying input types, and null values. They turn their Data Lake into the dreaded... Data "Swamp."
Additionally, data can be difficult to access. There can be many gates, logins, and permissions needed to merely touch the sweet sweet data. One thing I hate is when I need a different login for every application. So many passwords!
Finally, it is time-intensive to analyze all the various structures, forms, and systems that an organization might have around its data storage. The time investments are massive, and the likelihood for human errors is high with such a complex problem as Big Data management.
The Solution
Enter from stage left, Data Fabrics. I see it as a two-pronged approach.?
First, you must create your systems and standard operating procedures. This includes creating a naming system, ingestion process, cleaning process, and universal storage rules for the metadata that enters your data lake. Having organized data will lead to identifiable habits, workflows, and outcomes and make life a lot easier.
The second prong is adding machine learning to the management system. You've gone in and manually laid the road. Now, you can let a machine drive down it. Look at your processes, automate as much as possible, and apply models that predict decisions on storing and analyzing ingested data.
The goal is to build machines that can discover new insights from your metadata.
领英推荐
There may be correlations that your organization hadn't noticed. Proper data fabric will recognize these correlations and alert you to the insights for decision-making. The machine learning threads will ensure your organization consistently enhances and empowers its analytical capabilities and professionals.
What you have now is an automated, continuously improving, and manageable system to capture, clean, store, and access big data from any point in the organization. What a mouthful! But that mouthful is exactly what a Data Fabric is.
It's called a Data "Fabric" because it integrates all the parts of a data ecosystem like woven fabric. Every piece and every person connects through threads that span the entire system.
The Data Journey
Flowcharts and process maps help me understand complex subjects better. The one below from TIBCO helped me visualize how a data fabric works and hope it helps you too.
Below are the stages that I believe data goes through when your organization uses a data fabric. These could be more specific but suffice for basic comprehension.
Quotes and Helpful Links
"A data fabric is a data management architecture that can optimize access to distributed data and intelligently curate and orchestrate it for self-service delivery to data consumers. It automates data discovery, governance, and consumption, delivering business-ready data for analytics and AI."
"Gartner defines data fabric as a design concept that serves as an integrated layer (fabric) of data and connecting processes. A data fabric utilizes continuous analytics over existing, discoverable, and inferenced metadata assets to support the design, deployment, and utilization of integrated and reusable data across all environments, including hybrid and multi-cloud platforms."
Texas A&M | Honorary Member of ASH | Engineer Mentioned in Academia.edu | Dream Big Award Start Up Semi finalist | IT Management | Phase Separation Researcher
10 个月Insightful and easy to follow along! Good article.