A huge mistake we see companies make regarding data contracts is believing that implementing them requires numerous components and high upfront costs to achieve value. This makes the process seem overwhelming and a hard sell to their leadership. This is often the result of anchoring on the end state rather than seeing the various small wins you can make in implementing various components of data contracts over time. The diagram below, from the upcoming book on the topic, highlights the various building blocks and stages of maturity when implementing data contracts. Even if your organization implements only a fraction of these building blocks, you will receive value from each step in the journey. Furthermore, companies already quite mature in their data processes will often have many of these building blocks implemented individually, and thus, it's just connecting these blocks together. What do you believe are the most critical building blocks below? #data #AI
Gable
数据基础架构与分析
Seattle,Washington 6,114 位关注者
The collaboration, communication, and change management platform for data teams operating at scale
关于我们
Gable is a B2B data infrastructure SaaS that provides a collaboration platform to author and enforce data contracts. ‘Data contracts’, refer to API-based agreements between the software engineers who own upstream data sources and data engineers/analysts that consume data to build machine learning models and analytics. These agreements are defined, enforced, and discovered through the Gable platform.
- 网站
-
https://gable.ai
Gable的外部链接
- 所属行业
- 数据基础架构与分析
- 规模
- 11-50 人
- 总部
- Seattle,Washington
- 类型
- 私人持股
- 创立
- 2023
地点
-
主要
US,Washington,Seattle,98101
Gable员工
动态
-
Feeling overwhelmed by the disconnect between your team's data needs and the workflows of upstream developers? You’re not alone. Many businesses struggle with this challenge, but a robust data management framework and effort in communication can help alleviate the pain. Here’s how you can ensure your framework supports your business objectives and gets upstream engineers engaged: 1. Align with Business Goals: Ensure your data management framework enhances your business objectives, not just the IT department’s. 2. Manage Risk: Your framework should be designed to handle changes in data complexity so that other teams won't worry about eventually being stuck with tech debt (and blocking your proposed projects). 3. Integration: Integrate your framework with existing systems to avoid data silos-- this is why we are strong believers in data quality being embedded within existing CI/CD pipelines used by engineering. 4. Data Quality and Governance: Maintain high standards of data accuracy, consistency, and security and ensure these standards align with the business rather than just stating, "This is best practice"(see number 1). 5. Preemptive Challenges: Proactively address common implementation challenges before the data itself is changed. We believe data contracts are the way to do this, but if you are still early in your data maturity, an early win could be getting the data team involved with the planning and scoping phase of upstream engineering projects. Anything else you would add? ----- You can read more in our blog "The Pragmatist’s Guide to Data Management Frameworks" via the following link: https://lnkd.in/g228JUuz
-
"Treating data like a mere resource sets companies up for failure, as data is viewed as a means to an end for the organization. Just numbers for rarely viewed dashboards, just requirements for CRUD operations, or just a cost of running a business. Yet this perspective misses three significant attributes of data: 1. Data is an asset rather than a resource. 2. Not all data is of the same value. 3. The value of a data asset changes over time. Coupled with the reality that data is in a constant state of decay, without efforts to maintain your data asset, it will slowly decrease in value. As an analogy, we can compare the “asset” of data to the asset of real estate. While the initial asset holds inherent value, it requires additional effort and upgrades to increase the value and make a profit. The opposite is also true, and much more powerful, as I’m sure you have seen the swiftness of disarray setting in among abandoned properties." This is why we strongly believe data contracts are a mechanism to shift a company's data culture. Specifically, assigning a contract to a data asset forces one to determine the value you are trying to protect and forces a conversation across the organization about the value of this data and who owns it. #data #ai ----- ?? This illustration and excerpt comes from the following article "The True Cost of Data Debt" which you can read here: https://lnkd.in/gVeyi3FR
-
?? You keep hearing about data contracts within the developer workflow, but how does it actually work? Below is a sneak peek diagram from Chad and Mark's upcoming O'Reilly book on Data Contracts. To Summarize: 1. A pull request (PR) is created where the code changes the schema of a database. 2. A new branch kicks off CI/CD workflows where a Docker container is generated. 3. The Docker container has a test database that provides table name and schema metadata. 4. This metadata is used to validate if there is a data contract in place, and if so, check the Kafka schema registry for validation. 5. If the validation shows differing schemas from expectations, the CI/CD check fails, and the ability to merge the branch into main is blocked. How can you see this fitting within your developer workflow? #data #dataengineering ----- ?? Interested in learning more? You can download the early release chapters of the O'Reilly book for free here:?https://lnkd.in/gEhQeTxv
-
The data engineering team's ability to maintain high data quality and governance standards is constantly being challenged. Data inconsistencies, compliance issues, and a lack of clarity in roles can lead to significant setbacks for any organization. So, how does an already constrained team combat these challenges? 1. Establish clear expectations for data management between teams. By setting standards for data quality, governance, and compliance, everyone stays aligned while also preventing data mishaps-- fostering accountability and reliability. One of the key benefits of this structured approach is its ability to ensure data consistency. By setting predefined rules for data handling, the risk of errors and discrepancies is minimized, leading to more reliable and accurate data. This is crucial for making informed business decisions and maintaining trust in your data systems. 2. Further enhance data governance by clearly defining roles and responsibilities. This helps identify the data owners, data producers and consumers, and other key stakeholders, ensuring that each party understands their duties and the standards they need to uphold. This clarity reduces the likelihood of unexpected breaking changes and ensures compliance with regulatory requirements (think financial data). 3. Align data practices with strategic business goals. By integrating expectations and quality standards with key business objectives, data management practices can be directly tied to the organization’s most important endeavors. Implementing these formal agreements might seem daunting, but we believe that data contracts can make onboarding to such a state and maintaining this state as simple as possible via automations. Furthermore, iterations on this system happen at the PR level within already existing developer workflows—simply adding a step in the CI/CD config file. How has your team built buy-in for data quality? #data #dataengineering ----- ?? You can learn more in our latest blog, "How Data Contracts Impact Data Engineering Best Practices." https://lnkd.in/gQejVfA5
-
No matter how good your data practices are... data will find a way to become a nightmare. There are just too many edge cases to cover every possible scenario. Thus, prevention isn't about predicting the future. Instead, it's about having a system in place to quickly address new issues and then account for this new known threat to prevent future instances. Note that we emphasize "address" and not "alert" to resolve data quality issues. While awareness of an issue is important, it's not the bottleneck in resolving data quality issues. What's more time-consuming is... a) Knowing the root cause. b) Knowing who is impacted. c) Knowing who to talk to resolve the issue. This is why we are strong proponents of data contracts that are embedded within the developer workflow. The moment a pull request is created, it goes through CI/CD checks, and if it violates a contract, the developer instantly knows what the issue is, who it impacts, and who can help, all within their pull request. What's been the scariest data nightmare you have faced? #data #dataengineering
-
Downstream data teams feel the pain of upstream data quality issues immensely... but are scared of addressing it... "This is just how things are, we can't fix this." "The upstream engineering team doesn't care." "They would never allow us to additional CI/CD tests." Yet something interesting happens once we get an upstream engineer in the room to talk about data contracts. "Wait... the data team isn't already doing this? We can put this into our existing CI/CD pipeline? Notifications happen directly in my GitHub pull request? We should have been doing this yesterday." Despite the challenges of data quality, upstream engineers and downstream data teams are way more aligned than most think. It's just the silos between transactional and analytical databases that make communicating this alignment so hard. #data #dataengineering ----- ?? Want to learn more? Check out our article "OLTP Vs. OLAP: How Professional POVs Cause Data Problems" https://lnkd.in/g_8cHS7h
-
An introduction to the medallion architecture for data lakehouses: ?? Bronze: - All the data. Is it useful data? TBD... - It's an extremely accurate... depiction of your data swamp. - Where you first find out that product engineers changed the schema. ?? Silver: - Analogous to an Italian restaurant as there are spaghetti DAGs everywhere. - There is a whisper of a data model, but it's muffled by all the CASE WHENs. - Essentially a giant game of "telephone" to replicate upstream business logic. ?? Gold: - Practically speaking, the staging area for data to be replicated into Excel. - Aggregate tables that power the CEO's dashboard (looks at it once). - Assumes the data in the previous steps are correct..... What did we miss? #data #ai
-
Snowflake:???? APACHE ICEBERG Databricks:???? APACHE ICEBERG Data Engineers: ?? Wait... another technology!? The data world is not just excited about generative AI. A few days ago, Snowflake announced Polaris, a vendor-neutral open catalog implementation for Apache Iceberg. The following day, Databricks shared that they acquired Tabular, a company founded by the team behind Apache Iceberg, for $1B+. You may be wondering what's a true shift in the industry compared to hype, and we are here to help you navigate these new trends. A few weeks ago, we shared an article detailing the underlying architecture of Apache Iceberg and how you can use it in conjunction with data contracts. You can read more here: https://lnkd.in/gwTnmDM3 #apacheiceberg
-
Data engineers are given limited resources to solve problems that impact everyone in the company. As a result, the team becomes a bottleneck through no fault of their own. Think about it. The work of data engineers impacts the following teams: - Software engineers who need to integrate their events into the DB. - Data scientists who rely on the data to develop ML models. - ML Engineers who need to surface model predictions to the product. - Leaders who are looking at key dashboards built on the data warehouse. - Data analysts who are building reports for key decisions. Despite their wide impact, the data engineering team is often the smallest in the technology department. Think about your own company. How many software engineers, and how many data engineers? There are many tools to help data engineers scale their work downstream in the data lake and warehouse. Yet, going upstream, there is little visibility—the same area where substantially more staff are making changes. This is why we focus on "shifting data left" to help scale the data engineer's work before the bottleneck. How have you seen this in your own work? #data #dataengineering