Technology Landscape to Democratize Data

Technology Landscape to Democratize Data

Every organization today is aspiring to become data-driven! Organizations that are data-driven are 162% more likely to surpass revenue goals and 58% more likely to beat these goals compared to their non-data-driven counterparts.

Most organizations today are data-rich but information-poor.

This is because extracting insights today is bottlenecked by specialized data talent that is required to make data consistent, interpretable, accurate, timely, standardized, and sufficient. One of the key goals of data platform modernization (especially by leveraging the cloud) is to make data self-service for both technical users (data scientists, analysts) as well as business users (marketers, product managers).?

While there are 100s of tools and frameworks popping up in a rapidly evolving data landscape, architects and technology leaders find it extremely difficult to navigate the plethora of technologies that are all positioned as the "next silver bullet."

Teams often get attracted to “shiny new technologies” instead of finding the right fit blocks to make data self-service based on their current use-cases, process, technology, team skills, and data literacy.?

This post helps you understand the technology landscape. My approach is to view the technology landscape in the context of the data user’s journey map as they transform raw data into insights. This post (hopefully) becomes your starting point to correlate the needs and pain points of your data users to available technologies in the landscape.

Addressing the specific needs of your data users and making them self-service is your golden path to becoming data-driven. Unfortunately, there are no shortcuts or one-size-fits-all solutions!

The rest of the blog gets into the details of the landscape of available technologies to make data self-service.

Journey map from Raw Data to Insights to Impact

The journey map for any data insights can be divided into four key phases: discover, prep, build and operationalize?

No alt text provided for this image
One of the primary goals of data platform modernization is to minimize the Time to Insight (ToI) for both technical and business users.

ToI represents the time it takes to take raw data (e.g., customer support calls log) and generate insight (e.g., products with the highest call volume, least satisfaction, geographic distribution).? Minimizing ToI requires automating the various tasks within the journey map as engineering complexity today limits accessibility for data users. Even advanced technical users such as data scientists spend a significant amount of time on engineering activities related to aligning systems for data collection, defining metadata, wrangling data to feed ML algorithms, deploying pipelines and models at scale, and so on. These activities are outside of their core insight-extracting skills, and bottlenecked by dependency on data engineers and platform IT engineers who typically lack the necessary business context.?

Several enterprises have identified the need to automate and democratize the journey by making data self-service. Google’s TensorFlow Extended (TFX), Uber’s Michelangelo, Facebook’s FBLearner Flow, AirBnb’s Bighead, are examples of self-service Data+ML platforms.

None of these frameworks are silver bullets and cannot be applied as-is. The right solution for an organization depends on use-cases, type of data, existing technology building blocks, dataset quality, processes, data culture, and people skills.?

The journey map can be further broken into 12 milestones as shown.

No alt text provided for this image

  1. Find: Discover the existing datasets along with the metadata details?
  2. Aggregate: Collect new structured, semi-, or unstructured data from applications and third-party sources
  3. Standardize: Re-use standardized metrics that provide a single source-of-truth across insights
  4. Wrangle: Cleanse and transform the data for building reliable insights?
  5. Govern: Ensure the usage and access to data is within compliance?
  6. Model: Manage the global data namespace to effectively update and share???
  7. Process: Analyze data across multiple data stores?
  8. Visualize: Build dashboards and reports for visual analysis and information extraction??
  9. Orchestrate: Setting up end-to-end transformation pipelines from raw data into insight
  10. Deploy: Continuous integration and rollout of transformation pipelines
  11. Observe: Proactively monitor the performance, cost, quality of data applications and pipelines?
  12. Experiment: A/B testing to ensure insights lead to the right business impact??

Automating each milestone should be considered as crawl, walk, run.

It is important to treat self-service as having multiple levels, analogous to different levels of self-driving cars that vary in terms of the levels of human intervention required to operate them. For instance, a level-2 self-driving car accelerates, steers, and brakes by itself under driver supervision, while level 5 is fully automated and requires no human supervision.

The 2022 technology landscape is as shown. The rest of the post goes into details of each of the milestones.?

No alt text provided for this image

Discover Phase

1. Find datasets

As data grows and teams scale, silos are created across business lines, leading to no single source of truth. Data users today need to effectively navigate a sea of data resources of varying quality, complexity, relevance, and trustworthiness.??

No alt text provided for this image

The landscape of existing technologies focuses on data catalogs, metadata management, crawling of datasets. The technologies aim to automate the following key tasks within this milestone:

  • Locating new datasets: Locating datasets and artifacts across silos is currently an ad-hoc process; team knowledge in the form of cheat sheets, wikis, anecdotal experiences, and so on is used to get information about datasets and artifacts. ?
  • Tracking Lineage: For a given dataset, lineage includes all the dependent input tables, derived tables, and output models and dashboards. It includes the jobs that implement the transformation logic to derive the final output. For example, if a job J reads data‐ set D1 and produces dataset D2, then the lineage metadata for D1 contains D2 as one of its downstream datasets, and vice versa.?
  • Extracting technical metadata of datasets:? Technical metadata is extracted by crawling the individual data source without necessarily correlating across multiple sources. Technical Metadata includes logical and physical metadata: Physical metadata covers details related to physical layout and persistence, such as creation and modification timestamps, physical location and format, storage tiers, and retention details. Logical metadata includes dataset schema, data source details, the process of generating the dataset, and owners and users of the dataset.?
  • Tracking Operational metadata: Tracking execution stats that capture the completion times, data processed, and errors associated with the pipelines. It also covers the state of the dataset in terms of compliance, personally identifiable information (PII) data fields, data encryption requirements, and so on. Operational metadata is not generated by connecting to the data source but rather by stitching together the metadata state across multiple systems.?

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy.?Start with basic data and metadata catalog and add capabilities as shown in the figure.

No alt text provided for this image

2. Aggregate data

Datasets from applications and transaction sources are sitting across different application silos and need to be moved into a centralized repository like a data lake or data mesh. Moving data involves orchestrating the data movement across heterogeneous systems, verifying data correctness, and adapting to any schema or configuration changes that occur on the data source.?

No alt text provided for this image

The landscape of existing technologies focuses on EL (Extract-Load) of the traditional ELT process by simplifying CDC (Change Data Capture), periodic batch ingestion, or real-time streaming event ingestion. The technologies aim to automate the following key tasks within this milestone:

  • Ingesting data from data stores/apps: Data must be read from the source datastore and written to a target datastore/Data Lake/Data Mesh. A technology-specific adapter is required to read and write the data from and to the datastore.?
  • Managing source schema changes: After the initial configuration, changes to the schema and configuration can occur at the source and target datastores. The goal is to automatically manage schema changes such that the downstream analytics are not impacted by the change. In other words, we want to reuse existing queries against evolving schemas and avoid schema mismatch errors during querying. There are different kinds of schema changes, such as renaming columns; adding columns at the beginning, middle, or end of the table; removing columns; reordering columns, and changing column data types.?
  • Ensuring compliance for PII (Personally Identifiable Information): Before the data can be moved across systems, it must be verified for regulatory compliance. For example, if the source datastore is under regulatory compliance laws like PCI, the data movement must be documented with clear business justification. For data with PII attributes, these must be encrypted during transit and on the target datastore. Data rights laws further limit the data that can be moved from source datastores for analytics.?
  • Verifying data quality of ingested data: Data movement needs to ensure that source and target are in parity. In real-world deployments, quality errors can arise for a multitude of reasons, such as source errors, adapter failures, aggregation issues, and so on. Monitoring of data parity during movement is a must-have to ensure that data quality errors don’t go unnoticed and impact the correctness of business metrics and ML models. During data movement, data at the target may not exactly resemble the data at the source. The data at the target may be filtered, aggregated, or a transformed view of the source data. For instance, if the application data is sharded across multiple clusters, a single aggregated materialized view may be required on the target. Transformations need to be defined and verified before deploying in production.?

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy. Start with traditional batch ingestion and move to the higher level based on use-case needs and data model maturity.

No alt text provided for this image

3. Standardize Metrics

The goal is to provide a repository of well-documented, governed, versioned, and curated metrics. Data users should be able to search and use metrics to build dashboards and models with minimal data engineering.?

No alt text provided for this image

The landscape of existing technologies focuses on providing a Metrics Store. This is also similar to a Feature Store for standardizing features used in ML projects. The technologies aim to automate the following key tasks within this milestone:

  • Standardizing metrics computation: Pipelines extract the data from the source datastores and transform them into metrics. These pipelines have multiple transformations and need to handle corner cases that arise in production. Managing large datasets at scale requires distributed programming optimizations for scaling and performance.
  • Backfilling metrics: Whenever the business definition changes, a data backfill is required for calculating the new metrics values for historic data.?
  • Tracking business definitions: Version-controlled repository of business definitions to ensure there is a single source of truth. Instead of implementing one-off and inconsistent metrics, using a Domain Specific Language (DSL) to define metrics and dimensions. The DSL gets translated into an internal representation, which can then be compiled into SQL and other big data programming languages. This approach makes adding new metrics lightweight and self-service for a broad range of users.
  • Serving metrics offline and online: Metrics can be consumed offline (batch computed) or online (computed in real-time).
  • Cataloging business vocabulary: Organizing data objects and metrics in a business-intuitive hierarchy. Also, business rules are associated with generating the metrics from the raw datasets
  • Ensuring operational robustness: Handling scenarios such as uncoordinated source schema changes, changes in data element properties, ingestion issues, source and target systems with out-of-sync data, processing failures, incorrect business definitions for generating metrics. Automation tracks change to metrics over time and alerts metrics owners when things change. If the lineage or definition of metric changes, the owner is immediately notified and can see why and how this change occurred. If a spike or a dip in the data is present, the owner already knows it happened and has answers for stakeholders. No more tracking down anomalies in the data warehouse.

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy.?Starting with tracking business definitions and moving higher.

No alt text provided for this image

Prep Phase

4. Wrangle

Data is seldom in the exact form required for consumption – it needs to be transformed via an iterative process curating errors, outliers, missing values, imputing values, data imbalance, data encoding. Applying wrangling transformations requires writing idiosyncratic scripts in programming languages like Python, Perl, and R, or engaging in tedious manual editing using tools like Microsoft Excel. Given the growing volume, velocity, and variety of the data, the data users require low-level coding skills to apply the transformations at scale in an efficient, reliable, and recurring fashion. Also operating these transformations reliably on a day-to-day basis and proactively preventing transient issues from impacting data quality.??

No alt text provided for this image

The landscape of existing technologies focuses on simplifying the process of wrangling for technical and business users. The technologies aim to automate the following key tasks within this milestone:

  • Scoping: The metadata catalog is used to understand the properties of data and schema and the wrangling transformations required for analytic explorations. It is difficult for non-expert users to determine the required transformations. The process also involves record matching—i.e., finding the relationship between multiple datasets, even when those datasets do not share an identifier or when the identifier is not very reliable.
  • Validating: There are multiple dimensions of validation, including verifying whether the values of a data field adhere to syntactic constraints like Boolean true/false as opposed to 1/0. Distributional constraints verify value ranges for data attributes. Cross-attribute checks verify cross-database referential integrity—for example, a credit card updated in the customer database being correctly updated in the subscription billing database.
  • Structuring: Data comes in all shapes and sizes. There are different data formats that may not match the requirements for downstream analysis. For example, a customer shopping transaction log may have records with one or more items while individual records of the purchased items might be required for inventory analysis. Another example is standardizing particular attributes like zip codes, state names, and so on. Similarly, ML algorithms often do not consume data in raw form and typically require encoding, such as categories encoded using one-hot encoding.
  • Cleaning: There are different aspects of cleaning. The most common form is removing outliers, missing values, null values, and imbalanced data that can distort the generated insights. Cleaning requires knowledge about data quality and consistency—i.e., knowing how various data values might impact your final analysis. Another aspect is the deduplication of records within the dataset.
  • Enriching: This involves joining with other datasets, such as enriching customer profile data. For instance, agricultural firms may enrich production predictions with weather information forecasts. Another aspect is deriving new forms of data from the dataset.

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy. Starting with basic exploratory tools and then moving higher to interactive and ML-based recommendations.

No alt text provided for this image


5. Govern

There is a growing number of data rights regulations like GDPR, CCPA, Brazilian General Data Protection Act, India Personal Data Protection Bill, and several others. These laws require customer data to be collected, used, and deleted based on their preferences. Data scientists and other data users want an easy way to locate all the available data for a given use case without having to worry about compliance violations. Data engineers have to ensure they have located all the customer data copies correctly and execute the rights of users in a comprehensive, timely, and auditable fashion. In addition to compliance, ensuring the right level of access control for the datasets is required.

No alt text provided for this image

The landscape of existing technologies focuses on simplifying data discovery, classification, access control, and enforcement of data rights requests. They aim to automate the following tasks:

  • Enforcing data rights for data deletion: Delete personal data from backups and third parties when it’s no longer necessary or when consent has been withdrawn. You need the ability to delete a specific subset of data or all data associated with a specific customer from all systems. Given immutable storage formats, erasing data is difficult and requires an understanding of formats and namespace organization. The deletion operation has to cycle through the datasets asynchronously within the compliance SLA without affecting the performance of running jobs. Records that can’t be deleted need to be quarantined and the exception records need to be manually triaged. This processing needs to scale to tens of PBs of data as well as for third-party data.?
  • Enforcing customer preferences for the usage of data: Manage the customer’s preferences for data to be collected, behavioral data tracking, use cases for the use of data, Do not sell data preferences. There are different levels of access restrictions ranging from basic restriction (where access to a dataset is based on business need), to privacy by default (where data users shouldn’t get raw PII access by default), to consent-based access (where access to data attributes is only available if the user has consented for the particular use case).
  • Tracking the lifecycle of customer data:? This includes tracking how data is collected from the customer, how the data is stored and identified, how the customer preferences are persisted, how data is shared with third-party processors, and how data is transformed by different pipelines.?
  • Continuously discovering usage violations: For example, datasets containing PII or highly confidential data that are incorrectly accessible to specific data users or specific use cases. Also, discover datasets that have not been purged within the SLA, or requests from unauthorized roles.
  • Ensuring Authentication, Access control, and Audit tracking: Independent of the compliance requirements, ensuring data access is secure and verified is critical. Also creating audit rules.?

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy.

No alt text provided for this image

6. Model?

Data lakes have become popular as central data warehouses, they lack the support for traditional tasks for data modeling and life cycle management. Today, multiple workarounds need to be built and lead to several pain points. First, primitive data life cycle tasks have no automated APIs and require engineering expertise for reproducibility and rollback, provisioning data-serving layers, and so on. Second, application workarounds are required to accommodate the lack of consistency in the lake for concurrent read-write operations. Also, incremental updates, such as deleting a customer’s records for compliance, are highly nonoptimized. Third, unified data management combining stream and batch is not possible.?

No alt text provided for this image

Solutions in this space address these pain points and simplify the organization, sharing, and data management tasks. In particular, they automate the following tasks:

  • Organizing namespace zones: Within a data lake, zones allow the logical and/or physical separation of data. The namespace can be organized into many different zones based on the current workflows, data pipeline process, and dataset properties. A typical namespace configuration that is used by most enterprises in some shape and form to keep the lake secure, organized, and agile (as shown in the figure).?

No alt text provided for this image

  • Managing specialized data-serving layers: Data persisted in the lake can be structured, semi-structured, and unstructured. For semi-structured data, there are different data models such as key-value, graph, document, and so on. Depending on the data model, an appropriate datastore should be leveraged for optimal performance and scaling.?
  • Data rollback and versioning: Data pipelines write bad data for downstream consumers because of issues ranging from infrastructure instabilities to messy data to bugs in the pipeline. For pipelines with simple appends, rollbacks are addressed by date-based partitioning. When updates and deletes to previous records are involved, rollback becomes very complicated, requiring data engineers to deal with such complex scenarios. Versioning is required for exploration, model training, and resolving corruption due to failed jobs that have left data in an inconsistent state, resulting in a painful recovery process.?
  • Managing data partitions: Large tables are inefficient to analyze. Partitioning helps shard the data based on attributes such as time to allow distributed processing.
  • Incremental data updates:? Big data formats were originally designed for immutability. With the emergence of data rights compliance, where customers can request that their data be deleted, updating lake data has become a necessity. Because of the immutability of big data formats, deleting a record translates to reading all the remaining records and writing them in a new partition. Given the scale of big data, this can create significant overhead. A typical workaround today is to create fine-grained partitions to speed up the rewriting of data. Solutions such as Databricks Deltalake, Apache Hudi (Hadoop Upsert Delete and Incremental), Iceberg which enables applying mutations to data in HDFS on the order of a few minutes.??
  • ACID transactions: Implementing atomicity, consistency, isolation, durability (ACID) transactions on the data lake. Today, this is accomplished by painful workarounds.?
  • Data sharing:? Sharing datasets both internally and externally is an operational nightmare requiring specialized data teams.

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy. Namespace management within the data lake is the essential starting point for most organizations.

No alt text provided for this image


Build Phase

7. Process

Focusses on effectively executing SQL queries and Big Data programs at a large scale. There are three trends that need to be taken into account for the self-service processing of data. First is the polyglot data models associated with the datasets. For instance, graph data is best persisted and queried in a graph database. Similarly, there are other models, namely key-value, wide-column, document, and so on. Polyglot persistence is applicable both for lake data as well as application transactional data. Second, the decoupling of query engines from data storage persistence allows different query engines to run queries on data persisted in the lake. For instance, short, interactive queries are run on Presto clusters, whereas long-running batch processes are on Hive or Spark. Typically, multiple processing clusters are configured for different combinations of query workloads. Selecting the right cluster types is key. Third, for a growing number of use cases like real-time BI, the data in the lake is joined with the application sources in real-time. As insights generation becomes increasingly real-time, there is a need to combine historic data in the lake with real-time data in application datastores.?

No alt text provided for this image

The landscape of solutions focuses on specialized query engines, unifying batch and stream processing, processing structured, semi-, and unstructured data across polyglot stores. The solutions automating the following tasks related to this milestone:

  • Simplify running query processing (batch, interactive, streaming, real-time) engines:? Allows running the transformation logic as a batch, streaming, interactive depending on the requirements of the use case. This involves automating the routing of queries across clusters and query engines. The routing is based on tracking the static configuration properties (such as a number of cluster nodes and hardware configuration, namely CPU, disk, storage, and so on) as well as the dynamic load on the existing clusters (average wait time, distribution of query execution times, and so on).
  • Provide federated query support: Analyzing and joining data residing across different datastores in the lake as well as application microservices. Data is typically spread across a combination of relational databases, nonrelational datastores, and data lakes. Some data may be highly structured and stored in SQL databases or data warehouses. Other data may be stored in NoSQL engines, including key-value stores, graph databases, ledger databases, or time-series databases. Data may also reside in the data lake, stored in formats that may lack schema or that may involve nesting or multiple values (e.g., Parquet and JSON).?
  • Execute batch+streaming queries:? As insights are becoming real-time and predictive, they need to analyze both the ongoing stream of updates as well as historic data tables. Data users can access the combined streaming and batch data using existing queries using time-window functions. This allows for processing data continuously and incrementally as new data arrives without having to choose between batch or streaming. Streaming data ingest, batch historic backfill, and interactive queries need to be implemented. Another pattern is adding streaming events with batch tables, allowing data users to simply leverage the existing queries on the table.
  • Scaling the processing logic: Data users are not engineers. There is a learning curve to efficiently implementing the primitives (aggregates, filters, group by, etc.) in a scalable fashion across different systems. To increase productivity, there is a balance required between low-level and high-level business logic specifications. The low-level constructs are difficult to learn, while the high-level constructs need to be appropriately expressive.?
  • Search and analyze unstructured and semi-structured data: Support for log analytics involving searching, analyzing, and visualizing machine data.?

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy.

No alt text provided for this image

8. Visualize

Visualization is a key approach for analysis and decision-making, especially for business users. Visualization tools come with a few challenges that make it difficult to self-serve. First, visualization is difficult given multiple dimensions and growing scale. For large datasets, enabling rapid-linked selections like dynamic aggregate views is challenging. Second, different types of visualizations are best suited to different forms of structured, semi-structured, and unstructured data. Too much time is spent manipulating data just to get analysis and visualization tools to read it. Third, it is not easy for visualization tools to help reason with dirty, uncertain, or missing data. Automated methods can help identify anomalies, but determining the error is context-dependent and requires human judgment. While visualization tools can facilitate this process, analysts must often manually construct the necessary views to contextualize anomalies, requiring significant expertise.?

No alt text provided for this image

Available tools in the landscape aim to address these pain points by providing no-code features for data slice-&-dice, descriptive and statistical analysis, ML-based predictive analysis, and forecasting. They automate the following tasks related to this milestone:

  • Visual representation: Enabling data analysis with charts, graphs, histograms, and other visual representation
  • Visual storytelling: Using visual storytelling to share, communicate, and collaborate on insights in the flow of analysis.
  • Reporting: Sharing data analysis to stakeholders for decision-making.
  • Descriptive analytics: Using preliminary data analysis to find out what happened.
  • Statistical analysis: Exploring the data using statistics such as how this trend happened and why.
  • NLP-based summarization: Using natural language processing to match data fields and attributes and describe the contents in a data source. This can help teams understand what the data is telling them, reducing incorrect assumptions.

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy.

No alt text provided for this image

9. Orchestrate

Orchestrating job pipelines for data processing and ML has several pain points. First, defining and managing dependencies between the jobs is ad hoc and error-prone.

No alt text provided for this image

Data users need to specify these dependencies and version-control them through the life cycle of the pipeline evolution. Second, pipelines invoke services across ingestion, preparation, transformation, training, and deployment. Monitoring and debugging pipelines for correctness, robustness, and timeliness across these services are complex. Third, the orchestration of pipelines is multitenant, supporting multiple teams and business use cases. Orchestration is a balancing act in ensuring pipeline SLAs and efficient utilization of the underlying resources.?

No alt text provided for this image

The solutions in the technology landscape aim to simplify the dependency authoring of pipelines and make the execution self-healing. They aim to automate the following tasks associated with this milestone:

  • Authoring job dependencies: In an end-to-end pipeline, the job dependencies are represented as a DAG (Directed Acyclic Graph). Missing dependencies can lead to incorrect insights and is a significant challenge in production deployments. Tracking changes in dependencies with changes in code is difficult to version-control; while the dependent job may have completed, it may have failed to process the data correctly. In addition to knowing the dependent jobs, production deployments need ways to verify the correctness of the previous steps (i.e., they need circuit breakers based on data correctness). The job dependencies are not constant but evolve during the pipeline life cycle. For instance, a change in the dashboard may create dependencies on a new table that is being populated by another job. The dependency needs to be updated appropriately to reflect the dependency on the new job.
  • Orchestrating synchronous and asynchronous job types: Orchestrating a wide variety of specialized jobs, such as ingestion, real-time processing, ML constructs, and so on. Deep integration to service-specific APIs can improve job execution and monitoring compared to executing as a vanilla shell request.
  • Checkpointing job execution: For long-running jobs, checkpointing can help recover the jobs instead of restarting. Checkpointing can also help reuse previous results if the job is invoked without any change in data. Typically, if there are long-running jobs with strict SLAs, checkpointing is a must-have.
  • Scaling of resources: The hardware resources allocated to the orchestrator should be able to auto-scale based on the queue depth of the outstanding requests. This is typically applicable in environments with varying numbers and types of pipelines such that static cluster sizing is either not performant or wasteful with respect to resource allocation.
  • Automatic audit and backfill: Configuration changes associated with the pipeline orchestration, such as editing connections, editing variables, and toggling workflows, need to be saved to an audit store that can later be searched for debugging. For environments with evolving pipelines, a generic backfill feature will let data users create and easily manage backfills for any existing pipelines.?
  • Distributed execution: Jobs are executed on a distributed cluster of machines allocated to the orchestrator. The pipeline DAGs are continuously evaluated. Applicable jobs across multiple tenants are then queued up for execution and scheduled in a timely fashion to ensure SLAs. The orchestrator scales the underlying resources to match the execution needs. The orchestrator does the balancing act of ensuring pipeline SLAs, optimal resource utilization, and fairness in resource allocation across tenants. Distributed resource management is time-consuming thanks to a few challenges. First, ensuring isolation across multiple tenants such that a slowdown in one of the jobs doesn’t block other unrelated jobs on the same cluster. Second, as the number of pipelines increases, a single scheduler becomes the bottleneck, causing long wait times for the jobs to be executed. Having an approach to partition the jobs across parallel schedulers allows scaling across the available resources. Third, given the heterogeneous nature of the jobs, there’s a need to leverage a range of custom executors for data movement, schema services, processing, and ML tasks. In addition to resource management, job execution needs to handle appropriate retry for job execution errors, and jobs need to be recovered when failures occur at the crashed machines. Finally, the execution needs to fail over and continue execution with the appropriate leader election. Remembering the state of the pipeline for restart is critical.
  • Production monitoring and alerting: Upon deployment of the pipeline in production, it needs to be monitored to ensure SLAs as well as to proactively alert on issues. In production, several issues can arise, from job errors to underlying hardware problems. Detecting these proactively is critical to meeting SLAs. Trend analysis is used to uncover anomalies proactively, and fine-grained monitoring combined with logging can help distinguish between a long-running job and a stalled job that’s not making progress due to errors. Monitoring the pipeline orchestration in production is complex. Fine-grained monitoring is needed to distinguish between a long-running job and a stalled job that is not making progress. Debugging for root-cause analysis requires understanding and correlating logs and metadata across multiple systems.

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy. Focusing on simplifying dependency authoring and tracking is the key starting point.

No alt text provided for this image

Operationalize Phase

10. Deploy

Data applications and pipelines are continuously evolving to accommodate the changing data schema and business logic. The process to integrate, test, deploy, and post-deployment monitoring is often bottlenecked by the data platform team. One of the growing trends (known as “reverse ETL”) is to deploy the processed insights from the system of records like a warehouse to a system of actions like CRM, ERP, and other SaaS apps to operationalize data?

No alt text provided for this image

The tools in the technology landscape aim to automate the following tasks: ?

  • Testing ETL changes: Feature pipelines are written as ETL code that reads data from different data sources and transforms them into features. ETL code evolves continuously. Some of the common scenarios are moving to new versions of data processing frameworks like Spark, rewriting from Hive to Spark to improve performance, changes to the source schema, and so on. The ETL changes need to be validated for correctness using a comprehensive suite of unit, functional, regression, and integration tests. These tests ensure the pipeline code is robust and operates correctly for corner cases. As a first step in the integration process, unit tests and a golden test suite of integration tests are run. These are also referred to as smoke tests, as they compare the results of sample input-output data. Ideally, integration tests should use actual production data to test both robustness and performance. Often, scaling issues or inefficient implementations are undetected in production. Today, tests can be written as a part of the code or managed separately. Additionally, if the features are consumed for generating a business metrics dashboard, the data users need to verify the correctness of the results (this is known as user acceptance testing). The approach today is ad hoc, and the validation is typically done using small samples of data that aren’t representative of production data.
  • Validating schema change impact: Data source owners make changes to their source schema and typically do not coordinate with downstream ML pipeline users. These issues are typically detected in production and can have a significant impact. As a part of the change tracking, source schema changes need to be detected and trigger the continuous integration service to validate the effect of these changes proactively.?
  • Creating sandbox test environments: Spinning up multiple concurrent environments to smoke-test code changes to data applications and pipelines?
  • Supporting shadow mode deployment: This mode captures the inputs and inference of a new pipeline logic in production without actually serving the insights. The results can be analyzed with no significant consequences if a bug is detected.
  • Canary model deployment: Canary testing allows you to validate a new release with minimal risk by deploying it first for a fraction of your users. It requires mature deployment tooling, but it minimizes mistakes when they happen. The incoming requests can be split in many ways to determine whether they will be serviced by the old or new model: randomly, based on geolocation or specific user lists, and so on. There is a need for stickiness—i.e., for the duration of the test, designated users must be routed to servers running the new release. This can be achieved by setting a specific cookie for these users, allowing the web application to identify them and send their traffic to the proper servers.?

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy.

No alt text provided for this image

11. Observe

The goal of observability is to ensure the Big Data Applications and pipelines complete within performance SLAs, cost budgets, and generate reliable results. Data users aren’t engineers, which leads to several pain points for writing performant and cost-effective queries/programs. First, query engines like Hadoop, Spark, and Presto have a plethora of knobs. Understanding which knobs to tune and their impact is nontrivial for most data users and requires a deep understanding of the inner workings of the query engines. There are no silver bullets—the optimal knob values for the query vary based on data models, query types, cluster sizes, concurrent query load, and so on. Given the scale of data, a brute-force approach to experimenting with different knob values is not feasible either. Second, given the petabyte (PB) scale of data, writing queries optimized for dis‐ tributed data processing best practices is difficult for most data users. Often, data engineering teams have to rewrite the queries to run efficiently in production. Most query engines and datastores have specialized query primitives that are specific to their implementation; leveraging these capabilities requires a learning curve with a growing number of technologies. Third, query optimization is not a one-time activity but rather is ongoing based on the execution pattern. The query execution profile needs to be tuned based on the runtime properties in terms of partitioning, memory and CPU allocation, and so on. Query tuning is an iterative process with decreasing benefits after the initial few iterations targeting low-hanging optimizations.?

Another key aspect of observability is cost. With data democratization, where data users can self-serve the journey to extract insights, there is a risk of wasted resources and unbounded costs; data users often spin up resources, and without actively leveraging them, which leads to low utilization. A single bad query running on high-end GPUs can accumulate thousands of dollars in a matter of hours, typically to the surprise of the data users. Cost management provides the visibility and controls needed to manage and optimize costs. It answers questions such as dollars spent per application? Which teams are projected to spend more than their allocated budgets? Are there opportunities to reduce spending without impacting performance and availability? Are the allocated resources utilized appropriately?

Finally, several things can go wrong and lead to data quality issues: uncoordinated source schema changes, changes in data element properties, ingestion issues, source and target systems with out-of-sync data, processing failures, incorrect business definitions for generating metrics, and so on. Tracking quality in production pipelines is complex. First, there is no E2E unified and standardized tracking of data quality across multiple sources in the data pipeline. This results in a long delay in identifying and fixing data quality issues. Also, there is currently no standardized platform that requires teams to apply and manage their own hardware and software infrastructure to address the problem. Second, defining the quality checks and running them at scale requires a significant engineering effort. For instance, a personalization platform requires data quality validation of millions of records each day. Currently, data users rely on one-off checks that are not scalable with large volumes of data flowing across multiple systems. Third, it’s important not just to detect data quality issues, but also to avoid mixing low-quality data records with the rest of the dataset partitions.?

No alt text provided for this image

The tools in the technology landscape aim to address these pain points providing a single-pane-of-glass for observability. They aim to automate the following tasks related to observability:

  • Avoiding Cluster Clogs: Consider the scenario of a data user writing a complex query that joins tables with billions of rows on a nonindexed column value. While issuing the query, the data user may be unaware that this may take several hours or days to complete. Also, other SLA-sensitive query jobs can potentially be impacted. This scenario can occur during the exploration and production phases. Poorly written queries can clog the cluster and impact other production jobs.
  • Resolving Runtime Query Issues: An existing query may stop working and fail with out-of-memory (OOM) issues. A number of scenarios can arise at runtime, such as failures, stuck or runaway queries, SLA violations, changed configuration or data properties, or a rogue query clogging the cluster. There can be a range of issues to debug, such as container sizes, configuration settings, network issues, machine degradation, bad joins, bugs in the query logic, unoptimized data layout or file formats, and scheduler settings.?
  • Speeding Up applications: An increasing number of applications deployed in production rely on the performance of data queries. Optimizing these queries in production is critical for application performance and responsiveness for end-users. Also, the development of data products requires interactive ad-hoc queries during model creation, which can benefit from faster query runs during exploration phases.?
  • Automatically tune queries: For common scenarios, the ability to have the queries automatically tuned. Recommendations for improving the query based on the right primitives, cardinalities of tables, and other heuristics.
  • Monitoring Cost Usage: Cloud processing accounts are usually set up by data engineering and IT teams. A single processing account supports multiple different teams of data scientists, analysts, and users. The account hosts either shared services used by multiple teams (interleaving of requests) or dedicated services provisioned for apps with strict performance SLAs. Budgets are allocated to each team based on business needs. Data users within these teams are expected to be within their monthly budget and ensure the queries are delivering the appropriate cost-benefit. This presents multiple challenges. In a democratized platform, it is important for users to be also responsible for their allocated budgets and be able to make trade-off decisions between budget, business needs, and processing cost. Providing cost visibility to data users is not easy for shared services. Ideally, the user should be able to get the predicted cost of the processing or training at the time they issue their request. Resources spun up by teams are often not tagged, making accountability difficult. A lack of knowledge of the appropriate instance types, such as reserved versus on-demand versus spot-compute instances, can lead to significant money wasted.?
  • Continuous Cost Optimization: There are several big data services in the cloud that have different cost models. Data users perform two phases of cost optimizations. The first phase takes place at the time of designing the pipeline. Here, options are evaluated for available pay-as-you-go models that best match the workload and SLA requirements. The second phase happens on an ongoing basis, analyzing the utilization and continuously optimizing the configuration.?
  • Automating Quality observability: Analyze data attributes for anomalies, debug the root cause of detected quality issues, and proactively prevent low-quality data from impacting the insights in dashboards and models. These tasks can slow down the overall time to insight associated with the pipelines.
  • Data profiling and anomaly detection: Statistical analysis and assessment of data values within the dataset and pre-built algorithmic functions to identify events that do not conform to an expected pattern in a dataset (indicating a data quality problem). Assessment of a dataset’s accuracy made using absolute rules based on the data schema properties, value distributions, or business-specific logic.
  • Proactive problem avoidance: Measures to prevent low-quality data records from mixing with the rest of the dataset.?

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy. Observability has evolved in phases -- the initial solutions focussed on "what" is going on by tracking key metrics. The next generation focussed on "why" it is going on by correlating metrics and logs. With advancements in AI/ML, we currently have Observability 3.0 solutions available that combine the "what", "why", and "how-to" resolve the problem.

No alt text provided for this image

12. Experiment

A new data product or ML model is typically not rolled out across the entire population of users. A/B testing (also known as bucket testing, split testing, or controlled experiment) is becoming a standard approach for evaluating user satisfaction from a product change, a new feature, or any hypothesis related to product growth. It helps compare the performance of different versions of the same feature while monitoring a high-level metric like click-through rate (CTR), conversion rate, and so on.

No alt text provided for this image

In A/B testing, the incoming requests can be split in many ways to determine whether they will be serviced by the old or new model: randomly, based on geolocation or specific user lists, and so on. One of the key ingredients is collecting, analyzing, and aggregating behavioral data, known as clickstream data. Clickstream is a sequence of events that represent visitor actions within the application or website. It includes clicks, views, and related context, such as page load time, browser or device used by the visitor, and so on. Clickstream data is critical for business process insights like customer traffic analysis, marketing campaign management, market segmentation, sales funnel analysis, and so on. It also plays a key role in analyzing the product experience, understanding user intent, and personalizing the product experience for different customer segments. A/B testing uses clickstream data streams to compute business lifts or capture user feedback to new changes in the product or website.?

The existing tools in the technology landscape aim to make experimentation and customer behavior analysis turn-key. In particular, they automate the following tasks related to this milestone:

  • Instrumentation for gathering customer behavioral data: Standardizing creation of beacons across multiple libraries and collection frameworks. The beacons have to be updated constantly to accommodate third-party integrations, namely email marketing tools, experimentation tools, campaign tools, and so on. The tracking schema typically has inconsistent standards of event properties and attributes, resulting in dirty data.
  • Creating sessions of raw clickstream events: A session is a short-lived and interactive exchange between two or more devices and/or users—for instance, a user browsing and then exiting the website, or an IoT device periodically waking up to perform a job and then going back to sleep. The interactions result in a series of events that occur in sequence, with a start and an end. In web analytics, a session represents a user’s actions during one particular visit to the website. Using sessions enables the answering of questions about things like the most frequent paths to purchase, how users get to a specific page, when and why users leave, whether some acquisition funnels are more efficient than others, and so on. A start and an end of a session are difficult to determine and are often defined by a time period without a relevant event associated with it.
  • Identity stitching:? Customers today interact using multiple devices. They may start the website exploration on a desktop machine, continue on a mobile device, and make the buy decision using a different device. It is critical to know if this is the same customer or a different one. By tracking all the events in a single pipeline, the customer events can be correlated by matching IP addresses. Another example of a correlation is using cookie IDs when a customer opens an email, then having the cookie track the email address hashcode.?
  • Bot filtering: Filtering bot traffic from real user activity, especially for use cases predicting the engagement of users in response to product changes.
  • Summarization of events data over different timeframes: For use cases that vary in their requirements for individual event details versus general user activity trends over longer time periods.
  • Enriching customer events: To effectively extract insights, the clickstream events are enriched with additional contexts, such as user agent details like device type, browser type, and OS version. IP2Geo adds geolocations based on IP address by leveraging lookup services?
  • Executing experiments: The experiment is kicked off and users are assigned to the control or variant experience. Their interaction with each experience is measured, counted, and compared to determine how each performs. While the experiment is running, the analysis must answer two key questions: a) Is the experiment causing unacceptable harm to users? and b) Are there any data quality issues yielding untrustworthy experiment results?
  • Analyzing experiment results: The goal is to analyze the difference between the control and variants and determine whether there is a statistically significant difference. Such monitoring should continue throughout the experiment, checking for a variety of issues, including interactions with other concurrently running experiments. During the experiment, based on the analysis, actions can be suggested, such as stopping the experiment if harm is detected, looking at metric movements, or examining a specific user segment that behaves differently from others. Overall, the analysis needs to ascertain if the data from the experiment is trustworthy, and come to an understanding of why the treatment did better or worse than control. The next steps can be a ship or no-ship recommendation or a new hypothesis to test.?

Depending on your current state, look for solutions in the landscape that get you to the next higher level in the crawl, walk, run hierarchy. Start with basic customer behavior data collection and analysis before moving higher towards advanced experimentation and real-time optimization.

No alt text provided for this image

To summarize, you hopefully now have a better understanding of the data platform landscape compared to when you started reading this post! Looking forward to your comments and any upcoming companies for adding to the landscape!

Thanks for putting this together Sandeep Uttamchandani, Ph.D.! This was super helpful coverage of the entire data workflow and landscape.

Ketan U.

Co-Founder and CEO at Union.ai | Flyte.org

2 年

Fantastic job Sandeep Uttamchandani, Ph.D. You have defined the areas really cleanly. This is so hard as many areas overlap. But this is a good perspective and I am thankful you mentioned Flyte.org

fantastic overview of the landscape. Thanks Sandeep Uttamchandani, Ph.D.

Rama Motwani

Finance | Product Leader | Innovation | Emerging Technology

3 年

The blending on Product and Data was impressive!

要查看或添加评论,请登录

Sandeep Uttamchandani, Ph.D.的更多文章

社区洞察

其他会员也浏览了