Episode 4- From Serverless to Self-Services Analytics passing by Cloud Technologies

Episode 4- From Serverless to Self-Services Analytics passing by Cloud Technologies


I am Beshoy Gamal, Bigdata and Machine Learning geek, I have worked on implementing data-driven solutions?for more than 9 years in cross countries in the world and cross technologies from on-primes and cloud, and now I am working in Vodafone Group as Senior Data Architect.

From my all experience, I found that many organizations have invested in a central data lake and a data team with the expectation to drive their business based on data. However, after a few initial quick wins, they notice that?the central data team often becomes a bottleneck,?as they?cannot handle all the analytical questions of management and product owners quickly enough

So I have decided to write this series of articles about the DataMesh, Data Product, Selfe Services, and Data Democratization

What Exactly is Self-Service Analytics?

No alt text provided for this image

It shouldn't be a surprise to anyone that self service in the data analytics space is hard to define. Benn Stancil has a whole piece where he argues 'self service is a?feeling' — which I largely agree with — and Stancil says that what self service analytics?is?depends on how the org feels about self-serving data from their tools. Do they trust it? Do they feel comfortable getting what they need,?without?emailing an analyst?

This, Stancil continues, depends on the context of the organization (do they trust the numbers in their data systems?) and their data maturity (do they feel comfortable with their BI tool?) and the needs from business users (does the CEO set the tone for metrics consumption?)

So, yes, the organizational context matters when you're talking about self service analytics. A self service setup that works in one company might not be equivalently self-service in another.

But I think we can get more specific than 'Self-service is a?feeling'.

Instead, I'm going to invert the question and define self-service analytics by what it's?not. Because I think this is more useful.

In a sentence: I think self-service can be thought of as a business outcome that successfully avoids a common organizational failed state. To put this more concretely, I think self-service analytics is?a state where the business is sufficiently data-driven, but the data org?does not look like an army of English-to-SQL translators.

This should become more useful in a minute. Let's walk through this together:

You are a small company.

You realize you need a data analytics team, so you hire your first analyst and you use Google Data Studio or Tableau or some other analytics platforms. Your analyst churns out reports for management, and all is well for a few months. But eventually, your analyst can't keep up with all the requests she's getting from end users, so you hire another. And another. And another. And then your company grows up, creates departments that report to different leaders, and each department hires its own analysts, and now you have an army of analysts in various parts of the company all writing queries or tuning Excel spreadsheets, just trying to keep up with the business requests your company throws at them.

These analysts are mostly English-to-SQL translators or Excel jockeys. They're all relatively junior. Some are senior, sure. But there's not much career progression for them overall. And many of them are suitably displeased with their jobs, and a reliable percentage of them churn out (read: quits your company) every six months or so. You keep hiring new analysts to keep up with business demand and grit your teeth at the management challenge of constantly churning employees.

This is the failed state.

This is the failed state that self service analytics is supposed to solve. It is a failed state because it's rather painful to maintain an army of English-to-SQL translators. Ideally you want a smaller group of data folks that can service a much larger number of data consumers. And the only way you can hit?that?scale is to have some form of 'self service' — that is, some way that business users can get the data they need,?without?going through an analyst on Slack or email.

In other words, self service analytics is valuable as a goal because it increases the operating leverage of your data team. You can serve many more people with fewer analysts. This is an ideal business outcome.

In other words,?self-service analytics?is most usefully described as a business outcome — a place that you get to through a combination of tools and processes and org structure. And the way you get to it is by asking yourself, each step of the way: "does this move bring us closer or further away from the failed state?"

In such a scenario, the best thing a tool can do is to not get in your way. The best thing a Business Intelligence tool can do is to give you handles when you want to evolve your org away from the failed state.

That being said, knowing?what self-service analytics is?is different from actually achieving it.

We will now deconstruct and drill into the details with text and supplemental charts, starting with a discussion of business users.

No alt text provided for this image

Business Users and Developers

Casual versus Power Users. Business users can be divided into two major classes: casual users and power users. Casual users use information to do their jobs. They make up 90% of knowledge workers in an organization-typically, executives, managers, and front-line workers. In contrast, power users are hired to collect and analyze information on a daily basis. They have working knowledge of databases, query techniques, statistics, and machine-learning tools and techniques. They make up approximately 10% of knowledge workers in an organization.


Types of Casual Users

There are two major subclasses of casual users

??????Data consumers simply want to consume reports and dashboards created for them. Some only view the content, while others interact with it, searching, drilling, sorting, pivoting, and creating snapshots for later viewing.


??????Data explorers are data consumers who occasionally want to edit a report or dashboard or create one from scratch without coding. Using a Bl or discovery tool, they drag and drop metrics, dimensions, controls, and predefined charts from an object library onto a report canvas to create ad hoc reports and dashboards. They may also create metrics using a point-and-clickcalculation engine and merge local and external data using integrated data preparation functions. (See "The Conundrum of the Data Explorer," below.)


Types of Power Users

There are three major subclasses of power users

??????Business developers build reports and dashboards for business consumption. Traditionally, they are corporate Bl developers, or perhaps business-savvy data engineers. But increasingly, they are tech-savvy business users in each business unit. Ideally, they are Bldevelopers co-located in the business unit as part of a federated organizational model. But business developers can also be data analysts who have time to build business views and reports for colleagues. Increasingly, they are tech-savvy Bl analysts whocan not only gather and consolidate requirements from business users, but also design the report or dashboard interface using point-and-click development tools or light scripting.

??????Data scientists are data analysts with a computer science background who know how to code using languages such as SQL, Java, Python, Hive, and Pig. The best are also conversant with statistics and data mining tools and can create predictive and machine-learning models. Most data scientists want to access raw data at the lowest level of granularity.

??????Data engineers do the heavy lifting required to create and manage the information supply chain. We used to call these individuals ETL developers and data architects. They identify source data, mapdata flows, model databases, define and monitor data transformation jobs, and work with database administrators to create, manage, and tune databases and optimize performance. Some also design business views for business users, especially if they are built within a database.

No alt text provided for this image


Self-Service Workflows

Collectively, business users and developers create a self-service analytics environment that refines data for consumption. The environment consists of bidirectional and iterative workflows that enable business users to refine and enrich curated data to meet their needs quickly, while promoting prototypes and requirements to iteratively expand the boundaries of the curated data environment. These workflows create a vibrant, self-reinforcing data environment that accelerates time to insight as well as user productivity

No alt text provided for this image

The right way to implement Self Service Workflow is Serverless Analytics that enabled by Cloud Computing technologies

What is Serverless Computing?

Serverless Cloud Computing enables self-service provisioning and management of Servers. However, as we know in the world of Big Data, Dynamic Scaling and Cost Management are the key factors behind the success of any Analytics Platform. So, the Server Architecture exactly does that many cloud platforms such as AWS, Microsoft Azure, etc., and Open Source Technologies like Apache has launched many services which are?in which code execution and will scale up or down as per the requirement, and we have to pay for Infra only for the execution time of our code.

No alt text provided for this image

What is Serverless Analytics Framework?

It is an open-source web framework that is used for building applications on AWS, Microsoft Azure, Kubernetes, etc. It act as provider agnostic which means that you only need to have one tool to tap into the power of all the cloud providers.

Serverless for Big Data

It is becoming very popular in the world of it. As workloads are managed by its Platforms so We don’t need an extra team to manage our Hadoop/Spark Clusters. Let’s see various points which we can consider while setting our platforms.

No need for Infra Management

While working on various ETL and Analytical platforms, We found that we need many guys who can set up the Spark, Hadoop clusters, and nowadays, We use Kube Cluster and everything launched on containers. So, Monitoring them and Scaling the resources, cost optimization takes a lot of effort and resources. So Serverless make developer and manager’s life easy as they don’t have to worry about the infra.

Scale on Demand

Its platforms continuously monitor the resource usage of our deployed code ( or functions) and scale up/down as per the usage. So, the developer doesn’t need to worry about scalability. Just Imagine, We have deployed some ETL job on Spark Cluster, and it runs after every hour, and let’s say at peak times, many records to extract from Data Source per hour increase to 1 million and sometimes, at midnight, it falls to the only 1k to 10k.?ETL Service automatically scales up/down our job according to requirements. It's like the same we do in our Kubernetes cluster using AutoScale Mode, in that we just set the rules for CPU or Memory Usage and Kubernetes automatically takes care of scaling the cluster.


Cost-Effective means that we have to pay only for the execution time of our code. It means when our deployed function is idle and not being used by any client, we do not have to pay for any infra cost for that. It's like we do not have to pay on an hourly basis to any Cloud Platform for our Infra. It's like they launch things on the fly for us. Example: ETL platform like Glue launches the Spark Jobs according to the scheduled time of our ETL Job. So, Cloud Service will charge us only for that particular time of execution. Also, Imagine you have several endpoints/microservice / API which less frequently used. So for that type of case, It?is best as we will be charged only whenever those APIs will be getting called.

Built-In High Availability and Fault Tolerance

The primary architecture Providers provide built-in High Availability means our deployed application will never be down. It’s the same as we use Nginx for any application and having multiple servers deployed and Nginx automatically takes care of routing our request to any available server. In the context of it, Let’s say Our Spark’s ETL Job is running and suddenly Spark Cluster gets failed due to many reasons. So Glue will automatically re-deploy our Spark Job on the new cluster, and Ideally, Whenever a job fails, Glue should store the checkpoint of our job and resume it from wherever it fails.

No alt text provided for this image

What makes self-service analytics successful?

Self-serve analytics is successful when your organization is using a modern BI tool and adopts a collaborative data culture. Successful self-serve analytics will help the data team deliver insights without resorting to manual, repetitive analysis or playing constant catch-up as the business evolves. Here are three components that make self-serve analytics successful.

1. Centralized?analytics

With a single, centralized hub for all your analysis, you can source everyone’s shared knowledge in the same place. At this “watering hole” for data, all teams can collaborate more deeply and share more context allowing them to move faster, together.

2. Collaborative analytics

Building self-serve analytics should be a collaborative process between business teams and data teams. Business teams should be able to fluently work in the same tool as data teams. This allows for the data to be informed by everyone’s expertise and helps maintain that the highest priority questions are getting answered.

For example, product decisions are made collaboratively. To get the most-informed answers to product questions, data scientists, product management, growth teams, and other domain experts should be able weigh-in—easily viewing or editing work—in the same tool.

3. Distributed analytics

Successful self-serve also means that folks across teams are getting the most up-to-date data in the places they visit the most. This could look like data teams setting reports to automatically be refreshed every morning, setting up reports to be embedded in other tools or white-labeled, or if it's delivered to their inbox every week.

3. Context to inform ad hoc analysis?

Ad hoc analysis should drive self-serve reporting and self-serve reporting can and should lead to deeper questions that inspire even better ad hoc analysis. This process should be complementary and easy to from in a modern BI tool, like Mode, providing more and more relevant context to everyone over time.

No alt text provided for this image

Real Live scenarios from Google Cloud Platform

Now imagine you having access to all the best expedition equipment in one place. You can start your exploration instantly and have more freedom to experiment and uncover fascinating discoveries that will help humanity! Wouldn’t it be awesome if you too, as a Data Consumer,?get access to all the data exploration tools in one place? A?single unified view?that lets you discover and interactively query fully governed high-quality data with an option to operationalize your analysis???

This is exactly what the?Data exploration workbench?in?Dataplex?offers. It provides a?Spark-powered serverless data exploration experience?that lets data consumers interactively extract insights from data stored in Google Cloud Storage and BigQuery using Spark SQL scripts and?open source?packages in Jupyter Notebooks

No alt text provided for this image
No alt text provided for this image

Challenge 1: As a data consumer you spend more time on making different tools work together than on generating insights?

Solution: Data exploration workbench provides a single user interface where:

  1. You have 1-click access to run Spark SQL queries using an?interactive?Spark SQL editor.
  2. You can leverage?open-source technologies such as PySpark, Bokeh, Plotly to visualize data and build machine learning pipelines?via JupyterLab Notebooks.
  3. Your queries and notebooks run on?fully managed, serverless Apache Spark sessions?- Dataplex?auto-creates user-specific sessions and manages the session lifecycle.
  4. You can save the scripts and notebooks as content in Dataplex and enable better?discovery?and?collaboration?of that content across your organization. You can also govern access to content using IAM permissions.?
  5. You can interactively explore data, collaborate over your work, and?operationalize?it with one-click scheduling of scripts and notebooks.

Challenge 2: Discovering the right datasets needed to kickstart data exploration is often a “manual” process that involves reaching out to other analysts/data owners

Solution:?‘Do we have the right data to embark on further data analysis?’ - This is the question that kickstarts the data exploration journey. With Dataplex, you can examine the metadata of the tables you want to query right from within the data exploration workbench. You can further use the indexed Search to understand not only the technical metadata but business and operational metadata along with the data quality scores for your data. And finally, you get deeper insights into your data by interactively querying?using the Workbench.?

Challenge 3:?Finding the right query snippet to use —analysts often don’t save and share useful query snippets in an organized or centralized way. Furthermore, once you have access to the code, you now need to recreate the same infrastructure setup to get results.

Solution: Data exploration workbench allows users to?save?Spark SQL queries and Jupyter notebooks as content and?share?them?across the organization via IAM permissions. It provides a built-in?Notebook viewer?that helps you examine the output of a shared notebook without starting a Spark session or re-executing the code cells. You can not only share the content of a script or a notebook, but also the environment where the script ran to ensure others can run on the same underlying set up. This way, analysts can seamlessly collaborate and build on the analysis.?


Beshoy Gamal的更多文章

