登录查看更多内容

Unveiling the data processing in 6 steps.

Rodolfo Marcos

Sr Data / Full-Stack Engineer | Tech Writer | Google cloud certified | AWS @ Freestar

发布日期: 2024年6月9日

Suppose we work for a small company that sells paper. The business is doing well and we need to start reporting the sales… Where can I start from? Let's go through a series of milestones from the beginning that drive to a robust data solution. Just remember that the tools and examples are not unique but represent versatile options that can adapt to a wide range of needs.

?? Python scripts with Pandas

You can start as easily as developing a local Python script using Pandas. You can read files from the sales team that come in Excel format. Read them and generate a simple daily report by e-mail. Pandas is a Python library that has built-in data operators such as joins, filters, and group by that can make your life easier as you get more experienced with data.

Once your script is ready you can deploy it in some cloud service such as Cloud Functions in GCloud or lambda Functions in AWS making it easy to productionalize your brand-new sales report.

? Full-managed data processing - PySpark.

As the next step, you can translate the script developed later into Spark, you can choose from Java or Python versions. Spark can handle mass data processing using its parallel capabilities and has many plugins and libraries that can even handle machine learning - the sales can increase as much as they can, so you should be good to handle all data generated.

I recommend the Python version since most data tooling and frameworks are built using the language nowadays. It has great community support and it's more accessible to new developers on the team. It's important to choose a fully managed option of Spark so it scales automatically when necessary, also you can pay as you go.

You can choose DataProc on GCloud, EMR on AWS, or Databricks to develop and deploy your Spark script on the cloud.

Spark's parallel processing divides your data operations into multiple worker nodes.

?? Data Warehouse and Data Visualization

There will be a time when a file output as a .csv or Excel won't fit the consumer's needs. A data warehouse is where your processed data is ready to use in datasets and tables. Think of it as an organized, well-keep library with all books ready to be read. Data warehouses provide high availability and quick process time. Integrate its tables with data visualization tools like DataStudio, CubeJS, or Looker to have it all in nice visual Dashboards.

Now you have something professional coming out! The users can build and share their own reports and charts, filtering and aggregating the data to their needs.

?? Orchestration Tools

It's not all about just sales, right? So, at this point, we have data from other areas such as marketing, finance, and so on. Each requires a different data processing, and some reports depend on multiple tables. An orchestrator can combine and sequentialize many data processes, aggregate custom logic and provide data sensors to generate a final result.

Some orchestration options are the cloud-agnostic Apache Airflow or Data Factory in Azure. I recommend Apache Airflow since it has many available connectors to the most used data tooling and services.

领英推荐

GroupBy #10: Netflix's Psyberg, Parquet format, SQL…

Vu Trinh 1 年前

Hiding within those mounds of data is knowledge that…

Santosh Raman Mishra 4 年前

Mastering Pandas for Data Engineers: A 60-Day Data…

ITVersity, Inc. 2 个月前

? Data Streaming

Let's take a step further, the head of the company wants to see the sales in a real-time, beautiful dashboard. Data streaming is the ability to work with unbounded data that comes as soon as possible from its source. Suppose all business stores are equipped with a connection that sends every sale as soon as the payment is approved. Now you can use the Apache Spark streaming component to create fresh dashboards with real-time sales. Cool, huh? Streaming is a trend with the Internet of Things.

Not all businesses indeed need real-time data, but if you do, choose cloud providers as well since they provide a high level of availability and again, auto-scaling. Plus, most cloud providers already offer a data streaming service such as PubSub in GCloud or Amazon Kinesis in AWS, or if you need something more customized choose Apache Kafka.

?? Data quality and Observability

Last but not least, data quality guarantees that the information used is accurate, consistent, and timely. As the business grows the need to ensure data quality becomes a real concern. Automated tools are the best option to deal with large data. You can create an autonomous check to detect negative sales amounts for example.

Data observability complements this by providing real-time monitoring and understanding of data systems and workflows, alerting enabling quick detection and resolution of issues. Great Expectations is a great framework for creating data quality custom rules, it also has a Spark version. You'll also need to develop a monitoring logic in the cloud provider you choose to detect anomalies in your processes or if some system goes down.

Examples are Cloud Monitoring in Google Cloud or Amazon CloudWatch in AWS.

??? So, here’s to a happy and successful data journey! May your data always be clean, your processes efficient, and your insights profound. Happy data processing! ???

To enjoy the Journey:

Nail down the business needs and start with a single report.
Ask for the user's feedback.
Be patient, you'll learn a lot on the journey.
Do you know other tools or services that can be used in each stage? Share with us!

Alan Begnossi

Gerente de Martech | Marketing | Digital | CRM | Mídia Performance | BI | Dados | IA | Dados | Cloud | CRO | CX

9 个月

You' re amazing brow

1 次回应

Daniele Ferrari

Senior Frontend Developer | Javascript | Angular 2+ | Typescript

9 个月

As always, great content! Congratulations!

1 次回应

Jean V. Marcos

Desenvolvedor Front-End | React | Typescript | HTML5

9 个月

Very helpful!

1 次回应

查看更多评论

要查看或添加评论，请登录

Rodolfo Marcos的更多文章

Hands-on guide! Creating a data pipeline in AWS using Glue and Step Functions.

2024年11月11日

Hands-on guide! Creating a data pipeline in AWS using Glue and Step Functions.

In this tutorial, you’ll create an AWS pipeline to get up-to-date data — with error handling, retries, and parallel…

1 条评论
How to load local files to AWS Redshift using Python.

2024年1月10日

How to load local files to AWS Redshift using Python.

In this tutorial, you'll learn how to upload local files to an AWS Redshift Serverless table using a Python script…
How to create a simple and powerful data pipeline in Google Cloud.

2023年5月20日

How to create a simple and powerful data pipeline in Google Cloud.

In this post, we’ll learn how to create a simple and powerful data pipeline using only Google Cloud managed services…

6 条评论
Digital marketing trends for 2021.

2021年3月8日

Digital marketing trends for 2021.

In a highly connected environment get to know what the main trends are for 2021. Data engineering, data science…

Unveiling the data processing in 6 steps.

Rodolfo Marcos

Sr Data / Full-Stack Engineer | Tech Writer | Google cloud certified | AWS @ Freestar

?? Python scripts with Pandas

? Full-managed data processing - PySpark.

?? Data Warehouse and Data Visualization

?? Orchestration Tools

领英推荐

? Data Streaming

?? Data quality and Observability

Rodolfo Marcos的更多文章

社区洞察

其他会员也浏览了

SQL: The Basics for Data Science Newbies | Learnbay

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Best Practices and Spark optimisation Tips for Data engineers

Apache Spark 3.0 for Data Scientists : Best Practices

Dash for Data Science Dashboards

What is Delta Live Tables?

Top Data Science Tools

How a Data Engineer Works with Google Search API

TOP 10 DATA ANALYTICS TOOLS

Delta Live Tables: Declarative vs. Procedural Approaches in Databricks

?? Python scripts with Pandas

? Full-managed data processing - PySpark.

?? Data Warehouse and Data Visualization

?? Orchestration Tools

领英推荐

? Data Streaming

?? Data quality and Observability

Rodolfo Marcos的更多文章

Hands-on guide! Creating a data pipeline in AWS using Glue and Step Functions.

How to load local files to AWS Redshift using Python.

How to create a simple and powerful data pipeline in Google Cloud.

Digital marketing trends for 2021.

社区洞察

其他会员也浏览了

SQL: The Basics for Data Science Newbies | Learnbay

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Best Practices and Spark optimisation Tips for Data engineers

Apache Spark 3.0 for Data Scientists : Best Practices

Dash for Data Science Dashboards

What is Delta Live Tables?

Top Data Science Tools

How a Data Engineer Works with Google Search API

TOP 10 DATA ANALYTICS TOOLS

Delta Live Tables: Declarative vs. Procedural Approaches in Databricks