Unveiling the data processing in 6 steps.
From simple to complete data processing solutions.

Unveiling the data processing in 6 steps.

Suppose we work for a small company that sells paper. The business is doing well and we need to start reporting the sales… Where can I start from? Let's go through a series of milestones from the beginning that drive to a robust data solution. Just remember that the tools and examples are not unique but represent versatile options that can adapt to a wide range of needs.


?? Python scripts with Pandas

You can start as easily as developing a local Python script using Pandas. You can read files from the sales team that come in Excel format. Read them and generate a simple daily report by e-mail. Pandas is a Python library that has built-in data operators such as joins, filters, and group by that can make your life easier as you get more experienced with data.

Once your script is ready you can deploy it in some cloud service such as Cloud Functions in GCloud or lambda Functions in AWS making it easy to productionalize your brand-new sales report.


? Full-managed data processing - PySpark.

As the next step, you can translate the script developed later into Spark, you can choose from Java or Python versions. Spark can handle mass data processing using its parallel capabilities and has many plugins and libraries that can even handle machine learning - the sales can increase as much as they can, so you should be good to handle all data generated.

I recommend the Python version since most data tooling and frameworks are built using the language nowadays. It has great community support and it's more accessible to new developers on the team. It's important to choose a fully managed option of Spark so it scales automatically when necessary, also you can pay as you go.

You can choose DataProc on GCloud, EMR on AWS, or Databricks to develop and deploy your Spark script on the cloud.

Spark's parallel processing divides your data operations into multiple worker nodes.

?? Data Warehouse and Data Visualization

There will be a time when a file output as a .csv or Excel won't fit the consumer's needs. A data warehouse is where your processed data is ready to use in datasets and tables. Think of it as an organized, well-keep library with all books ready to be read. Data warehouses provide high availability and quick process time. Integrate its tables with data visualization tools like DataStudio, CubeJS, or Looker to have it all in nice visual Dashboards.

Now you have something professional coming out! The users can build and share their own reports and charts, filtering and aggregating the data to their needs.


?? Orchestration Tools

It's not all about just sales, right? So, at this point, we have data from other areas such as marketing, finance, and so on. Each requires a different data processing, and some reports depend on multiple tables. An orchestrator can combine and sequentialize many data processes, aggregate custom logic and provide data sensors to generate a final result.

Some orchestration options are the cloud-agnostic Apache Airflow or Data Factory in Azure. I recommend Apache Airflow since it has many available connectors to the most used data tooling and services.


? Data Streaming

Let's take a step further, the head of the company wants to see the sales in a real-time, beautiful dashboard. Data streaming is the ability to work with unbounded data that comes as soon as possible from its source. Suppose all business stores are equipped with a connection that sends every sale as soon as the payment is approved. Now you can use the Apache Spark streaming component to create fresh dashboards with real-time sales. Cool, huh? Streaming is a trend with the Internet of Things.

Not all businesses indeed need real-time data, but if you do, choose cloud providers as well since they provide a high level of availability and again, auto-scaling. Plus, most cloud providers already offer a data streaming service such as PubSub in GCloud or Amazon Kinesis in AWS, or if you need something more customized choose Apache Kafka.


Modern services rely on data streaming.

?? Data quality and Observability

Last but not least, data quality guarantees that the information used is accurate, consistent, and timely. As the business grows the need to ensure data quality becomes a real concern. Automated tools are the best option to deal with large data. You can create an autonomous check to detect negative sales amounts for example.

Data observability complements this by providing real-time monitoring and understanding of data systems and workflows, alerting enabling quick detection and resolution of issues. Great Expectations is a great framework for creating data quality custom rules, it also has a Spark version. You'll also need to develop a monitoring logic in the cloud provider you choose to detect anomalies in your processes or if some system goes down.

Examples are Cloud Monitoring in Google Cloud or Amazon CloudWatch in AWS.


??? So, here’s to a happy and successful data journey! May your data always be clean, your processes efficient, and your insights profound. Happy data processing! ???


To enjoy the Journey:

  • Nail down the business needs and start with a single report.
  • Ask for the user's feedback.
  • Be patient, you'll learn a lot on the journey.
  • Do you know other tools or services that can be used in each stage? Share with us!



Alan Begnossi

Gerente de Martech | Marketing | Digital | CRM | Mídia Performance | BI | Dados | IA | Dados | Cloud | CRO | CX

9 个月

You' re amazing brow

Daniele Ferrari

Senior Frontend Developer | Javascript | Angular 2+ | Typescript

9 个月

As always, great content! Congratulations!

Jean V. Marcos

Desenvolvedor Front-End | React | Typescript | HTML5

9 个月

Very helpful!

要查看或添加评论,请登录

Rodolfo Marcos的更多文章

社区洞察