Unveiling the data processing in 6 steps.
Rodolfo Marcos
Sr Data / Full-Stack Engineer | Tech Writer | Google cloud certified | AWS @ Freestar
Suppose we work for a small company that sells paper. The business is doing well and we need to start reporting the sales… Where can I start from? Let's go through a series of milestones from the beginning that drive to a robust data solution. Just remember that the tools and examples are not unique but represent versatile options that can adapt to a wide range of needs.
?? Python scripts with Pandas
You can start as easily as developing a local Python script using Pandas. You can read files from the sales team that come in Excel format. Read them and generate a simple daily report by e-mail. Pandas is a Python library that has built-in data operators such as joins, filters, and group by that can make your life easier as you get more experienced with data.
Once your script is ready you can deploy it in some cloud service such as Cloud Functions in GCloud or lambda Functions in AWS making it easy to productionalize your brand-new sales report.
? Full-managed data processing - PySpark.
As the next step, you can translate the script developed later into Spark, you can choose from Java or Python versions. Spark can handle mass data processing using its parallel capabilities and has many plugins and libraries that can even handle machine learning - the sales can increase as much as they can, so you should be good to handle all data generated.
I recommend the Python version since most data tooling and frameworks are built using the language nowadays. It has great community support and it's more accessible to new developers on the team. It's important to choose a fully managed option of Spark so it scales automatically when necessary, also you can pay as you go.
You can choose DataProc on GCloud, EMR on AWS, or Databricks to develop and deploy your Spark script on the cloud.
?? Data Warehouse and Data Visualization
There will be a time when a file output as a .csv or Excel won't fit the consumer's needs. A data warehouse is where your processed data is ready to use in datasets and tables. Think of it as an organized, well-keep library with all books ready to be read. Data warehouses provide high availability and quick process time. Integrate its tables with data visualization tools like DataStudio, CubeJS, or Looker to have it all in nice visual Dashboards.
Now you have something professional coming out! The users can build and share their own reports and charts, filtering and aggregating the data to their needs.
?? Orchestration Tools
It's not all about just sales, right? So, at this point, we have data from other areas such as marketing, finance, and so on. Each requires a different data processing, and some reports depend on multiple tables. An orchestrator can combine and sequentialize many data processes, aggregate custom logic and provide data sensors to generate a final result.
Some orchestration options are the cloud-agnostic Apache Airflow or Data Factory in Azure. I recommend Apache Airflow since it has many available connectors to the most used data tooling and services.
? Data Streaming
Let's take a step further, the head of the company wants to see the sales in a real-time, beautiful dashboard. Data streaming is the ability to work with unbounded data that comes as soon as possible from its source. Suppose all business stores are equipped with a connection that sends every sale as soon as the payment is approved. Now you can use the Apache Spark streaming component to create fresh dashboards with real-time sales. Cool, huh? Streaming is a trend with the Internet of Things.
Not all businesses indeed need real-time data, but if you do, choose cloud providers as well since they provide a high level of availability and again, auto-scaling. Plus, most cloud providers already offer a data streaming service such as PubSub in GCloud or Amazon Kinesis in AWS, or if you need something more customized choose Apache Kafka.
?? Data quality and Observability
Last but not least, data quality guarantees that the information used is accurate, consistent, and timely. As the business grows the need to ensure data quality becomes a real concern. Automated tools are the best option to deal with large data. You can create an autonomous check to detect negative sales amounts for example.
Data observability complements this by providing real-time monitoring and understanding of data systems and workflows, alerting enabling quick detection and resolution of issues. Great Expectations is a great framework for creating data quality custom rules, it also has a Spark version. You'll also need to develop a monitoring logic in the cloud provider you choose to detect anomalies in your processes or if some system goes down.
Examples are Cloud Monitoring in Google Cloud or Amazon CloudWatch in AWS.
??? So, here’s to a happy and successful data journey! May your data always be clean, your processes efficient, and your insights profound. Happy data processing! ???
To enjoy the Journey:
Gerente de Martech | Marketing | Digital | CRM | Mídia Performance | BI | Dados | IA | Dados | Cloud | CRO | CX
9 个月You' re amazing brow
Senior Frontend Developer | Javascript | Angular 2+ | Typescript
9 个月As always, great content! Congratulations!
Desenvolvedor Front-End | React | Typescript | HTML5
9 个月Very helpful!