The multi-threading hammer - Part 4

And the process was being orchestrated in Airflow . . . (Part 3, link below)

We were expecting good results by using multi-threading and when that did not work, using multi-processing. But both approaches did not give us the expected results. Though we were executing on a Spark cluster, we were using multi-threading (and then multi-processing) features of Python. This was making all threads and processes run on the same processor - when 200 threads / processes are spawned on the same CPU, things will not work as per expectation as the number of cores in the CPU will still be four or 8 or 16. We were clearly limited by the compute capacity available.

Our next approach was to look for a solution using Airflow. Airflow was already being used for orchestration. I suggested that we define an Airflow Directed Acyclic Graph (DAG) that would execute the required task of EBCDIC to ASCII conversion. We would then launch multiple instances of the same DAG on a schedule of say one minute. Each DAG would have the offset from which it would read the file and write the result to a directory.

When a DAG is executed, Airflow chooses any node from the cluster, allowing us to spawn multiple DAG instances and spread the load across the cluster and execute conversion of multiple blocks at the same time, in parallel.

The team did not implement the solution in Airflow. They decided to implement the same concept using AWS Glue (due to their familiarity). Initially, the team launched around 200 AWS Glue tasks in parallel. Then 400 parallel executions were also tried out. Using this approach, we were able to achieve a significant speed up in the file conversion activity. Finally we settled on a number between 200 and 400.

End of story? Not really . . .

Link


#python #multithreading #multi_threading #multiprocessing #multi_processing #parallel #thread #aws #glue #awsglue #aws_glue

要查看或添加评论,请登录

Bipin Patwardhan的更多文章

  • Parallel execution in Spark

    Parallel execution in Spark

    On reading the title, I am sure the first reaction will be 'What am I talking about'. As we all know, Spark is a…

    1 条评论
  • Writing code to generate code - Python + SQL version

    Writing code to generate code - Python + SQL version

    In my current project, we had to build multiple metric tables. The base table had 50 columns and we had to add around…

  • Change management is crucial (Databricks version)

    Change management is crucial (Databricks version)

    My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…

  • Friday fun - Impersonation (in a good way)

    Friday fun - Impersonation (in a good way)

    All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…

  • Any design is a trade-off

    Any design is a trade-off

    Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

    1 条评论
  • Quick Tip: The headache caused by import statements in Python

    Quick Tip: The headache caused by import statements in Python

    When developing applications, there has to be a method to the madness. Just because a programming environment allows…

  • Databricks: Enabling safety in utility jobs

    Databricks: Enabling safety in utility jobs

    I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…

  • A Simple Code Generator Using a Cool Python Feature

    A Simple Code Generator Using a Cool Python Feature

    For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…

  • Recap of my articles from 2024

    Recap of my articles from 2024

    As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…

  • Handling dates

    Handling dates

    Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.

社区洞察

其他会员也浏览了