Adding Python wheel dependencies to Glue jobs

Adding Python wheel dependencies to Glue jobs

Reference 1: Repost article

Reference 2: AWS Glue docs

I am sharing this in case someone faces a similar task. I had to run a AWS Glue Pyspark job that needed an import from the fire library (https://pypi.org/project/fire/). As Glue library does not include fire by default, I have to pip install it in the Glue executor or provide a whl file. The corporate network does not allow downloading python libraries on the fly from the internet - so, pip install using additional-python-modules was not possible.

Typically, I would find the whl file in pypi.org under the downloads. I will download those file into my S3 bucket and use the full S3 URI - 's3://bucket-name/folder/package-name.whl' under --extra-py-files

As the whl files have not been published under the download section for Fire, I could only download the fire-0.5.0.tar.gz file and that was not compatible with the Glue Pyspark job.

I had to find the source version of the library - this was open sourced by google - so, it was easy to locate: https://github.com/google/python-fire/tree/master

Note: this already has a setup.py file with all the information.

All I had to do was to download the git repo into my local machine (or EC2 machine)

and run the below command

pip install setuptools wheel 
python setup.py bdist_wheel        



Are you just starting to learn Python. Here is a Python crash course to get started on your Developer journey:

Creating a wheel file for our python code

I also had to do this for my own python library. Here I will try to give a template of a setup.py file:

from setuptools import setup, find_packages

setup(
    name="YourPackageName",
    version="0.1.0",
    author="Your Name",
    author_email="[email protected]",
    description="A short description of the project",
    long_description=open('README.md').read(),
    long_description_content_type="text/markdown",
    url="https://github.com/yourusername/yourpackagename",
    packages=find_packages(),
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
    ],
    python_requires='>=3.6',
    install_requires=[
        "numpy",
        "pandas>=1.1.0",
        "scikit-learn==0.24.1",
        "matplotlib",
        "requests"
    ],
)        

After you've created this setup.py file and placed it in the root directory of your project, you can then generate a wheel file by running the following commands in your terminal:

pip install setuptools wheel
python setup.py sdist bdist_wheel        

This will create a dist directory containing your .whl file, which is the wheel package for your project. Remember to ensure your project structure is properly set up, with your Python packages and modules organized in a way that setuptools can recognize.

要查看或添加评论,请登录

Ananth Tirumanur的更多文章

  • How to create S3 Table bucket?

    How to create S3 Table bucket?

    At re:Invent 2024, AWS introduced Amazon S3 Tables, the first cloud object store with built-in Apache Iceberg support…

  • Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

    Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

    Organizations lose $5 million annually due to data pipeline failures. Lost productivity and missed opportunities make…

  • 10 Years of AWS Lambda: Lessons for Data Engineers

    10 Years of AWS Lambda: Lessons for Data Engineers

    Picture this: It's November 2014, and developers around the world are glued to their screens during AWS re:Invent…

    1 条评论
  • AI is taking your ETL job

    AI is taking your ETL job

    Sorry! that was clickbait! this article is more about advancing ETL Processes with AI. AI is bringing unprecedented…

    1 条评论
  • Masking credit card numbers in the data lake

    Masking credit card numbers in the data lake

    To mask credit card numbers in an AWS data lake using AWS Glue, Python, S3, and Athena, you'll need to create an ETL…

    2 条评论
  • Pulumi vs Terraform for AWS

    Pulumi vs Terraform for AWS

    In my earlier projects, Terraform was my go-to for infrastructure as code. I loved how straightforward it was—just…

  • Run a llm on your local machine

    Run a llm on your local machine

    In the modern realm of artificial intelligence (AI), language models have been gaining immense popularity for their…

    2 条评论
  • Wierd AWS Athena issues and how to solve them

    Wierd AWS Athena issues and how to solve them

    We were having an inability to query on the first column in our CSV files. The problem comes down to the encoding of…

  • Troubleshooting executor out of memory error in Pyspark

    Troubleshooting executor out of memory error in Pyspark

    When working with PySpark, encountering an "Executor Out of Memory" error is common, especially when dealing with large…

  • Tech Focus - Handling PII data in AWS Glue

    Tech Focus - Handling PII data in AWS Glue

    Step-by-step guide to detecting, masking, and redacting PII data using AWS Glue Today, I'm sharing a step-by-step guide…

    1 条评论

社区洞察

其他会员也浏览了