AWS MWAA — Testing AWS managed airflow with a free tier account

AWS MWAA — Testing AWS managed airflow with a free tier account

Preamble

Let me start this post by answering the most common question: ?

Is MWAA part of the AWS free tier offering ?

No. But it can cost you as low as £3/$3.65 to test it.

And I don’t honestly have any hopes it will ever be, since AWS already offers many things as part of the “free tier”, that allows us to test many different technologies and even have small projects deployed and running on the Internet, without ever needing to pay a cent (or a penny).

Also to do this test (on my private AWS account), I spent $3.65 (less than £3). That is the price of a cappuccino in London.?

Final bill on my (usually free) personal AWS account

I am not going to say that I am an expert on Airflow, but I have been using it for a few years now. Being open source software, I initially worked on it as it was deployed on AWS by my team at Worldremit, mostly to handle the necessary dependencies that come with implementing a DataWareHouse in the classic Kimball methodology (many dependencies around when something should run).?

Mug shots of the 3 developers

Fast-forward to 2024, I am working for msk.ai (AKA myrecovery), it was decided that the tech team would have an “innovation day” AKA Hackathon every 8 weeks and me (Jose), Oluwafemi and Karl decided to test MWAA to use for the data projects we have at this company. The company was already using Airflow even before I was hired, as a service running from a docker container running on an EC2 instance (all managed by the data engineers). So, besides MWAA itself, we also wanted to test:

  • Could we create a useful dashboard, using the metrics MWAA sends to CloudWatch to show successful and failed DAGs ?
  • Could we receive alerts on our emails and phones of when some DAG had failed?

What is Airflow anyway?

Airflow is an orchestration tool widely used by data engineering teams to organise what code should run and when. All in Python.

In airflow we have the concept of a DAG (Directed Acyclic Graph) where you can define some things such as when that DAG should run. In a DAG we can have tasks that define the actual Graph structure by defining not only what code should run, but also after which task it should run.


There is much much more about airflow, but this post is not about Airflow itself, so I would advise anyone wanting to know more to look at the official documentation

Step by step guide to use MWAA

For the purpose of the 1 day spike/MVP/Hackathon we used the AWS web console and we have not created any code to create the infrastructure (CloudFormation, Terraform… ).

Our initial plan was:

  1. Create a DAG that sometimes fails and sometimes works (it can be random)
  2. Create an S3 bucket to hold your DAGs
  3. Start MWAA and have a few DAG runs that fail and others that succeed
  4. Go to CloudWatch and see what kind of metrics MWAA is sending there
  5. Create a Dashboard with relevant metrics
  6. Send an email and an SMS message when a DAG fails
  7. Stop MWAA so that I don't spend all my salary on this spike

Create a DAG that sometimes fails and sometimes works (it can be random)

As we were already running airflow in the company, it ended up being easy for us to create the DAGs and run them on our laptops, before we had to start MWAA on AWS.

We created 3 DAGs without much concern about “code beauty”:

  • 1 DAG was running a simple Python code that always succeeded
  • 1 DAG was running a simple Python code that failed (with an exception) 30% of the time.
  • 1 DAG was running a simple Python code that always failed (with an exception)

-- Python code used to create the 3 DAGs
-- We also tried with 3 different files and that also works?
-- (to add/remove/change DAGs)
from datetime import datetime
from random import randint
from airflow import DAG
from airflow.operators.python import PythonOperator


def fail_or_success():
????value = randint(0, 10)
????return value / 1 if value < 9 else value / 0


def always_fails():
????return 1 / 0

with DAG(
????dag_id="always_passes",
????start_date=datetime(2024, 1, 1),
????schedule="*  ",
) as dag:
????success_task = PythonOperator(
????????task_id="always_passes_task1", python_callable=lambda: print("Task ok")
????)
with DAG(
????dag_id="always_fails",
????start_date=datetime(2024, 1, 1),
????schedule="*  ",
) as dag:
????fail_task = PythonOperator(
????????task_id="always_fails_task1",
????????python_callable=always_fails,
????)
with DAG(
????dag_id="fails_sometimes",
????start_date=datetime(2024, 1, 1),
????schedule="*  ",
) as dag:
????fail_task = PythonOperator(
????????task_id="fails_sometimes_task1",
????????python_callable=fail_or_success,
????)        

Create an S3 bucket to hold your DAGs

Go to S3 on the web console and create a new bucket to hold your MWAA resources.?

Inside the new S3 bucket, create a folder called dags.

I believe other resources (e.g. logs and other Python code) could be in this bucket, but we are keeping it simple for now.

AWS Dashboard of the S3 bucket and location of our first DAG file

Launch the MWAA resources on AWS

On the AWS console, search for MWAA and select it.

  • You can then select “Create environment”.
  • Fill in what is being asked and after selecting the “S3 bucket” created in the previous step, new options appear. For our simple test, we only need to select the DAGs folder, also created in the previous step.?
  • Press next and you need to select an existing VPC. As I didn’t had one already created, I chose “Create MWAA VPC”.
  • For the security group, I also selected “Create new security group”.

Please note that all these resources will be created, but will also be deleted afterwards when we choose to delete MWAA resources.

Creation of our MWAA, using the AWS console

Test the new MWAA

After the console reports MWAA to have been created, it also shows the URL to reach it (when in doubt, choose to have the VPC/subnets open to the world, just for the purpose of testing it). Airflow itself will also have a password that you will need to open its web console.

Open the airflow web interface and you should have no DAGs available.

Send the DAG files into the S3 bucket and folder you created before. Wait a minute or 2 and airflow should start showing you your new DAGs or an error message of why it was not able to import those files.

Whenever a file is added, changed or deleted in that S3 URL, after a couple of minutes you should see the changes on MWAA.

Manually trigger a few DAGs to make sure they are running as intended and to have events sent to CloudWatch.

Our Airflow interface with the 3 DAGs

Create Dashboards from CloudWatch Metrics

Now this is where it really starts getting interesting. Because we are already running several instances of Airflow, but all inside Docker containers on EC2 instances and all manually managed. MWAA sends all sorts of metrics automatically into CloudWatch?

Some metrics created my MWAA on CloudWatch

And we can use that to create custom Dashboards like this one that we created to check the number of DAG run failures:

Our Dashboard showing failed and successfull DAG runs

Alerts

Having a Dashboard that shows some insights, by itself, is already great. But then again, if we have something running in production and we want to know about it (either only during office hours or at any time), we need alerts. Going to the actions drop down, we have the option to send a notification to an SNS topic. From that topic, we can send a message directly to an email (needs to subscribe first):

Example of an alert sent to email

Or we can decide that we want to receive that notification immediately and subscribe to receive an SMS message on our phone:

Example of an alrt sent to my mobile phone (SMS)

For the sake of completeness, we know that it is also possible to send a message to Slack. We already have that implemented in the company, by having a Lambda function that subscribes to that SNS topic and re-sending the message to Slack. We have not implemented this part though on our 1 day spike.

It is also important to note that all of this is not a replacement for more advanced products like PagerDuty that will handle escalation policies, resend of alerts and many other great features. But for a simple use case, where you don’t want to (or are not able to) spend more money, this native AWS CloudWatch solution works just fine.


Is using MWAA a good idea?

We believe it is, because:

  1. You get Airflow that you don’t really need to manage
  2. Adding/Updating/Removing DAGs is very simple (simply add/change/remove files from a folder on an S3 bucket)
  3. You get automatic metrics that you can use to create simple (but useful) Dashboards as well as alerts that you can receive in different ways.

But How much does it really cost ?

I used the AWS calculator for estimating the cost of a “small” MWAA instance, in Virginia (usually the cheapest AWS region), with the max of 10 workers where I gave a very generous estimate of Airflow using those 10 workers, on average 3 hours per day.

The estimated price for this setup was 524.32 USD per month.

The full breakdown of this estimate:

10 maximum workers - 1 minimum workers = 9 worker/s

0.49 USD per hour x 730 hours per month = 357.70 USD

Environment monthly costs: 357.70 USD

1 minimum workers - 1 included with environment = 0 billable minimum workers

10 maximum workers - 1 minimum workers = 9 billable maximum workers

9 maximum workers x 91.25 hours at max = 821.25 maximum worker hours

821.25 worker hours x 0.055 USD per hour = 45.17 USD

Worker monthly costs: 45.17 USD

5 schedulers - 2 included with environment = 3 billable scheduler instances

3 scheduler instances x 730 hours per month x 0.055 USD per hour = 120.45 USD

Scheduler monthly costs: 120.45 USD

10 GB x 0.10 USD per GB-Month = 1.00 USD

Meta database monthly costs: 1.00 USD

357.70 USD + 45.17 USD + 120.45 USD + 1.00 USD = 524.32 USD

Total Managed Workflows with Apache Airflow costs (monthly): 524.32 USD

Using a “small” MWAA is completely acceptable and scales really well, if you do not use it to run CPU or memory intensive applications, but have those running in Docker containers.

Next steps?

At this moment we are using Airflow, a little bit the way it was initially created to do: We run DAGs based on intervals (in our case, timestamps from tables) and if we need to backfill some table for some reason (e.g. a new column is added to a source or to a destination table), we can re-run that DAG for a specific period or “for all time”.

This is great, but it has its issues. For example, all code is run in the Airflow server and if you use too much CPU or memory, it will hurt other DAGs that also need to run.

A common alternative is to use Airflow as an orchestrator, but have all code running on Docker containers.

That is where technology like AWS Batch, ECS (or EKS) and spot instances come to the mix. But I am getting ahead of myself here. This post is about MWAA and its advantages. The rest I will leave for a later post.

Axel Sylvan

Doctor - Co-founder at myrecovery / HOPCo Ltd

11 个月

Great work José, Oluwafemi and Karl! Here's to tightening nuts and bolts - and to what the coming Innovation Day Hackathons will bring! ?? ??

回复

要查看或添加评论,请登录

José Santos的更多文章

  • ETLT

    ETLT

    In regards to the "data integration" step for DataWarehousing, I have used both the ETL and ELT approach, but there is…

社区洞察

其他会员也浏览了