AWS MWAA — Testing AWS managed airflow with a free tier account
Preamble
Let me start this post by answering the most common question: ?
Is MWAA part of the AWS free tier offering ?
No. But it can cost you as low as £3/$3.65 to test it.
And I don’t honestly have any hopes it will ever be, since AWS already offers many things as part of the “free tier”, that allows us to test many different technologies and even have small projects deployed and running on the Internet, without ever needing to pay a cent (or a penny).
Also to do this test (on my private AWS account), I spent $3.65 (less than £3). That is the price of a cappuccino in London.?
I am not going to say that I am an expert on Airflow, but I have been using it for a few years now. Being open source software, I initially worked on it as it was deployed on AWS by my team at Worldremit, mostly to handle the necessary dependencies that come with implementing a DataWareHouse in the classic Kimball methodology (many dependencies around when something should run).?
Fast-forward to 2024, I am working for msk.ai (AKA myrecovery), it was decided that the tech team would have an “innovation day” AKA Hackathon every 8 weeks and me (Jose), Oluwafemi and Karl decided to test MWAA to use for the data projects we have at this company. The company was already using Airflow even before I was hired, as a service running from a docker container running on an EC2 instance (all managed by the data engineers). So, besides MWAA itself, we also wanted to test:
What is Airflow anyway?
Airflow is an orchestration tool widely used by data engineering teams to organise what code should run and when. All in Python.
In airflow we have the concept of a DAG (Directed Acyclic Graph) where you can define some things such as when that DAG should run. In a DAG we can have tasks that define the actual Graph structure by defining not only what code should run, but also after which task it should run.
There is much much more about airflow, but this post is not about Airflow itself, so I would advise anyone wanting to know more to look at the official documentation
Step by step guide to use MWAA
For the purpose of the 1 day spike/MVP/Hackathon we used the AWS web console and we have not created any code to create the infrastructure (CloudFormation, Terraform… ).
Our initial plan was:
Create a DAG that sometimes fails and sometimes works (it can be random)
As we were already running airflow in the company, it ended up being easy for us to create the DAGs and run them on our laptops, before we had to start MWAA on AWS.
We created 3 DAGs without much concern about “code beauty”:
-- Python code used to create the 3 DAGs
-- We also tried with 3 different files and that also works?
-- (to add/remove/change DAGs)
from datetime import datetime
from random import randint
from airflow import DAG
from airflow.operators.python import PythonOperator
def fail_or_success():
????value = randint(0, 10)
????return value / 1 if value < 9 else value / 0
def always_fails():
????return 1 / 0
with DAG(
????dag_id="always_passes",
????start_date=datetime(2024, 1, 1),
????schedule="* ",
) as dag:
????success_task = PythonOperator(
????????task_id="always_passes_task1", python_callable=lambda: print("Task ok")
????)
with DAG(
????dag_id="always_fails",
????start_date=datetime(2024, 1, 1),
????schedule="* ",
) as dag:
????fail_task = PythonOperator(
????????task_id="always_fails_task1",
????????python_callable=always_fails,
????)
with DAG(
????dag_id="fails_sometimes",
????start_date=datetime(2024, 1, 1),
????schedule="* ",
) as dag:
????fail_task = PythonOperator(
????????task_id="fails_sometimes_task1",
????????python_callable=fail_or_success,
????)
Create an S3 bucket to hold your DAGs
Go to S3 on the web console and create a new bucket to hold your MWAA resources.?
Inside the new S3 bucket, create a folder called dags.
I believe other resources (e.g. logs and other Python code) could be in this bucket, but we are keeping it simple for now.
Launch the MWAA resources on AWS
On the AWS console, search for MWAA and select it.
Please note that all these resources will be created, but will also be deleted afterwards when we choose to delete MWAA resources.
Test the new MWAA
After the console reports MWAA to have been created, it also shows the URL to reach it (when in doubt, choose to have the VPC/subnets open to the world, just for the purpose of testing it). Airflow itself will also have a password that you will need to open its web console.
Open the airflow web interface and you should have no DAGs available.
Send the DAG files into the S3 bucket and folder you created before. Wait a minute or 2 and airflow should start showing you your new DAGs or an error message of why it was not able to import those files.
Whenever a file is added, changed or deleted in that S3 URL, after a couple of minutes you should see the changes on MWAA.
Manually trigger a few DAGs to make sure they are running as intended and to have events sent to CloudWatch.
领英推荐
Create Dashboards from CloudWatch Metrics
Now this is where it really starts getting interesting. Because we are already running several instances of Airflow, but all inside Docker containers on EC2 instances and all manually managed. MWAA sends all sorts of metrics automatically into CloudWatch?
And we can use that to create custom Dashboards like this one that we created to check the number of DAG run failures:
Alerts
Having a Dashboard that shows some insights, by itself, is already great. But then again, if we have something running in production and we want to know about it (either only during office hours or at any time), we need alerts. Going to the actions drop down, we have the option to send a notification to an SNS topic. From that topic, we can send a message directly to an email (needs to subscribe first):
Or we can decide that we want to receive that notification immediately and subscribe to receive an SMS message on our phone:
For the sake of completeness, we know that it is also possible to send a message to Slack. We already have that implemented in the company, by having a Lambda function that subscribes to that SNS topic and re-sending the message to Slack. We have not implemented this part though on our 1 day spike.
It is also important to note that all of this is not a replacement for more advanced products like PagerDuty that will handle escalation policies, resend of alerts and many other great features. But for a simple use case, where you don’t want to (or are not able to) spend more money, this native AWS CloudWatch solution works just fine.
Is using MWAA a good idea?
We believe it is, because:
But How much does it really cost ?
I used the AWS calculator for estimating the cost of a “small” MWAA instance, in Virginia (usually the cheapest AWS region), with the max of 10 workers where I gave a very generous estimate of Airflow using those 10 workers, on average 3 hours per day.
The estimated price for this setup was 524.32 USD per month.
The full breakdown of this estimate:
10 maximum workers - 1 minimum workers = 9 worker/s
0.49 USD per hour x 730 hours per month = 357.70 USD
Environment monthly costs: 357.70 USD
1 minimum workers - 1 included with environment = 0 billable minimum workers
10 maximum workers - 1 minimum workers = 9 billable maximum workers
9 maximum workers x 91.25 hours at max = 821.25 maximum worker hours
821.25 worker hours x 0.055 USD per hour = 45.17 USD
Worker monthly costs: 45.17 USD
5 schedulers - 2 included with environment = 3 billable scheduler instances
3 scheduler instances x 730 hours per month x 0.055 USD per hour = 120.45 USD
Scheduler monthly costs: 120.45 USD
10 GB x 0.10 USD per GB-Month = 1.00 USD
Meta database monthly costs: 1.00 USD
357.70 USD + 45.17 USD + 120.45 USD + 1.00 USD = 524.32 USD
Total Managed Workflows with Apache Airflow costs (monthly): 524.32 USD
Using a “small” MWAA is completely acceptable and scales really well, if you do not use it to run CPU or memory intensive applications, but have those running in Docker containers.
Next steps?
At this moment we are using Airflow, a little bit the way it was initially created to do: We run DAGs based on intervals (in our case, timestamps from tables) and if we need to backfill some table for some reason (e.g. a new column is added to a source or to a destination table), we can re-run that DAG for a specific period or “for all time”.
This is great, but it has its issues. For example, all code is run in the Airflow server and if you use too much CPU or memory, it will hurt other DAGs that also need to run.
A common alternative is to use Airflow as an orchestrator, but have all code running on Docker containers.
That is where technology like AWS Batch, ECS (or EKS) and spot instances come to the mix. But I am getting ahead of myself here. This post is about MWAA and its advantages. The rest I will leave for a later post.
Doctor - Co-founder at myrecovery / HOPCo Ltd
11 个月Great work José, Oluwafemi and Karl! Here's to tightening nuts and bolts - and to what the coming Innovation Day Hackathons will bring! ?? ??