Need to analyze massive amounts of data quickly and at low cost? AWS Athena holds the key

Need to analyze massive amounts of data quickly and at low cost? AWS Athena holds the key

What if I told you that you could start analyzing large sets of data in the cloud at almost no cost and have the necessary infrastructure setup under 1 hour? All you need is an AWS account, S3 bucket, AWS Athena and some pretty basic knowledge of SQL.

Chances that if you are working in the IT industry and haven't yet heard about AWS are very low, but if this is your case, then let me just briefly summarize what AWS is.

What are Amazon Web Services (AWS)?

No alt text provided for this image

AWS is as of today the world's most comprehensive cloud platform that offers flexible, reliable, scalable, easy-to-use and cost-effective computing solutions to both companies and individuals. 

Imagine that instead of investing lots of money upfront into buying, owning and maintaining physical data centers, servers and applications, you can just sign up to AWS and instantly start using its computing power, storage, databases and more than 175 of various cloud services on an as-needed basis.

AWS pricing works on a relative scale - less you use it, the less you pay - more you use it, the less you pay per unit. For some of the AWS services you pay nothing for the first year of their use, some of them are even offered entirely for free.

What is AWS Athena?

No alt text provided for this image

AWS Athena is one of the core AWS services that we will use in this article to demonstrate how you can perform interactive queries in the AWS cloud on your data to analyze it.

As AWS Athena is a serverless query service, you don't need to manage any underlying compute infrastructure if you want to use it - Amazon does it for you. 

What are we going to need to get started with data analysis?

1. Prepare your data Files

I suppose you already have the data that you want to analyze.

It can be basically any kind of unstructured, semi-structured or structured data, but I recommend for you for the start to use data in the format such as: CSV, JSON, LOG or columnar ones like Apache Parquet and Apache ORC.

In my example I am going to use Apache httpd logs which you are maybe familiar with if you worked with Apache webserver yourself in the past.

This is how the example contents of such log file looks like (one row of data):

www.nazdravie.sk 77.75.77.101 - - [30/Nov/2020:00:01:23 +0100] "GET /robots.txt HTTP/1.1" 301 243 "-" "Mozilla/5.0 (compatible; SeznamBot/3.2; +https://napoveda.seznam.cz/en/seznambot-intro/)" 0.002 BYPASS 10.10.174.106:80 0.000

2. Log into your AWS Account or create a new AWS Account

If you already don't have AWS Account, you need to create one.

Navigate to this page and fill in the form. There are no costs associated with creating an account, you are going to pay only for the services you spend.

Amazon provides plenty of services for free for the first year, some of them are even free forever. Be sure to check AWS Free Tier program to learn more.

3. Use your existing S3 Bucket or create a new Bucket

No alt text provided for this image

Amazon S3 (Simple Storage Service) provides a place where you need to upload your data files in order for AWS Athena to be able to process them. What you need to do, is to go to the AWS S3 management console and create a new S3 Bucket (unless you already have one).

You will be asked to provide a unique BUCKET NAME (feel free to choose anything, AWS will tell you if your bucket name is unique or not).

Besides that you need to also choose a REGION where you S3 bucket will reside physically (usually a good idea is to choose the region that is the closest to where you are now).

You may skip all other options and proceed to clicking "Create bucket" at the very bottom of the page.

Once your S3 Bucket is created, you will see it appear in the list.

  1. Click on its name. On the new page you will see the list of objects your bucket holds (as it is a brand new one, of course, there will be none).
  2. Click on the "Create folder" button to create a new folder that will hold Athena query results (e.g. athena).
  3. Click on the "Create folder" button again to create a new folder that will hold your data files that you wish to process using Athena (you may name it as you wish e.g. logs).
  4. Click on the folder that you created for holding your data files to get inside it.
  5. Click on the Upload button and upload your data files from your local computer to your S3 bucket. Before you leave, copy the URL location of your S3 bucket with your folder somewhere to your notepad because we will need this URL location later on. You can find this URL below the Destination subtitle and looks like this: s3://%YOUR-BUCKET-NAME%/%YOUR-DATA-FOLDER%/ (instead of %placeholders% there should be your names you chose during bucket and folder creation).

4. Configure DB scheme in AWS Athena

Before we can perform our first SQL query on the data we uploaded, we need to navigate to the AWS Management Console. In the top right part of the screen please click on the 2nd option from the right and choose the same region as you indicated where you were creating your S3 bucket in the previous step. 

Then under the Find Services type in the word: Athena into the search input field and press ENTER.

Click on the SETTINGS and setup Query result location to point to the folder that you created to hold Athena query results (see section 3.2 above).

No alt text provided for this image

After you save the settings, click on the Query editor in the top menu. On the right side of the screen you will see the textarea where you should enter your query. Based on the data format of the data you uploaded, you need to first create a DB scheme for the data on S3, so your first query will be the DB scheme definition itself.

In my example, we are going to create scheme that will fit the data format of our Apache httpd logs.

CREATE EXTERNAL TABLE logs(
 hostname STRING,
 remote_addr STRING,
 identity STRING,
 user_id STRING,
 time STRING,
 request STRING,
 status_code STRING,
 size STRING,
 referer STRING,
 user_agent STRING,
 data1 STRING,
 data2 STRING,
 data3 STRING,
 data4 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (
 "input.regex" = "^([^ ]*) ([0-9.]+) ([\\w.-]) ([\\w.-]) \\[([A-Za-z0-9:/]+ [+-][0-9]{4})\\] \"(.+?)\" ([0-9]{3}) ([0-9]+) \"(.+?)\" \"(.+?)\" ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)$",
 "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s"
)

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://%YOUR-BUCKET-NAME%/%YOUR-DATA-FOLDER%/'

After you have typed in this into the New query textarea, click on "Run query" to execute it.

Notes:

  • columns that you specify in your table should reflect their position in the data file - the positional order is essential. If you do not account for the position, you will end up something that would not align with expectations.
  • input regex part may look pretty weird for you if you haven't worked with regular expressions before, but basically its function there is to let Athena know which parts of the data are belonging to which column. 
  • instead of s3://%YOUR-BUCKET-NAME%/%YOUR-DATA-FOLDER%/ type in the URL location of your own bucket/folder for holding data - this is the location which I asked you to note somewhere in one of my previous paragraphs of this article.

If you have done everything right, there should a new table created and it will appear on the left side under the Tables.

Next to the name of your table click on the 3 dots icon and choose preview table to check if Athena was able to process your data right.

You should see first 10 lines of data listed in the Results section similarily like below:

No alt text provided for this image

5. Run your data analysis SQL queries

Once you have set all of the above mentioned up, congratulations!

You can now run your own SQL queries on the data you uploaded to S3 bucket and do as many analysis as you wish.

FREQUENTLY ASKED QUESTIONS

1. What if I upload new data to my S3 bucket? Should I let Athena know about it and initiate data reindexing?

No, this is not needed. Once you have setup your DB scheme, anytime you make changes into the data in your S3 bucket (this includes also adding new data), Athena will know about it and will reflect it in the results it provides next time you perform new SQL query.

2. How much am I going to pay for the use of AWS Athena? 

As of Dec 5th 2020 its price is just $5.00 per each TB of your data scanned. See this page to check for current pricing.

3. Is using the Query editor in Athena the only way how can I access Athena and perform queries?

Sure not, but it is the quickest way for you how to start when learning how to work with Athena. You may also use Amazon Athena API or the AWS CLI.

4. Which SQL dialect does Athena speak?

No alt text provided for this image

Athena uses open source Presto distributed SQL query engine. This enables quick ad-hoc analysis and complex requests including window functions, large joins & aggregations.

5. What is the difference between AWS Athena and AWS Redshift?

No alt text provided for this image

AWS Redshift is a data warehouse service and addresses different needs than Athena does. Redshift is used with more complex multi-part SQL queries. It is a better fit if you need to combine data from disparate sources into a common format.

Redshift is great for business intelligence workloads and enterprise reporting, while AWS Athena is better suited for simpler, ad-hoc queries on S3 data.

Hirmalina Mohd-Padil

Once an engineer, then a product manager, always a home chef, and now rocking the prestigious kitchen scene! I’m all about dishing out happiness through culinary wonders and spreading smiles. Cook up moments together!”

4 年

I never heard AWS Athena until I read your article. Maybe I am not enough playing with data yet. Interesting.

要查看或添加评论,请登录

Rastislav Skultety, MBA的更多文章

社区洞察

其他会员也浏览了