登录查看更多内容

Amazon Athena

Sanjeev Singh

Building Gensol EV | GenAI Advisor in Startups ??

发布日期: 2018年12月6日

Amazon Athena is server-less way to query your data that lives on S3 using SQL. It excels with data-sets that are anywhere up to multiple petabytes in size. It's interface is a simple web page that you can access from the AWS console. Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query. If a query fails you won't be charged for it but if you cancel a query part way through you'll be charged for the data scanned up until that point.

Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format. You can even use compressed CSV files in GZIP format to save on query costs and improve performance over regular, uncompressed CSV files.

Athena can execute queries from either the us-east-1 (North Virginia) or the us-west-2 (Oregon) regions though the S3 data being queried can live in other parts of the world.

It's important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena's purpose is to ask questions rather than insert records quickly or update random records with low latency.

That being said, Presto's performance, given it can work on some of the world's largest data-sets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.

Setting Up Athena

When you first open the Athena web interface there is a step-by-step setup wizard to help you create a table and point it at your data. There is a step where you need to add each of the fields in for the table using buttons and drop downs. Because I have 51 fields and they can be added via SQL I simply created a dummy table to just get past the wizard and get to the query editor where I was able to run the following to setup my trips table.

CREATE EXTERNAL TABLE trips_csv (
    trip_id                 INT,
    vendor_id               VARCHAR(3),
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      VARCHAR(1),
    rate_code_id            SMALLINT,
    pickup_longitude        DECIMAL(18,14),
    pickup_latitude         DECIMAL(18,14),
    dropoff_longitude       DECIMAL(18,14),
    dropoff_latitude        DECIMAL(18,14),
    passenger_count         SMALLINT,
    trip_distance           DECIMAL(6,3),
    fare_amount             DECIMAL(6,2),
    extra                   DECIMAL(6,2),
    mta_tax                 DECIMAL(6,2),
    tip_amount              DECIMAL(6,2),
    tolls_amount            DECIMAL(6,2),
    ehail_fee               DECIMAL(6,2),
    improvement_surcharge   DECIMAL(6,2),
    total_amount            DECIMAL(6,2),
    payment_type            VARCHAR(3),
    trip_type               SMALLINT,
    pickup                  VARCHAR(50),
    dropoff                 VARCHAR(50),

    cab_type                VARCHAR(6),

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          VARCHAR(10),
    pickup_borocode         SMALLINT,
    pickup_boroname         VARCHAR(13),
    pickup_ct2010           VARCHAR(6),
    pickup_boroct2010       VARCHAR(7),
    pickup_cdeligibil       VARCHAR(1),
    pickup_ntacode          VARCHAR(4),
    pickup_ntaname          VARCHAR(56),
    pickup_puma             VARCHAR(4),

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         VARCHAR(10),
    dropoff_borocode        SMALLINT,
    dropoff_boroname        VARCHAR(13),
    dropoff_ct2010          VARCHAR(6),
    dropoff_boroct2010      VARCHAR(7),
    dropoff_cdeligibil      VARCHAR(1),
    dropoff_ntacode         VARCHAR(4),
    dropoff_ntaname         VARCHAR(56),
    dropoff_puma            VARCHAR(4)
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  LOCATION 's3://taxis-ath/csv/';

Optimizing using Columnar Storage

Unlike Redshift and BigQuery, Athena doesn't store your data internally so it won't sort, compress and organise the data into columns for you. CSV files don't have summaries of the data they contain in each column and not even by groups of rows.

If data is laid out one column at a time then a query can use just its columns of interest rather than every row in the dataset as it hunts down the columns of data it's interested in.

If that data is sorted then it can have summaries every x many records giving statistics on those values. This allows a query to skip over whole sections of a dataset.

For queries where looking at the value of each cell of information is required, if it reads the data off the disk in a compressed form then it'll not be bottle necked by the storage device so much. Reading 1 GB of data off a disk will always be faster than reading 10 GB of data off a disk.

The above three optimizations are the kinds you get when you store your data in ORC or Parquet format. If you convert your CSV data into one of these formats you will not only lower your costs of each query, you'll see significant speed increases as well. You can think of ORC and Parquet as the ZIP and RAR formats of the Hadoop ecosystem.

Normally in a query execution tool like Hive on a Hadoop cluster you can run the following to read data from a CSV-formatted table into an ORC formatted table.

INSERT INTO trips_orc
SELECT * FROM trips_csv;

Unfortunately, Athena will return an error stating this functionality has been restricted from their interface.

Your query has the following error(s):

Queries of this type are not supported (Service: AmazonAthena; Status Code: 400;

要查看或添加评论，请登录

Sanjeev Singh的更多文章

Understanding the CAP Theorem

2018年10月29日

Understanding the CAP Theorem

The theorem states that networked shared-data systems can only guarantee/strongly support two of the following three…
Future of Big Data in 2018!

2017年12月14日

Future of Big Data in 2018!

The year 2017 was an interesting one in the Big Data world. Though the adoption of Hadoop as the Big Data platform…

2 条评论
10 IT Roles That Will Be In High Demand In 2015

2014年12月31日

10 IT Roles That Will Be In High Demand In 2015

The supply of highly skilled technology professionals is expected to remain below demand for the foreseeable future…
Amazing Success Story Of Flipkarts CTO

2014年12月29日

Amazing Success Story Of Flipkarts CTO

"Start-up life has been heavy". "There was a time when I went without salary for about six months".

2 条评论

Amazon Athena

Sanjeev Singh

Building Gensol EV | GenAI Advisor in Startups ??

Setting Up Athena

Optimizing using Columnar Storage

Sanjeev Singh的更多文章

社区洞察

其他会员也浏览了

SQL is (almost) always superior to NoSQL

RocksDB, an excellent choice for modern SQL Databases (LSM Tree vs. B-Tree)

SQL Pool in Azure

Why Postgres Stands Out Among Relational Databases

5 Reasons to learn SQL Database Administration

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

Friends don't let friends use MongoDB!

Managing Performance in Azure SQL: Strategies for Handling Skewed Indexes

Debezium Pipeline

Why Every Developer Should Learn SQL (Even in a NoSQL World)

Setting Up Athena

Optimizing using Columnar Storage

Sanjeev Singh的更多文章

Understanding the CAP Theorem

Future of Big Data in 2018!

10 IT Roles That Will Be In High Demand In 2015

Amazing Success Story Of Flipkarts CTO

社区洞察

其他会员也浏览了

SQL is (almost) always superior to NoSQL

RocksDB, an excellent choice for modern SQL Databases (LSM Tree vs. B-Tree)

SQL Pool in Azure

Why Postgres Stands Out Among Relational Databases

5 Reasons to learn SQL Database Administration

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

Friends don't let friends use MongoDB!

Managing Performance in Azure SQL: Strategies for Handling Skewed Indexes

Debezium Pipeline

Why Every Developer Should Learn SQL (Even in a NoSQL World)