Amazon Athena
Amazon Athena is server-less way to query your data that lives on S3 using SQL. It excels with data-sets that are anywhere up to multiple petabytes in size. It's interface is a simple web page that you can access from the AWS console. Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query. If a query fails you won't be charged for it but if you cancel a query part way through you'll be charged for the data scanned up until that point.
Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format. You can even use compressed CSV files in GZIP format to save on query costs and improve performance over regular, uncompressed CSV files.
Athena can execute queries from either the us-east-1 (North Virginia) or the us-west-2 (Oregon) regions though the S3 data being queried can live in other parts of the world.
It's important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena's purpose is to ask questions rather than insert records quickly or update random records with low latency.
That being said, Presto's performance, given it can work on some of the world's largest data-sets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.
Setting Up Athena
When you first open the Athena web interface there is a step-by-step setup wizard to help you create a table and point it at your data. There is a step where you need to add each of the fields in for the table using buttons and drop downs. Because I have 51 fields and they can be added via SQL I simply created a dummy table to just get past the wizard and get to the query editor where I was able to run the following to setup my trips table.
CREATE EXTERNAL TABLE trips_csv (
trip_id INT,
vendor_id VARCHAR(3),
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag VARCHAR(1),
rate_code_id SMALLINT,
pickup_longitude DECIMAL(18,14),
pickup_latitude DECIMAL(18,14),
dropoff_longitude DECIMAL(18,14),
dropoff_latitude DECIMAL(18,14),
passenger_count SMALLINT,
trip_distance DECIMAL(6,3),
fare_amount DECIMAL(6,2),
extra DECIMAL(6,2),
mta_tax DECIMAL(6,2),
tip_amount DECIMAL(6,2),
tolls_amount DECIMAL(6,2),
ehail_fee DECIMAL(6,2),
improvement_surcharge DECIMAL(6,2),
total_amount DECIMAL(6,2),
payment_type VARCHAR(3),
trip_type SMALLINT,
pickup VARCHAR(50),
dropoff VARCHAR(50),
cab_type VARCHAR(6),
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel VARCHAR(10),
pickup_borocode SMALLINT,
pickup_boroname VARCHAR(13),
pickup_ct2010 VARCHAR(6),
pickup_boroct2010 VARCHAR(7),
pickup_cdeligibil VARCHAR(1),
pickup_ntacode VARCHAR(4),
pickup_ntaname VARCHAR(56),
pickup_puma VARCHAR(4),
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel VARCHAR(10),
dropoff_borocode SMALLINT,
dropoff_boroname VARCHAR(13),
dropoff_ct2010 VARCHAR(6),
dropoff_boroct2010 VARCHAR(7),
dropoff_cdeligibil VARCHAR(1),
dropoff_ntacode VARCHAR(4),
dropoff_ntaname VARCHAR(56),
dropoff_puma VARCHAR(4)
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://taxis-ath/csv/';
Optimizing using Columnar Storage
Unlike Redshift and BigQuery, Athena doesn't store your data internally so it won't sort, compress and organise the data into columns for you. CSV files don't have summaries of the data they contain in each column and not even by groups of rows.
If data is laid out one column at a time then a query can use just its columns of interest rather than every row in the dataset as it hunts down the columns of data it's interested in.
If that data is sorted then it can have summaries every x many records giving statistics on those values. This allows a query to skip over whole sections of a dataset.
For queries where looking at the value of each cell of information is required, if it reads the data off the disk in a compressed form then it'll not be bottle necked by the storage device so much. Reading 1 GB of data off a disk will always be faster than reading 10 GB of data off a disk.
The above three optimizations are the kinds you get when you store your data in ORC or Parquet format. If you convert your CSV data into one of these formats you will not only lower your costs of each query, you'll see significant speed increases as well. You can think of ORC and Parquet as the ZIP and RAR formats of the Hadoop ecosystem.
Normally in a query execution tool like Hive on a Hadoop cluster you can run the following to read data from a CSV-formatted table into an ORC formatted table.
INSERT INTO trips_orc
SELECT * FROM trips_csv;
Unfortunately, Athena will return an error stating this functionality has been restricted from their interface.
Your query has the following error(s):
Queries of this type are not supported (Service: AmazonAthena; Status Code: 400;