Wierd AWS Athena issues and how to solve them
We were having an inability to query on the first column in our CSV files. The problem comes down to the encoding of the CSV file. In short, AWS Glue and Athena currently do not support CSV's encoded in UTF-8-BOM.
For example, let's say we have a CSV file named "employees.csv" with the following content:
Name,Age,Department
Joh? Doe,32,IT
Jane Smith,28,Marketing
Michael Johnson,45,Sales
If you open up this CSV encoded with a Byte Order Mark (BOM) in Excel or Notepad++, it looks like any comma-delimited text file. However, opening it up in a Hex editor reveals the underlying issue. There are a bunch of special characters at the start of the file: ??? i.e. the BOM.
When a UTF-8-BOM CSV file like "employees.csv" is processed in AWS Glue, it retains these special characters, and associates them with the first column name. When you try and query on the first column within Athena, you will generate an error because the column name is not recognized correctly.
There are ways around this on AWS:
To make your life simple, just make sure your CSV's are encoded as UTF-8 without the BOM.
Permission issues - IAM and Athena
Permissions in Athena are managed through IAM, unless you use Lake Formation (which is a topic in itself and not covered here). As mentioned, Athena is not an isolated service, and running a query involves at least three AWS services: Athena, Glue Data Catalog, and S3. This is reflected in the permissions model too.
For example, let's say you want to run a query on a table "sales_data" stored in an S3 bucket "my-athena-bucket". To run this query successfully, Athena will need permissions to:
You will need to grant permissions for all of these actions in IAM for the query to succeed. This is unlike invoking a Lambda function where the function has its own set of permissions that govern its actions. Athena instead proxies your permissions when it performs actions on other services.
Because there are multiple services involved, IAM policies for Athena often have a lot of statements, and they can be hard to get right initially. Each service has its resources and ways of specifying and limiting permissions.
For Athena, you need to ensure the principal (user or role) has permission to the API calls involved in running a query, which means the actions athena:StartQueryExecution, athena:GetQueryExecution, and athena:GetQueryResults for the workgroup that the query runs in.
For S3, you need s3:ListBucket and s3:GetObject permissions for the bucket and objects that will be read (my-athena-bucket and the data files for sales_data), and s3:PutObject and s3:GetObject permissions for the location where the results will be written.
For Glue, you need to grant permissions on all levels of its catalog hierarchy – granting permission to a table is not enough, you also need to grant permission to the database the table is in, and the catalog the database is in.
While the permissions model has a steep learning curve, it is transparent that Athena uses the other services, and it makes the API calls to them in the same way, with the same permissions, as if the principal had done it themselves – which is also reflected in CloudTrail.
HIVE BAD DATA
The component in Athena that is responsible for reading and parsing data is called a serde (serializer/deserializer). If you don't specify anything else when creating an Athena table, you get a serde called LazySimpleSerDe, which is designed for delimited text such as CSV files.
For example, let's say you have a CSV file named "sales_data.csv" with the following content:
id,product,price,date
1,Widget A,9.99,2023-03-01
2,Gadget B,14.95,"2023-03-02"
3,,19.99,2023-03-03
4,Gizmo D,24.99,
When you create an Athena table using this file and the default LazySimpleSerDe, it can handle different delimiters, escape characters, and line endings. The columns of the table must be defined in the same order as they appear in the files.
However, LazySimpleSerDe and OpenCSVSerDe (another commonly used serde) interpret empty fields differently depending on the column data type. For string columns, both serdes interpret an empty field as an empty string. But for other data types, LazySimpleSerDe interprets an empty field as NULL, while OpenCSVSerDe throws an error.
For example, in the "sales_data.csv" file, the third row has an empty product name, and the fourth row has an empty date. If the "product" column is a string, both serdes will treat the empty value as an empty string. But if the "date" column is a date type, LazySimpleSerDe will interpret the empty value as NULL, while OpenCSVSerDe will throw an error.
LazySimpleSerDe can also be configured to interpret specific strings (like "\N" or "-") as NULL values, while OpenCSVSerDe does not have this option.
The serdes also handle non-string column types differently. OpenCSVSerDe parses strings to typed values, while LazySimpleSerDe converts directly from the byte stream, leading to differences in how timestamps and dates are parsed.
Both serdes assume UTF-8 encoding by default, but LazySimpleSerDe can be configured to use other encodings if needed.
In general, if your data has quoted fields, you should use OpenCSVSerDe. If not, the Athena documentation recommends using the default LazySimpleSerDe.