Wierd AWS Athena issues and how to solve them

We were having an inability to query on the first column in our CSV files. The problem comes down to the encoding of the CSV file. In short, AWS Glue and Athena currently do not support CSV's encoded in UTF-8-BOM.

For example, let's say we have a CSV file named "employees.csv" with the following content:

Name,Age,Department 
Joh? Doe,32,IT 
Jane Smith,28,Marketing 
Michael Johnson,45,Sales        


If you open up this CSV encoded with a Byte Order Mark (BOM) in Excel or Notepad++, it looks like any comma-delimited text file. However, opening it up in a Hex editor reveals the underlying issue. There are a bunch of special characters at the start of the file: ??? i.e. the BOM.

When a UTF-8-BOM CSV file like "employees.csv" is processed in AWS Glue, it retains these special characters, and associates them with the first column name. When you try and query on the first column within Athena, you will generate an error because the column name is not recognized correctly.

There are ways around this on AWS:

  • In AWS Glue, edit the table schema for the table created from "employees.csv" and delete the first column, then reinsert it back with the proper column name "Name", OR
  • In AWS Athena, execute the SHOW CREATE TABLE DDL to script out the problematic table created from "employees.csv", remove the special character in the generated script for the first column name, then run the script to create a new table which you can query on using the correct column name "Name".

To make your life simple, just make sure your CSV's are encoded as UTF-8 without the BOM.

Permission issues - IAM and Athena

Permissions in Athena are managed through IAM, unless you use Lake Formation (which is a topic in itself and not covered here). As mentioned, Athena is not an isolated service, and running a query involves at least three AWS services: Athena, Glue Data Catalog, and S3. This is reflected in the permissions model too.

For example, let's say you want to run a query on a table "sales_data" stored in an S3 bucket "my-athena-bucket". To run this query successfully, Athena will need permissions to:

  1. Use the Glue Data Catalog on your behalf to access the "sales_data" table metadata.
  2. List and read files from the "my-athena-bucket" S3 bucket where the underlying data for "sales_data" is stored.
  3. Write the query results back to a specified location in S3.

You will need to grant permissions for all of these actions in IAM for the query to succeed. This is unlike invoking a Lambda function where the function has its own set of permissions that govern its actions. Athena instead proxies your permissions when it performs actions on other services.

Because there are multiple services involved, IAM policies for Athena often have a lot of statements, and they can be hard to get right initially. Each service has its resources and ways of specifying and limiting permissions.

For Athena, you need to ensure the principal (user or role) has permission to the API calls involved in running a query, which means the actions athena:StartQueryExecution, athena:GetQueryExecution, and athena:GetQueryResults for the workgroup that the query runs in.

For S3, you need s3:ListBucket and s3:GetObject permissions for the bucket and objects that will be read (my-athena-bucket and the data files for sales_data), and s3:PutObject and s3:GetObject permissions for the location where the results will be written.

For Glue, you need to grant permissions on all levels of its catalog hierarchy – granting permission to a table is not enough, you also need to grant permission to the database the table is in, and the catalog the database is in.

While the permissions model has a steep learning curve, it is transparent that Athena uses the other services, and it makes the API calls to them in the same way, with the same permissions, as if the principal had done it themselves – which is also reflected in CloudTrail.

HIVE BAD DATA

The component in Athena that is responsible for reading and parsing data is called a serde (serializer/deserializer). If you don't specify anything else when creating an Athena table, you get a serde called LazySimpleSerDe, which is designed for delimited text such as CSV files.

For example, let's say you have a CSV file named "sales_data.csv" with the following content:

id,product,price,date
1,Widget A,9.99,2023-03-01
2,Gadget B,14.95,"2023-03-02"
3,,19.99,2023-03-03
4,Gizmo D,24.99,        

When you create an Athena table using this file and the default LazySimpleSerDe, it can handle different delimiters, escape characters, and line endings. The columns of the table must be defined in the same order as they appear in the files.

However, LazySimpleSerDe and OpenCSVSerDe (another commonly used serde) interpret empty fields differently depending on the column data type. For string columns, both serdes interpret an empty field as an empty string. But for other data types, LazySimpleSerDe interprets an empty field as NULL, while OpenCSVSerDe throws an error.

For example, in the "sales_data.csv" file, the third row has an empty product name, and the fourth row has an empty date. If the "product" column is a string, both serdes will treat the empty value as an empty string. But if the "date" column is a date type, LazySimpleSerDe will interpret the empty value as NULL, while OpenCSVSerDe will throw an error.

LazySimpleSerDe can also be configured to interpret specific strings (like "\N" or "-") as NULL values, while OpenCSVSerDe does not have this option.

The serdes also handle non-string column types differently. OpenCSVSerDe parses strings to typed values, while LazySimpleSerDe converts directly from the byte stream, leading to differences in how timestamps and dates are parsed.

Both serdes assume UTF-8 encoding by default, but LazySimpleSerDe can be configured to use other encodings if needed.

In general, if your data has quoted fields, you should use OpenCSVSerDe. If not, the Athena documentation recommends using the default LazySimpleSerDe.

要查看或添加评论,请登录

Ananth Tirumanur的更多文章

  • How to create S3 Table bucket?

    How to create S3 Table bucket?

    At re:Invent 2024, AWS introduced Amazon S3 Tables, the first cloud object store with built-in Apache Iceberg support…

  • Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

    Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

    Organizations lose $5 million annually due to data pipeline failures. Lost productivity and missed opportunities make…

  • 10 Years of AWS Lambda: Lessons for Data Engineers

    10 Years of AWS Lambda: Lessons for Data Engineers

    Picture this: It's November 2014, and developers around the world are glued to their screens during AWS re:Invent…

    1 条评论
  • AI is taking your ETL job

    AI is taking your ETL job

    Sorry! that was clickbait! this article is more about advancing ETL Processes with AI. AI is bringing unprecedented…

    1 条评论
  • Masking credit card numbers in the data lake

    Masking credit card numbers in the data lake

    To mask credit card numbers in an AWS data lake using AWS Glue, Python, S3, and Athena, you'll need to create an ETL…

    2 条评论
  • Pulumi vs Terraform for AWS

    Pulumi vs Terraform for AWS

    In my earlier projects, Terraform was my go-to for infrastructure as code. I loved how straightforward it was—just…

  • Run a llm on your local machine

    Run a llm on your local machine

    In the modern realm of artificial intelligence (AI), language models have been gaining immense popularity for their…

    2 条评论
  • Adding Python wheel dependencies to Glue jobs

    Adding Python wheel dependencies to Glue jobs

    Reference 1: Repost article Reference 2: AWS Glue docs I am sharing this in case someone faces a similar task. I had to…

  • Troubleshooting executor out of memory error in Pyspark

    Troubleshooting executor out of memory error in Pyspark

    When working with PySpark, encountering an "Executor Out of Memory" error is common, especially when dealing with large…

  • Tech Focus - Handling PII data in AWS Glue

    Tech Focus - Handling PII data in AWS Glue

    Step-by-step guide to detecting, masking, and redacting PII data using AWS Glue Today, I'm sharing a step-by-step guide…

    1 条评论

社区洞察

其他会员也浏览了