登录查看更多内容

Wierd AWS Athena issues and how to solve them

Ananth Tirumanur

Data Architect

发布日期: 2024年3月11日

We were having an inability to query on the first column in our CSV files. The problem comes down to the encoding of the CSV file. In short, AWS Glue and Athena currently do not support CSV's encoded in UTF-8-BOM.

For example, let's say we have a CSV file named "employees.csv" with the following content:

Name,Age,Department 
Joh? Doe,32,IT 
Jane Smith,28,Marketing 
Michael Johnson,45,Sales

If you open up this CSV encoded with a Byte Order Mark (BOM) in Excel or Notepad++, it looks like any comma-delimited text file. However, opening it up in a Hex editor reveals the underlying issue. There are a bunch of special characters at the start of the file: ??? i.e. the BOM.

When a UTF-8-BOM CSV file like "employees.csv" is processed in AWS Glue, it retains these special characters, and associates them with the first column name. When you try and query on the first column within Athena, you will generate an error because the column name is not recognized correctly.

There are ways around this on AWS:

In AWS Glue, edit the table schema for the table created from "employees.csv" and delete the first column, then reinsert it back with the proper column name "Name", OR
In AWS Athena, execute the SHOW CREATE TABLE DDL to script out the problematic table created from "employees.csv", remove the special character in the generated script for the first column name, then run the script to create a new table which you can query on using the correct column name "Name".

To make your life simple, just make sure your CSV's are encoded as UTF-8 without the BOM.

Permission issues - IAM and Athena

Permissions in Athena are managed through IAM, unless you use Lake Formation (which is a topic in itself and not covered here). As mentioned, Athena is not an isolated service, and running a query involves at least three AWS services: Athena, Glue Data Catalog, and S3. This is reflected in the permissions model too.

For example, let's say you want to run a query on a table "sales_data" stored in an S3 bucket "my-athena-bucket". To run this query successfully, Athena will need permissions to:

Use the Glue Data Catalog on your behalf to access the "sales_data" table metadata.
List and read files from the "my-athena-bucket" S3 bucket where the underlying data for "sales_data" is stored.
Write the query results back to a specified location in S3.

You will need to grant permissions for all of these actions in IAM for the query to succeed. This is unlike invoking a Lambda function where the function has its own set of permissions that govern its actions. Athena instead proxies your permissions when it performs actions on other services.

Because there are multiple services involved, IAM policies for Athena often have a lot of statements, and they can be hard to get right initially. Each service has its resources and ways of specifying and limiting permissions.

领英推荐

Bigquery

Darshika Srivastava 2 年前

66% say AWS is the most required platform in job…

Learn Data Engineering 2 年前

Week 10 (4 Mar - 10 Mar)

Ankur Patel 1 年前

For Athena, you need to ensure the principal (user or role) has permission to the API calls involved in running a query, which means the actions athena:StartQueryExecution, athena:GetQueryExecution, and athena:GetQueryResults for the workgroup that the query runs in.

For S3, you need s3:ListBucket and s3:GetObject permissions for the bucket and objects that will be read (my-athena-bucket and the data files for sales_data), and s3:PutObject and s3:GetObject permissions for the location where the results will be written.

For Glue, you need to grant permissions on all levels of its catalog hierarchy – granting permission to a table is not enough, you also need to grant permission to the database the table is in, and the catalog the database is in.

While the permissions model has a steep learning curve, it is transparent that Athena uses the other services, and it makes the API calls to them in the same way, with the same permissions, as if the principal had done it themselves – which is also reflected in CloudTrail.

HIVE BAD DATA

The component in Athena that is responsible for reading and parsing data is called a serde (serializer/deserializer). If you don't specify anything else when creating an Athena table, you get a serde called LazySimpleSerDe, which is designed for delimited text such as CSV files.

For example, let's say you have a CSV file named "sales_data.csv" with the following content:

id,product,price,date
1,Widget A,9.99,2023-03-01
2,Gadget B,14.95,"2023-03-02"
3,,19.99,2023-03-03
4,Gizmo D,24.99,

When you create an Athena table using this file and the default LazySimpleSerDe, it can handle different delimiters, escape characters, and line endings. The columns of the table must be defined in the same order as they appear in the files.

However, LazySimpleSerDe and OpenCSVSerDe (another commonly used serde) interpret empty fields differently depending on the column data type. For string columns, both serdes interpret an empty field as an empty string. But for other data types, LazySimpleSerDe interprets an empty field as NULL, while OpenCSVSerDe throws an error.

For example, in the "sales_data.csv" file, the third row has an empty product name, and the fourth row has an empty date. If the "product" column is a string, both serdes will treat the empty value as an empty string. But if the "date" column is a date type, LazySimpleSerDe will interpret the empty value as NULL, while OpenCSVSerDe will throw an error.

LazySimpleSerDe can also be configured to interpret specific strings (like "\N" or "-") as NULL values, while OpenCSVSerDe does not have this option.

The serdes also handle non-string column types differently. OpenCSVSerDe parses strings to typed values, while LazySimpleSerDe converts directly from the byte stream, leading to differences in how timestamps and dates are parsed.

Both serdes assume UTF-8 encoding by default, but LazySimpleSerDe can be configured to use other encodings if needed.

In general, if your data has quoted fields, you should use OpenCSVSerDe. If not, the Athena documentation recommends using the default LazySimpleSerDe.

要查看或添加评论，请登录

Ananth Tirumanur的更多文章

How to create S3 Table bucket?

2025年1月13日

How to create S3 Table bucket?

At re:Invent 2024, AWS introduced Amazon S3 Tables, the first cloud object store with built-in Apache Iceberg support…
Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

2025年1月6日

Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

Organizations lose $5 million annually due to data pipeline failures. Lost productivity and missed opportunities make…
10 Years of AWS Lambda: Lessons for Data Engineers

2024年12月6日

10 Years of AWS Lambda: Lessons for Data Engineers

Picture this: It's November 2014, and developers around the world are glued to their screens during AWS re:Invent…

1 条评论
AI is taking your ETL job

2024年11月18日

AI is taking your ETL job

Sorry! that was clickbait! this article is more about advancing ETL Processes with AI. AI is bringing unprecedented…

1 条评论
Masking credit card numbers in the data lake

2024年9月15日

Masking credit card numbers in the data lake

To mask credit card numbers in an AWS data lake using AWS Glue, Python, S3, and Athena, you'll need to create an ETL…

2 条评论
Pulumi vs Terraform for AWS

2024年9月2日

Pulumi vs Terraform for AWS

In my earlier projects, Terraform was my go-to for infrastructure as code. I loved how straightforward it was—just…
Run a llm on your local machine

2024年4月22日

Run a llm on your local machine

In the modern realm of artificial intelligence (AI), language models have been gaining immense popularity for their…

2 条评论
Adding Python wheel dependencies to Glue jobs

2024年3月4日

Adding Python wheel dependencies to Glue jobs

Reference 1: Repost article Reference 2: AWS Glue docs I am sharing this in case someone faces a similar task. I had to…
Troubleshooting executor out of memory error in Pyspark

2024年2月26日

Troubleshooting executor out of memory error in Pyspark

When working with PySpark, encountering an "Executor Out of Memory" error is common, especially when dealing with large…
Tech Focus - Handling PII data in AWS Glue

2024年1月15日

Tech Focus - Handling PII data in AWS Glue

Step-by-step guide to detecting, masking, and redacting PII data using AWS Glue Today, I'm sharing a step-by-step guide…

1 条评论

See all articles

Wierd AWS Athena issues and how to solve them

Ananth Tirumanur

Data Architect

Permission issues - IAM and Athena

领英推荐

HIVE BAD DATA

Ananth Tirumanur的更多文章

社区洞察

其他会员也浏览了

Simplify Your Azure Table Storage Experience with DynamicTableEntity and Reflection

How to open a very big CSV file?

Building a Data Pipeline with AWS Glue, DataBrew, and S3

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

AWS Athena

Effortless Data Cleansing: How AWS DataBrew Simplifies Your Workflow

Learn How to configure your Spark Session to Join Managed (S3 Table Buckets) and Unmanaged Iceberg Tables | Hands on Labs

Amazon Redshift Tutorial: Learning Through Puzzles and Examples (Part One)

Generic Data Ingestion Process in Apache Spark

Permission issues - IAM and Athena

领英推荐

HIVE BAD DATA

Ananth Tirumanur的更多文章

How to create S3 Table bucket?

Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

10 Years of AWS Lambda: Lessons for Data Engineers

AI is taking your ETL job

Masking credit card numbers in the data lake

Pulumi vs Terraform for AWS

Run a llm on your local machine

Adding Python wheel dependencies to Glue jobs

Troubleshooting executor out of memory error in Pyspark

Tech Focus - Handling PII data in AWS Glue

社区洞察

其他会员也浏览了

Simplify Your Azure Table Storage Experience with DynamicTableEntity and Reflection

How to open a very big CSV file?

Building a Data Pipeline with AWS Glue, DataBrew, and S3

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

AWS Athena

Effortless Data Cleansing: How AWS DataBrew Simplifies Your Workflow

Learn How to configure your Spark Session to Join Managed (S3 Table Buckets) and Unmanaged Iceberg Tables | Hands on Labs

Amazon Redshift Tutorial: Learning Through Puzzles and Examples (Part One)

Generic Data Ingestion Process in Apache Spark