登录查看更多内容

Friday Fun - adding columns to a table in Databricks

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

发布日期: 2024年5月31日

In the Big Data world, given that we build structured views on top of flat files (in most cases) adding columns post facto is a painful task. For row based structures like CSV, Avro and JSON, performing such a task can mean creating a new structure (with the new column), copying the existing data into the new structure, removing the old data set and then saving the new structure.

With column oriented storage like ORC and Parquet, this task is slightly easier. Adding a new column means allocating space for it and then updating the schema. In most cases, the addition operation will create a column with null values. Post the addition operation, we will update the column with proper values.

In Databricks, the syntax to add a column to a table is

alter table [table_name] add column [new_column_name] [data_type]

But, this adds a column to the end of the table definition. While this is not a problem, the testing team can make life miserable because the mapping document shows the column to be present after an existing column and not at the end. I believe Databricks went through this pain and have provided support for this situation as below

alter table [table_name] add column [new_column_name] string after [existing_column_name]

What if the new column has to be the first column? This is also supported as below

alter table [table_name] add column [new_column_name] [data_type] first

#databricks #parquet #column_format #bigdata #big_data #alter_table

要查看或添加评论，请登录

查看全部

Friday Fun - adding columns to a table in Databricks

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

更多精彩文章

社区洞察

其他会员也浏览了

PySpark melt()

How to Implement Dim_Date in Microsoft Fabric using PySpark

Converting Json Keys to Columns in PySpark

Design a Generic & Parametrized Pattern Using Auto Loader

Create a table using a parquet file

Creating a data frame in spark using schema file

Predicate Pushdown while reading data from files or database

Demystifying Synapse Serverless SQL' OpenRowset & External Tables. Securities and Permissions

Big Data journey in a poem!!

Points to ponder - Python - spaces vs tabs - unexpected issue

2024年10月9日

Databricks - Making a true copy of a table having an auto-generated identity column

2024年10月3日

Reducing run time by 10min only?

2024年9月23日

Points to Ponder - Even with GenAI at your disposal, you need to know how to use its power

2024年9月17日

Databricks: Conditional execution in a job using if-else

2024年9月12日

Friday Fun - a productive Thursday

2024年8月30日

Databricks - Access file metadata when loading multiple files from a directory

2024年8月19日

Friday Fun -- show content of large file (page by page)

2024年8月9日

Flatten an XML in pyspark environment

2024年8月6日

GenAI and the notion that 'it just works'

2024年7月23日