Friday Fun - adding columns to a table in Databricks

In the Big Data world, given that we build structured views on top of flat files (in most cases) adding columns post facto is a painful task. For row based structures like CSV, Avro and JSON, performing such a task can mean creating a new structure (with the new column), copying the existing data into the new structure, removing the old data set and then saving the new structure.

With column oriented storage like ORC and Parquet, this task is slightly easier. Adding a new column means allocating space for it and then updating the schema. In most cases, the addition operation will create a column with null values. Post the addition operation, we will update the column with proper values.

In Databricks, the syntax to add a column to a table is

alter table [table_name] add column [new_column_name] [data_type]        

But, this adds a column to the end of the table definition. While this is not a problem, the testing team can make life miserable because the mapping document shows the column to be present after an existing column and not at the end. I believe Databricks went through this pain and have provided support for this situation as below

alter table [table_name] add column [new_column_name] string after [existing_column_name]        

What if the new column has to be the first column? This is also supported as below

alter table [table_name] add column [new_column_name] [data_type] first        

#databricks #parquet #column_format #bigdata #big_data #alter_table

要查看或添加评论,请登录

社区洞察

其他会员也浏览了