Friday Fun - adding columns to a table in Databricks
Bipin Patwardhan
Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9
In the Big Data world, given that we build structured views on top of flat files (in most cases) adding columns post facto is a painful task. For row based structures like CSV, Avro and JSON, performing such a task can mean creating a new structure (with the new column), copying the existing data into the new structure, removing the old data set and then saving the new structure.
With column oriented storage like ORC and Parquet, this task is slightly easier. Adding a new column means allocating space for it and then updating the schema. In most cases, the addition operation will create a column with null values. Post the addition operation, we will update the column with proper values.
In Databricks, the syntax to add a column to a table is
alter table [table_name] add column [new_column_name] [data_type]
But, this adds a column to the end of the table definition. While this is not a problem, the testing team can make life miserable because the mapping document shows the column to be present after an existing column and not at the end. I believe Databricks went through this pain and have provided support for this situation as below
alter table [table_name] add column [new_column_name] string after [existing_column_name]
What if the new column has to be the first column? This is also supported as below
alter table [table_name] add column [new_column_name] [data_type] first
#databricks #parquet #column_format #bigdata #big_data #alter_table