Databricks - Access file metadata when loading multiple files from a directory
Bipin Patwardhan
Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9
For a recent customer project, we were using Databricks Auto Loader feature to load JSON and CSV files. In most situations, we were loading one or more files that matched a pattern. When doing this, we wanted to know the name of the file and extract the timestamp from it (the file naming convention had the timestamp as yyyymmdd followed by the actual name). In the auto loader statement, we split the file using the '_metadata' field that is made available by Databricks.
As it happens, we were facing some issues while loading data using the auto loader. So we decided to fall back the standard 'spark.read.format' mechanism for file loading.
While this is the simplest way to load files, initially I was not sure about its usage primarily due to the metadata feature. But, Databricks supports the metadata feature for the 'spark.read.format' syntax as well. And that makes reading and loading files so much easier as we can build audit entries for all the file loads. Audit entries go a long way in data loads. One example - our SCD type 2 logic makes use of the file date while processing records.
#databricks #spark #pyspark #metadata