Databricks - Access file metadata when loading multiple files from a directory

For a recent customer project, we were using Databricks Auto Loader feature to load JSON and CSV files. In most situations, we were loading one or more files that matched a pattern. When doing this, we wanted to know the name of the file and extract the timestamp from it (the file naming convention had the timestamp as yyyymmdd followed by the actual name). In the auto loader statement, we split the file using the '_metadata' field that is made available by Databricks.

As it happens, we were facing some issues while loading data using the auto loader. So we decided to fall back the standard 'spark.read.format' mechanism for file loading.

While this is the simplest way to load files, initially I was not sure about its usage primarily due to the metadata feature. But, Databricks supports the metadata feature for the 'spark.read.format' syntax as well. And that makes reading and loading files so much easier as we can build audit entries for all the file loads. Audit entries go a long way in data loads. One example - our SCD type 2 logic makes use of the file date while processing records.

link: https://docs.databricks.com/en/ingestion/file-metadata-column.html

#databricks #spark #pyspark #metadata

要查看或添加评论,请登录

社区洞察

其他会员也浏览了