Versions, Settings and Data types - Databricks
So we happily develop data integration applications using PySpark on Databricks - multiple jobs to ingest a whole bunch of data elements - sometimes from staging files and sometimes from staging tables.
All things work well for days, even for months. then one fine day, BOOM - the job fails with and error - data length issue, data cannot be inserted.
On getting such an error, for a long time, we are left wondering what the issue might be. We start by investigating the data itself. Then the target table column definitions. All is well and we re still puzzled. Then we check the definition of the staging table and find that it has used VARCHAR data type instead of STRING. This has caused our temporary view column to be defined with VARCHAR data type, which is causing the length issue.
Sometimes, this issue can be resolved by a simple cast operation - by casting to string when creating the temp view. In certain cases, this does not resolve the issue. So we refer StackOverflow solutions and find the answer, as below
spark.conf.set("spark.sql.legacy.charVarcharAsString", True)
This flag is required because older versions of Spark treated VARCHAR as string, while newer versions are not making this conversion. Newer versions of Spark are treating VARCHAR as fixed length string/text, which can lead to issues in some cases.
#spark #version #databricks #string #varchar #flag