Extracting elements from an SQL query
When you are faced with moving around 17000 SQL statements residing in around 4000 scripts, there is limited documentation for the SQLs and some of the SQL statements span more than 100 lines on the screen, having a tool to extract important elements from each SQL is important.
An SQL parser goes a looooooong way in helping split the SQL into logical elements, allowing us to extract elements like table names, column names and aliases, to name a few.
Earlier, I tried to parse SQL statements using Mozilla SQL Parser. The limitation of this parser is that it works for select statements only. More over, it does not work for complex SQL queries.
ANTLR is an option, but laying your hands on the proper grammar file is just the first hurdle. If we get the grammar, we have to cross the learning curve of getting ANTLR to work and the ability to write code in the proper classes.
领英推荐
Fortunately, Spark Catalyst Parser came to my rescue. Using it, I was able to parse SQL queries and extract table information, column information, derived column information, join information and much more. Using this information, I was able to create lineage between tables in an accurate manner. If the Spark Catalyst Parser is able to successfully able to parse the SQL statement, it has a method that dumps the Abstract Syntax Tree (AST) as a JSON. And parsing JSON in Python is a manageable task. Given that the JSON can be complex, because the SQL was complex, making sense from the JSON still needs an effort, but it becomes a task that where we can make continuous progress.
Once we extract the lineage and dump it to a file, we can easily visualize the dependencies using pyvis.
SQL parsing is a non-trivial task and can neither be achieved using regular expressions nor by searching for the select keyword and then assuming that all text till the from keyword represent a list of comma separated column names and definitely not in three days.
Building #AlphaFi #Crypto #DeFi
2 年What kind of task it was? Simplifying the old queries? Or something else?