Efficient & Robust Analytic Queries
I help large financial institutions setup highly efficient, maintainable and auditable queries.
Often organizations end up with lots of unmaintainable and poorly tested SQL queries that range from hundreds to thousands of lines of code.
Many SQL developers don't have experience in any other programming skills or disciplines, so code quickly gets out of hand.
Typically when entering an organisation I can reduce the amount of SQL code that needs to be maintained by 10x to 100x, while adding more automation and more tests. Automation means that lengthy BAU processes can go from hours to zero. More importantly operational risk is massively reduced. All of this typically takes less time than the original SQL queries took to write in the first place.
This articles offers a guide to how I streamline and structure queries to achieve better results with less time and maintenance.
Specific Queries
Query is designed to answer a specific question. The query should aim to be as concise as possible so that it's easy to match the question / requirements with the code.
If a query can be parametrized it can answer more questions and be moved to some sort of self service option.
Multiple questions will often use the same tables, joins and data preparation, so this logic should be split out into Common Views to optimize reuse and reduce maintenance and development time.
As a general rule of thumb most joins should exist in the Common Views as opposed to the Specific Queries. The exceptions to this will be discussed in Common Views.
Common Views
When building Common Views it's often beneficial to have
Column Aliases should be provided by mapping files and applied to the data as opposed to being hard coded into the queries. Where possible use names exactly as they appear in other systems as opposed to what someone might have informally written down in a specification. This significantly improves the maintenance of systems while also providing documentation and an easy way to trace and reconcile name changes between different views and systems. Use the rename_all() function in R to apply mapping to all columns in a SQL query.
Bottom Up vs Top Down
Use Top Down for Exploring and Querying Views
Often people start building queries top down and gradually drill into the detail. This is fine if you are exploring an already validated view but not if you are building a view.
Use Bottom Up for Building and Validating Views
When building views I recommend you always start with the transaction level table that contains the key metrics and then look to classify your data with higher level dimensions / categories. The reason for this is that once you identify the main transaction / metrics table, you can use it to validate that sum of metrics and record counts are consistent before and after joins. You should also check to ensure no keys are dropped or duplicated. You can't do these checks if you start with higher level tables as every join can alter the record count either by duplicating records or truncating records which are not present in all the tables.
Dealing with Duplicates
Joining on tables that have multiple records for the 1 transaction will cause duplicates in your transactions table and thus overstate your metrics. Do not include these joins in your main views. Avoid using these, but if necessary some solutions for dealing with them include
These options should be discussed properly with stakeholders to understand the implications of each approach. All of these options are computationally expensive and so often optimised by precomputing and materializing the results.
SQL vs R vs Python
SQL
SQL is by far the most common language for interfacing with a data store such as a SQL Server Database, a data warehouse like Snowflake or analytics engines like Spark.
A SQL data store is often where your data will exist before it can be pulled down into a more specialised analytics tool such as R or Python.
A SQL data store also means you can centralise your data and the processing of that data on power computer clusters for access by many users.
SQL has many common commands but has different variants for different tools which affects the compatibility and therefore transportability of your SQL code.
领英推荐
R and Python
Are generally used for single user batch operations. Typical use cases are data pipeline transformations, exploratory data analysis, visualizations, adhoc analysis, data science and machine learning. R and Python are both far more flexible than SQL. The biggest drawback when compared to SQL is that SQL engines are designed to work on very large data and serve many users simultaneously.
SQL is not better or worse it's a different use case and trying to use a tool for the wrong use cases leads to significantly more work to achieve an inferior result.
Local In Memory Data Frames (<10 Million Records)
R and Python are most commonly used to handle in memory data processing on a local machine. Under this scenario the capacity for data processing is relatively small at millions of records instead of tens of millions to trillions of records in a distributed analytics engine.
Local Disk Data Frames + Multicore Processing (<10 Billion Records)
If you know what you're doing it's still possible to use R or Python to process billions of records on a laptop quite quickly by using disk based data frames and parallel processing on multiple CPU cores.
These can be a good options if
Remote Data Frames using SQL Engine (Trillions of Records)
Note to process this much data you really need a SQL engine such as Big Query as opposed to a transactional database such as SQL Server.
Both R and Python have the ability dynamically generate, parse and execute SQL on remote servers which can massively streamline and automate workflows.
However at present R is the only language I'm aware of that has the ability to translate highly concise R code into all the main SQL variants, execute remotely and returns results to a remote lazy data frame via DBPlyr. This enables you to remotely execute queries larger than R can handle. A preview of the data is returned so you can quickly iterate on your solution without overloading your system and a full result set is not returned until you explicitly call the collect() function.
This differs to Python which can execute SQL and return the results to a local data frame. The local data frame then becomes the bottleneck because of all of the limitations discussed earlier.
Dealing with Missing Data
In data science we often remove, fill or impute missing values before we build models. This is a step that should come after the Common Views are built so that Common Views can be used for other purposes other than data science such as reporting and QA. When building Common Views you should generally leave Nulls as they are, potentially fill with zeros for numeric fields and fill with 'other' for categorical in the Reporting Views but not the Raw Views. This is important so the data can be reconciled back to source systems. The benefit of leaving all NULLs in place is that exception reporting can easily be built for missing data which can be used to inform the relevant teams to rectify the issue at source. Filling Zeros can solve some calculation issues, however it can also skew averages calculations as NULL values are typically excluded. The benefit of filling NULL Categorical values with 'Other' is that SQL often does not include NULL in group by which is a common reason why data can go missing from reporting.
Sometimes people will try to fill missing values from other data source, other than the main system. While this can sometimes be a necessary option, it's often best avoided as the alternate data sources don't usually have as robust or automated quality controls or processes which generally means more maintenance and data synchronization issues. Again exceptions reporting to fix data issues at source is usually best.
Testing
Do Specific Queries only use joins when necessary and offload common elements to a Common View?
UAT Testing often involves stakeholders comparing query results to a source system in both aggregate and sampled spot checks. Where possible column name mapping should match source systems to make reconciliation easier. User spot checks can also be added as automated checks that run each time the code is refreshed or updated.
Does the Common View contain all of the original, unduplicated primary key values as the underlying metric / transaction table?
Do the record counts and metric totals match?
Upon refresh are there any diffs when compared to a snapshot
Are there any NULL values for key fields (key field include any field that is used for keys, grouping or calculation). Often caused by either data quality or joins.
Have column aliases been implementing by a structured mapping process that reconciles with some system or standard.
Unit testing can be used to test the output of functions given a number of both common scenarios and scenarios that don't necessarily exist in your sample data. For example does a function fail if it is passed NULL, Zeros (divide by zero error), negative values, small fractional numbers, value out of range (eg formula lookups a value that doesn't exist), etc.
What would you add to this list?
Have anything you'd like to add? Let me know in the comments below.