登录查看更多内容

Efficient & Robust Analytic Queries

Jonathan Ng

Best Selling Instructor R, Python and SQL 17k+ Students

发布日期: 2024年2月2日

I help large financial institutions setup highly efficient, maintainable and auditable queries.

Often organizations end up with lots of unmaintainable and poorly tested SQL queries that range from hundreds to thousands of lines of code.

Many SQL developers don't have experience in any other programming skills or disciplines, so code quickly gets out of hand.

Typically when entering an organisation I can reduce the amount of SQL code that needs to be maintained by 10x to 100x, while adding more automation and more tests. Automation means that lengthy BAU processes can go from hours to zero. More importantly operational risk is massively reduced. All of this typically takes less time than the original SQL queries took to write in the first place.

This articles offers a guide to how I streamline and structure queries to achieve better results with less time and maintenance.

Specific Queries

Query is designed to answer a specific question. The query should aim to be as concise as possible so that it's easy to match the question / requirements with the code.

If a query can be parametrized it can answer more questions and be moved to some sort of self service option.

Multiple questions will often use the same tables, joins and data preparation, so this logic should be split out into Common Views to optimize reuse and reduce maintenance and development time.

As a general rule of thumb most joins should exist in the Common Views as opposed to the Specific Queries. The exceptions to this will be discussed in Common Views.

Common Views

When building Common Views it's often beneficial to have

Raw Views - columns match the names in the underlying database and can be easily reconciled to the database
Reporting Views - This can actually be multiple views which add aliases to the Raw Views. Aliases can include names which match the Front End application, match other systems so different systems can easily be conformed and merged into a single repository or other reporting names that are useful to the business.

Column Aliases should be provided by mapping files and applied to the data as opposed to being hard coded into the queries. Where possible use names exactly as they appear in other systems as opposed to what someone might have informally written down in a specification. This significantly improves the maintenance of systems while also providing documentation and an easy way to trace and reconcile name changes between different views and systems. Use the rename_all() function in R to apply mapping to all columns in a SQL query.

Bottom Up vs Top Down

Use Top Down for Exploring and Querying Views

Often people start building queries top down and gradually drill into the detail. This is fine if you are exploring an already validated view but not if you are building a view.

Use Bottom Up for Building and Validating Views

When building views I recommend you always start with the transaction level table that contains the key metrics and then look to classify your data with higher level dimensions / categories. The reason for this is that once you identify the main transaction / metrics table, you can use it to validate that sum of metrics and record counts are consistent before and after joins. You should also check to ensure no keys are dropped or duplicated. You can't do these checks if you start with higher level tables as every join can alter the record count either by duplicating records or truncating records which are not present in all the tables.

Dealing with Duplicates

Joining on tables that have multiple records for the 1 transaction will cause duplicates in your transactions table and thus overstate your metrics. Do not include these joins in your main views. Avoid using these, but if necessary some solutions for dealing with them include

Top of stack / latest record
Combine multiple rows into a single row
Report separately to the metrics so metrics are not duplicated
Apportion the metrics over multiple records

These options should be discussed properly with stakeholders to understand the implications of each approach. All of these options are computationally expensive and so often optimised by precomputing and materializing the results.

SQL vs R vs Python

SQL

SQL is by far the most common language for interfacing with a data store such as a SQL Server Database, a data warehouse like Snowflake or analytics engines like Spark.

A SQL data store is often where your data will exist before it can be pulled down into a more specialised analytics tool such as R or Python.

A SQL data store also means you can centralise your data and the processing of that data on power computer clusters for access by many users.

SQL has many common commands but has different variants for different tools which affects the compatibility and therefore transportability of your SQL code.

领英推荐

Harnessing the Power of Common Table Expressions…

??Database Design SQL??Development MySQL ??Data Analyst ??Business Intelligence 12 个月前

Mastering Subqueries in SQL: An In-depth Guide with…

??Database Design SQL??Development MySQL ??Data Analyst ??Business Intelligence 12 个月前

How an Integrated Development Environment Can Enhance…

Aqua Data Studio 8 个月前

R and Python

Are generally used for single user batch operations. Typical use cases are data pipeline transformations, exploratory data analysis, visualizations, adhoc analysis, data science and machine learning. R and Python are both far more flexible than SQL. The biggest drawback when compared to SQL is that SQL engines are designed to work on very large data and serve many users simultaneously.

SQL is not better or worse it's a different use case and trying to use a tool for the wrong use cases leads to significantly more work to achieve an inferior result.

Local In Memory Data Frames (<10 Million Records)

R and Python are most commonly used to handle in memory data processing on a local machine. Under this scenario the capacity for data processing is relatively small at millions of records instead of tens of millions to trillions of records in a distributed analytics engine.

Local Disk Data Frames + Multicore Processing (<10 Billion Records)

If you know what you're doing it's still possible to use R or Python to process billions of records on a laptop quite quickly by using disk based data frames and parallel processing on multiple CPU cores.

These can be a good options if

There isn't a centralised SQL query engine available
The SQL query engine is too slow (either because it's under resourced or not setup properly, typically OLTP instead of OLAP and no indexing)
The queries are more complex than can be efficiently implemented in your SQL variant. Unpivot or REGEX are typical examples.

Remote Data Frames using SQL Engine (Trillions of Records)

Note to process this much data you really need a SQL engine such as Big Query as opposed to a transactional database such as SQL Server.

Both R and Python have the ability dynamically generate, parse and execute SQL on remote servers which can massively streamline and automate workflows.

However at present R is the only language I'm aware of that has the ability to translate highly concise R code into all the main SQL variants, execute remotely and returns results to a remote lazy data frame via DBPlyr. This enables you to remotely execute queries larger than R can handle. A preview of the data is returned so you can quickly iterate on your solution without overloading your system and a full result set is not returned until you explicitly call the collect() function.

This differs to Python which can execute SQL and return the results to a local data frame. The local data frame then becomes the bottleneck because of all of the limitations discussed earlier.

Dealing with Missing Data

In data science we often remove, fill or impute missing values before we build models. This is a step that should come after the Common Views are built so that Common Views can be used for other purposes other than data science such as reporting and QA. When building Common Views you should generally leave Nulls as they are, potentially fill with zeros for numeric fields and fill with 'other' for categorical in the Reporting Views but not the Raw Views. This is important so the data can be reconciled back to source systems. The benefit of leaving all NULLs in place is that exception reporting can easily be built for missing data which can be used to inform the relevant teams to rectify the issue at source. Filling Zeros can solve some calculation issues, however it can also skew averages calculations as NULL values are typically excluded. The benefit of filling NULL Categorical values with 'Other' is that SQL often does not include NULL in group by which is a common reason why data can go missing from reporting.

Sometimes people will try to fill missing values from other data source, other than the main system. While this can sometimes be a necessary option, it's often best avoided as the alternate data sources don't usually have as robust or automated quality controls or processes which generally means more maintenance and data synchronization issues. Again exceptions reporting to fix data issues at source is usually best.

Testing

Do Specific Queries only use joins when necessary and offload common elements to a Common View?

UAT Testing often involves stakeholders comparing query results to a source system in both aggregate and sampled spot checks. Where possible column name mapping should match source systems to make reconciliation easier. User spot checks can also be added as automated checks that run each time the code is refreshed or updated.

Does the Common View contain all of the original, unduplicated primary key values as the underlying metric / transaction table?

Do the record counts and metric totals match?

Upon refresh are there any diffs when compared to a snapshot

Are there any NULL values for key fields (key field include any field that is used for keys, grouping or calculation). Often caused by either data quality or joins.

Have column aliases been implementing by a structured mapping process that reconciles with some system or standard.

Unit testing can be used to test the output of functions given a number of both common scenarios and scenarios that don't necessarily exist in your sample data. For example does a function fail if it is passed NULL, Zeros (divide by zero error), negative values, small fractional numbers, value out of range (eg formula lookups a value that doesn't exist), etc.

What would you add to this list?

Have anything you'd like to add? Let me know in the comments below.

要查看或添加评论，请登录

Jonathan Ng的更多文章

Designing a Timeseries Analytics DB for Stock Market Analysis

2023年12月23日

Designing a Timeseries Analytics DB for Stock Market Analysis

The amount of free data sources we have not for analyzing the stock market is pretty incredible. Tools like OpenBB…

1 条评论
Fun AI, Robotics and STEM Projects for Kids and Adults

2021年4月16日

Fun AI, Robotics and STEM Projects for Kids and Adults

Note this article is a work in progress. It's something I want to share with my friends who are keen to find…
4 Reasons High Accuracy is Misleading

2020年8月13日

4 Reasons High Accuracy is Misleading

A common mistake I often see from both decision-makers, data analysts, data scientists, machine learning engineers, and…

1 条评论
R vs Python for Data Science, Data Analytics, Machine Learning Building Apps, Moving to Production

2018年10月23日

R vs Python for Data Science, Data Analytics, Machine Learning Building Apps, Moving to Production

R & Python are the top 2 programming languages for data analytics and data science right now. There is a lot of debate…
Dynamic Column Selector with R Shiny Flex Dashboards

2018年9月17日

Dynamic Column Selector with R Shiny Flex Dashboards

Learn how and why you might want to implement a dynamic column selector for your interactive dashboards. This video is…
Dynamic Dashboard Filters with R and Shiny Flex Dashboards

2018年9月11日

Dynamic Dashboard Filters with R and Shiny Flex Dashboards

This video is inspired by a couple of students on my new course R Shiny Flex Dashboards and Interactive Data…
The Best Way to Import Data into R - 2018

2018年9月4日

The Best Way to Import Data into R - 2018

Not all methods for loading data in R are made equal. Find out the best way of loading data in 2018 Read the article…
Amazing New R Package Makes Shiny App Development Even Better

2018年8月20日

Amazing New R Package Makes Shiny App Development Even Better

Interested in learning R Shiny for rapidly building interactive web applications? Then you have to see this new package…
R Shiny 2 Minute Overview

2018年8月13日

R Shiny 2 Minute Overview

In my course R Tidyverse Reporting and Analytics for Excel Users I started to introduce interactive Shiny applications.…
Don’t Use This Neat Power Query Trick

2018年7月30日

Don’t Use This Neat Power Query Trick

Recently a client was telling me about how revolutionary Excel Power Query is for cleaning up all of their data. They…

See all articles

Efficient & Robust Analytic Queries

Jonathan Ng

Best Selling Instructor R, Python and SQL 17k+ Students

Specific Queries

Common Views

Bottom Up vs Top Down

SQL vs R vs Python

领英推荐

Dealing with Missing Data

Testing

What would you add to this list?

Jonathan Ng的更多文章

社区洞察

其他会员也浏览了

Learn SQL Comprehensive Guide for Beginners in 2024

A not-so-good idea: Pipe Syntax In SQL

Manually creating your first Hyperlambda endpoint

SQL Essentials: Breaking Down Key Concepts for Beginners

SQL 101: Top Mistakes Beginners Should Avoid

It's The Assumptions That Get You

SQL queries and their optimization

Norm Data Access for .NET Core 3

Exploring SQL: Definition, Benefits, and Practical Applications

Understanding Common Table Expressions (CTEs)

Specific Queries

Common Views

Bottom Up vs Top Down

SQL vs R vs Python

领英推荐

Dealing with Missing Data

Testing

What would you add to this list?

Jonathan Ng的更多文章

Designing a Timeseries Analytics DB for Stock Market Analysis

Fun AI, Robotics and STEM Projects for Kids and Adults

4 Reasons High Accuracy is Misleading

R vs Python for Data Science, Data Analytics, Machine Learning Building Apps, Moving to Production

Dynamic Column Selector with R Shiny Flex Dashboards

Dynamic Dashboard Filters with R and Shiny Flex Dashboards

The Best Way to Import Data into R - 2018

Amazing New R Package Makes Shiny App Development Even Better

R Shiny 2 Minute Overview

Don’t Use This Neat Power Query Trick

社区洞察

其他会员也浏览了

Learn SQL Comprehensive Guide for Beginners in 2024

A not-so-good idea: Pipe Syntax In SQL

Manually creating your first Hyperlambda endpoint

SQL Essentials: Breaking Down Key Concepts for Beginners

SQL 101: Top Mistakes Beginners Should Avoid

It's The Assumptions That Get You

SQL queries and their optimization

Norm Data Access for .NET Core 3

Exploring SQL: Definition, Benefits, and Practical Applications

Understanding Common Table Expressions (CTEs)