登录查看更多内容

Building Scalable Analytics Pipelines

Eddie Jay

FSA Actuary, Let's make Analytics POP!

发布日期: 2022年5月4日

Analytic pipelines are processes through which raw data are transformed into insights that are then delivered to the end user. The way a project’s pipeline is designed materially affect how much manual intervention is needed, how repeatable and fast the pipeline is and what the end user experience is like.

Source data ingestion

Analytics pipelines tend to start with ingestion of source data, e.g. sales transactions, customer, calls. Sources that often change will require more maintenance. You need to design steps to catch these changes – DO NOT rely on DBAs (data base administrators) to remember to tell you… Sources that are not refreshed synchronously can cause misalignments, e.g. if customer account information updates after sales, prices may be erroneously charged and reported incorrectly.

The more stable, synchronous, well maintained your source data, the more automatable your pipeline will be. Where manual steps are needed, create SOPs (standard operating procedures) to follow. Don’t rely on your memory.

Subscribe to my newsletter HERE.

Data structure design and ETL coding

For a pipeline to work efficiently, a thoughtfully designed data structure is critical. This means studying the source data wrt the insight needed. Pull the least amount of data, aggregate, enrich along the pipeline so the end result provides the dimensions and richness of data needed BUT also was also fast to generate. Throwing more storage and processing power is an expensive, lazy and temporary solution to poorly designed data architectures.

E.g. for customer insight, you rarely need individual sales records, aggregating to monthly would do. If all you need are market segment level, you can aggregate above individuals too. Say you have 100 customers from 3 segments, each customer purchasing 5000 sales a year, your extract data could shrink from 500k sales lines to 3*12 rows per year.

The ETL job design should be scripted in a disciplined way, to avoid expensive joins, convoluted loops etc. The design should take place step wise so you’re aware of time consuming steps, making debugging easier. Document liberally, both for record keeping as well as for self-checking.

Where source data is of large size, and/or ETL job is complex, consider creating temp tables on disc and avoid hogging RAM.

领英推荐

Scaling Analytics Maturity - A Guide for Data Analysts

Andrew Madson MSc, MBA 8 个月前

The Future of Data Transformation

Ravit Jain 10 个月前

Why Data Mismatches Persist in Data Migration Projects…

Yoav Aviv 3 个月前

Automate?

A once-of analysis is very quick. To automate the analysis would require more careful design to navigating multiple scenarios and steps. If a task takes place often, is of stable source data and methodology, AND it’s of high value to end users, you should consider automating. Analyses that require unstable or asynchronous source data, with oft changing requirements, OR of low value should be kept manual (or in fact not done at all).

Automation requires a mechanism to schedule ETL jobs, e.g. cron, Jenkins. If you’re using dashboard tools like Tableau, most have integrated DB query functions (or csv extract file ingestion) which you can use to schedule. More integrated visualization tools like Looker have these automation built in.

External reference files should be maintained explicitly, and documented as such in the SOP. Resist the temptation to hard code elements into your ETL script.

Delivery modalities

Most dashboard tools can schedule email alerts to be sent to the end users, avoiding your needing to remember to do so. It’s 2022, you should not be sending Excel files via email daily…!

Watch for slow response/latency of the tool you use for insight delivery. Provided your backend data structure was design efficiently, your reporting tool should work well per your design. However, server performance issues can occur, especially if you have many users hitting the same tool/dashboard at the same time each day. E.g. Tableau server provider latency monitoring and can alert you if performance dips below some threshold.

Where multiple user groups are accessing AND do not need to see others’ info, use security settings to ring fence each group’s data. You would have to design these firewalls and access profiles up front, and build into your pipeline the necessary tags to trigger these delineations.

Maintenance and Iteration

No analytics pipeline is forever. You will always have to check it’s working as built. Set up regular office hours with end users to seek feedback, and prioritize enhancements through say quarterly updates.

Subscribe to my newsletter HERE.

要查看或添加评论，请登录

Eddie Jay的更多文章

Automating Healthcare Fraud Detection

2020年2月7日

Automating Healthcare Fraud Detection

Fraud costs the US health system a lot. As do wasteful spending and abuse of health services.
Pharmacy Fraud Waste Abuse

2019年7月2日

Pharmacy Fraud Waste Abuse

The US spends nearly $400bn annually on pharmaceutical drugs. Some estimates put the amount of fraud waste and abuse…
Pharmaceutical Analytics - 2

2019年4月18日

Pharmaceutical Analytics - 2

This is No.2 of a series of blogs I'm writing on pharmaceutical analytics.
What's in it for me anyway? Incentives in healthcare

2019年3月9日

What's in it for me anyway? Incentives in healthcare

Incentives, as some economists would suggest, drive the world. Whether someone is doing things to fill their stomach…

1 条评论
Applying AI in the real "healthcare" world

2019年1月16日

Applying AI in the real "healthcare" world

My previous post discussed Challenges to doing ML in healthcare. This one suggests ways to apply AI in that overcome…
Challenges of applying machine learning to healthcare

2018年12月20日

Challenges of applying machine learning to healthcare

A number of trends have paved the way for increasing adoption of machine learning (ML) in healthcare. We’re capturing…
A health data whisperer’s tricks

2018年11月14日

A health data whisperer’s tricks

Some of you may remember the 1998 movie The Horse Whisperer. In it, Robert Redford’s character, who had a remarkable…
Health Insurance Analytics Metrics

2018年8月20日

Health Insurance Analytics Metrics

Health insurance is primarily in the business of receiving premium from policyholders and paying for their medical…
Feature Engineering for Health Analytics

2018年7月11日

Feature Engineering for Health Analytics

Feature engineering is an important step in analytics. Some may say this is THE most important step.
Practical considerations for Predictive Modeling

2018年6月7日

Practical considerations for Predictive Modeling

Advances in computing power and in machine learning techniques are rapidly changing how humans utilize data. Aside from…

See all articles

Building Scalable Analytics Pipelines

Eddie Jay

FSA Actuary, Let's make Analytics POP!

Source data ingestion

Data structure design and ETL coding

领英推荐

Automate?

Delivery modalities

Maintenance and Iteration

Eddie Jay的更多文章

社区洞察

其他会员也浏览了

Real-Time Analytics on Live Databases

Ensuring Data Integrity: A Guide to Validating Translated Data in ETL Pipelines Amid Data Warehouse Migration

Data Build Tool(DBT) — Aamir P

Data Science and System Design for Product Managers

dbt (Data Build Tool) Overview: What is dbt and What Can It Do for My Data Pipeline?

Parker on Parquet

Overview of Azure Data Factory Components

How To Power BI Series - Article #4: The Art of ETL - A Dive into Modern Data Engineering

Implementing the Medallion Architecture in Microsoft Fabric: A Comprehensive Guide

Introduction to DBT (Data Build Tool)

Source data ingestion

Data structure design and ETL coding

领英推荐

Automate?

Delivery modalities

Maintenance and Iteration

Eddie Jay的更多文章

Automating Healthcare Fraud Detection

Pharmacy Fraud Waste Abuse

Pharmaceutical Analytics - 2

What's in it for me anyway? Incentives in healthcare

Applying AI in the real "healthcare" world

Challenges of applying machine learning to healthcare

A health data whisperer’s tricks

Health Insurance Analytics Metrics

Feature Engineering for Health Analytics

Practical considerations for Predictive Modeling

社区洞察

其他会员也浏览了

Real-Time Analytics on Live Databases

Ensuring Data Integrity: A Guide to Validating Translated Data in ETL Pipelines Amid Data Warehouse Migration

Data Build Tool(DBT) — Aamir P

Data Science and System Design for Product Managers

dbt (Data Build Tool) Overview: What is dbt and What Can It Do for My Data Pipeline?

Parker on Parquet

Overview of Azure Data Factory Components

How To Power BI Series - Article #4: The Art of ETL - A Dive into Modern Data Engineering

Implementing the Medallion Architecture in Microsoft Fabric: A Comprehensive Guide

Introduction to DBT (Data Build Tool)