Building Scalable Analytics Pipelines

Building Scalable Analytics Pipelines

Analytic pipelines are processes through which raw data are transformed into insights that are then delivered to the end user. The way a project’s pipeline is designed materially affect how much manual intervention is needed, how repeatable and fast the pipeline is and what the end user experience is like.

Source data ingestion

Analytics pipelines tend to start with ingestion of source data, e.g. sales transactions, customer, calls. Sources that often change will require more maintenance. You need to design steps to catch these changes – DO NOT rely on DBAs (data base administrators) to remember to tell you… Sources that are not refreshed synchronously can cause misalignments, e.g. if customer account information updates after sales, prices may be erroneously charged and reported incorrectly.

The more stable, synchronous, well maintained your source data, the more automatable your pipeline will be. Where manual steps are needed, create SOPs (standard operating procedures) to follow. Don’t rely on your memory.

Subscribe to my newsletter HERE.

Data structure design and ETL coding

For a pipeline to work efficiently, a thoughtfully designed data structure is critical. This means studying the source data wrt the insight needed. Pull the least amount of data, aggregate, enrich along the pipeline so the end result provides the dimensions and richness of data needed BUT also was also fast to generate. Throwing more storage and processing power is an expensive, lazy and temporary solution to poorly designed data architectures.

E.g. for customer insight, you rarely need individual sales records, aggregating to monthly would do. If all you need are market segment level, you can aggregate above individuals too. Say you have 100 customers from 3 segments, each customer purchasing 5000 sales a year, your extract data could shrink from 500k sales lines to 3*12 rows per year.

The ETL job design should be scripted in a disciplined way, to avoid expensive joins, convoluted loops etc. The design should take place step wise so you’re aware of time consuming steps, making debugging easier. Document liberally, both for record keeping as well as for self-checking.

Where source data is of large size, and/or ETL job is complex, consider creating temp tables on disc and avoid hogging RAM.

Automate?

A once-of analysis is very quick. To automate the analysis would require more careful design to navigating multiple scenarios and steps. If a task takes place often, is of stable source data and methodology, AND it’s of high value to end users, you should consider automating. Analyses that require unstable or asynchronous source data, with oft changing requirements, OR of low value should be kept manual (or in fact not done at all).

Automation requires a mechanism to schedule ETL jobs, e.g. cron, Jenkins. If you’re using dashboard tools like Tableau, most have integrated DB query functions (or csv extract file ingestion) which you can use to schedule. More integrated visualization tools like Looker have these automation built in.

External reference files should be maintained explicitly, and documented as such in the SOP. Resist the temptation to hard code elements into your ETL script.

Delivery modalities

Most dashboard tools can schedule email alerts to be sent to the end users, avoiding your needing to remember to do so. It’s 2022, you should not be sending Excel files via email daily…!

Watch for slow response/latency of the tool you use for insight delivery. Provided your backend data structure was design efficiently, your reporting tool should work well per your design. However, server performance issues can occur, especially if you have many users hitting the same tool/dashboard at the same time each day. E.g. Tableau server provider latency monitoring and can alert you if performance dips below some threshold.

Where multiple user groups are accessing AND do not need to see others’ info, use security settings to ring fence each group’s data. You would have to design these firewalls and access profiles up front, and build into your pipeline the necessary tags to trigger these delineations.

Maintenance and Iteration

No analytics pipeline is forever. You will always have to check it’s working as built. Set up regular office hours with end users to seek feedback, and prioritize enhancements through say quarterly updates.

Subscribe to my newsletter HERE.

要查看或添加评论,请登录

Eddie Jay的更多文章

  • Automating Healthcare Fraud Detection

    Automating Healthcare Fraud Detection

    Fraud costs the US health system a lot. As do wasteful spending and abuse of health services.

  • Pharmacy Fraud Waste Abuse

    Pharmacy Fraud Waste Abuse

    The US spends nearly $400bn annually on pharmaceutical drugs. Some estimates put the amount of fraud waste and abuse…

  • Pharmaceutical Analytics - 2

    Pharmaceutical Analytics - 2

    This is No.2 of a series of blogs I'm writing on pharmaceutical analytics.

  • What's in it for me anyway? Incentives in healthcare

    What's in it for me anyway? Incentives in healthcare

    Incentives, as some economists would suggest, drive the world. Whether someone is doing things to fill their stomach…

    1 条评论
  • Applying AI in the real "healthcare" world

    Applying AI in the real "healthcare" world

    My previous post discussed Challenges to doing ML in healthcare. This one suggests ways to apply AI in that overcome…

  • Challenges of applying machine learning to healthcare

    Challenges of applying machine learning to healthcare

    A number of trends have paved the way for increasing adoption of machine learning (ML) in healthcare. We’re capturing…

  • A health data whisperer’s tricks

    A health data whisperer’s tricks

    Some of you may remember the 1998 movie The Horse Whisperer. In it, Robert Redford’s character, who had a remarkable…

  • Health Insurance Analytics Metrics

    Health Insurance Analytics Metrics

    Health insurance is primarily in the business of receiving premium from policyholders and paying for their medical…

  • Feature Engineering for Health Analytics

    Feature Engineering for Health Analytics

    Feature engineering is an important step in analytics. Some may say this is THE most important step.

  • Practical considerations for Predictive Modeling

    Practical considerations for Predictive Modeling

    Advances in computing power and in machine learning techniques are rapidly changing how humans utilize data. Aside from…

社区洞察

其他会员也浏览了