Building Scalable Analytics Pipelines
Analytic pipelines are processes through which raw data are transformed into insights that are then delivered to the end user. The way a project’s pipeline is designed materially affect how much manual intervention is needed, how repeatable and fast the pipeline is and what the end user experience is like.
Source data ingestion
Analytics pipelines tend to start with ingestion of source data, e.g. sales transactions, customer, calls. Sources that often change will require more maintenance. You need to design steps to catch these changes – DO NOT rely on DBAs (data base administrators) to remember to tell you… Sources that are not refreshed synchronously can cause misalignments, e.g. if customer account information updates after sales, prices may be erroneously charged and reported incorrectly.
The more stable, synchronous, well maintained your source data, the more automatable your pipeline will be. Where manual steps are needed, create SOPs (standard operating procedures) to follow. Don’t rely on your memory.
Subscribe to my newsletter HERE.
Data structure design and ETL coding
For a pipeline to work efficiently, a thoughtfully designed data structure is critical. This means studying the source data wrt the insight needed. Pull the least amount of data, aggregate, enrich along the pipeline so the end result provides the dimensions and richness of data needed BUT also was also fast to generate. Throwing more storage and processing power is an expensive, lazy and temporary solution to poorly designed data architectures.
E.g. for customer insight, you rarely need individual sales records, aggregating to monthly would do. If all you need are market segment level, you can aggregate above individuals too. Say you have 100 customers from 3 segments, each customer purchasing 5000 sales a year, your extract data could shrink from 500k sales lines to 3*12 rows per year.
The ETL job design should be scripted in a disciplined way, to avoid expensive joins, convoluted loops etc. The design should take place step wise so you’re aware of time consuming steps, making debugging easier. Document liberally, both for record keeping as well as for self-checking.
Where source data is of large size, and/or ETL job is complex, consider creating temp tables on disc and avoid hogging RAM.
领英推荐
Automate?
A once-of analysis is very quick. To automate the analysis would require more careful design to navigating multiple scenarios and steps. If a task takes place often, is of stable source data and methodology, AND it’s of high value to end users, you should consider automating. Analyses that require unstable or asynchronous source data, with oft changing requirements, OR of low value should be kept manual (or in fact not done at all).
Automation requires a mechanism to schedule ETL jobs, e.g. cron, Jenkins. If you’re using dashboard tools like Tableau, most have integrated DB query functions (or csv extract file ingestion) which you can use to schedule. More integrated visualization tools like Looker have these automation built in.
External reference files should be maintained explicitly, and documented as such in the SOP. Resist the temptation to hard code elements into your ETL script.
Delivery modalities
Most dashboard tools can schedule email alerts to be sent to the end users, avoiding your needing to remember to do so. It’s 2022, you should not be sending Excel files via email daily…!
Watch for slow response/latency of the tool you use for insight delivery. Provided your backend data structure was design efficiently, your reporting tool should work well per your design. However, server performance issues can occur, especially if you have many users hitting the same tool/dashboard at the same time each day. E.g. Tableau server provider latency monitoring and can alert you if performance dips below some threshold.
Where multiple user groups are accessing AND do not need to see others’ info, use security settings to ring fence each group’s data. You would have to design these firewalls and access profiles up front, and build into your pipeline the necessary tags to trigger these delineations.
Maintenance and Iteration
No analytics pipeline is forever. You will always have to check it’s working as built. Set up regular office hours with end users to seek feedback, and prioritize enhancements through say quarterly updates.
Subscribe to my newsletter HERE.