登录查看更多内容

Data Analytics: Lessons from Software Development

Luke Elliott

Python-flavoured Data Professional

发布日期: 2024年9月9日

In data we often view languages like Python and SQL as simply a means to an end but, in reality, these are powerful tools and learning to use them effectively can help us deliver higher quality data products faster.

Be careful though, failing to appreciate their potential complexity can instead create unnecessary burden and expose businesses to additional risk. Here are some lessons I brought over from software development when redeveloping NHS analytical pipelines.

Clarity

The purpose of your code should be clear to anybody reading it, including you!

Use sensible names; whether it's variables and functions or filenames and projects - make its use obvious!
Formatting matters; visually cluttered code is harder to understand and a nightmare to edit - use whitespace to delineate
Parentheses (are your friend); ensure order of operations by grouping using appropriate parentheses - don't assume
Correctly comment; like all good seasonings not too much and not too little - I try to limit comments to novel implementations or to point to external resources

Stop reinventing the wheel

Teams often end up working in a siloed environment and it's easy to think you're the first people to ever encounter a specific problem.

Work to break down communication barriers between teams and business areas to standardise approaches to common tasks. Does the business favour pure SQL analytics or a mixed-language SQL/Python approach? Do you have a hierarchy of transformations and a preferred order? Just how do you calculate the median in SQL?!?!

Tech debt

Every coding choice we make brings with it the downstream burden of tech debt. Perhaps our implementation of a yearly analysis is fragile and will need updating when the pipeline is next executed, or the library we used for data profiling has been abandoned and stops interfacing well with our data structures.

Tech debt isn't necessarily a problem, rather it needs to be documented and managed to ensure it is correctly mitigated. Sometimes we have to add a bespoke section to our pipeline in order to meet a tight deadline, that's fine - just make sure it is recorded and addressed down the line.

Nilson Cardoso A. 4 个月前

Airflow

Sejal Baweja 1 年前

Mastering Custom Serialization in Flask: A Guide for…

Hossein Shojaei, PhD 1 个月前

DevOps

Having developed somewhat parallel to the wider development community, some of the tooling around development environments and deployment can be missing.

* Version control; make sure to use well maintained repositories to track changes and establish a single source of truth for each pipeline

* IDEs; modern IDEs are a dream to code in compared to basic text editors. Use them, explore their features and extensions, and you'll soon be writing higher quality code faster

* Virtual environments; similar to version control, having a single source of truth for a pipeline improves reproducibility while also helping make your code more portable

There is also the separate topic of how we plan and execute our development, using tools like Jira, and process documentation. It's important none of these become ritualistic to the point they do not offer any real benefit. Failing to understand the philosophy behind these tools leads to uneven adoption and increased burden.

Speak the local language

Each language is designed around one, or more, central philosophies and it helps to understand what these are and how they might impact your code.

Python is well-designed for Objected-Oriented Programming and works best when you exploit this. Learn about the different basic data structures, lists, tuples, dicts etc, and how to work with them, especially iteration. Understanding classes will help you work with common libraries like pandas, while also allowing you to quickly modify their behaviour with inheritance. Other libraries explicitly require the use of classes to make use of their functionality, for example pydantic is based around data models using classes.

SQL is relatively uncommon in being a declarative language, i.e. you tell it what to do but not how to do it. It is optimised around the use of data indexing to speed up common operations and this is lost when forcing row by row transformations. Keep your code straightforward, making use of intermediate structures, and let the query planner handle the rest.

Conclusions

These are just some surface lessons from working in data analytics after a background in development. I'll be writing up more specific advice in the next week or two and hope it'll be useful to folks - in meantime I'll always plug the excellent RAP Community of Practice to see how NHS colleagues are tackling these challenges in a large organisation!

#data #analytics #datascience #python #sql

Sam Hollings

Head of Coding and Data Standards (and RAP guy) at NHS England

2 周

Thanks for sharing your insights Luke, all great points! I especially like the last one, of using the right tool for the right job. Many of us have a favourite language, but it would be good if we dabble in many and lean on the expertise (and review!) of our colleagues to ensure our code is OK! This will hopefully mean we can always use whatever language is strongest for each piece of work.

查看更多评论

要查看或添加评论，请登录

Luke Elliott的更多文章

Data Analytics: Becoming a tech generalist

2024年9月18日

Data Analytics: Becoming a tech generalist

One of the best ways to add value as a data analyst is to become a tech generalist, capable of quickly adapting to new…

2 条评论

Data Analytics: Lessons from Software Development

Luke Elliott

Python-flavoured Data Professional

Clarity

Stop reinventing the wheel

Tech debt

领英推荐

DevOps

Speak the local language

Conclusions

Luke Elliott的更多文章

社区洞察

其他会员也浏览了

Dash for Data Science Dashboards

Tools for an End-to-End Data Process: From Extraction to Utilization

Automating Data Pipelines with Apache Airflow

Automating Data Pipelines with Apache Airflow

Skills Developers Need For Big Data Projects

All-In: The Seamless Collaboration Between Software Engineers, Data Engineers, and Data Scientists

Day 14: Advanced Jinja Techniques in dbt

Error Handling and Debugging in Apache Airflow: Part 3

Workflow Solutions with Apache Airflow

The importance of building your pipeline toolbox from small independent segments of platform agnostic code

Clarity

Stop reinventing the wheel

Tech debt

领英推荐

DevOps

Speak the local language

Conclusions

Luke Elliott的更多文章

Data Analytics: Becoming a tech generalist

社区洞察

其他会员也浏览了

Dash for Data Science Dashboards

Tools for an End-to-End Data Process: From Extraction to Utilization

Automating Data Pipelines with Apache Airflow

Automating Data Pipelines with Apache Airflow

Skills Developers Need For Big Data Projects

All-In: The Seamless Collaboration Between Software Engineers, Data Engineers, and Data Scientists

Day 14: Advanced Jinja Techniques in dbt

Error Handling and Debugging in Apache Airflow: Part 3

Workflow Solutions with Apache Airflow

The importance of building your pipeline toolbox from small independent segments of platform agnostic code