Data Analytics: Lessons from Software Development
In data we often view languages like Python and SQL as simply a means to an end but, in reality, these are powerful tools and learning to use them effectively can help us deliver higher quality data products faster.
Be careful though, failing to appreciate their potential complexity can instead create unnecessary burden and expose businesses to additional risk. Here are some lessons I brought over from software development when redeveloping NHS analytical pipelines.
Clarity
The purpose of your code should be clear to anybody reading it, including you!
Stop reinventing the wheel
Teams often end up working in a siloed environment and it's easy to think you're the first people to ever encounter a specific problem.
Work to break down communication barriers between teams and business areas to standardise approaches to common tasks. Does the business favour pure SQL analytics or a mixed-language SQL/Python approach? Do you have a hierarchy of transformations and a preferred order? Just how do you calculate the median in SQL?!?!
Tech debt
Every coding choice we make brings with it the downstream burden of tech debt. Perhaps our implementation of a yearly analysis is fragile and will need updating when the pipeline is next executed, or the library we used for data profiling has been abandoned and stops interfacing well with our data structures.
Tech debt isn't necessarily a problem, rather it needs to be documented and managed to ensure it is correctly mitigated. Sometimes we have to add a bespoke section to our pipeline in order to meet a tight deadline, that's fine - just make sure it is recorded and addressed down the line.
领英推荐
DevOps
Having developed somewhat parallel to the wider development community, some of the tooling around development environments and deployment can be missing.
* Version control; make sure to use well maintained repositories to track changes and establish a single source of truth for each pipeline
* IDEs; modern IDEs are a dream to code in compared to basic text editors. Use them, explore their features and extensions, and you'll soon be writing higher quality code faster
* Virtual environments; similar to version control, having a single source of truth for a pipeline improves reproducibility while also helping make your code more portable
There is also the separate topic of how we plan and execute our development, using tools like Jira, and process documentation. It's important none of these become ritualistic to the point they do not offer any real benefit. Failing to understand the philosophy behind these tools leads to uneven adoption and increased burden.
Speak the local language
Each language is designed around one, or more, central philosophies and it helps to understand what these are and how they might impact your code.
Python is well-designed for Objected-Oriented Programming and works best when you exploit this. Learn about the different basic data structures, lists, tuples, dicts etc, and how to work with them, especially iteration. Understanding classes will help you work with common libraries like pandas, while also allowing you to quickly modify their behaviour with inheritance. Other libraries explicitly require the use of classes to make use of their functionality, for example pydantic is based around data models using classes.
SQL is relatively uncommon in being a declarative language, i.e. you tell it what to do but not how to do it. It is optimised around the use of data indexing to speed up common operations and this is lost when forcing row by row transformations. Keep your code straightforward, making use of intermediate structures, and let the query planner handle the rest.
Conclusions
These are just some surface lessons from working in data analytics after a background in development. I'll be writing up more specific advice in the next week or two and hope it'll be useful to folks - in meantime I'll always plug the excellent RAP Community of Practice to see how NHS colleagues are tackling these challenges in a large organisation!
#data #analytics #datascience #python #sql
Head of Coding and Data Standards (and RAP guy) at NHS England
2 周Thanks for sharing your insights Luke, all great points! I especially like the last one, of using the right tool for the right job. Many of us have a favourite language, but it would be good if we dabble in many and lean on the expertise (and review!) of our colleagues to ensure our code is OK! This will hopefully mean we can always use whatever language is strongest for each piece of work.