Data Science Important Programming principles
Darko Medin
Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.
In Data Science, different programming languages play a key role. In fact most career Data Science ads will generally contain one of them or even o combination of multiple of them. Even the no-code Data Science platforms had to be coded first in order to work :). In this article, i am showing different principles that need to be applied to be efficient in Data Science, and in my opinion, this is almost an imperative for both present and future Data Scientists and Statistical Programmers.
First and very important thing is - being able to document the code efficiently. Every Data scientist that is considered a good data scientist needs to have a very good way of communicating the code to other Data Scientists (my opinion). Those #parts of the code are not irrelevant. They should contain all the relevant background behind the code, and what it actually does. Large scripts can become hard to navigate if Data Scientist don't leave notes in the code too so this is probably one of the most important principles in having clean and understandable code, its actually not always about the code itself.
In Data science being able to scale is everything. Writing blocks of code that can be easily reused, or scaled and used over many variables at the same time is the key for Advanced Data Science. In programming languages like R and Python, the language itself is ideally adapted for functions, loops and object oriented programming (OOP) which makes the scaling of the code easy to achieve (especially using the features such as Encapsulation in OOP where data and methods can be bound into single objects-very effective). SAS programming language has macros that enable similar features. Its interesting that one of the most used Data Science tools, yes Microsoft excel, also has coding and macros options, which is very important when working with large number of datasets/spreadsheets.
There is another fantastic advantage of OOP called Abstraction, and this principle is in my opinion essential especially when making AI/ML products. Being able to define parts of the algorithm that can change, while the rest of the algorithm is a safe non-changing environment is very important. This is another feature that is achieved with object oriented programming and storing functions and arguments in a right way (R and Python examples).
领英推荐
Going beyond what's typically refereed to as object oriented programming, the data structures also enable us to store a lot of information about the models in easily callable objects, so using Python of R makes it really easy to scale up just by storing the data, arguments, functions, models/algorithms or whatever is needed to scale up as very simple and easily callable objects. SAS, SQL and many other programming languages enable this feature too.
Another important, yet not present in the overall discussions as much as it should be is the model deployment and re-training. These will not depend only on the model accuracy and explainabaility, but also about the effectiveness of the model itself. Models should be fast enough to be implementable in most software solutions, so its good to benchmark important blocks of code.
Simplify the code - one of the most important principles in creating optimized scripts. First version of the code should be considered for next rounds of code simplifications. Most often several rounds of code simplification can bring the code to the optimal level of complexity. This is essential not just for simplification purpose but most often, simpler code will tend to perform better too (of course in case they have all the functional and data capabilities as the complex ones)