Why tooling matters a lot in Data Science and how it fights the Replication Crises (Part 3)

Why tooling matters a lot in Data Science and how it fights the Replication Crises (Part 3)

In the third part about the importance of good tooling in Data Science and the programming languages we use in this field, we'll take a look at the compiler/parser, testing, and benchmarking facilities of the Julia Programming Language.

As you might remember from the prior article, we can use the Julia REPL not only to execute code and look up documentation, but also to manage packages in the context of an environment and replicate that environment with ease. Now we actually start running some code and see what Julia's JIT compiler (sometimes called a"Just-In-Time-Ahead-of-time" or even "Just-Barely-Ahead-of-Time" compiler)*, can do for us, when we use it directly on the REPL.


*This would be a longer discussion about JIT vs. AOT in Julia[1]. Bottom-line is that Julia takes advantage of just-in-time-compilation features such as being fast and fully dynamic, allowing for generic code that is then optimized for the actual on-machine implementation going through layers of abstraction towards LLVM (the compiler infrastructure framework), but also uses elements from an ahead-of-time approach, e.g. when something identical was already precompiled in a former run and its unique signature is found e.g. in the REPL cache to avoid doing unnecessary work.


All of this happens automatically, of course. However, you can use this feature yourself with the PackageCompiler package, which allows you to do some compilation work upfront and store the results, allowing for shorter startup time[2] of your own code as well.

To err is human, therefore errors should be human (readable)

First off some examples, that do not relate to the compiler in the strict sense, but to the parser: Since Julia 1.10 [3] the parsers points out visually, where something went wrong in the code instead of giving just the position and error message.

Here are two examples: The parser points you to the exact place, where it "Expected" something to be different from how you wrote it.

This idea of improved readability goes down the tool chain as far as an improved stacktrace renderings[4]. These used to be rather long, like several terminal screens, especially when using nested parametric types and were cluttered by unused variables, internally generated methods and a confusing way of expressing keywords that was different from what you would see in your code.

In short: Stack traces, that are often huge hay stacks with a tiny needle of a single error hidden in them, are now much smaller and easier to raffle through.

Time is precious, and you should know how much you are using

At this point it is fair to assume that a lot of data scientists and statisticians do not consider themselves to be programmers or developers but people who happen to use certain tools that allow them to write domain specific programs with domain affine languages like Matlab and Mathematica, R and many more industry specific languages say for simulation of stress in materials, fluid dynamics and so on.

While many of these do in fact offer testing and benchmarking facilities, I dare to speculate, that they are rarely used. Too pessimistic? Try this:

Next time you talk to a statistician, ask him about his stance towards TDD in general and how proud he was about the last unit test he wrote in particular. Given he can remember the occasion...

My guess is, it's putting in some "time" and "print" statements, whenever something does not work as expected. Not ideal.

This might stem from the erroneous notion that writing tests is something that only developers do, when they want to get some code production-ready, or that it has to be really complicated. Neither of this is true.

Actually:

  • Writing simple unit tests is usually simple
  • Testing should be done constantly and
  • will let you sleep better.

The writing and frequent execution of a test suit will not only ensure that your code runs correctly, and thus the results will be therefore more likely reproducible, but it will also make your tummy feel fuzzy and warm. And this will make you a better developer (and as a data scientist and statistician you most likely ARE one, like it or not).

Trust me on this one: Reinforcement learning does work, even with the original neural network that is your brain. Seeing all tests turn green is just great!

Now, the thing with modern languages like Go and Julia is that, they come with testing packages (in the case of Go right out of the box), that are not only very easy to use but also give you an overview of the time consumption every time you execute them. This allows you to see, if something is going wrong time-wise, too. Below is the output of some basic unit tests that follow a simple syntax like...

@testset "NAME OF YOUR TEST SET" begin
    @test your_function(TEST_ARGUMENTS) == [EXPECTED_OUTPUT]
end        

... just using the julia Package "Test". I think, you might see the point. It's not really hard in this case. And so it is for lots of cases.

Now, if you need more insights, there is a lot that you can do with proper benchmarking. That is, at least running your code several times to compensate for fluctuations in the underlying system. Obviously, you would also vary the inputs to check performance, e.g. for different batch sizes, measure performance in relation to time complexity (Big O notation, anyone?) etc. But this is already advanced. It is good, when you need it, but it is not always needed.

But like with all suggestions in this article series, even doing a little of it in the most simple way will pay some major dividends. Like just this:

julia> using BenchmarkTools
julia> include("randsackHiGHS.jl")        

The below output is the result of using the Julia Package "BenchmarkTools" and simply prepending a function call with the "@benchmark" macro.

In this example, we call a function that heats up the CPU a bit by solving the well known knapsack optimization problem (my code on GitHub: [5]) a couple of times for 1k of randomly chosen Float64 numbers using the HiGHS[6] solver via a wrapper package.

Alternatively, the "@btime" macro provides a less verbose output, but still enough to see how your code is doing performancewise.

So, that is enough for now, I think.

In this exploration, we covered a lot of ground, but there is one more piece: The often overlooked topic of Documentation, which we will cover in the next installment of this article series.

Until then!

[1] https://www.reddit.com/r/Julia/comments/y73l9b/why_does_julia_use_jit/

[2] https://julialang.github.io/PackageCompiler.jl/dev/

[3] https://julialang.org/blog/2023/12/julia-1.10-highlights/#new_parser_written_in_julia

[4] https://julialang.org/blog/2023/12/julia-1.10-highlights/#improvements_in_stacktrace_rendering

[5] https://github.com/BenitoEck/ISCC

[6] https://highs.dev/

[7] https://github.com/jump-dev/HiGHS.jl


Benjamin Eckenfels

Head of Operations at algorithmica technologies, Germany | Senior Implementation Expert in Data Driven Industrial Projects | Machine Learning & AI in the Fields of Geology, Oil & Gas, Chemical Industry

4 个月

Hi, this is Ben from the future: I had to update the article slightly, since LinkedIn's asset server logic apparently decided I did not need the article's code blocks and pictures anymore, since I deleted them in another article 'I'm currently working on, where I used this one as a template.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了