Why tooling matters a lot in Data Science and how it fights the Replication Crises (Part 3)

Benjamin Eckenfels

Head of Operations at algorithmica technologies, Germany | Senior Implementation Expert in Data Driven Industrial Projects | Machine Learning & AI in the Fields of Geology, Oil & Gas, Chemical Industry

发布日期: 2024年5月22日

In the third part about the importance of good tooling in Data Science and the programming languages we use in this field, we'll take a look at the compiler/parser, testing, and benchmarking facilities of the Julia Programming Language.

As you might remember from the prior article, we can use the Julia REPL not only to execute code and look up documentation, but also to manage packages in the context of an environment and replicate that environment with ease. Now we actually start running some code and see what Julia's JIT compiler (sometimes called a"Just-In-Time-Ahead-of-time" or even "Just-Barely-Ahead-of-Time" compiler)*, can do for us, when we use it directly on the REPL.

*This would be a longer discussion about JIT vs. AOT in Julia[1]. Bottom-line is that Julia takes advantage of just-in-time-compilation features such as being fast and fully dynamic, allowing for generic code that is then optimized for the actual on-machine implementation going through layers of abstraction towards LLVM (the compiler infrastructure framework), but also uses elements from an ahead-of-time approach, e.g. when something identical was already precompiled in a former run and its unique signature is found e.g. in the REPL cache to avoid doing unnecessary work.

All of this happens automatically, of course. However, you can use this feature yourself with the PackageCompiler package, which allows you to do some compilation work upfront and store the results, allowing for shorter startup time[2] of your own code as well.

To err is human, therefore errors should be human (readable)

First off some examples, that do not relate to the compiler in the strict sense, but to the parser: Since Julia 1.10 [3] the parsers points out visually, where something went wrong in the code instead of giving just the position and error message.

Here are two examples: The parser points you to the exact place, where it "Expected" something to be different from how you wrote it.

This idea of improved readability goes down the tool chain as far as an improved stacktrace renderings[4]. These used to be rather long, like several terminal screens, especially when using nested parametric types and were cluttered by unused variables, internally generated methods and a confusing way of expressing keywords that was different from what you would see in your code.

In short: Stack traces, that are often huge hay stacks with a tiny needle of a single error hidden in them, are now much smaller and easier to raffle through.

Time is precious, and you should know how much you are using

At this point it is fair to assume that a lot of data scientists and statisticians do not consider themselves to be programmers or developers but people who happen to use certain tools that allow them to write domain specific programs with domain affine languages like Matlab and Mathematica, R and many more industry specific languages say for simulation of stress in materials, fluid dynamics and so on.

While many of these do in fact offer testing and benchmarking facilities, I dare to speculate, that they are rarely used. Too pessimistic? Try this:

Next time you talk to a statistician, ask him about his stance towards TDD in general and how proud he was about the last unit test he wrote in particular. Given he can remember the occasion...

My guess is, it's putting in some "time" and "print" statements, whenever something does not work as expected. Not ideal.

This might stem from the erroneous notion that writing tests is something that only developers do, when they want to get some code production-ready, or that it has to be really complicated. Neither of this is true.

Actually:

Writing simple unit tests is usually simple
Testing should be done constantly and
will let you sleep better.

The writing and frequent execution of a test suit will not only ensure that your code runs correctly, and thus the results will be therefore more likely reproducible, but it will also make your tummy feel fuzzy and warm. And this will make you a better developer (and as a data scientist and statistician you most likely ARE one, like it or not).

Trust me on this one: Reinforcement learning does work, even with the original neural network that is your brain. Seeing all tests turn green is just great!

领英推荐

How to choose a garbage collector?

Arpit Bhayani 2 年前

Processor Design #4: Assembly Language

Simon Southwell 2 年前

C++23: The Small Pearls in the Core Language

Rainer Grimm 1 年前

Now, the thing with modern languages like Go and Julia is that, they come with testing packages (in the case of Go right out of the box), that are not only very easy to use but also give you an overview of the time consumption every time you execute them. This allows you to see, if something is going wrong time-wise, too. Below is the output of some basic unit tests that follow a simple syntax like...

@testset "NAME OF YOUR TEST SET" begin
    @test your_function(TEST_ARGUMENTS) == [EXPECTED_OUTPUT]
end

... just using the julia Package "Test". I think, you might see the point. It's not really hard in this case. And so it is for lots of cases.

Now, if you need more insights, there is a lot that you can do with proper benchmarking. That is, at least running your code several times to compensate for fluctuations in the underlying system. Obviously, you would also vary the inputs to check performance, e.g. for different batch sizes, measure performance in relation to time complexity (Big O notation, anyone?) etc. But this is already advanced. It is good, when you need it, but it is not always needed.

But like with all suggestions in this article series, even doing a little of it in the most simple way will pay some major dividends. Like just this:

julia> using BenchmarkTools
julia> include("randsackHiGHS.jl")

The below output is the result of using the Julia Package "BenchmarkTools" and simply prepending a function call with the "@benchmark" macro.

In this example, we call a function that heats up the CPU a bit by solving the well known knapsack optimization problem (my code on GitHub: [5]) a couple of times for 1k of randomly chosen Float64 numbers using the HiGHS[6] solver via a wrapper package.

Alternatively, the "@btime" macro provides a less verbose output, but still enough to see how your code is doing performancewise.

So, that is enough for now, I think.

In this exploration, we covered a lot of ground, but there is one more piece: The often overlooked topic of Documentation, which we will cover in the next installment of this article series.

Until then!

[1] https://www.reddit.com/r/Julia/comments/y73l9b/why_does_julia_use_jit/

[2] https://julialang.github.io/PackageCompiler.jl/dev/

[3] https://julialang.org/blog/2023/12/julia-1.10-highlights/#new_parser_written_in_julia

[4] https://julialang.org/blog/2023/12/julia-1.10-highlights/#improvements_in_stacktrace_rendering

[5] https://github.com/BenitoEck/ISCC

[6] https://highs.dev/

[7] https://github.com/jump-dev/HiGHS.jl

4 个月

Hi, this is Ben from the future: I had to update the article slightly, since LinkedIn's asset server logic apparently decided I did not need the article's code blocks and pictures anymore, since I deleted them in another article 'I'm currently working on, where I used this one as a template.

Why tooling matters a lot in Data Science and how it fights the Replication Crises (Part 3)

Benjamin Eckenfels

Head of Operations at algorithmica technologies, Germany | Senior Implementation Expert in Data Driven Industrial Projects | Machine Learning & AI in the Fields of Geology, Oil & Gas, Chemical Industry

To err is human, therefore errors should be human (readable)

Time is precious, and you should know how much you are using

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Nth element variadic pack extraction

Under the Hood: Exploring the Inner Workings of Jetpack Compose

How we sympathize with a question on StackOverflow but keep silent

C++20 Coroutine: Under The Hood

std::format in C++20

Understanding the C Build Process: From Source Code to Executable

The Structure of Compiler (Part 2)

The Algorithm

Understanding constexpr and Constant Expressions in C++

Demystifying the GIL: How Threading Impacts I/O in Python

To err is human, therefore errors should be human (readable)

Time is precious, and you should know how much you are using

领英推荐

Why tooling matters a lot in Data Science and how it fights the Replication Crises (Part 2)

2024年4月29日

Why tooling matters a lot in Data Science and how it fights the Replication Crises (Part 1)

2024年4月19日

Code Julia to bend the Matrix

2024年4月12日

Meet Julia, your new ML and Data Science language

2024年4月8日

社区洞察

其他会员也浏览了

Nth element variadic pack extraction

Under the Hood: Exploring the Inner Workings of Jetpack Compose

How we sympathize with a question on StackOverflow but keep silent

C++20 Coroutine: Under The Hood

std::format in C++20

Understanding the C Build Process: From Source Code to Executable

The Structure of Compiler (Part 2)

The Algorithm

Understanding constexpr and Constant Expressions in C++

Demystifying the GIL: How Threading Impacts I/O in Python