CSL, the new machine learning

CSL, the new machine learning

So I have been working on and off with machine learning for the last 5 years or so.. I think I must have seen at least 20 courses and finished 5 of them ;-) The options felt overwhelming.. like a kid in a candy store is what I wrote to someone just a couple of minutes ago.

I also wrote some code using machine learning, a traffic sign classifier based on images, a cone penetration test to soillayers convertor, a way to try and find important levee points on a 2D crosssection.. etc. Now I definitely do not consider myself to be a machine learning (ML from now on) expert.. I know how to use the models, I even know the mathematics behind most of the models (I had to because of acquiring my NanoDegree machine learning degree at Udacity) but if I look at Kaggle I feel like I only touched the surface and with a busy job and a lot of hobbies I simply lack the time to really dive into all the possibilities.

However.. the last months I tend to favour the CSL model.. It feels good, it feels natural and surprisingly.. it is often up to the task without writing all the code and collecting all the data you need for a ML model. So let me tell you about CSL and why I favor it many times over ML.

CSL

So what is CSL you might ask.. a new kernel with advanced stochastic gradient descent? maybe a new TensorFlow module? Sorry to dissapoint you.. CSL stands for Common Sense Learning and I think I am the inventor.

Now that is out of the way let me tell you why I think CSL can outperform ML, at least in most of my cases. For that reason I will show you some examples where CSL outperforms ML.

Characteristic points

I am a geotechnical engineer and I love to work on water safety problems so I like to find ways to automate the input for -for example- levee stability calculations. Once upon a time, in a land far far away, we had (actually still have) DAM.. dike analysis method by Deltares. And this is the input they required for the crosssections;

No alt text provided for this image

Mmmm.. imagine.. we have a levee of 1000 meters and we like to check this levee every 25m.. so we have 40 crosssections and we need 16 points.. that would mean to input 640 points. Actually I wrote software with a friend of mine to 'click' these points so we tried to make it easy for people to do this task but was it still annoying.. oh yes! Especially if you know that we tried to do this for 550km with a crosssection at every 10 meters (yeah.. ambitious times)

Then came my ML period.. and I thought.. ML can help! So I asked a lot of collegues from Dutch waterboards if they used this tool and could send me the output they had generated and I got a lot of 'clicked' points. I spent days on collecting the data, getting it ready for easy input and again days, maybe even weeks to build and test models.. the results were awful.. Yes, some points were ok but most of them were rubbish.. even worse, it seemed to generalize towards a specific kind of levee (which we call secondary levee for the smaller rivers) and not work at all for the primary levees (like those along the large rivers).. so I had to split up the data, create two models.. etc. etc. In the end the result was not usable and I was a litte frustrated.

A couple of years later I was less hyped with ML and came back to the same problem this time using CSL. The first thing that came to mind was.. why do I need all those points? Well DAM needs it, but do I need DAM? Turned out, I didn't. So then I thought, what are the points that I really, really need. A lot less..

No alt text provided for this image

I needed the yellow ones and the blue underlined would be nice. Now CSL came to the rescue.. first of all the really OCLS (Obvious common sense learning), if you generate a crosssection then the first point (left to right) will be 'maaiveld buitenwaarts' and the last one 'maaiveld binnenwaarts'.. oh wow.. aleady a 100% secure method ;-)

Looking at the source data we also found that we had raster files (kind of a big matrix with x,y locations and whatever data you attach to those locations) with landheight, the bottom of the levee and the bottom of the ditches. CSL to the rescue.. simply mark the points because then it is really easy to find the source of the point and it will be easy to find the ditch points (the ones with 'sloot' in it in the image). Image you have consecutive marked 'ditch' points then you can mark the first one as the left side ('insteek sloot dijkzijde') and the last one 'insteek sloot polderzijde'. So a simple script using GIS data and using CSL fixed those points.

Now up to a little ACSL (Advanced common sense learning.. and this is the last time I make 'funny abbreviations ;-) what about the really, really important yellow marked points. For ACSL you best think out of the box.. this was in the time that I thought that I could make money using automated bitcoin trading (but that's another bedtime story).. and one of the techniques to find sudden changes in prices is the usage of Bohlinger Bands. Here's a picture;

No alt text provided for this image

Some my ACSL kicked in and I thought.. what if the surface of the levee is the stock price / bitcoin price and I look at the intersection of the upper band with this line.. turned out this was quite a nice way to find the points I needed with a good accuracy.. way better then ML!

Off course I had to tweak the algortihm a little but the results were excellent. I now have all the points I need to automagically generate the characteristic points I need. This makes it possible to generate a complete 2D model of the levees that I am working with.. nice!

CPT interpretation

Another example. A CPT is a cone penetration test which we as geotechnical engineers use to try and understand the soil beneath our feet. Now that's all the detail I am gonna give you or else I would be lecturing geomechanics which is not my current purpose of the article. Suffice to say that a CPT generates a lot of data and we want to translate that data to soil names like 'clay' or 'peat' or 'sand' etc.

In my ML phase I collected loads of data were CPTs were manually interpreted. So I could find a correlation between the raw values of the CPT and the soil names. I wrote some articles about it because I was really happy with the process but it turned out that there were always some errors that made the algorithms unusable.

This was also due to a lack of ML knowledge, I am the first to admit it and later Ritchie Vink wrote an excellent articel about using advanced ML were it definitely did work. But still I found some things that I had to improve to be able to actually apply his algorithm.. and it was also behind an API and it always itches if I can't do it myself.

There are loads of well known correlations for CPTs and I wrote code for some of them but I also found code online for a well known correlation. The problem with this correlation was that it simply did not find a very important layer (peat, which is really the nightmare of levee assessments in The Netherlands). Now the ML algorithm did -most of the times- find that layer but again.. that was behind an API and not code I could adjust.

CSL kicked in.. there is a very simple rule that is true most of the times.. a CPT has a special value called friction and using a simple formula you can generate another value and if that is above a certain threshold it means that you are probably dealing with peat. So CSL told me.. just use the (non ML) algorithm that was already provided with access to the code and pickup the result but add my own rule that everything above a certain threshold will be marked as peat.. and.. it works! (for the geotechnical engineers among us.. I could never find the lower peat layers)

No alt text provided for this image

So is ML bad?

Oh deary no.. that's definitely not the point of this article. ML is a great tool, it has so many interesting options and if you have the right problem it might be the best way to progress.. way better than CSL. But my point is that people might get carried away and try to find ML solutions for problems that can easily be solved with CSL. Just don't believe the next sales pitch that ML will solve all your problems.. It won't.. trust me, your data is most likely not even ready for normal statistics (been there, done that!). So if you feel hyped by ML, do yourself a favor and think CSL first. Think out of the box, be creative and if you still think ML is the way to go.. then it probably is.

Have fun!

Rob


Mark van der Krogt

Senior Onderzoeker en Adviseur Geotechniek, betrouwbaarheid en risico bij Deltares

3 年

Enlightening view Rob. Seems that you are describing the difference between Data Science and Machine Learning ;-)

要查看或添加评论,请登录

Rob van Putten的更多文章

社区洞察

其他会员也浏览了