登录查看更多内容

Books I considered helpful

Philipp Leser

Founding Partner bei data cybernetics ssc GmbH

发布日期: 2023年12月21日

I don't like most of the "Books you need/should read" articles by "Data-Science influencers". And by "influencers", I don't mean people like Yann LeCun . I mean the people that tell me on LinkedIn how to "ace the next data-science interview". I think you get the picture.

Most often, these lists recommend the same books, like Deep Learning by Gengio and Goodfellow, the Probabilistic Machine Learning Series (Perspective, Intro and Advanced Topics), Deep Learning with Python by Chollet or classics like the Statistical Learning Series (Intro and Elements) by Hastie and colleagues, and Pattern Recognition by Bishop. All great books! For sure! And many of them are now freely available on the web!

But if you work with Machine Learning, you probably already know them.

The other set of book lists comes from the "you have to know the math" gang. They recommend very fundamental books on Linear Algebra, deep texts on probability theory, or even wilder stuff on tensors or graph theory that goes way beyond the basics. And obviously you have to do all the proofs before you are allowed to install python.

Don't get me wrong. It never hurts to know the math. I am at a severe disadvantage because I am not as mathematically savvy as many of my colleagues. And I work on improving there. But lots of the recommended stuff goes way beyond what you need to work with data or do science and engineering. Or add value to a business. It certainly does not hurt to have a deeper knowledge of linear algebra than the Gilbert Strang course. But these super heavy math book lists strike me a bit as virtue signaling.

Another "virtue signaling" cluster is the "data science is big data" crowd. The people that make strict distinctions between "data engineers", "software engineers" or "machine learning engineers". They pretend that a "data scientist" is somebody who does a job that was considered Business Intelligence back in the day. And does it on well curated data from some data lake. Mostly in a Jupyter Notebook. And that this data is provided by an army of open source tools and cloud services. They chain together whatever is available as an Apache project to train, serve and monitor ML models, thrive on harping about DevOps and deploy pipelines. And somehow you supposedly need to know all these tools. Books on ML-Ops get thrown around a lot here. Like this or that one.

IMHO, contrary to the "math fundamentalists", these books can do damage. They are like the NETFLIX developer blog and may often lead to over bloated and expensive solutions. There is a lot to be said about big data tools abused on small data problems (I personally like the big data is dead article by Jordan Tigani ), the "cloudNative" craze, or overreliance on frameworks. But that's for another rant.

So my book list is called "books that I considered helpful". And that's it. These books helped me back then solving practical problems. They might be too basic for you.

These books are not exactly on ML. But they introduce topics that could be usefull to people in software engineering and everybody working with data (like the so-called data scientist). So here we go:

1) Feedback Control for Computer Systems

This was a great intro for me back then. If you are an EE, Aerospace Engineer, mechatronics or robotics person: skip it. You are way beyond. It is actually just a book about PID controllers. If you have never heard of PID controllers, Wikipedia and this book are good starting points. PID controllers are only a tiny subset of the Feedback Control discipline. I try to learn as much as possible from the system control people. Optimal Control has solved a lot of hairy problems. Reinforcement Learning is a way to do Optimal Control.

Feedback control is a big topic, and I don't know as much about it as I like. The Schaum-series book on feedback control was another good and broader intro for me.

But I used PID controllers a couple of times with great success. And I have another idea for one of my clients that I will try out over the holidays. They find applications outside their original domain a lot. Like the load shredder developed by Uber. Or in the paper written by Anastasios Angelopoulos and his fellow researchers on dynamic conformal prediction for time series. If you get an intro to the subject, you start to see PID controllers everywhere.

2) SQL Tuning: Generating Optimal Execution Plans

I was pretty clueless about databases when I started my career. Even though I love Clickhouse, kdb+, InfluxDB and I am a firm believer that you can do a lot of good with a combination of well partioned parquet files and Polars or duckdb, the relational database and SQL is one battle proven way to store and access data and should be your go-to move in many situations. If you work with data, you should know some SQL. And other people (the DB admins) should not be scared of you using it. This one was a great intro for me and is agnostic to the database you are using. Database Internals is also a nice, but it is a more general introduction into databases.

Econometrics

Hayashi was my intro into Econometrics back in the days. Econometrics is trying to do a hard job.

I have been guilty of it myself. Abusing predictive models to answer questions about causality. In my experience, questions about causality appear quite often in business contexst. Often, these questions come directly from management! And as a "data person" you are asked to come up with an answer. There are basically two branches of causal analysis:

on experimental data
on observational data.

Experimental data is the holy grail of science. The gold standard. You have complete control over the variables.

But often you don't have this luxury. No double-blind study. Econometrics is good at this and has been doing it for ages. To be honest, today I would probably suggest the awesome site Causality for the Brave and True, since it also goes into debiased ML stuff and is an overall fun read. And there are modern books like this, this or this. But Hayashi is the book I started with. And it is still a good read.

Think DSP

领英推荐

25 Powerful Resources: What Are Some Popular Libraries…

Ze Learning Labb 12 个月前

Code Smarter, Not Harder: The Speed Benefits of LLMs…

AltaSigma 3 个月前

Essential AI Tools for Aspiring Data Scientists ????

HirePort AI 1 年前

I have witnessed consultants from big firms advocate complex and computationally expensive stuff to clients for problems that are basically solved problems in digital signal proccessing. DSP is a subset of Signal Processing. And like the Optimal Control field, it is a discipline where I learned a lot from, used it in practice, and want to learn more about it. I will try to track down the whole series by Kay in a library. To this day, I think people who work with systems that produce data should know about these things. It certainly helped me a lot.

Mastering System Identification

A bit like the above. Coming from the world of Econometrics, that book was a mind bending read. Probably just basic stuff for the EE. But not for me. A lot of things are the same. Aikaike Information Criterion and Best Linear Estimators. But it was important for me to learn and understand how things change in a hard-science/engineering context. And the whole concept of FRF was new to me (I never used it, but still interesting). There are probably better and more modern books about that. And the book uses MATLAB. Back then, I had a student license and access to the mathwork sysid toolbox. I wonder if sysidentpy is a good replacement.

Operations Research

Even if you have nothing to do with the classical areas where OR is prominent, like supply chain management, transportation or warehouse logistics...I can almost guarantee you that you run into OR like problems all the time in business and engineering. Operations Research is a combination of Linear Programming, Non-Linear Programming, Deterministic/Stochastic Dynamic Programming, Forecasting and Decision Theory (including some Game Theory). To me, that is important:

My partner, Carsten Blank had a case where he had to convince a client that the problem was actually a Linear Program that could be EXACTLY solved. In Excel! So he saved them from inventing wild heuristics to suboptimally solve a problem in more time. In my experience, people tend to discard the classical optimization techniques too early or don't even think about them. Often, the problem at hand can be solved that way, or even the approximate solution is still good or great. And fast.
This is often a hard sell. But the framework of Decision Theory forces you to think straight and ask yourself uncomfortable questions. People don't use it often, because it seems to be too theoretical. And under uncertainty, it only makes a difference if you make a lot of decisions. Same with Game Theory. I never used it in a strict sense for code, but it helps me to think about the world as soon as other people are involved. It also seems to have applications in engineering (even though I never ran into it in the real world) and there some other applications like the Shapely Values available via the SHAP python package for model diagnostics, or as an approach for PCA on big data sets by Google Deepmind (even though it is debated that Game Theory is neccessary here).

There might be better and deeper books on each subtopic. But to me, the book was influential, because it showed me that these things are part of a problem-solving toolbox that can, and often should, be used together.

Risk Assessment and Decision Analysis

Further enhances on the point of Decision Theory in OR and causal analysis in Econometrics.

Is is actually quite verbose, and if you are a Data Scientist or something of that nature, the explanations on statistics and basic probability theory are a bit lengthy. But this book by Norman Fenton and Martin Neil was an important book for me, even though I have never used a Bayesian Network in practice (but probably should have)!! But the book shaped my view on what a usefull decision support system, like an anomaly detection system, needs to provide to be useful.

Algorithmic Learning in a Random World

I think Scott Locklin introduced me to conformal prediction in the early 2000s. I bought the first edition of this book in 2008 or something. And wherever I need intervals around my predictions (which is often...remember: decision theory), I try to use it.

The field has rapidly advanced since then. And I think the book by Valeriy Manokhin, PhD, MBA, CQF is most likely a way more practical reference on the topic now (it just came out, and my hard copy is on the way). He is also the author of the best resource-collection git on that topic.

To me, all these books were helpful to solve practical problems. But probably more important than the concrete books, are the topics they introduce. Hope you have found something interesting (topic and/or book).

Brian Field, FRM

Data Scientist and Consumer Banking Credit Executive

1 年

Philipp Leser FANTASTIC POST! Really resonated with me. Thanks for sharing.

Scott Locklin

1 年

Funny I just read Hayashi (had looked at Wooldridge and Green before); it's a very good book. So weird you read that System Identification in 100 exercises book; Amazon sent me to review it. I pick it up from time to time for a different perspective.

Carsten Blank

Researcher | Quantum Tech | Energy Markets | Entrepreneur

1 年

If this helps at least one person to find their path through good books, this was a fantastic post. So, please share!

1 次回应

查看更多评论

要查看或添加评论，请登录

Books I considered helpful

Philipp Leser

Founding Partner bei data cybernetics ssc GmbH

领英推荐

社区洞察

其他会员也浏览了

Unlocking the Power of Data: A Comprehensive Guide to Our Data Science Course

From Novice to Data Scientist: Your Path to Success

Machine Learning (ML)

How to create your own Synthetic Data- for computer vision applications

Issue #297 - The ML Engineer ??

Future-Proof Your Career: Key Data Science Skills for the AI Era

Exploring Scikit-Learn in 10 Examples

Data Science Technologies

Building 10 Classifier ????Models in Machine?Learning + Notebook

End to End Movie Recommendation System with Flask app