Review of the new book "Machine Learning in Biotechnology and Life Sciences: Build machine learning models using Python and deploy them on the cloud"
Recently I have read this book "Machine Learning in Biotechnology and Life Sciences: Build machine learning models using Python and deploy them on the cloud" written by?Saleh Alkhalifa?and is currently on Amazon.
This a great book for beginners to learn Machine learning and for professionals to refresh their memories. I am a data science professional working in the healthcare industry for almost 10 years after my Ph.D in Biostatistics and Data Science. This book is a wonderful place for me to re-visit my index of knowledge and learn some new stuff.
The book covers a comprehensive list of topics in the format of chapters, including Python/Command line, SQL/relational databases, visualization with python, machine learning (sub-chapters include introduction of ML concepts and terms like overfitting vs underfitting, unsupervised machine learning methods, supervised machine learning methods, deep learning using Keras, Natural Language Processing, Time Series Analysis), Deploying model with Python Flask and deploying applications to AWS/GCP Cloud. I personally think the order of the chapters/topics is natural as it reflects the working order of a typical ML/DS project. I also think the coverage is good enough for a beginner to get familiar on all typical machine learning project aspects as well as the opportunity to extend their learning from each topic, for example, reading key papers and wikipedia to understand the under-the-hood mechanism of bidirectional LSTM.
For data science professionals, besides the opportunity to systematically review their dictionary of knowledge, the book also provides new stuff to reward their readings. For example,?I am quite versed in SQL query and routine database operation, but I never knew database normalization rules, which has three points stated in the book. I cannot agree more “Although database administrators and data engineers tend to spend more time on structuring and normalizing databases, a great deal of time is spent at a data scientist’s end when it comes to understanding these structures and developing effective queries to retrieve data correctly and efficiently. Therefore, a strong foundational understanding of databases will always be useful regardless of the type of database being used.” Other examples include, I never knew “df.query()” as new Pandas library introduced this function for easier data frame operations; “sklearn.classification_report()“ could report a series of useful performance metrics at once instead of extracting each individual metric manually.
I think Visualization chapter could be better. The current visualization chapter covers a limited variety of basic plots like barplot, box plot, scatter plot, etc., and it doesn’t mention anything about Tableau or PowerBI dashboards. As a data science professional, I would say it will be better to introduce more types of plots such as funnel plot, forest plot, piechart, heatmap, treemap, geographical map, etc. For typical business intelligence analytics, Tableau and PowerBI are top 2 popular softwares a data scientist should manage. The book should include a subchapter introducing how to use Tableau/PowerBI, covering basic operation like drag and drop as well as more advanced skills like customized function, pivot table, link actions on the same dashboard, level of detail expressions, etc. There is still good thing to praise for this chapter, author has included several chemistry/molecular domain specific figure examples and associated libraries, which could be useful to a group of audience.
I enjoyed the unsupervised machine learning chapter, which covers Clustering and Dimension Reduction these two main topics. The chapter covers most representative methods like K-means, hierarchical clustering, t-SNE, SVD, PCA, etc. with read data examples. One thing could improve might be the color print for those clustering figures. In the print version I received, figures are gray scale, which are hard to distinguish two clusters’ dots.
The chapter of supervised machine leanring covers most representative methods like decision tree, linear regression, XGBoost. However, it doesn’t go deeper enough to provide the audience a little bit of under-the-hood mechanism. Actually for a few methods, author just mentioned the method name and provide a simple flowchart then that is it. I am expecting author to introduce like L1-norm and L2-norm for penalized regression; Gradient Descent, Newton Raphson and Taylor Expansion for XGBoost. It will be good for audience to learn a little bit methodology for better memorization/understanding before going to application. Worthy of mentioning is author did teach Amazon SageMaker by providing a step-after-step GUI tutorial, which is very user friendly to cloud beginners.
领英推荐
The chapter of deep learning suffers the same issue as last one - the lack of explaining a bit methodology. For example, for CNN, author just provided a minimalist flowchart but verbally mentioning nothing about CNN architecture. I was expecting to see at least the introduction of CONV layer, pooling layer, fully connected layer, padding, etc. These points should be added and explained well.
Coming to the application of deep learning or artificial neural network, I applaud for this chapter. Author provided detailed codes in Keras with real data example. Author also illustrated how to use MLFlow to track experiment performance and versioning. Keras and MLFlow are tools data scientists used daily for dev and production. Without knowing tensorflow or pyTorch, these two tools together already enable you to deliver a sound deep learning application to production. I also like the fact author provided different cloud application tutorial in AWS and GCP. Though Azure VM was not mentioned, I do think AWS and/or GCP should suffice most audiences’ needs.
I also enjoyed the chapter about NLP. This chapter introduced several real data applications using popular packages like NLTK, spaCY and PyMed. The three packages are very practical in working with NLP problems. Author also introduced how to tackle OCR problem by giving tutorials of AWS Textract and AWS comprehend to finish tasks like NER. Author covered Topic modeling/clustering using TFIDF and NMF, and building a search engine using transformers. For advanced methods in topic modeling, author has pointed to external resources like (Bio)BERT, LDA, Gensim, etc. I did think author should introduce some of these in the book as well cause 5 to 10 pages will do the job.
Time series chapter is a bit shorter than I thought it should be. Besides covering LSTM method is very good, I do think author can introduce the concept of autocorrelation and the classical statistical method called ARIMA.
The last a few chapters give tutorials on deployment using Python Flask, docker container, AWS Lightsail and GCP. Code chunks with step-by-step web GUI tutorial are my favorite format for this kind of learning.?
In general, I give 4-star for this book on Amazon. I will recommend this book to both beginners as well as data science professionals as both groups will benefit by reading the book and follow the application tutorial hands-on.
Technical Recruitment Consultant
1 年Yang, thanks for sharing!
Helping Pre-Seed to Series C startups attract and find software engineering talent | Founder: aiRecruit
1 年Yang, thanks for sharing!