Beginner's guide: The top 10 Data Science libraries in Python
Photo by Emile Perron on Unsplash

Beginner's guide: The top 10 Data Science libraries in Python

Dear aspiring Data Scientist,??

You think that Data Science is the coolest. Yes, you are right!?So you decide to pursue Data Science to be your career.??

After digging a bit deeper, and you get lost in all the references that you are supposed to use. Fear not, here is the ultimate 7 references that you will need to master Data Science.??Congratulations, you’ve now untangled the references zoo.??

And you are now ready to jump to the second zoo, namely the python packages that you will be using.??And here comes your guide for that.??

____________________________________________________


I. Data Prep:??

This is the very first stage of your work. And in this stage, you will load your data, transform it, clean it and summarise it so that it will be ready for your work afterwards. So which packages should you use for your data prep????

1. Pandas?

Pandas is a package that allows you to process, manipulate, summarise and analyse data in python. Fun fact, Pandas was actually created to duplicate the Data Frame functionality that R provides out of the box.??

  • When to use: You will have to use pandas to analyse and?preprocess?your data.??
  • Limitations: Pandas processes the data in memory. That means that all of your data will need to fit into your RAM. On one hand, this enables the processing to be fast. On the other hand, pandas could not be used for the processing of big data??
  • Solution: Comes in the Big data packages: mostly?Pyspark &?Dask.?


2. Pyspark??

Pyspark is the Python API written to use Apache Spark directly from the Python script. Apache Spark is one of the most commonly used engines for the processing of big data.??

  • When to use: You will need?PySpark?to process “big data”, that is: Data that does not fit into the memory of your machine.???
  • Limitation: You must have Spark installed on your machine. And this might be trickier than what you think.??
  • Solution: Use?Dask, which I personally think that this is one of the most underrated packages.??

____________________________________________________


II. Visualisation:??

After you have prepared your data, you are ready to visualise it to visually spot out trends and insights. And for that purpose, you will need the following packages?


3. Matplotlib:??

Matplotlib is one the most commonly used visualisation packages in python. Personally, I believe this package is a pain to be considered in the standard Data Science stack due to its non-intuitive syntax and poor visualisation quality.. But, it is the status-quo standard.?

  • When to use: Use Matplotlib for the quick drafts or prototypes of visualizations.?
  • Limitations: The visual quality of the results is not optimal to be presented to a non-technical customer.????
  • Solution: Use Plotly for fancier visualizations.

?

4. Plotly:??

Addressing the limitations of Matplotlib,?Plotly was developed to improve the quality of the data visualizations in python.?Plotly?also offer extensive interactive visualizations.?

  • When to use: Use?Plotly to present your data in a visually compelling way, specially to external customers.?
  • Limitations: You might face performance issues if your data set is big.??
  • Solution:?Create the initial prototypes of your visualization using Matplotlib. And try to summarize and aggregate your data frames before passing them to?Plotly.?

?____________________________________________________


III. Machine learning:??

After having spent 80% of your time with the package mentioned above. It is now time for the advanced phase, namely: Machine Learning??


5. Numpy:?

Even though python is a high level language that is very attractive in terms of learning-curve, readability and community, python is not the best in terms of performance. And given that Machine Learning relies heavily on complex numeric computing, a higher preforming solution was needed.??

Thus, NumPy was developed using C, a high-performance programming language.?

  • What:?Numpy?is a package that supports large arrays and matrices and provides a large number of mathematical functions that could be applied to those.???
  • When to use:?You will use NumPy every time you need to use arrays and matrices directly. This is an extensive part if you are implementing ML algorithms from scratch.??
  • Limitations: Some of the advanced mathematical operations are not implemented in NumPy.?
  • Solution: Use?Scipy.??

?

6. Scipy:??

Building on the high-preforming implementation of NumPy, SciPy offers more advanced operations of linear algebra and calculus such as integration,?fourier?transformations, functions minimizations etc??

  • When to use: Use?Scipy?every time you need operations of?linear algebra and calculus. This will mostly happen if you are implementing an algorithm from scratch.??
  • Limitations:?Scipy?does not provide an extensive library of the standard ML algorithms.?
  • Solution: You've guessed it – here comes the?Sklearn


7. Sklearn:??

Sklearn is one of the most beautifully designed and commonly used packages that has most of the standard Machine Learning algorithms implemented, extensively optimised using NumPy and Cython???

  • When to use: Use?Sklearn?each time you need to use a standard Machine Learning algorithm??
  • Limitations: Even though Sklearn does contain implementations for Neural Networks, it does not provide the flexibility that Deep Learning specialists need??
  • Solution:?Tensorflow?to the rescue??

Tip: If you are already familiar with the basics of Sklearn, check out this article: 5 advanced Scikit-learn features that will transform the way you code

?

8. Tensorflow:?

Tensorflow is a library that is used to create Neural Networks and Deep Learning models. It is also partially implemented in C to increase the performance. And it can run on multiple CPUs, GPUs & TPUs??

  • When to use: Tensorflow is the go-to package whenever you need to implement a deep learning model??
  • Limitation: If you will be processing image or text data, which are the common inputs for the deep learning model, you will need to use additional packages to process your data (e.g. Spacy, OpenCV)??

____________________________________________________


IV. Production:??

And now you have reached the point where you have a trained model that is ready to be used by your?customer.??

?

9. Flask:??

You now have one model that you want to intergrate into your product. One of the most common ways to expose your model is to wrap it in an API that other applications - irrespective of their language - can directly call.??

  • When to use: Flask is a very straight forward package that enables you to simply build an API for your model?
  • Limitations: Do expect that you will need to coordinate with your system and security admins?if you want your API to be exposed externally or?publically??

?

10. Dash?

As attractive as an API can be for tech geeks, as useless as it is for business stakeholders.?Solution? Build a web app that your customer can use to access the results of your model??

  • When to use: Use Dash to build a user friendly web app. The front end and backend of the web app will both be written in the python script??
  • Limitations: Dash might have a steep learning curve at the beginning?
  • Solution:?Streamlit?might be simpler to start with, but the framework is not as mature as Dash?

?____________________________________________________



Mohamed Samir

Digital Business Consultant

3 年

Thanks for the great article.

回复
Karthik Suriyanarayanan

Business Analyst @ BNY | PMP, PSPO, Financial Engineering

3 年

Very useful post simplifying the processes with their limitations … thanks a lot Deena Gergis. Very very helpful

回复
Javier Mansilla

CTO at Krino with experience in developing AI-powered SaaS software for startups and enterprise companies. Self-taught Backend Developer and proactive in business technology.

3 年

Great!!

回复

Hey Deena Gergis your articles are very informative and valuable ! We would love to invite you for a chat!

回复

要查看或添加评论,请登录