Exploring new dimensions in Data Science

Exploring new dimensions in Data Science


Big data, hadoop, Apache Spark , MongoDB all are funny but at the same time are scary words. In my journey as a Data Scientist with limited computer language knowledge, I thank Python libraries a lot to make my voyage so easy. In this article, I am describing the most common Python libraries used in data analytics.

A Python library is nothing but a collection of functions and methods that allow performing lots of actions without writing any code. These libraries have built in modules which provide different functionalities and can be used directly. Python has extensive libraries that offer a broad range of facilities. The best part is all of the libraries are open sourced. We can divide the Python libraries into three main groups.

1. Scientific Computing Libraries: The first group is “scientific computing libraries”. It is a collection of software specifically designed for scientific computing in Python. One of the most used packages to manipulate, to aggregate and to analyze data (Better known as “data wrangling”) is Pandas. Pandas is a perfect tool to use on structured data that has columns and rows with labels.

The other commonly used library is NumPy. Here, arrays are used as its inputs and outputs. The same can be extended to objects for matrices with minor coding changes for faster array processing. The beauty of NumPy is it’s ability to extend python into a high-level language for manipulating numerical data, similar to MATLAB.

Another famous package is SciPy that includes functions for some advanced math problems.

2. Python Visualization Libraries: Visualization is the best way to tell a story based on complicated numbers and processes. Python libraries make it easy to create graphs, charts and maps in a single line code. The Matplotlib package is the most known library for making highly customized graphs and plots. Another highly used visualization library is Seaborn. It is based on Matplotlib and used to create plots such as heat maps, time series violin plots etc. It is mostly used for visualization of statistical models.

3. Algorithmic libraries: Machine learning algorithms are used to develop data models for predictions. These libraries can manage from basic to complex machine learning tasks. The most used library is Scikit-learn library. It contains tools for statistical modeling, including regression, classification, clustering and so on. This library is built on NumPy, SciPy and Matplotib.

Statsmodels is also a Python module that allows users to explore data, estimate statistical models and perform statistical tests.

In the next post, we will explore these libraries in detail with some examples to enjoy data science magical world.

Piyush Agarwal, Ph.D.

Principal Scientist at Pfizer | University of Waterloo | IIT Madras

5 年

Perfect! This will surely help me with my transition from MATLAB to Python. Looking forward to your next post Sakshi :)

要查看或添加评论,请登录

Sakshi Jain的更多文章

社区洞察

其他会员也浏览了