Introduction to Pandas

Introduction to Pandas

Python was the third most used programming language in 2019. Being the third most used qualifies it to be a useful, efficient and clean programming language. Some of the application of python include:

  • Web development
  • Web Scraping
  • Testing
  • Data Analysis

Data, as we all know, is the most important entity in the world today. It is being generated at an expeditious rate. About 90% of the data present in the current world was generated only over the course of the last two years. There are 2.5 quintillion bytes of data created each day at our current pace. Plus who does not know the value of it? Big companies have faced senate grillings over the issues concerning the privacy of the data acquired. To sum it all up, data is the new oil. In the words of Clive Humby:

“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.”

Data Life-Cycle

No alt text provided for this image
  • Data is stored in different formats. It can be a CSV (comma-separated values) file or Excel or an HTML file. So the need is to convert/transform that data into a single format and store it somewhere. That's where Data warehousing plays a major role.
  • Once the data is stored somewhere, one can perform further analysis on it. This can include predictive modeling, joining or merging of data and much more.
  • It is evident, that the human brain works very well with pictures. So, the next step is to visualize the data or say you can plot it in the form of graphs, known as Data Visualization.

Let us get into some more details about data analysis.

What is Data Analysis?

As per Wikipedia, Data analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusion and supporting decision-making.

For instance, suppose you have a dataset that contains the data about the school dropout children in different states of India ranging from primary to high school. And suppose you want to analyze the percentage of girl child dropouts in a particular state. Now, what should be done? You have to perform certain analyses in the given data set and that analysis should present to you the percent increase in the girl child dropout for the selected state.

Pandas

No alt text provided for this image

To perform data analysis in python we need to import a particular module called Pandas. Pandas is a software library written for the Python programming language for data manipulation and analysis. Pandas are built on Numpy, Scipy, and matplotlib. Matplotlib is a data visualization module used in python. Numpy is a fundamental package for scientific computing so, it contains powerful n-dimensional array objects, tools for integrating with c or c++ and is significant in performing linear algebra, Fourier transformation, random number capabilities, etc. Scipy is also an open-source python module used for scientific and technical computing. It contains modules for optimization, linear algebra, integration, etc. Pandas use Series for a one-dimensional data structure and DataFrame for a multi-dimensional data structure. It provides an efficient way to slice the data. It also provides a flexible way to merge, concatenate or reshape the data. It easily handles missing data, which is a very prominent headache for data analysts out there.

Working with pandas

There are a lot of functions that can be performed using pandas. Let us look at one of them known as slicing. Suppose you are provided with a movies data set that looks something like this:

No alt text provided for this image

This is an excel file that contains the movie id, name of the movie, the year it was released in, ratings and the duration of the film. (Well, a great list for holidays !).

As we were discussing slicing, suppose you want to see only the first five rows of the dataset. Then, the head comes into play:

No alt text provided for this image

Similarly, if you want the last 2 rows, use tail:

No alt text provided for this image

What if you want to find the name of movies released between 1950 and 1960? Here is how to do it:

No alt text provided for this image

Here we retrieve the movies by applying the condition and then specifically getting the movie name from the dataset or it would have resulted in retrieving all the columns including ratings, duration, and id. To get multiple columns we need to pass the list of lists, for example :

No alt text provided for this image

A great practice to get a clue about the data is to use describe(). It provides the counts, mean, std, min, max and percentile of the dataset.

No alt text provided for this image

In conclusion, Pandas is a useful library in data analysis. It can be used to perform data manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures. We will discuss more in the upcoming articles. Till then enjoy the holidays and analyze your actions to calculate the percentage of productivity. Make sure it keeps increasing because procrastination is directly proportional to the number of holidays. Peace out.

Rohit S ModGil

Software Developer | Angular - Node.js - SQL

5 å¹´

Well, you have a flair in writing.

要查看或添加评论,请登录

Ankisha Sharma的更多文章

  • ANN: Components

    ANN: Components

    Neurons, the reason our brain works. Neurons are the sole reasons for our reflexes, our activities (besides our will…

    1 条评论
  • Gradient Descent: An Introduction

    Gradient Descent: An Introduction

    Gradient descent is an optimization method that helps us find the precise combination of weights for a network that may…

    2 条评论
  • RSA Cryptosystem

    RSA Cryptosystem

    You can not tell me that you studied cryptography and not once this algorithm troubled you. Students always are…

    3 条评论
  • Python: Modules

    Python: Modules

    According to the formal definition, a module stands for, one of a set of separate parts that, when combined, form a…

    3 条评论
  • Python: Data Structures

    Python: Data Structures

    Consider a scenario where you are at your home looking for your socks in a pile of clothes. You can imagine how…

    3 条评论
  • Python: A brief history

    Python: A brief history

    Gone are the days when python was the name of a non-venomous snake. Nowadays this one big snake is biting a lot of…

    2 条评论

社区洞察

其他会员也浏览了