Introduction to Pandas
Python was the third most used programming language in 2019. Being the third most used qualifies it to be a useful, efficient and clean programming language. Some of the application of python include:
- Web development
- Web Scraping
- Testing
- Data Analysis
Data, as we all know, is the most important entity in the world today. It is being generated at an expeditious rate. About 90% of the data present in the current world was generated only over the course of the last two years. There are 2.5 quintillion bytes of data created each day at our current pace. Plus who does not know the value of it? Big companies have faced senate grillings over the issues concerning the privacy of the data acquired. To sum it all up, data is the new oil. In the words of Clive Humby:
“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.â€
Data Life-Cycle
- Data is stored in different formats. It can be a CSV (comma-separated values) file or Excel or an HTML file. So the need is to convert/transform that data into a single format and store it somewhere. That's where Data warehousing plays a major role.
- Once the data is stored somewhere, one can perform further analysis on it. This can include predictive modeling, joining or merging of data and much more.
- It is evident, that the human brain works very well with pictures. So, the next step is to visualize the data or say you can plot it in the form of graphs, known as Data Visualization.
Let us get into some more details about data analysis.
What is Data Analysis?
As per Wikipedia, Data analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusion and supporting decision-making.
For instance, suppose you have a dataset that contains the data about the school dropout children in different states of India ranging from primary to high school. And suppose you want to analyze the percentage of girl child dropouts in a particular state. Now, what should be done? You have to perform certain analyses in the given data set and that analysis should present to you the percent increase in the girl child dropout for the selected state.
Pandas
To perform data analysis in python we need to import a particular module called Pandas. Pandas is a software library written for the Python programming language for data manipulation and analysis. Pandas are built on Numpy, Scipy, and matplotlib. Matplotlib is a data visualization module used in python. Numpy is a fundamental package for scientific computing so, it contains powerful n-dimensional array objects, tools for integrating with c or c++ and is significant in performing linear algebra, Fourier transformation, random number capabilities, etc. Scipy is also an open-source python module used for scientific and technical computing. It contains modules for optimization, linear algebra, integration, etc. Pandas use Series for a one-dimensional data structure and DataFrame for a multi-dimensional data structure. It provides an efficient way to slice the data. It also provides a flexible way to merge, concatenate or reshape the data. It easily handles missing data, which is a very prominent headache for data analysts out there.
Working with pandas
There are a lot of functions that can be performed using pandas. Let us look at one of them known as slicing. Suppose you are provided with a movies data set that looks something like this:
This is an excel file that contains the movie id, name of the movie, the year it was released in, ratings and the duration of the film. (Well, a great list for holidays !).
As we were discussing slicing, suppose you want to see only the first five rows of the dataset. Then, the head comes into play:
Similarly, if you want the last 2 rows, use tail:
What if you want to find the name of movies released between 1950 and 1960? Here is how to do it:
Here we retrieve the movies by applying the condition and then specifically getting the movie name from the dataset or it would have resulted in retrieving all the columns including ratings, duration, and id. To get multiple columns we need to pass the list of lists, for example :
A great practice to get a clue about the data is to use describe(). It provides the counts, mean, std, min, max and percentile of the dataset.
In conclusion, Pandas is a useful library in data analysis. It can be used to perform data manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures. We will discuss more in the upcoming articles. Till then enjoy the holidays and analyze your actions to calculate the percentage of productivity. Make sure it keeps increasing because procrastination is directly proportional to the number of holidays. Peace out.
Software Developer | Angular - Node.js - SQL
5 å¹´Well, you have a flair in writing.