ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Introduction to Pandas

Ankisha Sharma

å‘å¸ƒæ—¥æœŸ: 2019å¹´12æœˆ29æ—¥

Python was the third most used programming language in 2019. Being the third most used qualifies it to be a useful, efficient and clean programming language. Some of the application of python include:

Web development
Web Scraping
Testing
Data Analysis

Data, as we all know, is the most important entity in the world today. It is being generated at an expeditious rate. About 90% of the data present in the current world was generated only over the course of the last two years. There are 2.5 quintillion bytes of data created each day at our current pace. Plus who does not know the value of it? Big companies have faced senate grillings over the issues concerning the privacy of the data acquired. To sum it all up, data is the new oil. In the words of Clive Humby:

â€œData is the new oil. Itâ€™s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.â€

Data Life-Cycle

Data is stored in different formats. It can be a CSV (comma-separated values) file or Excel or an HTML file. So the need is to convert/transform that data into a single format and store it somewhere. That's where Data warehousing plays a major role.
Once the data is stored somewhere, one can perform further analysis on it. This can include predictive modeling, joining or merging of data and much more.
It is evident, that the human brain works very well with pictures. So, the next step is to visualize the data or say you can plot it in the form of graphs, known as Data Visualization.

Let us get into some more details about data analysis.

What is Data Analysis?

As per Wikipedia, Data analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusion and supporting decision-making.

For instance, suppose you have a dataset that contains the data about the school dropout children in different states of India ranging from primary to high school. And suppose you want to analyze the percentage of girl child dropouts in a particular state. Now, what should be done? You have to perform certain analyses in the given data set and that analysis should present to you the percent increase in the girl child dropout for the selected state.

Pandas

To perform data analysis in python we need to import a particular module called Pandas. Pandas is a software library written for the Python programming language for data manipulation and analysis. Pandas are built on Numpy, Scipy, and matplotlib. Matplotlib is a data visualization module used in python. Numpy is a fundamental package for scientific computing so, it contains powerful n-dimensional array objects, tools for integrating with c or c++ and is significant in performing linear algebra, Fourier transformation, random number capabilities, etc. Scipy is also an open-source python module used for scientific and technical computing. It contains modules for optimization, linear algebra, integration, etc. Pandas use Series for a one-dimensional data structure and DataFrame for a multi-dimensional data structure. It provides an efficient way to slice the data. It also provides a flexible way to merge, concatenate or reshape the data. It easily handles missing data, which is a very prominent headache for data analysts out there.

Working with pandas

There are a lot of functions that can be performed using pandas. Let us look at one of them known as slicing. Suppose you are provided with a movies data set that looks something like this:

This is an excel file that contains the movie id, name of the movie, the year it was released in, ratings and the duration of the film. (Well, a great list for holidays !).

As we were discussing slicing, suppose you want to see only the first five rows of the dataset. Then, the head comes into play:

Similarly, if you want the last 2 rows, use tail:

What if you want to find the name of movies released between 1950 and 1960? Here is how to do it:

Here we retrieve the movies by applying the condition and then specifically getting the movie name from the dataset or it would have resulted in retrieving all the columns including ratings, duration, and id. To get multiple columns we need to pass the list of lists, for example :

A great practice to get a clue about the data is to use describe(). It provides the counts, mean, std, min, max and percentile of the dataset.

In conclusion, Pandas is a useful library in data analysis. It can be used to perform data manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures. We will discuss more in the upcoming articles. Till then enjoy the holidays and analyze your actions to calculate the percentage of productivity. Make sure it keeps increasing because procrastination is directly proportional to the number of holidays. Peace out.

Rohit S ModGil

Software Developer | Angular - Node.js - SQL

5 å¹´

Well, you have a flair in writing.

èµž

å›žå¤

1 æ¬¡å›žåº”

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Ankisha Sharmaçš„æ›´å¤šæ–‡ç«

ANN: Components

2020å¹´4æœˆ17æ—¥

ANN: Components

Neurons, the reason our brain works. Neurons are the sole reasons for our reflexes, our activities (besides our willâ€¦

1 æ¡è¯„è®º
Gradient Descent: An Introduction

2020å¹´4æœˆ13æ—¥

Gradient Descent: An Introduction

Gradient descent is an optimization method that helps us find the precise combination of weights for a network that mayâ€¦

2 æ¡è¯„è®º
RSA Cryptosystem

2019å¹´12æœˆ12æ—¥

RSA Cryptosystem

You can not tell me that you studied cryptography and not once this algorithm troubled you. Students always areâ€¦

3 æ¡è¯„è®º
Python: Modules

2019å¹´11æœˆ30æ—¥

Python: Modules

According to the formal definition, a module stands for, one of a set of separate parts that, when combined, form aâ€¦

3 æ¡è¯„è®º
Python: Data Structures

2019å¹´11æœˆ23æ—¥

Python: Data Structures

Consider a scenario where you are at your home looking for your socks in a pile of clothes. You can imagine howâ€¦

3 æ¡è¯„è®º
Python: A brief history

2019å¹´11æœˆ16æ—¥

Python: A brief history

Gone are the days when python was the name of a non-venomous snake. Nowadays this one big snake is biting a lot ofâ€¦

2 æ¡è¯„è®º

See all articles

Introduction to Pandas

Ankisha Sharma

Data Life-Cycle

What is Data Analysis?

Pandas

Working with pandas

Ankisha Sharmaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Why You Should Learn Python for Data Analysis: Surpassing Excel in Efficiency and Automation

Python Pandas DataFrame

Why Use Python Language For Data Analysis? Benefits | Best Python Course

Understanding the capabilities of Polars Python implementation

Pandas

Calculating Principal Components in Python

Data Cleaning and Preprocessing in Python: Best Practices

Python vs. Excel: A Comprehensive Comparison for Data Analytics

The Complete Guide To Time Series Analysis With Python.

?????? # 4 ???????????????????? ?????? ?????????? ???? ????????????: Basic Data Types in Python

Data Life-Cycle

What is Data Analysis?

Pandas

Working with pandas

Ankisha Sharmaçš„æ›´å¤šæ–‡ç«

ANN: Components

Gradient Descent: An Introduction

RSA Cryptosystem

Python: Modules

Python: Data Structures

Python: A brief history

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Why You Should Learn Python for Data Analysis: Surpassing Excel in Efficiency and Automation

Python Pandas DataFrame

Why Use Python Language For Data Analysis? Benefits | Best Python Course

Understanding the capabilities of Polars Python implementation

Pandas

Calculating Principal Components in Python

Data Cleaning and Preprocessing in Python: Best Practices

Python vs. Excel: A Comprehensive Comparison for Data Analytics

The Complete Guide To Time Series Analysis With Python.

?????? # 4 ???????????????????? ?????? ?????????? ???? ????????????: Basic Data Types in Python

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†