Getting Started With Data Analysis in Python
Pavithra Nagaraj
6G, AI Researcher | Founder- Paaru Wireless | Director- Women in 6G?
Python is a great programming language for doing data analysis , primarily because of the fantastic ecosystem of data-centric packages. Pandas is one of those Python packages aimed to provide fast and flexible data structures designed to make working with data much easier and intuitive.
Do you want to load an .csv or excel file and easily manipulate the data in it?
Do you want to replace missing values on your data or ignore them all together?
Do you want a quick statistic summary of your data?
Well, pandas got it all covered. It provides a set of tools to make working with data simple and efficient.
The topics in this post will enable you to:
1.Load your data into a Python Pandas DataFrame.
2. Examine the basic statistics of the data.
3. Modify the values
4. Finally output the result to a new file.
Loading Data:
Loading data with pandas is quite easy. The library provides methods to load data from Excel files(xls, xlsx), csv, json and others. For this example i will be using the data available in .csv (comma seperated value) file.
In order to load the data, we'll need to use the .read_csv function. This function will take in a csv file and return a .DataFrame object, a table like data structure that will make it easier for us to manipulate the data set and extract information. From now on ufo will be the representation of our DataFrame.
Viewing data:
Pandas provides some methods to visualize the data we are working on.
ufo.head( )
Used to visualize the first few rows on our DataFrame, the default value is 5.
ufo.tail( )
Similar to df.head(), tail will return the last few rows on our DataFrame.
ufo.describe( )
Describe shows a quick statistic summary of your data, on the numeric columns.
Describe function can also be used to see some of the core statistics about a particular column. Select a column to describe using a string inside the [] braces, and call describe() as follows:
ufo.shape and ufo.ndim
The shape command gives information on the data set size – ‘shape’ returns a tuple with the number of rows, and the number of columns for the data in the DataFrame. Another descriptive property is the ‘ndim’ which gives the number of dimensions in your data, typically 2.
Selecting and Manipulating Data:
Selecting Columns
There are 3 main methods of selecting columns in pandas:
- using a dot notation, e.g. data.column_name,
- using square braces and the name of the column as a string, e.g. data['column_name']
- using numeric indexing and the iloc selector, e.g. data.iloc[:, <column_number>]
Selecting Multiple Columns
Selecting multiple columns at the same time extracts a new DataFrame from your existing DataFrame. For selection of multiple columns, the syntax is:
- using square-braces with a list of column names, e.g. data[['column_name_1', 'column_name_2']]
- using numeric indexing with the iloc selector and a list of column numbers, e.g. data.iloc[:, [0,1,20,22]]
Selecting Rows
Rows in a DataFrame are selected, typically, using the iloc/loc selection methods, or using logical selectors (selecting based on the value of another column or variable).
The basic methods are:
- numeric row selection using the iloc selector, e.g. data.iloc[0:10, :] – to select the first 10 rows.
- label-based row selection using the loc selector (this is only applicably if you have set an “index” on your dataframe. e.g. data.loc[23, :]
- logical-based row selection using evaluated statements, e.g. data[data["City"] == "Ithaca"] – select the rows where City value is ‘Ithaca’.
We can also filter multiple values, using the builtin function ufo.isin()
Removing or deleting the data
To delete rows and columns from DataFrames, Pandas uses the "drop” function.
Removing columns - To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1.
Removing rows -To delete a row, or multiple rows, use the label of the row(s), and specify the “axis” as 0.
We can also use pandas ufo.dropna() to remove incomplete data from our DataFrame.
To remove the rows with incomplete or missing data
To remove the columns with incomplete or missing data
Exporting and Saving Pandas DataFrame:
After manipulation, saving your data back to csv format is the next step. Data output in Pandas is as simple as loading data.
Pandas is really a powerful and fun library for data manipulation / analysis, with easy syntax and fast operations. This introductory article is just the tip of the iceberg, it is possible to do much more with pandas by exploring rest of the tools.
Happy Learning!! Happy Coding!!
To stay up-to-date on my posts and Articles, click the follow button at the top of this post.
If you like my articles, do share your thoughts in the comments section below as I learn just as much from you as you do from me.