Data Analysis with Python 3.7.2 and Pandas: Data Frame Class

Data Analysis with Python 3.7.2 and Pandas: Data Frame Class

The objective of this article is to introduce Python for the preparation of (data munging or wrangling) massive data, when they are too large for the memory (RAM) of a computer. This step is addressed by the introduction of features extracted from the pandas library and the DataFrame class; read and write files, manage a table of data and types of variables, to sample, discretize, group together simple and bi-varied elementary modalities, descriptions; concatenation and joining tables.

2 Series and DataFrame types

The Numpy library introduces the essential array type to the manipulation of matrices in scientific computation, that pandas defines the Series classes (time series) and DataFrame or data table crucial in statistics.

2.1 Series

The Series class is the association of two one-dimensional arrays. The first is a set of values indexed by the 2nd which is often a time series. This type is introduced mainly for applications in Econometric and Finance where Python is widely used.

2.2 DataFrame

This class is the same as the one defined in R programming language. It is about the association of the same row index with columns or type variables (integer, real, boolean, character). It is a two-dimensional array with indexes of rows and columns. However, it can also be seen as a list of Series sharing the same index. The column index (names of variables) is an object of type dict (dictionary).

We will execute the lines of code one by one or rather result by result, in an IPython or Jupyter notebook.

# Example of data frame 
import pandas as pd 
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada"], "year": [2000, 2001, 2002, 2001, 2002], "pop": [1.5, 1.7, 3.6, 2.4, 2.9]} 
frame = pd.DataFrame(data) 

# ordre of col

pd.DataFrame(data, columns=["year", "state", "pop"]) 

# index des lignes et valeurs manquantes (NaN) 
frame2=pd.DataFrame(data, columns=["year", "state", "pop", "debt"], index=["one", "two", "three", "four", "five"]) 

# list of col 

# value of a col 
frame["state"] frame.year 

# "imputation" 
frame2["debt"] = 16.5 

# define a variable 
frame2["eastern"] = frame2.state == "Ohio" 

# delete the variable del 

2.3 Example

The data of the sinking of the Titanic illustrating the use of pandas. They are directly read from their URL or else loaded to the Python working directory.

# Importations 
import pandas as pd 
import numpy as np 

# Read the path="" 
df = pd.read_csv(path+"titanic-train.csv",nrows=5) print(df) df.tail() 

# Read all
df = pd.read_csv(path+"titanic-train.csv") 
df.head() type(df) df.dtypes 

# Variables are inaccessible
# Select the useful col

df=pd.read_csv(path+"titanic-train.csv", usecols=[1,2,4,5,6,7,9,11],nrows=5) 

In this step, a new Panda's type called category is adopted. It should be generally declared in a dictionary at the moment of reading the data as follow:

(dtype = {"Surv": pd.Categorical ...})

However, this is not the case.

It is, therefore, the object type that is declared and modified. It is strongly recommended that the correct types be assigned to each variable, if only for avoiding unreliable operations as showed below:

df=pd.read_csv(path+"titanic-train.csv",skiprows=1, header=None,usecols=[1,2,4,5,9,11], names=["Surv","Classe","Genre","Age", "Prix","Port"],dtype={"Surv":object, "Classe":object,"Genre":object,"Port":object}) 

In order to avoid the above cited issues, I am suggesting the following solution (re-definition of the variables):

df["Classe"]=pd.Categorical(df["Classe"], ordered=False) 
df["Genre"]=pd.Categorical(df["Genre"], ordered=False) 

It is possible also to read all the variable from the beginning to avoid the undesirable variables:

df = df.drop(["Name", "Ticket", "Cabin"], axis=1)

2.4 Data Sampling

The Python DataFrame type is loaded into memory. Despite the previous options to select, columns, types of variables...etc, it is possible, before looking for a heavy hardware configuration and as a first approximation, draw a simple random sample according to a uniform distribution. A stratified print would require more work. This assumes to know the number of lines of the file as follows:

# For the titanic data: 

# File size

# Sample size

# Do not read the first line
# Do not read the n-1 line randomly
lin2skipe.extend(np.random.choice(np.arange(1,N+1), (N-n),replace=False)) df_small=pd.read_csv(path+"titanic-train.csv", skiprows=lin2skipe,header=None, usecols=[1,2,4,5,9,11], names=["Surv","Classe","Genre","Age", "Prix","Port"]) 


Abdelkhalek Bakkari的更多文章

