Class 13 - DATA TRANSFORMATION, SORTING & VISUALIZATION

Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Class 13 - DATA TRANSFORMATION, SORTING & VISUALIZATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Class 13 - DATA TRANSFORMATION, SORTING & VISUALIZATION

Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

If you want to excel in AI, keep yourself upto date.

Compete with yourself, don't compete with other's.

Be Consistent & Persistent.

Structured Data is form of tabular data.

Descriptive statistics give summary of the data.

Features/Attributes/Variable are different names of columns.

Mean:

Average of data

Median:

middle value of data

Mode:

Most frequent value of data

Mean is sensitive to extreme values.

IQR (Inter Quartile Range):

IQR = Q3-Q1

Standard Deviation is taken while using mean.

IQR is taken while using median.

If dispersion is less in data, it is easy for machine to process that data but ML models learning will be less.

If dispersion is more in data, it is difficult for machine to process that data but ML model learning will be more.

IQR, data dispersion, data spread all are same word.

IQR depends on which purpose you are going to use that data.

5 number Summary will helps to check the distribution of data.

Correlation Analysis:

Relation b/w two values.

Correlation analysis is used to determine the relationship between two or more variables in a dataset. It helps us understand how changes in one variable affect another variable.

Correlation values close to +1 or -1 indicate a strong relationship.

Correlation values close to 0 indicate a weak or no relationship.

The magnitude of the correlation value represents the strength of the relationship

In boxplot, box is representing IQR.

Google Colab Link:

https://colab.research.google.com/drive/11cQwJO_oT1GO-mwbamQLjIhuM3B1eDSI#scrollTo=GGyDovL2QDLa

import pandas as pd

import numpy as np

import seaborn as sns #visualisation

import matplotlib.pyplot as plt #visualisation

%matplotlib inline

sns.set(color_codes=True)

df = pd.read_csv("data.csv")

# To display the top 5 rows

df.head(5)

df.tail(5) # To display the botton 5 rows

df.dtypes

df = df.drop([ 'Vehicle Size'], axis=1)

df.head(5)

df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })

df.head(5)

df.shape

duplicate_rows_df = df[df.duplicated()]

print("number of duplicate rows: ", duplicate_rows_df.shape)

df.count() # Used to count the number of rows

df = df.drop_duplicates()

df.head(5)

df.count()

print(df.isnull().sum())

df = df.dropna() # Dropping the missing values.

df.count()

print(df.isnull().sum()) # After dropping the values

sns.boxplot(x=df['Price'])

sns.boxplot(x=df['HP'])

sns.boxplot(x=df['Cylinders'])

Q1 = df.quantile(0.25)

Q3 = df.quantile(0.75)

IQR = Q3 - Q1

print(IQR)

df.corr()

df.Make.value_counts().nlargest(15).plot(kind='bar', figsize=(10,5))

plt.title("Price of cars by make")

plt.ylabel('price of cars')

plt.xlabel('Make');

df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))

plt.title("Number of cars by make")

plt.ylabel('Number of cars')

plt.xlabel('Make');

df.Year.value_counts().plot(kind='pie')

plt.show()

df

df.Make.value_counts().nlargest(20).plot(kind='bar', figsize=(15, 10)) # figsize=(15, 10)

plt.title("Number of HP by car")

plt.ylabel('Number of HP')

plt.xlabel('Make');

# Adjusting the Size of Figure

plt.figure(figsize=(10,5))

# calculating the Correlation

correlation = df.corr()

# Displaying the correlation using the Heap Map

sns.heatmap(correlation,cmap="BrBG",annot=True) # Br: Brown. B: Blue, G: Green

#correlation

#AI #artificialintelligence #datascience #irfanmalik #drsheraz #xevensolutions #hamzanadeem

要查看或添加评论,请登录

Hamza Nadeem的更多文章

社区洞察

其他会员也浏览了