Class 13 - DATA TRANSFORMATION, SORTING & VISUALIZATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)
Class 13 - DATA TRANSFORMATION, SORTING & VISUALIZATION
Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)
If you want to excel in AI, keep yourself upto date.
Compete with yourself, don't compete with other's.
Be Consistent & Persistent.
Structured Data is form of tabular data.
Descriptive statistics give summary of the data.
Features/Attributes/Variable are different names of columns.
Mean:
Average of data
Median:
middle value of data
Mode:
Most frequent value of data
Mean is sensitive to extreme values.
IQR (Inter Quartile Range):
IQR = Q3-Q1
Standard Deviation is taken while using mean.
IQR is taken while using median.
If dispersion is less in data, it is easy for machine to process that data but ML models learning will be less.
If dispersion is more in data, it is difficult for machine to process that data but ML model learning will be more.
IQR, data dispersion, data spread all are same word.
IQR depends on which purpose you are going to use that data.
5 number Summary will helps to check the distribution of data.
Correlation Analysis:
Relation b/w two values.
Correlation analysis is used to determine the relationship between two or more variables in a dataset. It helps us understand how changes in one variable affect another variable.
Correlation values close to +1 or -1 indicate a strong relationship.
Correlation values close to 0 indicate a weak or no relationship.
The magnitude of the correlation value represents the strength of the relationship
In boxplot, box is representing IQR.
Google Colab Link:
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)
df = pd.read_csv("data.csv")
# To display the top 5 rows
df.head(5)
df.tail(5) # To display the botton 5 rows
df.dtypes
df = df.drop([ 'Vehicle Size'], axis=1)
领英推荐
df.head(5)
df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
df.head(5)
df.shape
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
df.count() # Used to count the number of rows
df = df.drop_duplicates()
df.head(5)
df.count()
print(df.isnull().sum())
df = df.dropna() # Dropping the missing values.
df.count()
print(df.isnull().sum()) # After dropping the values
sns.boxplot(x=df['Price'])
sns.boxplot(x=df['HP'])
sns.boxplot(x=df['Cylinders'])
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
df.corr()
df.Make.value_counts().nlargest(15).plot(kind='bar', figsize=(10,5))
plt.title("Price of cars by make")
plt.ylabel('price of cars')
plt.xlabel('Make');
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');
df.Year.value_counts().plot(kind='pie')
plt.show()
df
df.Make.value_counts().nlargest(20).plot(kind='bar', figsize=(15, 10)) # figsize=(15, 10)
plt.title("Number of HP by car")
plt.ylabel('Number of HP')
plt.xlabel('Make');
# Adjusting the Size of Figure
plt.figure(figsize=(10,5))
# calculating the Correlation
correlation = df.corr()
# Displaying the correlation using the Heap Map
sns.heatmap(correlation,cmap="BrBG",annot=True) # Br: Brown. B: Blue, G: Green
#correlation
#AI #artificialintelligence #datascience #irfanmalik #drsheraz #xevensolutions #hamzanadeem