Python For DataScience!
There are a lot of programming languages for data science! Why specifically python? Because it has lots of Pre build libraries and functions present inside of python, it makes it easier to code compared to any other programming language.
Intro to Python:
We often use Py for data science and other related AI areas, because it has lots of libraries and more pre-built functions that make us more productive.
* In Data Science, we use more libraries like Numpy, Pandas, Seaborn, SciPy, and more.
* We see some basic concepts of python using Numpy and Pandas.
* Those are all very useful in future engineering part.
Pandas:
* Pandas is an inbuilt library present in python.
* Pandas helps to do more feature engineering parts, most of the parts in data science work with pandas.
* It is the most necessary library for data science.
* We will see some of the productive topics in pandas.
1. Indexing Technique:
* Helps to set the index for Data Frame.
import pandas as pd # Importing pandas from python
# Creating a data
pe = {"aravind": [12,13,14,15,15], "ara": [12,13,14,15,15],"evan":[12,131,41,51,21],"rosy": [12,31,24,25,43]}
df = pd.DataFrame(pe,index = ["english","maths","physis","botany","biology"] )
# Indexing Technique
sc = df.set_index('aravind')
print(sc)
sc = df.sort_index('aravind') # sort the index
print(sc)
sc = df.reset_index('aravind') # It reset all to orginal form
pring(sc)
Output:
2. Indexing Location:
* Helps to retrieve the data using Index.
a = sc.iloc[0:3,:]?# Before comma it rows, after columns
print(a)
# Iloc - Indexing Location
# 0:3 - it will retrieve first 3 rows.
# : - It retrieves all the columns.
# Indexing starts from 0 and end with n numbers we have in data.
# you can extract what you want using indexing
# some of the examples
b = sc.iloc[0:9, :]
print(b)
c = sc.iloc[0:3,1:3]
print(c)
# Copy the code and try it in local!
Output:
3. Filtering Technique:
* Helps to filter the data frame, the main part of future engineering.
# Creating a Data
pe = {"english": [12,13,14,15,15], "tamil": [12,13,14,15,15],"maths":[12,131,41,51,21],"science": [12,31,24,25,43],"results":["pass","fail","pass","fail","pass"]
fil = pd.DataFrame(pe,index = ["aravind","raghul","robo","bot","fate" ])}
aa = fil['results'] == 'pass'
fil[aa] # It helps to change all the data to 'pass'
print(fil[na])
fil[~aa]# if you want negative , you need to use this symbol
print(fil[~aa])
fil[ab]#this is the filtering method?
4. Add and remove rows and columns:
* Helps to add and remove columns | rows.
# Creating a Data
ne = {'firstname ' : ['aravind','kowshik','dubuke'], 'lastname' : ['r','vasu','kumar'],'favoritecolor':['red','blur','green']
nes = pd.DataFrame(ne)}
# Adding new columns using existing features
nes['fullname'] = nes['firstname ']+ " " +nes['lastname']
print(nes['fullname'])
# Dropping the columns
nes.drop(columns = ['firstname ','lastname'])#if you want to drop two things you #need to give in list
5. GroupBy:
* It helps to group the data by columns and rows.
* Helps to retrieve the data properly.
# Data we have created before
data.groupby("make").mean()
# Groupby('make') - Grouping data['make'] column by mean
data.groupby(['make','price']).mean() #we can do more functions here like std, var
6. Reading Multiple Files:
* Helps to read multiple files like (CSV, Excel).
import pandas as pd
data = pd.read_csv('data.csv') #Helps to read csv files
# you need to mention extension ( csv )
data = pd.read_excel('data.xlsx') #Helps to read excel files.
data = pd.read_json('data.json') # Helps to read json files.
data = pd.read_hdf('data.h5') # Helps to read h5 files ( usefull for dl )
# You can real multiple files like this but just add extension of the file
7. Write Files:
* Helps to convert your data to any other files.
import pandas as pd
data.to_json() # convert your file to json
data.to_csv() # convert your file to csv
data.to_html('html_page') #you can give name to file
data.to_pickle('pickle_data') # you can convert the data to pickle file easily
8. Finding Missing Values:
* It helps to find missing values.
import pandas as pd
data = pd.read_csv('data.csv')
data.isnull() # It gives Boolean values True -having missing valules and visevera
data.isnull().sum() #It helps to find which feature having how many missing values
data.isna().sum() # Alternate Method to check missing values
9. Handling Missing Values:
* It helps to manage missing values.
* Handling missing values needs more statistical analysis, but here we will see some general functions!
import pandas as pd
data = pd.read_csv('data.csv')
data.isna().sum() #Finding missng values
data.dropna() # It will drop the null values by rows.
data.fillna(34) # It will fill the null values using '34' number, you can give any
number
10. Suffling:
* Shuffle makes your model works productive.
* It means it shuffles all the data from the population, and it will give shuffled data.
import pandas as pd
data = pd.read_csv('data.csv')
data.sample(frac = 0.8)
# sample - shuffle
# frac - How many percentage of sample you need!
Numpy:
It's time for Numpy!
* Numpy -?Numerical?Python.
* In NumPy we use array format and matrix formats.
We see some similarities between the list and array.
* Array?Contains only similar data types. We do mathematical computations faster than lists and have less memory.
*?List?can contain any data type. We can't do any mathematical operations, and it occupies high memory compared to an array.
What is an array?
It's a collection of elements of the same type.
1. Create an array!
import numpy as np # importing from python library!
#let's create an array()
a = np.array([1,2,3,4]) # one dimensional array.
b = np.array([[1,2,3,4],[1,2,3,4]]) # two dimensional array.[array inside an array]?
c = np.array([1,2,3,4], dtype = 'float') # we can change the dtype using this parameter.
d = np.array((1,2,3,4))# we can use tuple also to creat an array.
e = np.array([[[1,2,3,4],[1,2,3,4]]])# Three dimensional array.[ 2D array within an array]?)
2. Arange function:
* A- Array, range - Range function.
* It will return evenly spaced numerical values with intervals.
import numpy as np
# let's create an arange()
a1 = np.arange(1,10) # 10 is exclusive so its take only 9 numbers.
b1 = np.arange(1,10,3)? # 3 is step, it returns step wise.
c1 = np.arange(1,12,dtype = 'float') # we have data type aslo.)
3, Ones and Zeros:
* It creates an array filled with zeros.
# let's creata a zeros()
import numpy as np
a2 = np.zeros(1,dtype = 'int') # default data type is float.
b2 = np.zeros([2,3], dtype = 'int') # in list we can give the dimensions 2rows 3 columns.)
# let's create a ones()
a3 = np.ones([3,3], dtype = 'int'))
4. Linspace:
* Linspace - Linearaly Space
* It creates an array of evenly spaced values, it's almost like an arange function().
# lets create linear space
a3 = np.linspace(1,100, num = 5) # evenly spaced is nothing but diffence between all the output's are same.
b3 = np.linspace(1,100, 5, endpoint = False) # if you don't want the ending number, we use this.
c3 = np.linspace(1,100,6,retstep = True)# return step will give you evenly spaced number.
d3 = np.linspace(1,129,3,dtype = 'int')# we use the dtype here also.
d3 = np.linspace(1,35)# if you don't mention num it will return 50 numbers, num default value is 50..
5. Random:
* Return a random array.
* In random, we have four types.
i) rand - uniformly distributed values.
ii) randn - (n) Normally distributed values.
iii) randint - uniformly distributed integers in a range.
iv) ranf - uniformly distributed floating-point numbers.
import numpy as np
# let's do rand function
a5 = np.random.rand(5,5) # rows and columns it gives random numbers from unifrom distribution.
# let's do randn
b5 = np.random.randn(5,5) #it return random values with help standard normal distribution.?
# let's do rand
c5 = np.random.ranf(5)# it will give only float numbers with the rows
# let's do randint
d5 = np.random.randint(1,10,size = [5,4])# here end point is exclusive.
6. Attributes:
* It helps to check how many dimensions, shapes, and data types we have in our array.
# if you want to check how many dimension in our data we use the ndim array
e5 = np.ndim(e)#if you want to check the shape of the array use shape(
f5 = np.shape(e)# It will give the shape of our array
# if you want to find the data type of the variable use this
g5 = e.dtype
g5.
7. Operations:
* It helps to do mathematical operations.
a8 * 2 # Multiply the array.
a8 ** 2 # Power
a8 / 3?# Dividing
a9 = np.array([1,2,3]
b9 = np.array([1,2,3]))
ab = a9 + b9 or np.add(a9, b9)
# we can use some internal functions like
np.sum(ab)
np.max(ab)
np.min(ab)
np.std(ab)
np.var(ab)
8. Broadcasting
* It helps to do operations with different sizes of arrays.
z = np.array([[2],[3],[5]]
y = np.array([1,2,3]))
z + y
9. Manipulation:
* Help to manipulate the data. We have a cool function for manipulation in NumPy.
# Reshape
np.reshape(a9,(3,1)) #first is array name and second is (rows and columns)?
# Resize
np.resize(a10,(3,3))
a10.resize((3,4),refcheck = False)
10. Argmax and Armin:
* Argument max and argument min.
* It gives the index of maximum numbers and index of minimum numbers.
a10.argmax()
a10.argmin()
Hope you gained new knowledge!
Thank you!
Name:?R. Aravindan
Position:?Content Writer.
Company:?Artificial Neurons.AI
Explore!