I want to know what is Python Pandas

I want to know what is Python Pandas

Python Zen Master: Jan 26 1121 am

python-tips/python_tips.ipynb at main · CalebCurry/python-tips · GitHub


Reading comma separated values (CSV) into pandas DataFrame | Pythontic.com

mysql,numpy array,pickle,postgresql,string,summary,to_csv,to_json,to_latex,to_xmle        

Faster Python pandas library May19 2021


styleguide | Style guides for Google-originated open-source projects

Faster learning python

learn in simple way:

How to Think Like a Computer Scientist — How to Think Like a Computer Scientist: Learning with Python 3 (openbookproject.net)

Fantastic and short way for iteration with enumarte, loop

Loop like a native: while, for, iterators, generators - YouTube

Python 3 basic programming code for absolute beginners | aipython

30 python scripts examples (linuxhint.com)

Python Projects for Beginners to Advanced With Source Code (2021) - InterviewBit

Python If and Else | CodesDope


Authors Python : Swathi Arun – Medium

Python Pandas

Fantastic explanation of functions below with passing default, *arg and *kwargs

Python Functions - How To Define And Call A Python Function (softwaretestinghelp.com)


Python image processing library : opencv

Important Python Snippets for Image Processing | by Gopalakrishna Adusumilli | Nov, 2021 | Python in Plain English

100 tricks


All Syntax in a Single Page : https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9

#Code snippets for Pandas

import pandas as pd

Reading Files, Selecting Columns, and Summarizing
# reading in a file from local computer or directly from a URL
# various file formats that can be read in out wrote out
Format Type     Data Description      Reader           Writer
text                  CSV            read_csv          to_csv
text                 JSON            read_json         to_json
text                 HTML            read_html         to_html
text             Local clipboard  read_clipboard     to_clipboard
binary             MS Excel          read_excel        to_excel
binary            HDF5 Format        read_hdf           to_hdf
binary           Feather Format     read_feather      to_feather
binary              Msgpack         read_msgpack      to_msgpack
binary               Stata           read_stata        to_stata
binary                SAS             read_sas 
binary        Python Pickle Format   read_pickle       to_pickle
SQL                   SQL             read_sql          to_sql
SQL             Google Big Query      read_gbq          to_gbq

#to read about different types of files, and further functionality of reading in files, visit: https://pandas.pydata.org/pandas-docs/version/0.20/io.html

df = pd.read_csv(‘local_path/file.csv’)
df = pd.read_csv(‘https://file_path/file.csv')
# when reading in tables, can specify separators, and note a column to be used as index separators can include tabs (“\t”), commas(“,”), pipes (“|”), etc.
df = pd.read_table(‘https://file_path/file', sep=’|’, index_col=’column_x’)
# examine the df data
df           # print the first 30 and last 30 rows
type(df)     # DataFrame
df.head()    # print the first 5 rows
df.head(10)  # print the first 10 rows
df.tail()    # print the last 5 rows
df.index     # “the index” (aka “the labels”)
df.columns   # column names (which is “an index”)
df.dtypes    # data types of each column
df.shape     # number of rows and columns
df.values    # underlying numpy array — df are stored as numpy arrays for effeciencies.
# select a column
df[‘column_y’]         # select one column
type(df[‘column_y’])   # determine datatype of column (e.g., Series)
df.column_y            # select one column using the DataFrame attribute — not effective if column names have spaces
# summarize (describe) the DataFrame
df.describe()          # describe all numeric columns
df.describe(include=[‘object’]) # describe all object columns
df.describe(include=’all’)      # describe all columns
# summarize a Series
df.column_y.describe()   # describe a single column
df.column_z.mean()       # only calculate the mean
df[“column_z”].mean()    # alternate method for calculating mean
# count the number of occurrences of each value
df.column_y.value_counts()   # most useful for categorical variables, but can also be used with numeric variables        

#filter df by one column, and print out values of another column

#when using numeric values, no quotations

df[df.column_y == “string_value”].column_z

df[df.column_y == 20 ].column_z    
# display only the number of rows of the ‘df’ DataFrame
# display the 3 most frequent occurances of column in ‘df’
Filtering and Sorting
# boolean filtering: only show df with column_z < 20
filter_bool = df.column_z < 20    # create a Series of booleans…
df[filter_bool]                # …and use that Series to filter rows
df[filter_bool].describe()     # describes a data frame filtered by filter_bool
df[df.column_z < 20]           # or, combine into a single step
df[df.column_z < 20].column_x  # select one column from the filtered results
df[df[“column_z”] < 20].column_x     # alternate method 
df[df.column_z < 20].column_x.value_counts()   # value_counts of resulting Series, can also use .mean(), etc. instead of .value_counts()
# boolean filtering with multiple conditions; indexes are in square brackets, conditions are in parens
df[(df.column_z < 20) & (df.column_y==’string’)] # ampersand for AND condition 
df[(df.column_z < 20) | (df.column_z > 60)] # pipe for OR condition
# sorting
df.column_z.order()          # sort a column
df.sort_values(‘column_z’)   # sort a DataFrame by a single column
df.sort_values(‘column_z’, ascending=False)     # use descending order instead
# Sort dataframe by multiple columns
df = df.sort([‘col1’,’col2',’col3'],ascending=[1,1,0]) 
# can also filter ‘df’ using pandas.Series.isin 
df[df.column_x.isin([“string_1”, “string_2”])]
Renaming, Adding, and Removing Columns
# rename one or more columns
df.rename(columns={‘original_column_1’:’column_x’, ‘original_column_2’:’column_y’}, inplace=True) #saves changes 
# replace all column names (in place)
new_cols = [‘column_x’, ‘column_y’, ‘column_z’]
df.columns = new_cols
# replace all column names when reading the file
df = pd.read_csv(‘df.csv’, header=0, names=new_cols)
# add a new column as a function of existing columns
df[‘new_column_1’] = df.column_x + df.column_y        

df[‘new_column_2’] = df.column_x * 1000 #can create new columns without for loops

# removing columns

df.drop(‘column_x’, axis=1)   # axis=0 for rows, 1 for columns — does not drop in place
df.drop([‘column_x’, ‘column_y’], axis=1, inplace=True) # drop multiple columns
# Lower-case all DataFrame column names
df.columns = map(str.lower, df.columns)
# Even more fancy DataFrame column re-naming
# lower-case all DataFrame column names (for example)
df.rename(columns=lambda x: x.split(‘.’)[-1], inplace=True)
Handling Missing Values
# missing values are usually excluded by default
df.column_x.value_counts()             # excludes missing values
df.column_x.value_counts(dropna=False) # includes missing values
# find missing values in a Series
df.column_x.isnull()  # True if missing
df.column_x.notnull() # True if not missing
# use a boolean Series to filter DataFrame rows
df[df.column_x.isnull()]  # only show rows where column_x is missing
df[df.column_x.notnull()] # only show rows where column_x is not missing
# understanding axes
df.sum()       # sums “down” the 0 axis (rows)
df.sum(axis=0) # equivalent (since axis=0 is the default)
df.sum(axis=1) # sums “across” the 1 axis (columns)
# adding booleans
pd.Series([True, False, True])       # create a boolean Series
pd.Series([True, False, True]).sum() # converts False to 0 and True to 1
# find missing values in a DataFrame
df.isnull() # DataFrame of booleans
df.isnull().sum() # count the missing values in each column
# drop missing values
df.dropna(inplace=True)   # drop a row if ANY values are missing, defaults to rows, but can be applied to columns with axis=1
df.dropna(how=’all’, inplace=True)  # drop a row only if ALL values are missing
# fill in missing values
df.column_x.fillna(value=’NA’, inplace=True) 
# fill in missing values with ‘NA’
# value does not have to equal a string — can be set as some calculated value like df.column_x.mode(), or just a number like 0
# turn off the missing value filter
df = pd.read_csv(‘df.csv’, header=0, names=new_cols, na_filter=False)
Diagram: https://i.imgur.com/yjNkiwL.png
# for each value in column_x, calculate the mean column_y 
# for each value in column_x, count the number of occurrences
# for each value in column_x, describe column_y
# similar, but outputs a DataFrame and can be customized
df.groupby(‘column_x’).column_y.agg([‘count’, ‘mean’, ‘min’, ‘max’])
df.groupby(‘column_x’).column_y.agg([‘count’, ‘mean’, ‘min’, ‘max’]).sort_values(‘mean’)
# if you don’t specify a column to which the aggregation function should be applied, it will be applied to all numeric columns
# can also groupby a list of columns, i.e., for each combination of column_x and column_y, calculate the mean column_z

#to take groupby results out of hierarchical index format (e.g., present as table), use .unstack() method


#conversely, if you want to transform a table into a hierarchical index, use the .stack() method

Selecting Multiple Columns and Filtering Rows
# select multiple columns
my_cols = [‘column_x’, ‘column_y’]  # create a list of column names…
df[my_cols]                   # …and use that list to select columns
df[[‘column_x’, ‘column_y’]]  # or, combine into a single step — double brackets due to indexing a list.
# use loc to select columns by name
df.loc[:, ‘column_x’]    # colon means “all rows”, then select one column
df.loc[:, [‘column_x’, ‘column_y’]]  # select two columns
df.loc[:, ‘column_x’:’column_y’]     # select a range of columns (i.e., selects all columns including first through last specified)
# loc can also filter rows by “name” (the index)
df.loc[0, :]       # row 0, all columns
df.loc[0:2, :]     # rows 0/1/2, all columns
df.loc[0:2, ‘column_x’:’column_y’] # rows 0/1/2, range of columns
# use iloc to filter rows and select columns by integer position
df.iloc[:, [0, 3]]     # all rows, columns in position 0/3
df.iloc[:, 0:4]        # all rows, columns in position 0/1/2/3
df.iloc[0:3, :]        # rows in position 0/1/2, all columns
#filtering out and dropping rows based on condition (e.g., where column_x values are null)
drop_rows = df[df[“column_x”].isnull()]
new_df = df[~df.isin(drop_rows)].dropna(how=’all’)
Merging and Concatenating Dataframes

#concatenating two dfs together (just smooshes them together, does not pair them in any meaningful way) - axis=1 concats df2 to right side of df1; axis=0 concats df2 to bottom of df1

new_df = pd.concat([df1, df2], axis=1)        

#merging dfs based on paired columns; columns do not need to have same name, but should match values; left_on column comes from df1, right_on column comes from df2

new_df = pd.merge(df1, df2, left_on=’column_x’, right_on=’column_y’)        

#can also merge slices of dfs together, though slices need to include columns used for merging

new_df = pd.merge(df1[[‘column_x1’, ‘column_x2’]], df2, left_on=’column_x2', right_on=’column_y’)
#merging two dataframes based on shared index values (left is df1, right is df2)
new_df = pd.merge(df1, df2, left_index=True, right_index=True)
Other Frequently Used Features
# map existing values to a different set of values
df[‘column_x’] = df.column_y.map({‘F’:0, ‘M’:1})
# encode strings as integer values (automatically starts at 0)
df[‘column_x_num’] = df.column_x.factorize()[0]
# determine unique values in a column
df.column_x.nunique()   # count the number of unique values
df.column_x.unique()    # return the unique values
# replace all instances of a value in a column (must match entire value)
df.column_y.replace(‘old_string’, ‘new_string’, inplace=True)
#alter values in one column based on values in another column (changes occur in place)        

#can use either .loc or .ix methods

df.loc[df[“column_x”] == 5, “column_y”] = 1

df.ix[df.column_x == “string_value”, “column_y”] = “new_string_value”
#transpose data frame (i.e. rows become columns, columns become rows)
# string methods are accessed via ‘str’
df.column_y.str.upper() # converts to uppercase
df.column_y.str.contains(‘value’, na=’False’) # checks for a substring, returns boolean series
# convert a string to the datetime_column format
df[‘time_column’] = pd.to_datetime_column(df.time_column)
df.time_column.dt.hour   # datetime_column format exposes convenient attributes
(df.time_column.max() — df.time_column.min()).days   # also allows you to do datetime_column “math”
df[df.time_column > pd.datetime_column(2014, 1, 1)]   # boolean filtering with datetime_column format
# setting and then removing an index, resetting index can help remove hierarchical indexes while preserving the table in its basic structure
df.set_index(‘time_column’, inplace=True)
# sort a column by its index
# change the data type of a column
df[‘column_x’] = df.column_x.astype(‘float’)
# change the data type of a column when reading in a file
pd.read_csv(‘df.csv’, dtype={‘column_x’:float})
# create dummy variables for ‘column_x’ and exclude first dummy column
column_x_dummies = pd.get_dummies(df.column_x).iloc[:, 1:]
# concatenate two DataFrames (axis=0 for rows, axis=1 for columns)
df = pd.concat([df, column_x_dummies], axis=1)
Less Frequently Used Features
# create a DataFrame from a dictionary
pd.DataFrame({‘column_x’:[‘value_x1’, ‘value_x2’, ‘value_x3’], ‘column_y’:[‘value_y1’, ‘value_y2’, ‘value_y3’]})
# create a DataFrame from a list of lists
pd.DataFrame([[‘value_x1’, ‘value_y1’], [‘value_x2’, ‘value_y2’], [‘value_x3’, ‘value_y3’]], columns=[‘column_x’, ‘column_y’])
# detecting duplicate rows
df.duplicated()       # True if a row is identical to a previous row
df.duplicated().sum() # count of duplicates
df[df.duplicated()]   # only show duplicates
df.drop_duplicates()  # drop duplicate rows
df.column_z.duplicated()   # check a single column for duplicates
df.duplicated([‘column_x’, ‘column_y’, ‘column_z’]).sum()  # specify columns for finding duplicates
# Clean up missing values in multiple DataFrame columns
df = df.fillna({
 ‘col1’: ‘missing’,
 ‘col2’: ‘99.999’,
 ‘col3’: ‘999’,
 ‘col4’: ‘missing’,
 ‘col5’: ‘missing’,
 ‘col6’: ‘99’
# Concatenate two DataFrame columns into a new, single column - (useful when dealing with composite keys, for example)
df[‘newcol’] = df[‘col1’].map(str) + df[‘col2’].map(str)
# Doing calculations with DataFrame columns that have missing values
# In example below, swap in 0 for df[‘col1’] cells that contain null
df[‘new_col’] = np.where(pd.isnull(df[‘col1’]),0,df[‘col1’]) + df[‘col2’]
# display a cross-tabulation of two Series
pd.crosstab(df.column_x, df.column_y)
# alternative syntax for boolean filtering (noted as “experimental” in the documentation)
df.query(‘column_z < 20’) # df[df.column_z < 20]
df.query(“column_z < 20 and column_y==’string’”)  # df[(df.column_z < 20) & (df.column_y==’string’)]
df.query(‘column_z < 20 or column_z > 60’)        # df[(df.column_z < 20) | (df.column_z > 60)]
# Loop through rows in a DataFrame
for index, row in df.iterrows():
 print index, row[‘column_x’]
# Much faster way to loop through DataFrame rows if you can work with tuples
for row in df.itertuples():
# Get rid of non-numeric values throughout a DataFrame:
for col in df.columns.values:
 df[col] = df[col].replace(‘[?-9]+.-’, ‘’, regex=True)
# Change all NaNs to None (useful before loading to a db)
df = df.where((pd.notnull(df)), None)
# Split delimited values in a DataFrame column into two new columns
df[‘new_col1’], df[‘new_col2’] = zip(*df[‘original_col’].apply(lambda x: x.split(‘: ‘, 1)))
# Collapse hierarchical column indexes
df.columns = df.columns.get_level_values(0)
# display the memory usage of a DataFrame
df.info()         # total usage
df.memory_usage() # usage by column
# change a Series to the ‘category’ data type (reduces memory usage and increases performance)
df[‘column_y’] = df.column_y.astype(‘category’)
# temporarily define a new column as a function of existing columns
df.assign(new_column = df.column_x + df.spirit + df.column_y)
# limit which rows are read when reading in a file
pd.read_csv(‘df.csv’, nrows=10)        # only read first 10 rows
pd.read_csv(‘df.csv’, skiprows=[1, 2]) # skip the first two rows of data
# randomly sample a DataFrame
train = df.sample(frac=0.75, random_column_y=1) # will contain 75% of the rows
test = df[~df.index.isin(train.index)] # will contain the other 25%
# change the maximum number of rows and columns printed (‘None’ means unlimited)
pd.set_option(‘max_rows’, None) # default is 60 rows
pd.set_option(‘max_columns’, None) # default is 20 columns
print df
# reset options to defaults
# change the options temporarily (settings are restored when you exit the ‘with’ block)
with pd.option_context(‘max_rows’, None, ‘max_columns’, None):
 print df        

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas Data structure are classified as Pandas Data frame and Series. In other terms we can conclude that 1 dimension are meant for Series and 2 dimension are meant for Data frame.(like tables having row and columns).

Installation of Anaconda an opensource will help us to work on Python pandas using Jupyter Notebook or Spyder will help code python pandas.

Python pandas allows to extract or write data from CSV or TSV file, or a SQL database etc.

  • what we know the below Data types in Python

  1. List?is a collection which is ordered and changeable. Allows duplicate members.(represented in [] ie x =[’a’, ‘b’, 123]?)
  2. Tuple?is a collection which is ordered and unchangeable. Allows duplicate members.(represented by a list seperated by comma i.e x=1,2,3
  3. Set?is a collection which is unordered and unindexed. No duplicate members.(represent in {} i.e s ={[1,2,3]}
  4. Dictionary?is a collection which is unordered, changeable and indexed. No duplicate members.(represented letters = (1:'a',2:'b') the value 1 represent keys and 'a' represent values

Best Reference Article : https://medium.freecodecamp.org/python-collection-of-my-favorite-articles-8469b8455939



Practical Python as Starting point for new beginners https://www.w3resource.com/python-exercises/

the below are some snippets of python pandas for reference.

Know the datastructure of python as Basic

Code : to download data csv file from the url
import pandas as?pd


df.head(5) - this will display first  5 rows of the csv file.

Code : to Save the downloaded csv in your local drive

df.to_csv('exa.csv') - this will save the above P1.csv in your local drive

if you want to where your csv download 

please use the command pwd - it let you know where your csv file is stored.

Make use of the below Python Cheat Sheet for reference.

Pandas for New Learners:

Pandas for Beginners:


Start Work with Pandas with below url for understanding


the below link is a collection of Python pandas Cheat sheet 

Pandas for datascience :





Data wranggle


Pandas by SQL


Pandas with excel

Little to know Use of Conda(Annaconda):

Some tips to know what is Panda

Some sample code to get Excercise:


Advanced Pandas and sample projects

Basic course in Pandas


very use to learn as new beginners

Pandas Pivothttps://pbpython.com/pandas-pivot-table-explained.html

Learn Python Step by Step Approach:

when to use Aggregate or filter or transform in pandas

Pandas techniques by Analytics Vidhya


Python Data Visualizations:

python basic level for beginners

Machine Learning Project in Python pandas 
(involves plot of visualiazation like ggplot2 in R)


5 reason why python is used for Big Data 

( this will have details numpy,scipy,pandas,scikit-learn,pybrian,tensorflow


Essential hacks and tricks with python ( useful informatiion about deep learning and links)

15 Essential libaray need to know from python to use


30 Amazing python project for reference


Sentimental analysis using python

Stock Market Analysis with python

pandas in Jupyter

Pandas with JSON

Processing Huge Dataset in Python

R Interface to Python

Python data science

Some basics command of Pandas

step 1: Arthimetic in Pandas

import pandas as pd
from pandas import series,DataFrame
import numpy as np

sk1 = Series([1,9,-2,23])
sk2 = Series([5,2,-10,4,50])


check if which Series greater

sk1 > sk2 

To find the square root

np.sqrt(sk1) will fetch square of the data.

sk1.mean() -  will get you the average of the data from the series

lamba function with checking the values of sk1 > 2 (if it > 2 then it will
        return to value of sk1 else it will return 2)
sk1.apply(lambda x: x if x > 2 else 2)

sk1 values   sk1.apply values
1             2
9             9 
-2            2
23           23

Program 2: team meeting on first friday every month

import calendar

print("team meeting will be on :")
?for m in range(1,13):
cal = calendar.monthcalendar(2018,m)

if weekone[calendar.FRIDAY] !=0:
? ? meetday = weekone[calendar.FRIDAY]
? ? meetday = weektwo[calednar.FRIDAY]
print("%10s %3d" % (calendar.month_name[m],meetday))

Program 3:First sunday of every Month

import calendar
c= calendar.TextCalendar(calendar.SUNDAY)
st = c.formatmonth(2017,1,0,0)

Files : open a file and write some text, close the file
program 3

def main():

f = open("rocky.txt","w+")

write some line in the? file
for i in range(10):
f.write("this is line", + str(i) + "\r\n")

close the file

append line in the file

def main():


for i in range(10):
f.write("Append - this is line", +str(i) +"\r\n")

program 4:

Read the exisiting file and print

def main():


if f.mode == 'r':
contents = f.read()

Working with Files
import os
from os import path
import datetime
from datetime import date,time,timedelta
import time

def main():

check if existence and type

def main():
print("file exists :" + str(path.exists("rocky.txt")))
print(" Is a file :" + str(path.isfile("rocky.txt")))
print(" Is a directoy :" +str(path.isdir("rocky.txt"))

work with file

print("file path :" +str(path.realpath("rocky.txt")))
print("file and the path :"+str(path.split(path.realpath("rocky.txt")))

Get the time when the file is modified


calcualate how long ago the file was modified

td=datetime.datetime.now() - datetime.datetime.fromtimestamp(path.getmtime("rocky.txt"))

print(" it has been " + str(td) + "since the file modified")

print("Or ," +str(td.total_seconds()) + " seconds")

shell commands

import os
from os import path
import shutil

def main():
? ?# make a duplicate of an existing file
?if path.exists("rocky.txt")
? # get the path to the file in current directory

? ?src = path.realpath("rocky.txt")

#? now aim to backup a copy by appending "bak" to the name

dst = src+".bak"

# copy over the permission,modification times and
other information

# rename the original file

make archieve of the file

import os
from os import path
import shutil
from shutil import make_archive ( make archive purpose)

# now put file into zip archive
root_dir,tail = path.split(src)

from os import path
import shutil
from shutil import make_archive ( make archive purpose)
from zipfile import ZipFile

# more fine=grained control over Zip files

?with ZipFile("testzip.zip","w") as newzip:
? ? ? ? ? newzip.write("rocky.txt")
? ? ? ? ? newzip.write("rocky.txt.bak")

Read data from Internet

import urllib.request

def main():
? webUrl = urllib.request.urlopen("http: www.google.com")
print("result code : " +str(webUrl.getcode()))
data = webUrl.read()

program : Read first 5 line

>>> with open("C:\Users\Sivakumar\Desktop\sam.txt") as myfile:
	firstnline = myfile.readlines()[0:5]
	print firstnline

['My? Family? ? ? ? ? \t\n',
 ' Ilove my family\n', 
'My the best cook,\n',
 'My the best man ,\n', 
'My s girl,\n']

for i in range(N):
    print line

Reading csv file

import csv
with open("names.csv','r') as csv-file:
? ? ? ?csv_reader= csv.reader(csv-file)

? ?for line in csv_reader:
? ? ? ? ?print(line)
## print(line['columname'])

Read and writer csv file
import csv
with open("names.csv','r') as csv-file:
? ? ? ?csv_reader= csv.reader(csv-file)

with open("newnames.csv','w') as new_file:
? ? ? ? csv_writer = csv.writer(new_file)

#csv_writer = csv.write(new_file,delimiter ='\t)

? ?for line in csv_reader:
? ? ? ? ?csv_writer.writerow(line)

Reading and writing fields name from the csv file

import csv
with open("names.csv','r') as csv-file:
? ? ? ?csv_reader= csv.DictReader(csv-file)

with open("newnames.csv','w') as new_file:
? ? ? ??
? ? ? ?fieldnames =['empno','empname','email']
? ? ? ? ? csv_writer = csv.DictWriter(new-file,filenames=fieldnames,delimiter='\t')

? ?for line in csv_reader:     
? ? ? ? ?csv_writer.writerow(line)

Deleting a column from the csv file

import csv
with open("names.csv','r') as csv-file:
? ? ? ?csv_reader= csv.DictReader(csv-file)

with open("newnames.csv','w') as new_file:
? ? ? ??
? ? ? ?fieldnames =['empno','empname','email']
? ? ? ? ? csv_writer = csv.DictWriter(new-file,filenames=fieldnames,delimiter='\t')
? ? ? ? csv_writer.writeheader()
? ?for line in csv_reader:
? ? ? ? ?del line['email']
? ? ? ? ?csv_writer.writerow(line)

If logic:

Location = 'Tiruvannamalai'
? if Location =='Chennai'
? ? ? print('Location is Chennai')
? else:
? ? ?print('no match location')

import my_module
subject =['english','tamil','Hindi','Telugu']

index= my_module.find_index(subject,'tamil')

-----Using Excel:

from xlrd import open_workbook

sheet = open_workbook('/data/sample.xls').sheet_by_index(0)

row ={}

row_values = sheet.row_slice(6,start_colx=2,end_colx=7)

row['empno']? = int(row_values[0].values)
row['phone']? ?= int(row_values[1].values)
row['phone1']? ?= int(row_values[2].values)
row['phone2']? ?= int(row_values[3].values)
row['loc']? ?= ' '
row['salary'] = int(row_values[4].values)

print row

with open('d:/table.csv','a') as f:
? ?w = csv.DictWriter(f,row.keys())
? ?w.writeheader()
? ? w.writerow(row)

Combining two csv file into one csv file
import csv

with open('combined_file.csv', 'wb') as outcsv:
    writer = csv.DictWriter(outcsv, fieldnames = ["Date", "temperature 1", "Temperature 2"])

    with open('t1.csv', 'rb') as incsv:
        reader = csv.reader(incsv)
        writer.writerows({'Date': row[0], 'temperature 1': row[1], 'temperature 2': 0.0} for row in reader)

    with open('t2.csv', 'rb') as incsv:
        reader = csv.reader(incsv)
        writer.writerows({'Date': row[0], 'temperature 1': 0.0, 'temperature 2': row[1]} for row in reader)


Combining with two csv with header, handling exceptions:
import csv

csv_file = 'file/path/file_name'

values = ['preschool', 'secondary school']

def csv_header(x):
    with open(x + '.csv', 'ab') as myfile:
        myfile.write("%s %s %s %s \n" % ('id', 'type', 'state', 'location'))

def csv_writer(y, value):
    for row in y:
        if value in row:
            with open(value + '.csv', 'ab') as myfile:
                spamwriter = csv.writer(myfile)

def csv_reader(z):
    with open(z + '.csv', 'rb') as spam:
        spamreader = csv.reader(spam, delimiter=',', quotechar='|')
        csv_writer(spamreader, value)

for value in values:


practical scenario for deleting a record base on value
csv with the following header
 columns?id,?type,?state,?location,?number of students

124, preschool, Pennsylvania, Pittsburgh, 1242
421, secondary school, Ohio, Cleveland, 1244
213, primary school, California, Los Angeles, 3213
155, secondary school, Pennsylvania, Pittsburgh, 2141

import csv

with open('fin.csv', 'r') as fin, open('fout.csv', 'w', newline='') as fout:

    # define reader and writer objects
    reader = csv.reader(fin, skipinitialspace=True)
    writer = csv.writer(fout, delimiter=',')

    # write headers

    # iterate and write rows based on condition
    for i in reader:
        if int(i[-1]) > 2000:

find link would be useful

How connect python with orcale - ddl and dml operations

connect python to oralce  using cx_oracle ,
Connect Python to SQL Server using pyodbc
Connect Python to MS Access Database using pyodbc
Connect Python to MySQL using?MySQLdb


below link nice information about all db connectivity::::


how to insert data into sql server using python


how to read data from website


Reading data from the url

import urllib.request
# open a connection to a URL using urllib
webUrl? = urllib.request.urlopen('https://docs.tibco.com/pub/spotfire/7.0.0/doc/html/ncfe/ncfe_binning_functions.htm')

#get the result code and print it

print ("result code: " + str(webUrl.getcode()))

# read the data from the URL and print it
data = webUrl.read()
print (data)

real data web scrapping using 

Basic Web scrapping using python


A Complete Reference on WebScraping :


Good Slideshare on WebScrapping


Need to know few details

Urllib2:?It?is a Python module which?can be used for fetching URLs.

?It defines functions and classes to help with URL actions 

(basic and digest?authentication, redirections, cookies, etc).

BeautifulSoup:?It?is an incredible tool for pulling out information 
             from a webpage. 
You can use it to extract tables, lists, paragraph 

and you can also put filters to extract information from web pages

In this article, we will use latest version BeautifulSoup 4. 

Refer Link : https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/


Interactive with EXCEL and Python an Excise



Python with Docker:


learn python in depth : https://www.whoishostingthis.com/resources/python/

12-08-dec updates:

How to deal with dates in Python


case studies:

case studies on Single well Analysis:



 file operations in C# for prj

case studies: 


Datascience with all

spotfire datascience
Regression Analysis in practical life

some practical python : https://realpython.com (always refer this - if you want some real time sceanrios

ironpython case studies


FullStack Python : https://www.fullstackpython.com/table-of-contents.html


20 Libraries to know in python


Python in a core : https://docs.python-guide.org/

python a web paper which having rich blog nice to know


1. https://lnkd.in/ey_5hby

1.Glossary Terms of all concept of python

2. Python with Datascience

3. Comparison between the tools

The link cover all most every topic and easy to learn python from basic level to advance level.

2. https://lnkd.in/enFZcBJ

The above link will provide most important top 10 NLP libraries and very useful for data science.

Upload data from SQL Server to Redshift ( Using Python Script)

__author__ = 'Abu Shaz'

import sys
from os.path import expanduser
import yaml
import os
from Util.ParseFile import ParseFile
from DataContexts.SqlServerDataContext import SqlServerDataContext
from DataContexts.RedshiftDataContext import RedshiftDataContext
from AWSServices.S3Service import S3Service
from Util.CompressedFile import FileCompression
from Util.FileManipulation import FileManipulation

def main():
? ? if len(sys.argv) > 1:
? ? ? ? file_location = os.path.dirname(sys.argv[1]) # Need to work here for space in argv
? ? else:
? ? ? ? file_location = os.path.dirname(expanduser("~"))

? ? #print(file_location + "/config.yml")

? ? #print(os.path.abspath('.'))

? ? #file_location = os.path.dirname("C:\\Users\khairuzzaman\Desktop\Python Script\\config.yml")

? ? try:
? ? ? ? config_file = ParseFile.parse_yml_file(file_location + "/config.yml")
? ? ? ? sql_configuration = config_file['SQLServer']
? ? ? ? redshift_configuration = config_file['Redshift']
? ? ? ? s3_configuration = config_file['S3Config']
? ? except Exception as err:
? ? ? ? print('Config file is not well formatted or not available')

? ? table_name = sql_configuration["table"]
? ? file_name = table_name + ".gz"

? ? print("Data Loading from SQL Server Start ...\n")
? ? try:
? ? ? ? sql_context = SqlServerDataContext(config=sql_configuration)
? ? ? ? sql_context.export_data_to_file(table_name,directory=file_location)

? ? except Exception as err:
? ? ? ? print(err)
? ? ? ? raise

? ? print("SQL Server Data Loading Complete ...\n")

? ? try:
? ? ? ? FileCompression.gzip_compression(table_name,"txt",directory=file_location)
? ? except Exception as err:
? ? ? ? FileManipulation.remove_file(file_location + "/" + table_name + ".txt")
? ? ? ? print(err)
? ? ? ? raise

? ? print("Upload File to S3 ...\n")
? ? try:
? ? ? ? s3_service = S3Service(config=s3_configuration)
? ? ? ? s3_service.upload_file_to_s3(s3_configuration['bucket_name'],file_location=file_location,file_name=file_name)
? ? except Exception as err:
? ? ? ? FileManipulation.remove_file(file_location + "/" + table_name + ".txt")
? ? ? ? FileManipulation.remove_file(file_location + "/" + file_name )
? ? ? ? print(err)
? ? ? ? raise

? ? print("File Upload Complete ...\n")

? ? print("Start Executing Copy Command ...\n")
? ? try:
? ? ? ? cpoy_command = {}
? ? ? ? cpoy_command['table_name'] = redshift_configuration['table']
? ? ? ? cpoy_command['bucket_name'] = s3_configuration['bucket_name']
? ? ? ? cpoy_command['file_name'] = file_name
? ? ? ? cpoy_command['aws_access_key_id'] = s3_configuration['AWS_ACCESS_KEY_ID']
? ? ? ? cpoy_command['aws_secret_access_key'] = s3_configuration['AWS_SECRET_ACCESS_KEY']
? ? ? ? cpoy_command['delimiter'] = '\\t'

? ? ? ? redshift_configuration["port"] = "5439"
? ? ? ? redshift_context = RedshiftDataContext(config=redshift_configuration)
? ? ? ? redshift_context.execute_copy_command(command_param=cpoy_command)

? ? except Exception as err:
? ? ? ? FileManipulation.remove_file(file_location + "/" + table_name + ".txt")
? ? ? ? FileManipulation.remove_file(file_location + "/" + file_name )
? ? ? ? print(err)
? ? ? ? raise

? ? print("Executing Copy Command Finish ...\n")

? ? FileManipulation.remove_file(file_location + "/" + table_name + ".txt")
? ? FileManipulation.remove_file(file_location + "/" + file_name )

? ? #print(sql_configuration)
? ? #print(redshift_configuration)

if __name__ == '__main__':
? ? main()        

the below script for



the below script

Pull Your data from Amazon Redshift and PostgreS
url :

Redshift is Compatible with PostgreSQL. We can access Redshift using the offical
PostgreSQL libararies using the language Python or R.

for JDBC or ODBC please find the below link


for accessing data from google using python

if the Choice of Language is Python we can either use Numpy or Pandas.

Numpy : Most Fundamental Library in Python for Scentific Computation.
Pandas : Python Data Analytics Library provides high performance data structure
         for operating with table like structure.

To Get Connected to Redshift with Python:

?The available libraries for Python, the one that PostgreSQL?recommends?


import psycopg2        

use the below libarary

import psycopg2
con=psycopg2.connect(dbname= 'dbname', host='host',?
port= 'port', user= 'user', password= 'pwd')

The parameter that you need.

Database name
Host name
User name


First get a cursor from your DB connection:
cur = con.cursor()
Execute a select query to pull data where?tableis the table
 you want to get data from

cur.execute("SELECT * FROM `table`;")

After the successful execution of your query you need to 
instruct Psycopg how to fetch your data from the database


Of course after you are done do not forget to close your
 cursor & connection 


import numpy as np
data = np.array(cur.fetchall())

Where?cur?is the cursor we have created previously. That’s all, 
your data from Redshift as a NumPy array?


If instead of NumPy you plan to work with pandas, 
you can avoid using the previous steps altogether.
 You can use the?read_sql method?with which you can read an 
SQL query or database table directly into a DataFrame. 
In order to do that you will also need to use?SQLAlchemy.

from sqlalchemy import create_engine
import pandas as padas
engine = create_engine('postgresql://scott:tiger@hredshift_host:5439/mydatabase')
data_frame = padas.read_sql_query('SELECT * FROM `table`;', engine)

Note : SQLAlchemy https://www.sqlalchemy.org/ 
(Python SQL Toolkit& Object Relational Mapper)

SQLAlchemy is the Python SQL toolkit and Object Relational Mapper 
that gives application developers the full power and flexibility of SQL.

With the above code you will end up having a pandas data frame 
that contains the results of the SQL query that you have provided to it.

R ( Language)

?Python is not your cup of tea and you prefer R instead, 
you are still covered. Getting your data from Amazon Redshift or PostgreSQL
 is equally easy as in Python. 
As in Python we again need to first take care of how we will
 connect to our database and execute queries to it. 
To do that we will need the “RPostgreSQL”?package.


With the above code we make the package available and visible by R. 
Now we need to proceed in creating connections and executing queries. 
At the end the command to close the connection,
 don’t do that before you execute queries though??
but always remember to close the connection when you are done 
with pulling data out of the database.

drv <- dbDriver("PostgreSQL")
con <-dbConnect(drv,dbname="dbname",host="host",port=1234,

Getting your data from the database into an
 R data frame is just one command away:

df_postgres <- dbGetQuery(con, "SELECT * from `table`")

getting your data from Redshift or PostgreSQL for further analysis in Python and R
 is really easy. The true power of a database that stores your data in
 comparison with CSV files etc. is that you have SQL as an additional tool. 
Invest some time learning how to work with SQL and you will not regret it, 
having structured data in a database and using SQL to pre-process
 your data before you start building your statistical models will 
save you time and resources.
Although in this article we focused mainly on Amazon Redshift and PostgreSQL,
 using any other database is equally easy. The main difference will 
be the selection of the appropriate 
library for Python in the case of NumPy and for R. 
The usage of SQLAlchemy by Pandas makes it easier for
 you as the only change required is the configuration string for the database engine.

cleaning up with AWS Boto3


Accessing S3 DAta in Python using Boto3 (linkedin Published)



Python SDK "Boto3" for Amazon AWS (videos)

Offload query Using pyspark (Spark python) - to be analyzed and work on next week

Choose the right database in amazon

AWS Pricing Calcuator

What is achieved by Data Lake in AWS

About parquet and manifest file

Building Chatbot with python

Building with Django

step by step R and Python for reference - GITHUB also included


Tips : How to learn Python as Beginnners


Good LinkedinIn Blog on Python and Datascience


Google sheet : A nice blog https://www.benlcollins.com/

Great Tips by Good blog Authors - Every day tips

Pandas : https://twitter.com/i/moments/1158828895547854849

Good Refrence on Data science


Python visualization


Annamalai Om NamaSivayaa Thunai Sivakumar Ramar Poornima Devi Sasmita Lakshmi

Unnamalai Om Kamatchi Thunai


Oct 29 : Thiruannamalai Deepam

f=open('C:\\Users\\Sivakumar\\Dropbox\\My PC (DESKTOP-RKHVE1P)\\Desktop\\myfile.txt','r')        

#read the content of file

data = f.read()        

#get the length of the data

number_of_characters = len(data)
print('Number of characters in text file :', number_of_characters)
print("Upper Case")
print("Lowerr Case")
print("Title Case")


Number of characters in text file : 106
Upper Case
Lowerr Case
annamalai om namasivayaa thunai sivakumar ramar poornima devi sasmita lakshmi
unnamalai om kamatchi thunai
Title Case
Annamalai Om Namasivayaa Thunai Sivakumar Ramar Poornima Devi Sasmita Lakshmi
Unnamalai Om Kamatchi Thunai        

Found an Interesting learning python : My Birthday Gift 13th 2021


Path to Become DAta Engineer


Hope this will help us to start on python pandas and wishing your valuable feedback from you

Loop Like A Native | Ned Batchelder

Have a Great Day


