I want to know what is Python Pandas
▲ ▲ ▲ Sivakumar Velayampakkam
Technical Project Manager at iLink Digital
Python Zen Master: Jan 26 1121 am
=====================================================
clipboard,dictionary,excel,feather,from_xml,hdf5,html_table,json_to_datafram
mysql,numpy array,pickle,postgresql,string,summary,to_csv,to_json,to_latex,to_xmle
Faster Python pandas library May19 2021
https://www.kdnuggets.com/2021/05/vaex-pandas-1000x-faster.html
Faster learning python
learn in simple way:
Fantastic and short way for iteration with enumarte, loop
https://github.com/vaexio/vaex
Authors Python : Swathi Arun – Medium
Python Pandas
Fantastic explanation of functions below with passing default, *arg and *kwargs
https://media-exp1.licdn.com/dms/image/C5622AQFALb1yle8Uuw/feedshare-shrink_1280-alternative/0?e=1609372800&v=beta&t=ixtY47RxPL63lf-9LllFL5bNASpjjn2VEoIEVSC91wc
Python image processing library : opencv
100 tricks
https://www.dataschool.io/python-pandas-tips-and-tricks/
All Syntax in a Single Page : https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9
#Code snippets for Pandas
import pandas as pd
‘’’
Reading Files, Selecting Columns, and Summarizing
‘’’
# reading in a file from local computer or directly from a URL
# various file formats that can be read in out wrote out
‘’’
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
‘’’
#to read about different types of files, and further functionality of reading in files, visit: https://pandas.pydata.org/pandas-docs/version/0.20/io.html
df = pd.read_csv(‘local_path/file.csv’)
df = pd.read_csv(‘https://file_path/file.csv')
# when reading in tables, can specify separators, and note a column to be used as index separators can include tabs (“\t”), commas(“,”), pipes (“|”), etc.
df = pd.read_table(‘https://file_path/file', sep=’|’, index_col=’column_x’)
# examine the df data
df # print the first 30 and last 30 rows
type(df) # DataFrame
df.head() # print the first 5 rows
df.head(10) # print the first 10 rows
df.tail() # print the last 5 rows
df.index # “the index” (aka “the labels”)
df.columns # column names (which is “an index”)
df.dtypes # data types of each column
df.shape # number of rows and columns
df.values # underlying numpy array — df are stored as numpy arrays for effeciencies.
# select a column
df[‘column_y’] # select one column
type(df[‘column_y’]) # determine datatype of column (e.g., Series)
df.column_y # select one column using the DataFrame attribute — not effective if column names have spaces
# summarize (describe) the DataFrame
df.describe() # describe all numeric columns
df.describe(include=[‘object’]) # describe all object columns
df.describe(include=’all’) # describe all columns
# summarize a Series
df.column_y.describe() # describe a single column
df.column_z.mean() # only calculate the mean
df[“column_z”].mean() # alternate method for calculating mean
# count the number of occurrences of each value
df.column_y.value_counts() # most useful for categorical variables, but can also be used with numeric variables
#filter df by one column, and print out values of another column
#when using numeric values, no quotations
df[df.column_y == “string_value”].column_z
df[df.column_y == 20 ].column_z
# display only the number of rows of the ‘df’ DataFrame
df.shape[0]
# display the 3 most frequent occurances of column in ‘df’
df.column_y.value_counts()[0:3]
‘’’
Filtering and Sorting
‘’’
# boolean filtering: only show df with column_z < 20
filter_bool = df.column_z < 20 # create a Series of booleans…
df[filter_bool] # …and use that Series to filter rows
df[filter_bool].describe() # describes a data frame filtered by filter_bool
df[df.column_z < 20] # or, combine into a single step
df[df.column_z < 20].column_x # select one column from the filtered results
df[df[“column_z”] < 20].column_x # alternate method
df[df.column_z < 20].column_x.value_counts() # value_counts of resulting Series, can also use .mean(), etc. instead of .value_counts()
# boolean filtering with multiple conditions; indexes are in square brackets, conditions are in parens
df[(df.column_z < 20) & (df.column_y==’string’)] # ampersand for AND condition
df[(df.column_z < 20) | (df.column_z > 60)] # pipe for OR condition
# sorting
df.column_z.order() # sort a column
df.sort_values(‘column_z’) # sort a DataFrame by a single column
df.sort_values(‘column_z’, ascending=False) # use descending order instead
# Sort dataframe by multiple columns
df = df.sort([‘col1’,’col2',’col3'],ascending=[1,1,0])
# can also filter ‘df’ using pandas.Series.isin
df[df.column_x.isin([“string_1”, “string_2”])]
‘’’
Renaming, Adding, and Removing Columns
‘’’
# rename one or more columns
df.rename(columns={‘original_column_1’:’column_x’, ‘original_column_2’:’column_y’}, inplace=True) #saves changes
# replace all column names (in place)
new_cols = [‘column_x’, ‘column_y’, ‘column_z’]
df.columns = new_cols
# replace all column names when reading the file
df = pd.read_csv(‘df.csv’, header=0, names=new_cols)
# add a new column as a function of existing columns
df[‘new_column_1’] = df.column_x + df.column_y
df[‘new_column_2’] = df.column_x * 1000 #can create new columns without for loops
# removing columns
df.drop(‘column_x’, axis=1) # axis=0 for rows, 1 for columns — does not drop in place
df.drop([‘column_x’, ‘column_y’], axis=1, inplace=True) # drop multiple columns
# Lower-case all DataFrame column names
df.columns = map(str.lower, df.columns)
# Even more fancy DataFrame column re-naming
# lower-case all DataFrame column names (for example)
df.rename(columns=lambda x: x.split(‘.’)[-1], inplace=True)
‘’’
Handling Missing Values
‘’’
# missing values are usually excluded by default
df.column_x.value_counts() # excludes missing values
df.column_x.value_counts(dropna=False) # includes missing values
# find missing values in a Series
df.column_x.isnull() # True if missing
df.column_x.notnull() # True if not missing
# use a boolean Series to filter DataFrame rows
df[df.column_x.isnull()] # only show rows where column_x is missing
df[df.column_x.notnull()] # only show rows where column_x is not missing
# understanding axes
df.sum() # sums “down” the 0 axis (rows)
df.sum(axis=0) # equivalent (since axis=0 is the default)
df.sum(axis=1) # sums “across” the 1 axis (columns)
# adding booleans
pd.Series([True, False, True]) # create a boolean Series
pd.Series([True, False, True]).sum() # converts False to 0 and True to 1
# find missing values in a DataFrame
df.isnull() # DataFrame of booleans
df.isnull().sum() # count the missing values in each column
# drop missing values
df.dropna(inplace=True) # drop a row if ANY values are missing, defaults to rows, but can be applied to columns with axis=1
df.dropna(how=’all’, inplace=True) # drop a row only if ALL values are missing
# fill in missing values
df.column_x.fillna(value=’NA’, inplace=True)
# fill in missing values with ‘NA’
# value does not have to equal a string — can be set as some calculated value like df.column_x.mode(), or just a number like 0
# turn off the missing value filter
df = pd.read_csv(‘df.csv’, header=0, names=new_cols, na_filter=False)
‘’’
Split-Apply-Combine
Diagram: https://i.imgur.com/yjNkiwL.png
‘’’
# for each value in column_x, calculate the mean column_y
df.groupby(‘column_x’).column_y.mean()
# for each value in column_x, count the number of occurrences
df.column_x.value_counts()
# for each value in column_x, describe column_y
df.groupby(‘column_x’).column_y.describe()
# similar, but outputs a DataFrame and can be customized
df.groupby(‘column_x’).column_y.agg([‘count’, ‘mean’, ‘min’, ‘max’])
df.groupby(‘column_x’).column_y.agg([‘count’, ‘mean’, ‘min’, ‘max’]).sort_values(‘mean’)
# if you don’t specify a column to which the aggregation function should be applied, it will be applied to all numeric columns
df.groupby(‘column_x’).mean()
df.groupby(‘column_x’).describe()
# can also groupby a list of columns, i.e., for each combination of column_x and column_y, calculate the mean column_z
df.groupby([“column_x”,”column_y”]).column_z.mean()
#to take groupby results out of hierarchical index format (e.g., present as table), use .unstack() method
df.groupby(“column_x”).column_y.value_counts().unstack()
#conversely, if you want to transform a table into a hierarchical index, use the .stack() method
df.stack()
‘’’
Selecting Multiple Columns and Filtering Rows
‘’’
# select multiple columns
my_cols = [‘column_x’, ‘column_y’] # create a list of column names…
df[my_cols] # …and use that list to select columns
df[[‘column_x’, ‘column_y’]] # or, combine into a single step — double brackets due to indexing a list.
# use loc to select columns by name
df.loc[:, ‘column_x’] # colon means “all rows”, then select one column
df.loc[:, [‘column_x’, ‘column_y’]] # select two columns
df.loc[:, ‘column_x’:’column_y’] # select a range of columns (i.e., selects all columns including first through last specified)
# loc can also filter rows by “name” (the index)
df.loc[0, :] # row 0, all columns
df.loc[0:2, :] # rows 0/1/2, all columns
df.loc[0:2, ‘column_x’:’column_y’] # rows 0/1/2, range of columns
# use iloc to filter rows and select columns by integer position
df.iloc[:, [0, 3]] # all rows, columns in position 0/3
df.iloc[:, 0:4] # all rows, columns in position 0/1/2/3
df.iloc[0:3, :] # rows in position 0/1/2, all columns
#filtering out and dropping rows based on condition (e.g., where column_x values are null)
drop_rows = df[df[“column_x”].isnull()]
new_df = df[~df.isin(drop_rows)].dropna(how=’all’)
‘’’
Merging and Concatenating Dataframes
‘’’
#concatenating two dfs together (just smooshes them together, does not pair them in any meaningful way) - axis=1 concats df2 to right side of df1; axis=0 concats df2 to bottom of df1
new_df = pd.concat([df1, df2], axis=1)
#merging dfs based on paired columns; columns do not need to have same name, but should match values; left_on column comes from df1, right_on column comes from df2
new_df = pd.merge(df1, df2, left_on=’column_x’, right_on=’column_y’)
#can also merge slices of dfs together, though slices need to include columns used for merging
new_df = pd.merge(df1[[‘column_x1’, ‘column_x2’]], df2, left_on=’column_x2', right_on=’column_y’)
#merging two dataframes based on shared index values (left is df1, right is df2)
new_df = pd.merge(df1, df2, left_index=True, right_index=True)
‘’’
Other Frequently Used Features
‘’’
# map existing values to a different set of values
df[‘column_x’] = df.column_y.map({‘F’:0, ‘M’:1})
# encode strings as integer values (automatically starts at 0)
df[‘column_x_num’] = df.column_x.factorize()[0]
# determine unique values in a column
df.column_x.nunique() # count the number of unique values
df.column_x.unique() # return the unique values
# replace all instances of a value in a column (must match entire value)
df.column_y.replace(‘old_string’, ‘new_string’, inplace=True)
#alter values in one column based on values in another column (changes occur in place)
#can use either .loc or .ix methods
df.loc[df[“column_x”] == 5, “column_y”] = 1
df.ix[df.column_x == “string_value”, “column_y”] = “new_string_value”
#transpose data frame (i.e. rows become columns, columns become rows)
df.T
# string methods are accessed via ‘str’
df.column_y.str.upper() # converts to uppercase
df.column_y.str.contains(‘value’, na=’False’) # checks for a substring, returns boolean series
# convert a string to the datetime_column format
df[‘time_column’] = pd.to_datetime_column(df.time_column)
df.time_column.dt.hour # datetime_column format exposes convenient attributes
(df.time_column.max() — df.time_column.min()).days # also allows you to do datetime_column “math”
df[df.time_column > pd.datetime_column(2014, 1, 1)] # boolean filtering with datetime_column format
# setting and then removing an index, resetting index can help remove hierarchical indexes while preserving the table in its basic structure
df.set_index(‘time_column’, inplace=True)
df.reset_index(inplace=True)
# sort a column by its index
df.column_y.value_counts().sort_index()
# change the data type of a column
df[‘column_x’] = df.column_x.astype(‘float’)
# change the data type of a column when reading in a file
pd.read_csv(‘df.csv’, dtype={‘column_x’:float})
# create dummy variables for ‘column_x’ and exclude first dummy column
column_x_dummies = pd.get_dummies(df.column_x).iloc[:, 1:]
# concatenate two DataFrames (axis=0 for rows, axis=1 for columns)
df = pd.concat([df, column_x_dummies], axis=1)
‘’’
Less Frequently Used Features
‘’’
# create a DataFrame from a dictionary
pd.DataFrame({‘column_x’:[‘value_x1’, ‘value_x2’, ‘value_x3’], ‘column_y’:[‘value_y1’, ‘value_y2’, ‘value_y3’]})
# create a DataFrame from a list of lists
pd.DataFrame([[‘value_x1’, ‘value_y1’], [‘value_x2’, ‘value_y2’], [‘value_x3’, ‘value_y3’]], columns=[‘column_x’, ‘column_y’])
# detecting duplicate rows
df.duplicated() # True if a row is identical to a previous row
df.duplicated().sum() # count of duplicates
df[df.duplicated()] # only show duplicates
df.drop_duplicates() # drop duplicate rows
df.column_z.duplicated() # check a single column for duplicates
df.duplicated([‘column_x’, ‘column_y’, ‘column_z’]).sum() # specify columns for finding duplicates
# Clean up missing values in multiple DataFrame columns
df = df.fillna({
‘col1’: ‘missing’,
‘col2’: ‘99.999’,
‘col3’: ‘999’,
‘col4’: ‘missing’,
‘col5’: ‘missing’,
‘col6’: ‘99’
})
# Concatenate two DataFrame columns into a new, single column - (useful when dealing with composite keys, for example)
df[‘newcol’] = df[‘col1’].map(str) + df[‘col2’].map(str)
# Doing calculations with DataFrame columns that have missing values
# In example below, swap in 0 for df[‘col1’] cells that contain null
df[‘new_col’] = np.where(pd.isnull(df[‘col1’]),0,df[‘col1’]) + df[‘col2’]
# display a cross-tabulation of two Series
pd.crosstab(df.column_x, df.column_y)
# alternative syntax for boolean filtering (noted as “experimental” in the documentation)
df.query(‘column_z < 20’) # df[df.column_z < 20]
df.query(“column_z < 20 and column_y==’string’”) # df[(df.column_z < 20) & (df.column_y==’string’)]
df.query(‘column_z < 20 or column_z > 60’) # df[(df.column_z < 20) | (df.column_z > 60)]
# Loop through rows in a DataFrame
for index, row in df.iterrows():
print index, row[‘column_x’]
# Much faster way to loop through DataFrame rows if you can work with tuples
for row in df.itertuples():
print(row)
# Get rid of non-numeric values throughout a DataFrame:
for col in df.columns.values:
df[col] = df[col].replace(‘[?-9]+.-’, ‘’, regex=True)
# Change all NaNs to None (useful before loading to a db)
df = df.where((pd.notnull(df)), None)
# Split delimited values in a DataFrame column into two new columns
df[‘new_col1’], df[‘new_col2’] = zip(*df[‘original_col’].apply(lambda x: x.split(‘: ‘, 1)))
# Collapse hierarchical column indexes
df.columns = df.columns.get_level_values(0)
# display the memory usage of a DataFrame
df.info() # total usage
df.memory_usage() # usage by column
# change a Series to the ‘category’ data type (reduces memory usage and increases performance)
df[‘column_y’] = df.column_y.astype(‘category’)
# temporarily define a new column as a function of existing columns
df.assign(new_column = df.column_x + df.spirit + df.column_y)
# limit which rows are read when reading in a file
pd.read_csv(‘df.csv’, nrows=10) # only read first 10 rows
pd.read_csv(‘df.csv’, skiprows=[1, 2]) # skip the first two rows of data
# randomly sample a DataFrame
train = df.sample(frac=0.75, random_column_y=1) # will contain 75% of the rows
test = df[~df.index.isin(train.index)] # will contain the other 25%
# change the maximum number of rows and columns printed (‘None’ means unlimited)
pd.set_option(‘max_rows’, None) # default is 60 rows
pd.set_option(‘max_columns’, None) # default is 20 columns
print df
# reset options to defaults
pd.reset_option(‘max_rows’)
pd.reset_option(‘max_columns’)
# change the options temporarily (settings are restored when you exit the ‘with’ block)
with pd.option_context(‘max_rows’, None, ‘max_columns’, None):
print df
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Pandas Data structure are classified as Pandas Data frame and Series. In other terms we can conclude that 1 dimension are meant for Series and 2 dimension are meant for Data frame.(like tables having row and columns).
Installation of Anaconda an opensource will help us to work on Python pandas using Jupyter Notebook or Spyder will help code python pandas.
Python pandas allows to extract or write data from CSV or TSV file, or a SQL database etc.
Best Reference Article : https://medium.freecodecamp.org/python-collection-of-my-favorite-articles-8469b8455939
https://www.dhirubhai.net/learning/python-for-data-science-tips-tricks-techniques/what-you-should-know
https://www.guru99.com/python-tutorials.html
Practical Python as Starting point for new beginners https://www.w3resource.com/python-exercises/
the below are some snippets of python pandas for reference.
Know the datastructure of python as Basic
https://www.grapenthin.org/teaching/geop501/lectures/lecture_06_data_structures.pdf
Code : to download data csv file from the url
import pandas as?pd
df?=?pd.read_csv("https://censusdata.ire.org/06/all_050_in_06.P1.csv")
df.head(5) - this will display first 5 rows of the csv file.
Code : to Save the downloaded csv in your local drive
df.to_csv('exa.csv') - this will save the above P1.csv in your local drive
if you want to where your csv download
please use the command pwd - it let you know where your csv file is stored.
Make use of the below Python Cheat Sheet for reference.
Pandas for New Learners:
https://www.webpages.uidaho.edu/~stevel/504/pandas%20dataframe%20notes.pdf
Pandas for Beginners:
https://forums.fast.ai/t/great-pandas-python-dataframes-cheat-sheet/6965
https://www2.bc.edu/lily-tsoi/python/cheatsheets/pandas_basics_cheatsheet.pdf
Start Work with Pandas with below url for understanding
https://morphocode.com/pandas-cheat-sheet/
the below link is a collection of Python pandas Cheat sheet
https://sinxloud.com/python-cheat-sheet-beginner-advanced/
Pandas for datascience :
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf
https://www.utc.fr/~jlaforet/Suppl/python-cheatsheets.pdf
https://www.dataquest.io/blog/large_files/pandas-cheat-sheet.pdf
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf
Data wranggle
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Pandas by SQL
https://hackernoon.com/pandas-cheatsheet-for-sql-people-part-1-2976894acd0
Pandas with excel
https://www.dataquest.io/blog/excel-and-pandas/
Little to know Use of Conda(Annaconda):
https://conda.io/docs/_downloads/conda-cheatsheet.pdf
Some tips to know what is Panda
https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673
Some sample code to get Excercise:
https://www.machinelearningplus.com/python/101-pandas-exercises-python/
Advanced Pandas and sample projects
Basic course in Pandas
https://www.python-course.eu/pandas.php
very use to learn as new beginners
https://www.tutorialspoint.com/python_pandas/python_pandas_categorical_data.htm
Pandas Pivothttps://pbpython.com/pandas-pivot-table-explained.html
Learn Python Step by Step Approach:
https://bigdata-madesimple.com/step-by-step-approach-to-perform-data-analysis-using-python/
when to use Aggregate or filter or transform in pandas
https://pythonforbiologists.com/when-to-use-aggregatefiltertransform-in-pandas/
Pandas techniques by Analytics Vidhya
https://www.analyticsvihttps://www.geeksforgeeks.org/data-analysis-visualization-python/dhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/
Python Data Visualizations:
https://www.geeksforgeeks.org/data-analysis-visualization-python/
https://www.commonlounge.com/discussion/0545c6f9156a44e6aacce6956c2624ed
python basic level for beginners
https://www.csc2.ncsu.edu/faculty/healey/msa-17/python/index.html
https://www.w3schools.com/python/
Machine Learning Project in Python pandas
(involves plot of visualiazation like ggplot2 in R)
https://www.hackerearth.com/practice/machine-learning/machine-learning-projects/python-project/tutorial/
5 reason why python is used for Big Data
( this will have details numpy,scipy,pandas,scikit-learn,pybrian,tensorflow
https://www.newgenapps.com/blog/python-for-big-data-science-projects-benefit-uses
Essential hacks and tricks with python ( useful informatiion about deep learning and links)
https://heartbeat.fritz.ai/some-essential-hacks-and-tricks-for-machine-learning-with-python-5478bc6593f2
15 Essential libaray need to know from python to use
https://www.upwork.com/hiring/data/15-python-libraries-data-science/
30 Amazing python project for reference
https://medium.mybridge.co/30-amazing-python-projects-for-the-past-year-v-2018-9c310b04cdb3
Sentimental analysis using python
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
?
Stock Market Analysis with python
https://ntguardian.wordpress.com/2016/09/19/introduction-stock-market-data-python-1/
?
pandas in Jupyter
https://nikgrozev.com/2015/12/27/pandas-in-jupyter-quickstart-and-useful-snippets/
Pandas with JSON
https://okfnlabs.org/blog/2016/08/01/using-data-packages-with-pandas.html
Processing Huge Dataset in Python
https://datascienceplus.com/processing-huge-dataset-with-python/
R Interface to Python
https://blog.rstudio.com/2018/03/26/reticulate-r-interface-to-python/
Python data science
https://www.sascommunity.org/planet/blog/category/python/
?
Some basics command of Pandas
step 1: Arthimetic in Pandas
import pandas as pd
from pandas import series,DataFrame
import numpy as np
sk1 = Series([1,9,-2,23])
sk2 = Series([5,2,-10,4,50])
print(sk1)
check if which Series greater
sk1 > sk2
To find the square root
np.sqrt(sk1) will fetch square of the data.
sk1.mean() - will get you the average of the data from the series
lamba function with checking the values of sk1 > 2 (if it > 2 then it will
return to value of sk1 else it will return 2)
sk1.apply(lambda x: x if x > 2 else 2)
sk1 values sk1.apply values
1 2
9 9
-2 2
23 23
Program 2: team meeting on first friday every month
import calendar
print("team meeting will be on :")
?for m in range(1,13):
cal = calendar.monthcalendar(2018,m)
weekone=cal[0]
weektwo=cal[1]
if weekone[calendar.FRIDAY] !=0:
? ? meetday = weekone[calendar.FRIDAY]
else:
? ? meetday = weektwo[calednar.FRIDAY]
print("%10s %3d" % (calendar.month_name[m],meetday))
Program 3:First sunday of every Month
import calendar
c= calendar.TextCalendar(calendar.SUNDAY)
st = c.formatmonth(2017,1,0,0)
print(st)
Files : open a file and write some text, close the file
program 3
def main():
f = open("rocky.txt","w+")
write some line in the? file
for i in range(10):
f.write("this is line", + str(i) + "\r\n")
close the file
f.close()
append line in the file
def main():
f=open("rocky.txt","a")
for i in range(10):
f.write("Append - this is line", +str(i) +"\r\n")
f.close()
program 4:
Read the exisiting file and print
def main():
f=open("rocky.txt","r")
if f.mode == 'r':
contents = f.read()
print(contents)
*******************************
Working with Files
*******************************
import os
from os import path
import datetime
from datetime import date,time,timedelta
import time
def main():
print(os.name)
check if existence and type
def main():
print("file exists :" + str(path.exists("rocky.txt")))
print(" Is a file :" + str(path.isfile("rocky.txt")))
print(" Is a directoy :" +str(path.isdir("rocky.txt"))
work with file
print("file path :" +str(path.realpath("rocky.txt")))
print("file and the path :"+str(path.split(path.realpath("rocky.txt")))
Get the time when the file is modified
t=time.ctime(path.getmtime("rocky.txt")
print(t)
print(datetime.datetime.fromtimestamp(path.getmtime("rocky.txt"))
calcualate how long ago the file was modified
td=datetime.datetime.now() - datetime.datetime.fromtimestamp(path.getmtime("rocky.txt"))
print(" it has been " + str(td) + "since the file modified")
print("Or ," +str(td.total_seconds()) + " seconds")
*************************
shell commands
*************************
import os
from os import path
import shutil
def main():
? ?# make a duplicate of an existing file
?if path.exists("rocky.txt")
? # get the path to the file in current directory
? ?src = path.realpath("rocky.txt")
#? now aim to backup a copy by appending "bak" to the name
dst = src+".bak"
# copy over the permission,modification times and
other information
shutil.copy(src,dst)
shutil.copystat(src,dst)
# rename the original file
os.rename("rocky.txt","tummy.txt")
make archieve of the file
import os
from os import path
import shutil
from shutil import make_archive ( make archive purpose)
# now put file into zip archive
root_dir,tail = path.split(src)
shutil.make_archieve("archive","zip",root_dir)
from os import path
import shutil
from shutil import make_archive ( make archive purpose)
from zipfile import ZipFile
# more fine=grained control over Zip files
?with ZipFile("testzip.zip","w") as newzip:
? ? ? ? ? newzip.write("rocky.txt")
? ? ? ? ? newzip.write("rocky.txt.bak")
********************************
Read data from Internet
********************************
import urllib.request
def main():
? webUrl = urllib.request.urlopen("http: www.google.com")
print("result code : " +str(webUrl.getcode()))
data = webUrl.read()
print(data)
program : Read first 5 line
>>> with open("C:\Users\Sivakumar\Desktop\sam.txt") as myfile:
firstnline = myfile.readlines()[0:5]
print firstnline
['My? Family? ? ? ? ? \t\n',
' Ilove my family\n',
'My the best cook,\n',
'My the best man ,\n',
'My s girl,\n']
N=10
f=open("file")
for i in range(N):
line=f.next().strip()
print line
f.close()
Reading csv file
****************************
import csv
with open("names.csv','r') as csv-file:
? ? ? ?csv_reader= csv.reader(csv-file)
? ?for line in csv_reader:
? ? ? ? ?print(line)
## print(line['columname'])
Read and writer csv file
****************************
import csv
with open("names.csv','r') as csv-file:
? ? ? ?csv_reader= csv.reader(csv-file)
with open("newnames.csv','w') as new_file:
? ? ? ? csv_writer = csv.writer(new_file)
#csv_writer = csv.write(new_file,delimiter ='\t)
? ?for line in csv_reader:
? ? ? ? ?csv_writer.writerow(line)
Reading and writing fields name from the csv file
**************************************************
import csv
with open("names.csv','r') as csv-file:
? ? ? ?csv_reader= csv.DictReader(csv-file)
with open("newnames.csv','w') as new_file:
? ? ? ??
? ? ? ?fieldnames =['empno','empname','email']
? ? ? ? ? csv_writer = csv.DictWriter(new-file,filenames=fieldnames,delimiter='\t')
? ?for line in csv_reader:
? ? ? ? ?csv_writer.writerow(line)
Deleting a column from the csv file
*********************************
import csv
with open("names.csv','r') as csv-file:
? ? ? ?csv_reader= csv.DictReader(csv-file)
with open("newnames.csv','w') as new_file:
? ? ? ??
? ? ? ?fieldnames =['empno','empname','email']
? ? ? ? ? csv_writer = csv.DictWriter(new-file,filenames=fieldnames,delimiter='\t')
? ? ? ? csv_writer.writeheader()
? ?for line in csv_reader:
? ? ? ? ?del line['email']
? ? ? ? ?csv_writer.writerow(line)
If logic:
*********
Location = 'Tiruvannamalai'
? if Location =='Chennai'
? ? ? print('Location is Chennai')
? else:
? ? ?print('no match location')
import my_module
subject =['english','tamil','Hindi','Telugu']
index= my_module.find_index(subject,'tamil')
print(index)
-----Using Excel:
*********************************
from xlrd import open_workbook
sheet = open_workbook('/data/sample.xls').sheet_by_index(0)
row ={}
row_values = sheet.row_slice(6,start_colx=2,end_colx=7)
row['empno']? = int(row_values[0].values)
row['phone']? ?= int(row_values[1].values)
row['phone1']? ?= int(row_values[2].values)
row['phone2']? ?= int(row_values[3].values)
row['loc']? ?= ' '
row['salary'] = int(row_values[4].values)
print row
with open('d:/table.csv','a') as f:
? ?w = csv.DictWriter(f,row.keys())
? ?w.writeheader()
? ? w.writerow(row)
Combining two csv file into one csv file
******************************************
import csv
with open('combined_file.csv', 'wb') as outcsv:
writer = csv.DictWriter(outcsv, fieldnames = ["Date", "temperature 1", "Temperature 2"])
writer.writeheader()
with open('t1.csv', 'rb') as incsv:
reader = csv.reader(incsv)
writer.writerows({'Date': row[0], 'temperature 1': row[1], 'temperature 2': 0.0} for row in reader)
with open('t2.csv', 'rb') as incsv:
reader = csv.reader(incsv)
writer.writerows({'Date': row[0], 'temperature 1': 0.0, 'temperature 2': row[1]} for row in reader)
https://docs.python.org/2/library/csv.html
Combining with two csv with header, handling exceptions:
=========================================================
import csv
csv_file = 'file/path/file_name'
values = ['preschool', 'secondary school']
def csv_header(x):
with open(x + '.csv', 'ab') as myfile:
myfile.write("%s %s %s %s \n" % ('id', 'type', 'state', 'location'))
myfile.close()
def csv_writer(y, value):
for row in y:
if value in row:
with open(value + '.csv', 'ab') as myfile:
spamwriter = csv.writer(myfile)
spamwriter.writerow(row)
myfile.close()
def csv_reader(z):
with open(z + '.csv', 'rb') as spam:
spamreader = csv.reader(spam, delimiter=',', quotechar='|')
csv_writer(spamreader, value)
for value in values:
try:
csv_reader(value)
csv_reader(csv_file)
except:
csv_header(value)
csv_reader(csv_file)
practical scenario for deleting a record base on value
*******************************************************
csv with the following header
columns?id,?type,?state,?location,?number of students
124, preschool, Pennsylvania, Pittsburgh, 1242
421, secondary school, Ohio, Cleveland, 1244
213, primary school, California, Los Angeles, 3213
155, secondary school, Pennsylvania, Pittsburgh, 2141
import csv
with open('fin.csv', 'r') as fin, open('fout.csv', 'w', newline='') as fout:
# define reader and writer objects
reader = csv.reader(fin, skipinitialspace=True)
writer = csv.writer(fout, delimiter=',')
# write headers
writer.writerow(next(reader))
# iterate and write rows based on condition
for i in reader:
if int(i[-1]) > 2000:
writer.writerow(i)
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
find link would be useful
https://realpython.com/python-csv/
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
How connect python with orcale - ddl and dml operations
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$\
https://www.orafaq.com/wiki/Python
connect python to oralce using cx_oracle ,
Connect Python to SQL Server using pyodbc
Connect Python to MS Access Database using pyodbc
Connect Python to MySQL using?MySQLdb
https://datatofish.com/how-to-connect-python-to-an-oracle-database-using-cx_oracle/
below link nice information about all db connectivity::::
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
https://ryrobes.com/featured-articles/using-a-simple-python-script-for-end-to-end-data-transformation-and-etl-part-1/
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
how to insert data into sql server using python
https://datatofish.com/insert-sql-server-python/
how to read data from website
https://docs.python.org/3.4/howto/urllib2.html
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
Reading data from the url
import urllib.request
# open a connection to a URL using urllib
webUrl? = urllib.request.urlopen('https://docs.tibco.com/pub/spotfire/7.0.0/doc/html/ncfe/ncfe_binning_functions.htm')
#get the result code and print it
print ("result code: " + str(webUrl.getcode()))
# read the data from the URL and print it
data = webUrl.read()
print (data)
real data web scrapping using
https://savvastjortjoglou.com/nba-draft-part01-scraping.html
Basic Web scrapping using python
https://t-redactyl.io/blog/2016/01/basic-web-scraping-in-python.html
______________________________________________________________
A Complete Reference on WebScraping :
https://index-of.es/Varios/Ryan%20Mitchell-Web%20Scraping%20with%20Python_%20Collecting%20Data%20from%20the%20Modern%20Web-O'Reilly%20Media%20(2015).pdf
Good Slideshare on WebScrapping
https://www.slideshare.net/paulschreiber/web-scraping-with-python?next_slideshow=1
Need to know few details
Urllib2:?It?is a Python module which?can be used for fetching URLs.
?It defines functions and classes to help with URL actions
(basic and digest?authentication, redirections, cookies, etc).
BeautifulSoup:?It?is an incredible tool for pulling out information
from a webpage.
You can use it to extract tables, lists, paragraph
and you can also put filters to extract information from web pages
In this article, we will use latest version BeautifulSoup 4.
Refer Link : https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/
*******************************************************************
Interactive with EXCEL and Python an Excise
https://pbpython.com/xlwings-pandas-excel.html
==================================================================
Python with Docker:
https://towardsdatascience.com/a-short-guide-to-using-docker-for-your-data-science-environment-912617b3603e
learn python in depth : https://www.whoishostingthis.com/resources/python/
领英推荐
12-08-dec updates:
How to deal with dates in Python
https://www.guru99.com/date-time-and-datetime-classes-in-python.html
https://www.w3schools.com/python/python_datetime.asp
case studies:
case studies on Single well Analysis:
https://www.dhirubhai.net/pulse/scripts-well-modeling-batch-automation-alfonso-r-reyes/
https://www.slideshare.net/AlfonsoReyes4/scripts-applied-to-well-and-network-modeling?from_action=save
***************************
file operations in C# for prj
******************************
case studies:
https://www.guru99.com/c-sharp-file-operations.html
https://www.dotnetperls.com/path
**************************************
Datascience with all
https://semanticommunity.info/Data_Science/Data_Science_Central
********************************************************************
***************************
spotfire datascience
https://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science
**********************************
*****************************************************************
Regression Analysis in practical life
https://www.dataquest.io/blog/linear-regression-in-real-life/
some practical python : https://realpython.com (always refer this - if you want some real time sceanrios
ironpython case studies
https://community.tibco.com/wiki/ironpython-script-access-data-web-service-using-httpwebrequest-and-parse-returned-json
FullStack Python : https://www.fullstackpython.com/table-of-contents.html
https://www.pythonforthelab.com/blog/how-to-use-hdf5-files-in-python/
20 Libraries to know in python
https://blog.stoneriverelearning.com/20-great-python-libraries-you-must-know/
Python in a core : https://docs.python-guide.org/
python a web paper which having rich blog nice to know
https://data-flair.training/blogs/python-tutorial/
1. https://lnkd.in/ey_5hby
1.Glossary Terms of all concept of python
2. Python with Datascience
3. Comparison between the tools
The link cover all most every topic and easy to learn python from basic level to advance level.
2. https://lnkd.in/enFZcBJ
The above link will provide most important top 10 NLP libraries and very useful for data science.
Upload data from SQL Server to Redshift ( Using Python Script)
__author__ = 'Abu Shaz'
import sys
from os.path import expanduser
import yaml
import os
from Util.ParseFile import ParseFile
from DataContexts.SqlServerDataContext import SqlServerDataContext
from DataContexts.RedshiftDataContext import RedshiftDataContext
from AWSServices.S3Service import S3Service
from Util.CompressedFile import FileCompression
from Util.FileManipulation import FileManipulation
def main():
? ? if len(sys.argv) > 1:
? ? ? ? file_location = os.path.dirname(sys.argv[1]) # Need to work here for space in argv
? ? else:
? ? ? ? file_location = os.path.dirname(expanduser("~"))
? ? #print(file_location + "/config.yml")
? ? #print(os.path.abspath('.'))
? ? #file_location = os.path.dirname("C:\\Users\khairuzzaman\Desktop\Python Script\\config.yml")
? ? try:
? ? ? ? config_file = ParseFile.parse_yml_file(file_location + "/config.yml")
? ? ? ? sql_configuration = config_file['SQLServer']
? ? ? ? redshift_configuration = config_file['Redshift']
? ? ? ? s3_configuration = config_file['S3Config']
? ? except Exception as err:
? ? ? ? print('Config file is not well formatted or not available')
? ? table_name = sql_configuration["table"]
? ? file_name = table_name + ".gz"
? ? print("Data Loading from SQL Server Start ...\n")
? ? try:
? ? ? ? sql_context = SqlServerDataContext(config=sql_configuration)
? ? ? ? sql_context.export_data_to_file(table_name,directory=file_location)
? ? except Exception as err:
? ? ? ? print(err)
? ? ? ? raise
? ? print("SQL Server Data Loading Complete ...\n")
? ? try:
? ? ? ? FileCompression.gzip_compression(table_name,"txt",directory=file_location)
? ? except Exception as err:
? ? ? ? FileManipulation.remove_file(file_location + "/" + table_name + ".txt")
? ? ? ? print(err)
? ? ? ? raise
? ? print("Upload File to S3 ...\n")
? ? try:
? ? ? ? s3_service = S3Service(config=s3_configuration)
? ? ? ? s3_service.upload_file_to_s3(s3_configuration['bucket_name'],file_location=file_location,file_name=file_name)
? ? except Exception as err:
? ? ? ? FileManipulation.remove_file(file_location + "/" + table_name + ".txt")
? ? ? ? FileManipulation.remove_file(file_location + "/" + file_name )
? ? ? ? print(err)
? ? ? ? raise
? ? print("File Upload Complete ...\n")
? ? print("Start Executing Copy Command ...\n")
? ? try:
? ? ? ? cpoy_command = {}
? ? ? ? cpoy_command['table_name'] = redshift_configuration['table']
? ? ? ? cpoy_command['bucket_name'] = s3_configuration['bucket_name']
? ? ? ? cpoy_command['file_name'] = file_name
? ? ? ? cpoy_command['aws_access_key_id'] = s3_configuration['AWS_ACCESS_KEY_ID']
? ? ? ? cpoy_command['aws_secret_access_key'] = s3_configuration['AWS_SECRET_ACCESS_KEY']
? ? ? ? cpoy_command['delimiter'] = '\\t'
? ? ? ? redshift_configuration["port"] = "5439"
? ? ? ? redshift_context = RedshiftDataContext(config=redshift_configuration)
? ? ? ? redshift_context.execute_copy_command(command_param=cpoy_command)
? ? except Exception as err:
? ? ? ? FileManipulation.remove_file(file_location + "/" + table_name + ".txt")
? ? ? ? FileManipulation.remove_file(file_location + "/" + file_name )
? ? ? ? print(err)
? ? ? ? raise
? ? print("Executing Copy Command Finish ...\n")
? ? FileManipulation.remove_file(file_location + "/" + table_name + ".txt")
? ? FileManipulation.remove_file(file_location + "/" + file_name )
? ? #print(sql_configuration)
? ? #print(redshift_configuration)
if __name__ == '__main__':
? ? main()
the below script for
TURNING AMAZON REDSHIFT QUERIES INTO AUTOMATED E-MAIL REPORTS
USING PYTHON IN MAC OS?X
https://artemiorimando.com/2018/07/20/turning-amazon-redshift-queries-into-automated-e-mail-reports-using-python-in-mac-os/
the below script
Pull Your data from Amazon Redshift and PostgreS
url :
https://www.blendo.co/blog/access-your-data-in-amazon-redshift-and-postgresql-with-python-and-r/
Redshift is Compatible with PostgreSQL. We can access Redshift using the offical
PostgreSQL libararies using the language Python or R.
for JDBC or ODBC please find the below link
https://www.blendo.co/blog/access-your-data-in-amazon-redshift-and-postgresql-with-python-and-r/
for accessing data from google using python
https://www.blendo.co/blog/access-data-google-bigquery-python-r/
if the Choice of Language is Python we can either use Numpy or Pandas.
Numpy : Most Fundamental Library in Python for Scentific Computation.
Pandas : Python Data Analytics Library provides high performance data structure
for operating with table like structure.
To Get Connected to Redshift with Python:
?The available libraries for Python, the one that PostgreSQL?recommends?
is?Psycopg.
import psycopg2
use the below libarary
import psycopg2
con=psycopg2.connect(dbname= 'dbname', host='host',?
port= 'port', user= 'user', password= 'pwd')
The parameter that you need.
Database name
Host name
Port
User name
password
EXECUTE QUERIES WITH PYTHON USING PSYCOPG
First get a cursor from your DB connection:
cur = con.cursor()
Execute a select query to pull data where?tableis the table
you want to get data from
cur.execute("SELECT * FROM `table`;")
After the successful execution of your query you need to
instruct Psycopg how to fetch your data from the database
cur.fetchall()
Of course after you are done do not forget to close your
cursor & connection
cur.close()
conn.close()
LOAD DATA TO NUMPY
import numpy as np
data = np.array(cur.fetchall())
Where?cur?is the cursor we have created previously. That’s all,
your data from Redshift as a NumPy array?
LOAD DATA TO PANDAS
If instead of NumPy you plan to work with pandas,
you can avoid using the previous steps altogether.
You can use the?read_sql method?with which you can read an
SQL query or database table directly into a DataFrame.
In order to do that you will also need to use?SQLAlchemy.
from sqlalchemy import create_engine
import pandas as padas
engine = create_engine('postgresql://scott:tiger@hredshift_host:5439/mydatabase')
data_frame = padas.read_sql_query('SELECT * FROM `table`;', engine)
Note : SQLAlchemy https://www.sqlalchemy.org/
(Python SQL Toolkit& Object Relational Mapper)
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper
that gives application developers the full power and flexibility of SQL.
With the above code you will end up having a pandas data frame
that contains the results of the SQL query that you have provided to it.
R ( Language)
-------------
?Python is not your cup of tea and you prefer R instead,
you are still covered. Getting your data from Amazon Redshift or PostgreSQL
is equally easy as in Python.
As in Python we again need to first take care of how we will
connect to our database and execute queries to it.
To do that we will need the “RPostgreSQL”?package.
install.packages("RPostgreSQL")
require("RPostgreSQL")
With the above code we make the package available and visible by R.
Now we need to proceed in creating connections and executing queries.
At the end the command to close the connection,
don’t do that before you execute queries though??
but always remember to close the connection when you are done
with pulling data out of the database.
drv <- dbDriver("PostgreSQL")
con <-dbConnect(drv,dbname="dbname",host="host",port=1234,
user="user",password="password")
dbDisconnect(con)
Getting your data from the database into an
R data frame is just one command away:
df_postgres <- dbGetQuery(con, "SELECT * from `table`")
getting your data from Redshift or PostgreSQL for further analysis in Python and R
is really easy. The true power of a database that stores your data in
comparison with CSV files etc. is that you have SQL as an additional tool.
Invest some time learning how to work with SQL and you will not regret it,
having structured data in a database and using SQL to pre-process
your data before you start building your statistical models will
save you time and resources.
Although in this article we focused mainly on Amazon Redshift and PostgreSQL,
using any other database is equally easy. The main difference will
be the selection of the appropriate
library for Python in the case of NumPy and for R.
The usage of SQLAlchemy by Pandas makes it easier for
you as the only change required is the configuration string for the database engine.
cleaning up with AWS Boto3
https://ranman.com/cleaning-up-aws-with-boto3/
Accessing S3 DAta in Python using Boto3 (linkedin Published)
https://dluo.me/s3databoto3
Python SDK "Boto3" for Amazon AWS (videos)
Offload query Using pyspark (Spark python) - to be analyzed and work on next week
Choose the right database in amazon
AWS Pricing Calcuator
What is achieved by Data Lake in AWS
About parquet and manifest file
Building Chatbot with python
Building with Django
step by step R and Python for reference - GITHUB also included
Tips : How to learn Python as Beginnners
Good LinkedinIn Blog on Python and Datascience
Google sheet : A nice blog https://www.benlcollins.com/
Great Tips by Good blog Authors - Every day tips
Good Refrence on Data science
Python visualization
Annamalai Om NamaSivayaa Thunai Sivakumar Ramar Poornima Devi Sasmita Lakshmi
Unnamalai Om Kamatchi Thunai
Oct 29 : Thiruannamalai Deepam
f=open('C:\\Users\\Sivakumar\\Dropbox\\My PC (DESKTOP-RKHVE1P)\\Desktop\\myfile.txt','r')
#read the content of file
data = f.read()
#get the length of the data
number_of_characters = len(data)
i='-'
print('Number of characters in text file :', number_of_characters)
print(i*100)
print("Upper Case")
print(data.upper())
print(i*100)
print("Lowerr Case")
print(data.lower())
print("Title Case")
print(i*100)
print(data.title())
Output:
Number of characters in text file : 106
----------------------------------------------------------------------------------------------------
Upper Case
ANNAMALAI OM NAMASIVAYAA THUNAI SIVAKUMAR RAMAR POORNIMA DEVI SASMITA LAKSHMI
UNNAMALAI OM KAMATCHI THUNAI
----------------------------------------------------------------------------------------------------
Lowerr Case
annamalai om namasivayaa thunai sivakumar ramar poornima devi sasmita lakshmi
unnamalai om kamatchi thunai
Title Case
----------------------------------------------------------------------------------------------------
Annamalai Om Namasivayaa Thunai Sivakumar Ramar Poornima Devi Sasmita Lakshmi
Unnamalai Om Kamatchi Thunai
Found an Interesting learning python : My Birthday Gift 13th 2021
https://www.youtube.com/watch?v=8DvywoWv6fI
Path to Become DAta Engineer
https://www.datacamp.com/community/blog/the-path-to-becoming-a-data-engineer
Hope this will help us to start on python pandas and wishing your valuable feedback from you
Have a Great Day