登录查看更多内容

Python for Data Analysis by ganesh kavhar

ganesh kavhar

Python | PySpark | Databricks | ETL | SQL | Unix | Big Data | Data warehouse

发布日期: 2019年3月24日

A friend recently asked this and I thought it might benefit others if published here. This is for someone new to Python that wants the easiest path from zero to one.

Download the Python 3.X version of the Anaconda distribution for your operating system here. You will avoid a lot of install-related headaches by choosing this pre-bundled distribution. It comes with most of the important data analysis packages pre-installed.
Once you have it installed, test to make sure that the default python interpreter is the one you’ve just installed. This is important because your system may already have a version of Python installed, but it won’t have all the good stuff in the Anaconda bundle, so you need to make sure the new one is the default. On Mac/Linux this might mean typing which python in the terminal. Or you can just run the Python interpreter and make sure the version matches what you downloaded. If all went well, it should have been done by the install. If not, you’ll need to stop here and fix it.
Issue the jupyter notebook command in your shell. This should open a browser window. If not, open a browser and navigate to https://localhost:8888. Once there, create a new Python notebook.
Go to the kernels section of www.kaggle.com and filter to Python kernels. These are mostly jupyter notebooks of other people doing analysis or building models on data sets that are freely available on Kaggle’s website. Look for titles with things like EDA (Exploratory Data Analysis), as opposed to those building predictive models. Find one that’s interesting and start recreating it in your notebook.

Note: You’ll find that when you try to recreate some of these analyses that you get import errors. This is likely because they’ve installed packages that are not bundled in the Anaconda distribution. You’ll eventually need to learn how to interact with the conda package manager and this will be one of many rabbit holes you’ll eventually go down. Usually it’s as easy as conda install <package_name> but you’ll need to find the right package name and sometimes you’ll need to specify other details. And other times you’ll need to use pip install <other_package_name>, but you’ll learn all that later.

High Level Library Summary

Here’s a quick summary of the important libraries you’ll interact with frequently.

NumPy: has a lot of the core functionality for scientific computing. Under the hood is calling C-compiled code, so is much faster than the same functions written in Python. Not the most user-friendly.
SciPy: similar to NumPy but has more means for sampling from distributions, calculating test statistics…etc.
MatPlotLib: The main plotting framework. A necessary evil.
Seaborn: import it after MatPlotLib and it will make your plots a lot prettier by default. Also has its own functionality, but I find the coolest stuff runs too slow.
Pandas: mostly a thin wrapper around NumPy/SciPy to make more user friendly. Ideal for interacting with tables of data, which they call a DataFrame. Also has wrappers around plotting functionality to enable quick plotting while avoiding complications of MPL. I use Pandas more than anything for manipulating data.
Scikit-learn: Has a lot of supervised and unsupervised machine learning algorithms. Also has many metrics for doing model selection and a nice preprocessing library for doing things like Principal Component Analysis or encoding categorical variables.

Quick Tips

When in a jupyter notebook, put a question mark in front of any object before running the cell and it will open up the documentation for it. This is really handy when you’ve forgotten the details of what the function you’re trying to call is expecting you to pass. e.g. ?my_dataframe.apply will explain the apply method of the pandas.DataFrame object, represented here by my_dataframe.
You will likely always need to refer to the documentation for whatever library you’re using, so just keep it open in your browser. There’s just too many optional arguments and nuances.
When it comes to the inevitable task of troubleshooting, stackoverflow probably has the answer.
Accept the fact that you’ll be doing things you don’t fully understand for awhile or you’ll get bogged down by details that aren’t that important. Some day you’ll probably need to understand virtual environments and it’s really not that hard, but there are many detours like that that add unnecessary pain for someone getting started.
Read other people’s code. It’s the best way to learn conventions and best practices. That’s where the Kaggle kernels really help. GitHub also supports the display of jupyter notebooks in the browser, so there are tons of examples on the internet.

要查看或添加评论，请登录

ganesh kavhar的更多文章

Python NumPy for Data Science

2020年11月10日

Python NumPy for Data Science

NumPy Introduction NumPy stands for ‘Numerical Python.’ It is a package in Python to work with arrays.
Becoming a Better Programmer

2019年11月7日

Becoming a Better Programmer

A smart programmer is one who understands that his or her work is never truly done. It doesn't matter how much you…
Data Processing in Machine Learning

2019年8月10日

Data Processing in Machine Learning

ML | Understanding Data Processing Data Processing is a task of converting data from a given form to a much more usable…
Operator in C programing by ganesh kavhar

2019年6月17日

Operator in C programing by ganesh kavhar

An operator in a programming language is a symbol that tells the compiler or interpreter to perform a specific…
Numpy by ganesh kavhar

2019年5月28日

Numpy by ganesh kavhar

Python Numpy Numpy is a general-purpose array-processing package. It provides a high-performance multidimensional array…
Cython to Wrap Existing C Code

2019年4月1日

Cython to Wrap Existing C Code

What is Cython ? It is an optimizing static compiler for both the Python programming language and the extended Cython…
Data Classes in Python | An Introduction by ganesh kavhar

2019年2月27日

Data Classes in Python | An Introduction by ganesh kavhar

dataclass module is introduced in Python 3.7 as a utility tool to make structured classes specially for storing data.
Why learning C Programming is a must?

2019年2月22日

Why learning C Programming is a must?

C is a procedural programming language. It was initially developed by Dennis Ritchie between 1969 and 1973.
String Operation in Python

2019年1月28日

String Operation in Python

String: Strings in an array of bytes which represent Unicode characters in python. Python does not support character…
Data Types In Python

2019年1月28日

Data Types In Python

Today, in this article, we will learn about Python data types and their usage. This is a very important topic because…

See all articles

Python for Data Analysis by ganesh kavhar

ganesh kavhar

Python | PySpark | Databricks | ETL | SQL | Unix | Big Data | Data warehouse

High Level Library Summary

Quick Tips

ganesh kavhar的更多文章

社区洞察

其他会员也浏览了

7 Magic Methods That Will Turn You Into a Python Wizard

Python...meh

Python Basics for Data Science

SnowPark Python— Aamir P

Mastering Data Visualization: Essential Plots in Python using Matplotlib

The lambda() and more

Essential Python Tools for Data Analysts and Developers

Using Python to Handle Large Subsurface Dataset (10GB++)

Python geotechTools on GitHub

Getting Started with NumPy

High Level Library Summary

Quick Tips

ganesh kavhar的更多文章

Python NumPy for Data Science

Becoming a Better Programmer

Data Processing in Machine Learning

Operator in C programing by ganesh kavhar

Numpy by ganesh kavhar

Cython to Wrap Existing C Code

Data Classes in Python | An Introduction by ganesh kavhar

Why learning C Programming is a must?

String Operation in Python

Data Types In Python

社区洞察

其他会员也浏览了

7 Magic Methods That Will Turn You Into a Python Wizard

Python...meh

Python Basics for Data Science

SnowPark Python— Aamir P

Mastering Data Visualization: Essential Plots in Python using Matplotlib

The lambda() and more

Essential Python Tools for Data Analysts and Developers

Using Python to Handle Large Subsurface Dataset (10GB++)

Python geotechTools on GitHub

Getting Started with NumPy