Pandas/Python Library
Pandas
Pandas?is a Python library for data analysis. Started by?Wes McKinney?in 2008 out of a need for a powerful and flexible quantitative analysis tool, pandas has grown into one of the most popular Python libraries. It has an extremely active?community of contributors.
Pandas is built on top of two core Python libraries—matplotlib?for data visualization and?NumPy?for mathematical operations. Pandas acts as a wrapper over these libraries, allowing you to access many of matplotlib's and NumPy's methods with less code. For instance, pandas'?.plot()?combines multiple matplotlib methods into a single method, enabling you to plot a chart in a few lines.
Before pandas, most analysts used Python for data munging and preparation, and then switched to a more domain specific language like R for the rest of their workflow. Pandas introduced two new types of?objects for storing data?that make analytical tasks easier and eliminate the need to switch tools:?Series, which have a list-like structure, and?DataFrames, which have a tabular structure.
Pandas tutorials
Here are some analysis-focused pandas tutorials that aren't riddled with technical jargon.
Pandas data structures
Series
You can think of a series as a single column of data. Each value in the series has a label, and these labels are collectively referred to as an index. This is demonstrated in the output below. 0-4 is the index and the column of numbers to the right contain the values.
DataFrames
While series are useful, most analysts work with the majority of their data in DataFrames. DataFrames store data in the familiar table format of rows and columns, much like a spreadsheet or database. DataFrames makes a lot of analytical tasks easier, such as?finding the averages per column?in a dataset.
You can also think of DataFrames as a collection of series—just as multiple columns combined make up a table, multiple series make up a DataFrame.
领英推荐
Note: In Mode, the results of your SQL queries are automatically converted into DataFrames and made available in the list variable "datasets." To describe or transform the results of Query 1, use?datasets[0], for the results of Query 2, use?datasets[1]?and so on.
For more on manipulating pandas data structures, check out?Greg Reda's three-part tutorial, which approaches the topic from a?SQL perspective.
Pandas features
Time series analysis
split-apply-combine
Split-apply-combine is a common strategy used during analysis to summarize data—you split data into logical subgroups, apply some function to each subgroup, and stick the results back together again. In pandas, this is accomplished using the?groupby()?function and whatever functions you want to apply to the subgroups.
Data visualization
Pivot tables
Working with missing data