Introduction to Pandas for Data Analysis
Rohit Ramteke
Senior Technical Lead @Birlasoft | DevOps Expert | CRM Solutions | Siebel Administrator | IT Infrastructure Optimization |Project Management
What is Pandas?
Pandas is a popular open-source data manipulation and analysis library for the Python programming language. It provides a powerful and flexible set of tools for working with structured data, making it a fundamental tool for data scientists, analysts, and engineers.
Key Features of Pandas:
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
A Series is a one-dimensional labeled array, essentially a single column or row of data.
Importing Pandas
To use Pandas, you must first import it in your Python script:
import pandas as pd
Data Loading
Pandas makes it easy to load data from various sources such as CSV and Excel files. The read_csv() function is used to load a CSV file:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('your_file.csv')
Replace 'your_file.csv' with the actual file path.
What is a Series?
A Series is a one-dimensional labeled array. You can create a Series from lists, NumPy arrays, or dictionaries:
import pandas as pd
# Create a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s)
Accessing Elements in a Series
print(s[2]) # Access the element with index 2
print(s.iloc[3]) # Access the element at position 3
print(s[1:4]) # Access a range of elements
Series Attributes and Methods
print(s.values) # Get values as NumPy array
print(s.index) # Get index labels
print(s.shape) # Get dimensions
print(s.size) # Get number of elements
print(s.mean()) # Get mean of elements
print(s.unique()) # Get unique values
What is a DataFrame?
A DataFrame is a two-dimensional labeled data structure, similar to an Excel spreadsheet or SQL table.
Creating DataFrames from Dictionaries
import pandas as pd
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)
Column Selection
print(df['Name']) # Access the 'Name' column
Accessing Rows
print(df.iloc[2]) # Access the third row by position
print(df.loc[1]) # Access the second row by label
Slicing
print(df[['Name', 'Age']]) # Select specific columns
print(df[1:3]) # Select specific rows
Finding Unique Elements
unique_ages = df['Age'].unique()
print(unique_ages)
Conditional Filtering
above_25 = df[df['Age'] > 25]
print(above_25)
Saving DataFrames
df.to_csv('data.csv', index=False)
DataFrame Attributes and Methods
print(df.shape) # Get dimensions
print(df.info()) # Get summary of DataFrame
print(df.describe()) # Get summary statistics
print(df.head(2)) # Get first 2 rows
print(df.tail(2)) # Get last 2 rows
print(df.mean()) # Calculate mean
print(df.sort_values(by='Age')) # Sort by Age
Conclusion
Pandas is an essential tool for data analysis, offering flexible and powerful data structures. Understanding Pandas Series and DataFrames helps in efficient data manipulation and analysis, making it a valuable skill for any data-driven professional. By mastering Pandas, you can handle real-world data effortlessly and gain insightful conclusions from your datasets.