Welcome to Module 2 of our Pandas tutorial series! In this module, we’ll delve into basic data manipulation techniques with Pandas. You’ll learn how to load data from various sources, explore and inspect datasets, and perform indexing and selection operations to extract specific data of interest.
1. Loading Data
Reading Data from Different Sources
Pandas provides versatile tools for reading data from various sources, including CSV, Excel, SQL databases, and more. You can use functions like pd.read_csv()
, pd.read_excel()
, and pd.read_sql()
to load data into Pandas DataFrames.
import pandas as pd
# Reading data from a CSV file
df = pd.read_csv('data.csv')
# Reading data from an Excel file
df = pd.read_excel('data.xlsx')
# Reading data from a SQL database
import sqlite3
conn = sqlite3.connect('mydatabase.db')
df = pd.read_sql('SELECT * FROM mytable', conn)
Example:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv')
print(df)
Output:
Setting the Index
The index is essential for data retrieval and manipulation. You can set a specific column as the index when reading data or using the .set_index()
method on an existing DataFrame.
# Setting the 'ID' column as the index
df = df.set_index('ID')
# Alternatively, set the index during data reading
df = pd.read_csv('data.csv', index_col='ID')
2. Data Exploration and Inspection
Head, Tail, and Sample Data
To quickly inspect your data, use methods like .head()
, .tail()
, and .sample()
. These methods display the first few, last few, or random rows of your DataFrame.
# Display the first 5 rows
print(df.head())
# Display the last 5 rows
print(df.tail())
# Display a random sample of rows
print(df.sample(5))
Info and Describe Methods
The .info()
method provides essential information about your DataFrame, such as the data types and the presence of missing values. The .describe()
method calculates summary statistics for numeric columns.
# Display DataFrame info
df.info()
# Calculate summary statistics
print(df.describe())
Shape and Dimensions
You can determine the shape (number of rows and columns) of your DataFrame using the .shape
attribute and the number of rows or columns using len()
.
# Get the shape of the DataFrame
print(df.shape)
# Get the number of rows
print(len(df))
# Get the number of columns
print(len(df.columns))
3. Indexing and Selection
Selecting Columns and Rows
Pandas provides multiple ways to select specific columns and rows. You can use square brackets []
, .loc[]
, and .iloc[]
for different selection purposes.
# Select a single column by name
column = df['Column_Name']
# Select multiple columns by names
columns = df[['Column1', 'Column2']]
# Select rows by label using .loc[]
selected_rows = df.loc['Label']
# Select rows by position using .iloc[]
selected_rows = df.iloc[0]
Boolean Indexing
You can filter rows based on conditions using Boolean indexing. This allows you to extract rows that meet specific criteria.
# Boolean indexing to filter rows
filtered_data = df[df['Column'] > 50]
# Combine multiple conditions using & (and) or | (or)
filtered_data = df[(df['Column1'] > 50) & (df['Column2'] < 30)]
Setting Values
You can set values for specific cells, rows, or columns using assignment.
# Set a value for a specific cell
df.at['Label', 'Column'] = new_value
# Set values for an entire column
df['Column'] = new_values
# Set values for multiple columns
df[['Column1', 'Column2']] = new_values
Congratulations! You’ve completed Module 2 of our Pandas tutorial. In this module, you’ve learned essential techniques for basic data manipulation with Pandas, including loading data from various sources, exploring and inspecting datasets, and performing indexing and selection operations.
In the next module, we’ll dive deeper into data cleaning and preparation with Pandas, including handling missing data and transforming data to make it suitable for analysis.