Mastering Data Analysis with Pandas (93/100 Days of Python)
Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. In this tutorial, we will explore the core features of pandas and provide real-world examples to demonstrate their usefulness.
To install pandas, simply run the following command in your terminal or command prompt:
pip install pandas
Loading and Previewing Data with Pandas
First, import pandas and load the data into a DataFrame, which is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes.
import pandas as pd
data = pd.read_csv('example_data.csv') # Load data from a CSV file
print(data.head()) # Preview the first 5 rows of the DataFrame
The data in the example_data.csv
can be any comma-separated data like the following:
name,age,gender,height,weight
Alice,29,Female,165,55
Bob,35,Male,180,85
Charlie,42,Male,176,78
Diana,25,Female,162,52
Eva,31,Female,170,60
Frank,28,Male,185,92
Grace,33,Female,168,58
Henry,46,Male,174,80
Isabel,39,Female,160,54
Jack,24,Male,190,90
Data Selection and Indexing
There are multiple ways to select and index data in pandas. Some common methods include:
- Selecting a single column:
data['column_name']
- Selecting multiple columns:
data[['column1', 'column2']]
- Selecting rows by index:
data.loc[row_index]
ordata.iloc[row_position]
- Selecting rows based on conditions:
data[data['column_name'] > value]
Example: To select all rows with an age greater than 30:
data_over_30 = data[data['age'] > 30]
The part data['age'] > 30
returns True
or False
for every row of the DataFrame
. So, data[True/False]
just selects the rows that are True
.
Handling Missing Data
Pandas provides methods to handle missing data, such as:
- Drop missing data:
data.dropna(axis=0, how='any', inplace=True)
- Fill in missing data with a specified value:
data.fillna(value, inplace=True)
- Interpolate missing data:
data.interpolate(method='linear', inplace=True)
To fill in missing values with the mean of the column:
data.fillna(data.mean(), inplace=True)
# Or if you don't want to do it in-place
new_values = data.fillna(data.mean())
Data Manipulation and Transformation
Pandas offers various functions for data manipulation, including:
- Creating a new column based on existing columns:
data['new_column'] = data['column1'] * data['column2']
- Renaming columns:
data.rename(columns={'old_name': 'new_name'}, inplace=True)
- Sorting data by column values:
data.sort_values(by='column_name', ascending=True, inplace=True)
To calculate the body mass index (BMI) and add it as a new column:
data['bmi'] = data['weight'] / (data['height'] / 100) ** 2
Grouping and Aggregation
Grouping and aggregation are essential when working with large datasets. Some common aggregation functions include sum()
, mean()
, median()
, min()
, max()
, and count()
.
To find the average age by gender:
average_age_by_gender = data.groupby('gender')['age'].mean()
Merging, Joining, and Concatenating Datasets
Pandas provides functions to combine datasets, such as:
- Concatenating datasets:
pd.concat([data1, data2], axis=0)
- Merging datasets:
pd.merge(data1, data2, on='key_column', how='inner')
- Joining datasets:
data1.join(data2, on='key_column', how='inner')
To merge customer and order data on the customer_id
column:
merged_data = pd.merge(customers, orders, on='customer_id', how='inner')
Basic Data Visualization
Pandas integrates with Matplotlib to provide simple data visualization. To get started, install Matplotlib using pip install matplotlib
, and then import it in your script.
To create a bar chart showing the average age by gender:
import matplotlib.pyplot as plt
average_age_by_gender = data.groupby('gender')['age'].mean()
average_age_by_gender.plot(kind='bar', title='Average Age by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Age')
plt.show()
What’s next?
- If you found this story valuable, please consider clapping multiple times (this really helps a lot!)
- Hands-on Practice: Free Python Course
- Full series: 100 Days of Python
- Previous topic: Mastering NumPy in Python for Numerical Computations
- Next topic: Machine Learning in Python with Scikit-Learn