Mastering Data Analysis with Pandas (93/100 Days of Python)

Martin Mirakyan
3 min readApr 4, 2023

--

Day 93 of the “100 Days of Python” blog post series covering the pandas library

Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. In this tutorial, we will explore the core features of pandas and provide real-world examples to demonstrate their usefulness.

To install pandas, simply run the following command in your terminal or command prompt:

pip install pandas

Loading and Previewing Data with Pandas

First, import pandas and load the data into a DataFrame, which is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes.

import pandas as pd

data = pd.read_csv('example_data.csv') # Load data from a CSV file
print(data.head()) # Preview the first 5 rows of the DataFrame

The data in the example_data.csv can be any comma-separated data like the following:

name,age,gender,height,weight
Alice,29,Female,165,55
Bob,35,Male,180,85
Charlie,42,Male,176,78
Diana,25,Female,162,52
Eva,31,Female,170,60
Frank,28,Male,185,92
Grace,33,Female,168,58
Henry,46,Male,174,80
Isabel,39,Female,160,54
Jack,24,Male,190,90

Data Selection and Indexing

There are multiple ways to select and index data in pandas. Some common methods include:

  • Selecting a single column: data['column_name']
  • Selecting multiple columns: data[['column1', 'column2']]
  • Selecting rows by index: data.loc[row_index] or data.iloc[row_position]
  • Selecting rows based on conditions: data[data['column_name'] > value]

Example: To select all rows with an age greater than 30:

data_over_30 = data[data['age'] > 30]

The part data['age'] > 30 returns True or False for every row of the DataFrame. So, data[True/False] just selects the rows that are True.

Handling Missing Data

Pandas provides methods to handle missing data, such as:

  • Drop missing data: data.dropna(axis=0, how='any', inplace=True)
  • Fill in missing data with a specified value: data.fillna(value, inplace=True)
  • Interpolate missing data: data.interpolate(method='linear', inplace=True)

To fill in missing values with the mean of the column:

data.fillna(data.mean(), inplace=True)
# Or if you don't want to do it in-place
new_values = data.fillna(data.mean())

Data Manipulation and Transformation

Pandas offers various functions for data manipulation, including:

  • Creating a new column based on existing columns: data['new_column'] = data['column1'] * data['column2']
  • Renaming columns: data.rename(columns={'old_name': 'new_name'}, inplace=True)
  • Sorting data by column values: data.sort_values(by='column_name', ascending=True, inplace=True)

To calculate the body mass index (BMI) and add it as a new column:

data['bmi'] = data['weight'] / (data['height'] / 100) ** 2

Grouping and Aggregation

Grouping and aggregation are essential when working with large datasets. Some common aggregation functions include sum(), mean(), median(), min(), max(), and count().

To find the average age by gender:

average_age_by_gender = data.groupby('gender')['age'].mean()

Merging, Joining, and Concatenating Datasets

Pandas provides functions to combine datasets, such as:

  • Concatenating datasets: pd.concat([data1, data2], axis=0)
  • Merging datasets: pd.merge(data1, data2, on='key_column', how='inner')
  • Joining datasets: data1.join(data2, on='key_column', how='inner')

To merge customer and order data on the customer_id column:

merged_data = pd.merge(customers, orders, on='customer_id', how='inner')

Basic Data Visualization

Pandas integrates with Matplotlib to provide simple data visualization. To get started, install Matplotlib using pip install matplotlib, and then import it in your script.

To create a bar chart showing the average age by gender:

import matplotlib.pyplot as plt

average_age_by_gender = data.groupby('gender')['age'].mean()
average_age_by_gender.plot(kind='bar', title='Average Age by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Age')
plt.show()

What’s next?

--

--