Working with Excel Sheets and CSV Files Using Pandas for Data Processing (89/100 Days of Python)
Excel and CSV
files are two of the most commonly used file formats for storing and analyzing data. Pandas is a popular data manipulation library in Python that provides a simple and efficient way to work with these file formats. In this tutorial, we will explore how to read and write Excel and CSV
files in Pandas with real-world examples.
Reading Excel Files in Pandas
Pandas provides a function called read_excel()
to read Excel files. The function takes the file path as the argument and returns a DataFrame
object:
import pandas as pd
# Reading the Excel file
df = pd.read_excel('sales_data.xlsx')
# Displaying the first five rows of the DataFrame
print(df.head())
In this example, we are reading the sales_data.xlsx
file and storing the data in a DataFrame
object named df
. We are then printing the first five rows of the DataFrame
using the head()
function.
Pandas also allows you to read specific sheets from an Excel file by specifying the sheet name or sheet index:
# Reading a specific sheet from the Excel file
df = pd.read_excel('sales_data.xlsx', sheet_name='Sheet1')
# Displaying the first five rows of the DataFrame
print(df.head())
In this example, we are reading the Sheet1
sheet from the sales_data.xlsx
file and storing the data in a DataFrame
object named df
.
Writing Excel Files in Pandas
Pandas provides a function called to_excel()
to write data to Excel files. The function takes the file path and sheet name as arguments:
# Writing the DataFrame to an Excel file
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
# Displaying a message after the file has been written
print('Excel file written successfully!')
In this example, we are writing the df
DataFrame
to an Excel file named output.xlsx
and specifying the sheet name as Sheet1
. We are also setting the index
argument to False
to exclude the index column from the output.
Reading CSV Files in Pandas
Pandas provides a function called read_csv()
to read CSV
files. The function takes the file path as the argument and returns a DataFrame
object:
# Reading the CSV file
df = pd.read_csv('sales_data.csv')
# Displaying the first five rows of the DataFrame
print(df.head())
In this example, we are reading the sales_data.csv
file and storing the data in a DataFrame
object named df
. We are then printing the first five rows of the DataFrame
using the head()
function.
Pandas also allows you to read CSV
files with different delimiters, such as tabs or semicolons. You can specify the delimiter using the delimiter
or sep
argument:
# Reading a CSV file with a semicolon delimiter
df = pd.read_csv('sales_data.csv', delimiter=';')
# Displaying the first five rows of the DataFrame
print(df.head())
Here we are reading the sales_data.csv
file with a semicolon delimiter and storing the data in a DataFrame
object named df
.
Writing CSV Files in Pandas
Pandas provides a function called to_csv()
to write data to CSV
files. The function takes the file path as the argument:
# Writing the DataFrame to a CSV file
df.to_csv('output.csv', index=False)
# Displaying a message after the file has been written
print('CSV file written successfully!')
In this example, we are writing the df
DataFrame
to a CSV
file named output.csv
and setting the index
argument to False
to exclude the index column from the output.
Real-World Example: Analyzing Sales Data
Let’s work through a real-world example to demonstrate how to work with Excel and CSV
files in Pandas. Suppose you have a sales data file in Excel format that contains information about sales made by a company. The file has the following columns:
- Date: the date the sale was made
- Product: the name of the product sold
- Quantity: the quantity of the product sold
- Price: the price of each unit sold
- Total: the total amount of the sale
You want to analyze this data to answer the following questions:
- What was the total revenue generated by the company?
- What was the most popular product sold?
- What was the average price per unit sold?
First, let’s read the Excel file into a Pandas DataFrame
:
sales_df = pd.read_excel('sales_data.xlsx')
To answer the first question, we need to calculate the total revenue generated by the company. We can do this by multiplying the quantity and price columns and then summing up the values in the Total column:
# Calculating the total revenue
total_revenue = sales_df['Total'].sum()
print(f'Total revenue generated by the company: ${total_revenue:.2f}')
To answer the second question, we need to find the most popular product sold. We can do this by grouping the DataFrame
by the Product column and then summing up the values in the Quantity column:
# Finding the most popular product sold
popular_product = sales_df.groupby('Product')['Quantity'].sum().idxmax()
print(f'Most popular product sold: {popular_product}')
To answer the third question, we need to calculate the average price per unit sold. We can do this by dividing the sum of the values in the Total column by the sum of the values in the Quantity column:
# Calculating the average price per unit sold
average_price = sales_df['Total'].sum() / sales_df['Quantity'].sum()
print(f'Average price per unit sold: ${average_price:.2f}')
Finally, let’s save the results to a new Excel file:
# Creating a new DataFrame with the results
results_df = pd.DataFrame({
'Metric': ['Total Revenue', 'Most Popular Product', 'Average Price per Unit'],
'Value': [total_revenue, popular_product, average_price]
})
# Writing the results to a new Excel file
results_df.to_excel('sales_data_results.xlsx', index=False)
So, we created a new DataFrame
with the results and then wrote the data to a new Excel file named sales_data_results.xlsx
.
What’s next?
- If you found this story valuable, please consider clapping multiple times (this really helps a lot!)
- Hands-on Practice: Free Python Course
- Full series: 100 Days of Python
- Previous topic: Web Scraping with Scrapy
- Next topic: Working With XML and JSON Data