Chapter 2 Data Preprocessing and Exploratory Data Analysis

Chapter 2 Data Preprocessing and Exploratory Data Analysis#

In this chapter, we’ll dive into the essential steps of data preprocessing and exploratory data analysis (EDA) using Python. We’ll use a common dataset ‘Online Retail dataset’ which is included in the Python package ‘ucimlrepo’ to demonstrate how to clean, preprocess, and explore marketing data using popular Python libraries like Pandas, Matplotlib, and Seaborn. After that, we will introduce several Python packages for streamlined EDA, which include Data Wrangler, and SweetViz.

2.1 Data Cleaning and Preprocessing with Pandas#

Before we can analyze our marketing data, we need to ensure that it’s clean, consistent, and in a suitable format. Pandas, a powerful data manipulation library in Python, provides various functions and methods to clean and preprocess data efficiently.

Let’s start by loading the Online Retail dataset and performing some basic data cleaning steps:

import pandas as pd

# read the csv file into a pandas dataframe 
df = pd.read_csv('data\online_retail.csv')
df.head()

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country
0	489434	85048	15CM CHRISTMAS GLASS BALL 20 LIGHTS	12	2009-12-01 07:45:00	6.95	13085.0	United Kingdom
1	489434	79323P	PINK CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom
2	489434	79323W	WHITE CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom
3	489434	22041	RECORD FRAME 7" SINGLE SIZE	48	2009-12-01 07:45:00	2.10	13085.0	United Kingdom
4	489434	21232	STRAWBERRY CERAMIC TRINKET BOX	24	2009-12-01 07:45:00	1.25	13085.0	United Kingdom

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   Invoice      1067371 non-null  object 
 1   StockCode    1067371 non-null  object 
 2   Description  1062989 non-null  object 
 3   Quantity     1067371 non-null  int64  
 4   InvoiceDate  1067371 non-null  object 
 5   Price        1067371 non-null  float64
 6   Customer ID  824364 non-null   float64
 7   Country      1067371 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 65.1+ MB

Like any other programming languages, we need to pay attention to the data type in order to ensure the code runs as intended. We notice that InvoiceDate is not in date format. Thus, we need to convert it into datetime format.

import datetime as dt 
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Check for missing values
print(df.isnull().sum())

# Remove rows with missing values
df = df.dropna()

# Remove duplicates
df = df.drop_duplicates()

# Create a new column for total sales amount
df['Amount'] = df['Quantity'] * df['Price']

Invoice        0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
Price          0
Customer ID    0
Country        0
dtype: int64

In this code snippet, we first check for missing values using the isnull() and sum() functions. We then remove rows with missing values using dropna() and remove duplicate rows using drop_duplicates().

Next, we convert the ‘InvoiceDate’ column to datetime type using to_datetime() to enable time-based analysis. Finally, we create a new column called ‘Amount’ by multiplying ‘Quantity’ and ‘UnitPrice’ to calculate the total sales amount for each transaction.

2.2 Exploratory Data Analysis with Matplotlib and Seaborn#

Once our data is clean and preprocessed, we can move on to exploratory data analysis (EDA). EDA helps us understand the underlying patterns, relationships, and distributions in our marketing data. Matplotlib and Seaborn are popular Python libraries for creating informative and visually appealing plots and charts.

Let’s explore some key aspects of our dataset using Matplotlib and Seaborn:

Sales Trend Over Time#

import matplotlib.pyplot as plt

# Group by date and sum the sales amount
daily_sales = data.groupby(data['InvoiceDate'].dt.date)['Amount'].sum()

# Plot the daily sales trend
plt.figure(figsize=(12, 6))
plt.plot(daily_sales.index, daily_sales.values)
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.title('Daily Sales Trend')
plt.show()

This code snippet groups the data by date and calculates the total sales amount for each day. We then create a line plot using Matplotlib to visualize the daily sales trend over time.

Product Category Analysis#

import seaborn as sns

# Extract product categories from the Description column
data['Category'] = data['Description'].str.extract('(^\w+)')

# Group by category and sum the sales amount
category_sales = data.groupby('Category')['Amount'].sum().sort_values(ascending=False)

# Plot the sales distribution by category
plt.figure(figsize=(12, 6))
sns.barplot(x=category_sales.index, y=category_sales.values)
plt.xlabel('Product Category')
plt.ylabel('Total Sales')
plt.title('Sales Distribution by Product Category')
plt.xticks(rotation=45)
plt.show()

In this example, we extract product categories from the ‘Description’ column using the str.extract() function. We then group the data by category and calculate the total sales amount for each category. Finally, we create a bar plot using Seaborn to visualize the sales distribution across different product categories.

Customer Segmentation#

# Group by customer ID and calculate total sales and number of orders
customer_data = data.groupby('CustomerID').agg({'Amount': 'sum', 'InvoiceNo': 'nunique'})

# Rename columns
customer_data.columns = ['TotalSales', 'NumOrders']

# Create a scatter plot
plt.figure(figsize=(10, 8))
sns.scatterplot(x='TotalSales', y='NumOrders', data=customer_data)
plt.xlabel('Total Sales')
plt.ylabel('Number of Orders')
plt.title('Customer Segmentation')
plt.show()

Here, we group the data by customer ID and calculate the total sales and number of orders for each customer. We then create a scatter plot using Seaborn to visualize the relationship between total sales and the number of orders. This plot helps us identify different customer segments based on their purchasing behavior.

These are just a few examples of how Matplotlib and Seaborn can be used for exploratory data analysis in marketing. By creating informative visualizations, we can gain deeper insights into customer behavior, product performance, and sales trends.

2.3 Streamlined Exploratory Data Analysis with Data Wrangler and SweetViz#

In addition to using Pandas, Matplotlib, and Seaborn for data preprocessing and EDA, there are several other Python packages that can make this process even more streamlined. We’ll introduce two of these packages: Data Wrangler and SweetViz.

Data Wrangler#

Data Wrangler is a powerful library for automating data preprocessing tasks. With just a few lines of code, you can perform common data cleaning and preprocessing steps, such as:

Removing duplicate rows
Replacing missing values
Handling outliers
Scaling and normalizing data

To install Data Wrangler, run:

pip install datawrangler

Here’s an example of using Data Wrangler for data cleaning and preprocessing:

import datawrangler as dw

# Load data into a pandas DataFrame
data = dw.load_data('online\_retail.xlsx')

# Automate the data cleaning process
data = dw.preprocess(data)

SweetViz#

SweetViz is an open-source library for generating insights and visualizations from data. It automates the exploratory data analysis process by generating a set of visualizations that summarize the key aspects of the dataset. With SweetViz, you can quickly identify trends, patterns, and relationships in the data without having to write any code.

To install SweetViz, run:

pip install sweetviz

To use SweetViz, you can simply create a report using the following code:

from sweetviz import SweetVizReport

# Create a report based on the data
report = SweetVizReport('data\_retail.xlsx')

# Save the report to an HTML file
report.save('results\_report.html')

By using Data Wrangler and SweetViz alongside Pandas, Matplotlib, and Seaborn, you can streamline your data preprocessing and exploratory data analysis workflows and gain deeper insights into your marketing data.