Chapter 3 Feature Engineering for Marketing Data

Chapter 3 Feature Engineering for Marketing Data#

In the previous chapter, we learned how to preprocess and explore our marketing data using Python libraries like Pandas, Matplotlib, and Seaborn. Now, we’ll dive into feature engineering, a crucial step in the data science workflow that involves creating new and informative features from existing data to improve the predictive power of our models.

Feature engineering is particularly important in marketing analytics, as it helps us capture more complex patterns and relationships in customer behavior and market trends. In this chapter, we’ll explore various techniques for creating new features using Python and apply them to our Online Retail dataset.

3.1 Recency, Frequency, and Monetary (RFM) Analysis#

One of the most common feature engineering techniques in marketing is Recency, Frequency, and Monetary (RFM) analysis. RFM analysis segments customers based on three key metrics:

Recency: How recently a customer made a purchase
Frequency: How often a customer makes purchases
Monetary: How much a customer spends

Let’s calculate RFM metrics for our Online Retail dataset using Python:

import pandas as pd

# read the csv file into a pandas dataframe 
df = pd.read_csv('data\online_retail.csv')
df['Amount'] = df['Quantity'] * df['Price']
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.head()

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country	Amount
0	489434	85048	15CM CHRISTMAS GLASS BALL 20 LIGHTS	12	2009-12-01 07:45:00	6.95	13085.0	United Kingdom	83.4
1	489434	79323P	PINK CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom	81.0
2	489434	79323W	WHITE CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom	81.0
3	489434	22041	RECORD FRAME 7" SINGLE SIZE	48	2009-12-01 07:45:00	2.10	13085.0	United Kingdom	100.8
4	489434	21232	STRAWBERRY CERAMIC TRINKET BOX	24	2009-12-01 07:45:00	1.25	13085.0	United Kingdom	30.0

import pandas as pd
from datetime import datetime

# Calculate the maximum InvoiceDate for each customer
max_date = df.groupby('Customer ID')['InvoiceDate'].max()
max_date

Customer ID
0   2011-01-18 10:17:00
0   2011-12-07 15:52:00
0   2011-09-25 13:13:00
0   2011-11-21 09:51:00
0   2011-02-02 16:01:00
                  ...        
0   2011-12-06 12:02:00
0   2010-10-06 12:31:00
0   2010-02-17 10:24:00
0   2010-08-20 11:57:00
0   2011-10-28 09:29:00
Name: InvoiceDate, Length: 5942, dtype: datetime64[ns]

scoring_date = datetime(2011, 12, 10)

# Calculate Recency, Frequency, and Monetary metrics
rfm = df.groupby('Customer ID').agg({
    'InvoiceDate': lambda x: (scoring_date - x.max()).days,
    'Invoice': 'nunique',
    'Amount': 'sum'
})

# Rename columns
rfm.columns = ['Recency', 'Frequency', 'Monetary']

# Assign quintile ranks to each metric
rfm['R_rank'] = pd.qcut(rfm['Recency'], 5, labels=False, duplicates='drop')
rfm['F_rank'] = pd.qcut(rfm['Frequency'], 5, labels=False, duplicates='drop')
rfm['M_rank'] = pd.qcut(rfm['Monetary'], 5, labels=False, duplicates='drop')

# Add 1 to the ranks to start ranking from 1 instead of 0
rfm['R_rank'] += 1
rfm['F_rank'] += 1
rfm['M_rank'] += 1

# Combine R, F, and M ranks into a single RFM score
rfm['RFM_score'] = rfm['R_rank'].astype(str) + rfm['F_rank'].astype(str) + rfm['M_rank'].astype(str)

rfm.head()

	Recency	Frequency	Monetary	R_rank	F_rank	M_rank	RFM_score
Customer ID
12346.0	325	17	-64.68	4	4	1	441
12347.0	2	8	5633.32	1	3	5	135
12348.0	75	5	2019.40	3	2	4	324
12349.0	18	5	4404.54	1	2	5	125
12350.0	310	1	334.40	4	1	2	412

In this code snippet, we first calculate the maximum InvoiceDate for each customer. Then, we calculate the Recency, Frequency, and Monetary metrics using the groupby() and agg() functions. We assign quintile ranks to each metric using pd.qcut() and combine the ranks into a single RFM score.

The resulting RFM score can be used to segment customers and tailor marketing strategies accordingly. For example, customers with high RFM scores (e.g., 555) are likely to be high-value customers who have made recent, frequent, and high-value purchases.

By calculating RFM metrics for the Online Retail dataset, we can segment our customers based on their purchasing behavior and create targeted marketing campaigns. For example, we might offer incentives to customers in the lowest recency quintiles (Q1 or Q2) to encourage them to make a purchase. Similarly, we might target customers in the highest monetary quintiles (Q4 or Q5) with high-value products or services.

Now that we’ve calculated RFM metrics for our dataset, let’s move on to other feature engineering techniques for marketing data.

3.2 Time-based Features#

Time-based features can provide valuable insights into customer behavior over time. For example, we might want to understand how a customer’s purchasing behavior changes throughout the year or the impact of a marketing campaign on sales over time.

To create time-based features for our Online Retail dataset, we might use techniques such as:

Rolling averages and moving averages to capture trends in customer behavior over time
Lagged features to capture the impact of previous purchases or events on current behavior
Year, month, day, and hour-of-the-day features to capture seasonality or time-specific patterns

Let’s create time-based features for our Online Retail dataset:

df.head()

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country	Amount
0	489434	85048	15CM CHRISTMAS GLASS BALL 20 LIGHTS	12	2009-12-01 07:45:00	6.95	13085.0	United Kingdom	83.4
1	489434	79323P	PINK CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom	81.0
2	489434	79323W	WHITE CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom	81.0
3	489434	22041	RECORD FRAME 7" SINGLE SIZE	48	2009-12-01 07:45:00	2.10	13085.0	United Kingdom	100.8
4	489434	21232	STRAWBERRY CERAMIC TRINKET BOX	24	2009-12-01 07:45:00	1.25	13085.0	United Kingdom	30.0

df.loc[df['Customer ID'] == 18287.0].head()

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country	Amount
1824	C489592	22301	COFFEE MUG CAT + BIRD DESIGN	-2	2009-12-01 14:19:00	2.55	18287.0	United Kingdom	-5.10
199795	508581	22243	HOOK, 5 HANGER ,MAGIC TOADSTOOL RED	12	2010-05-17 11:55:00	1.65	18287.0	United Kingdom	19.80
199796	508581	22246	GARLAND, MAGIC GARDEN 1.8M	12	2010-05-17 11:55:00	1.95	18287.0	United Kingdom	23.40
199797	508581	22236	CAKE STAND 3 TIER MAGIC GARDEN	1	2010-05-17 11:55:00	12.75	18287.0	United Kingdom	12.75
199798	508581	21439	BASKET OF TOADSTOOLS	12	2010-05-17 11:55:00	1.25	18287.0	United Kingdom	15.00

# Set the index of the dataFrame to InvoiceDate
df_tbf = df.copy()
df_tbf.set_index('InvoiceDate', inplace=True)

# create a sum of sales for each customer for each day, number of unique invoices, and total sales
daily_sales = df_tbf.groupby('Customer ID').resample('D').agg({'Amount': 'sum', 'Invoice': 'nunique', 'Quantity': 'sum'}) 
daily_sales.head()

		Amount	Invoice	Quantity
Customer ID	InvoiceDate
12346.0	2009-12-14	90.0	3	20
	2009-12-15	0.0	0	0
	2009-12-16	0.0	0	0
	2009-12-17	0.0	0	0
	2009-12-18	23.5	2	6

# create a rolling average of sales for each customer over the last 7 days
rolling_avg = daily_sales.groupby('Customer ID')[['Amount',"Invoice","Quantity"]].rolling(window=7).mean()
rolling_avg = rolling_avg.reset_index(drop=True, level=0).reset_index()
# rename the column to RollingAvg
rolling_avg = rolling_avg.rename(columns={'Amount': 'RollingAvgSales_7days', "Invoice": "RollingAvgInvoices_7days", "Quantity": "RollingAvgQuantity_7days"})
rolling_avg.tail()

	Customer ID	InvoiceDate	RollingAvgSales_7days	RollingAvgInvoices_7days	RollingAvgQuantity_7days
1644576	18287.0	2011-10-24	0.000000	0.000000	0.000000
1644577	18287.0	2011-10-25	0.000000	0.000000	0.000000
1644578	18287.0	2011-10-26	0.000000	0.000000	0.000000
1644579	18287.0	2011-10-27	0.000000	0.000000	0.000000
1644580	18287.0	2011-10-28	10.097143	0.142857	15.428571

# Create lagged features for each customer's total sales and number of orders (invoices) for daily_sales
daily_sales['Amount_lag1'] = daily_sales.groupby('Customer ID')['Amount'].shift(1) 
daily_sales['Amount_lag2'] = daily_sales.groupby('Customer ID')['Amount'].shift(2)
daily_sales['InvoiceNo_lag1'] = daily_sales.groupby('Customer ID')['Invoice'].shift(1)
daily_sales['InvoiceNo_lag2'] = daily_sales.groupby('Customer ID')['Invoice'].shift(2)

# Create year, month, day, hour-of-the-day, and Day-of-the-week features from the InvoiceDate index in daily_sales
daily_sales['InvoiceYear'] = daily_sales.index.get_level_values('InvoiceDate').year
daily_sales['InvoiceMonth'] = daily_sales.index.get_level_values('InvoiceDate').month
daily_sales['InvoiceDay'] = daily_sales.index.get_level_values('InvoiceDate').day
daily_sales['InvoiceHour'] = daily_sales.index.get_level_values('InvoiceDate').hour
daily_sales['InvoiceDayOfWeek'] = daily_sales.index.get_level_values('InvoiceDate').dayofweek

daily_sales.head()

		Amount	Invoice	Quantity	Amount_lag1	Amount_lag2	InvoiceNo_lag1	InvoiceNo_lag2	InvoiceYear	InvoiceMonth	InvoiceDay	InvoiceHour	InvoiceDayOfWeek
Customer ID	InvoiceDate
12346.0	2009-12-14	90.0	3	20	NaN	NaN	NaN	NaN	2009	12	14	0	0
	2009-12-15	0.0	0	0	90.0	NaN	3.0	NaN	2009	12	15	0	1
	2009-12-16	0.0	0	0	0.0	90.0	0.0	3.0	2009	12	16	0	2
	2009-12-17	0.0	0	0	0.0	0.0	0.0	0.0	2009	12	17	0	3
	2009-12-18	23.5	2	6	0.0	0.0	0.0	0.0	2009	12	18	0	4

In this example, we demonstrate how to extract time-based features from the InvoiceDate column and the methods for doing so. This exercise aims to enhance our understanding of the underlying concepts and fundamental calculation logic. We will show how to utilize specialized Python libraries to generate these features for large datasets.

3.3 Holiday-related Features#

To capture holiday-related features, we’ll use the ‘holidays’ Python package. This package allows us to easily identify holidays for a given country or region.

Let’s install the ‘holidays’ package using pip:

!pip install holidays

Once installed, we can use the holidays package to create holiday-related features for the Online Retail dataset. As the country in the dataset is United Kingdom, we will create the UK holiday-related features.

import pandas as pd
import holidays

# Create a holiday object for the United States
uk_holidays = holidays.UK()  # this is a dict

# Create a feature indicating if a transaction occurred on a holiday
df['Holiday'] = df['InvoiceDate'].apply(lambda x: x.date() in uk_holidays)

# Create a feature indicating the name of the holiday (if applicable)
df['HolidayName'] = df['InvoiceDate'].apply(lambda x: uk_holidays.get(x.date()))

df.head()

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country	Amount	Holiday	HolidayName
0	489434	85048	15CM CHRISTMAS GLASS BALL 20 LIGHTS	12	2009-12-01 07:45:00	6.95	13085.0	United Kingdom	83.4	False	None
1	489434	79323P	PINK CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom	81.0	False	None
2	489434	79323W	WHITE CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom	81.0	False	None
3	489434	22041	RECORD FRAME 7" SINGLE SIZE	48	2009-12-01 07:45:00	2.10	13085.0	United Kingdom	100.8	False	None
4	489434	21232	STRAWBERRY CERAMIC TRINKET BOX	24	2009-12-01 07:45:00	1.25	13085.0	United Kingdom	30.0	False	None

In this code snippet, we use the holidays package to create several holiday-related features:

Holiday: We create a binary feature indicating if a transaction occurred on a holiday in the United State. The holidays.UK() function generates a holiday object for the UK, and we check if each transaction date is present in the holiday object.

HolidayName: We create a feature indicating the name of the holiday for each transaction (if applicable). If a transaction occurred on a holiday, the feature will contain the name of the holiday; otherwise, it will be None.

By leveraging the holidays package, we can easily generate holiday-related features for our dataset without manually defining the holiday dates. The package supports various countries and regions, so we can choose the appropriate holiday calendar based on our data and analysis requirements.

These holiday-related features can provide valuable insights into customer purchasing behavior during holidays and weekends, allowing us to create more targeted marketing strategies and campaigns.

3.4 Product-based Features#

Product-based features capture information about the products customers purchase and their characteristics. Let’s create some product-based features for our Online Retail dataset:

df.head()

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country	Amount	Holiday	HolidayName
0	489434	85048	15CM CHRISTMAS GLASS BALL 20 LIGHTS	12	2009-12-01 07:45:00	6.95	13085.0	United Kingdom	83.4	False	None
1	489434	79323P	PINK CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom	81.0	False	None
2	489434	79323W	WHITE CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom	81.0	False	None
3	489434	22041	RECORD FRAME 7" SINGLE SIZE	48	2009-12-01 07:45:00	2.10	13085.0	United Kingdom	100.8	False	None
4	489434	21232	STRAWBERRY CERAMIC TRINKET BOX	24	2009-12-01 07:45:00	1.25	13085.0	United Kingdom	30.0	False	None

df_pbf = df.groupby('Customer ID')[['StockCode', 'Amount', 'Quantity', 'Invoice']].agg({'StockCode': 'nunique', 'Amount': 'sum', 'Quantity': 'sum', 'Invoice': 'nunique'})
                                                                 
# rename the column StockCode to 'NumUniqueProduts'
df_pbf = df_pbf.rename(columns={'StockCode': 'NumUniqueProducts', 'Amount': 'TotalSales', 'Quantity': 'TotalQuantity', 'Invoice': 'NumInvoices'})
df_pbf.head()

	NumUniqueProducts	TotalSales	TotalQuantity	NumInvoices
Customer ID
12346.0	30	-64.68	52	17
12347.0	126	5633.32	3286	8
12348.0	25	2019.40	2714	5
12349.0	139	4404.54	1619	5
12350.0	17	334.40	197	1

In this code snippet, we calculate various product-based features such as the total number of unique products purchased by each customer, the average unit price of products purchased, the total quantity of products purchased, and the average quantity per transaction.

These features can help us understand customer preferences and purchasing habits, which can inform product recommendations, cross-selling, and upselling strategies.

3.5 Interaction Features#

Interaction features capture the relationship between two or more variables in our dataset. For example, we might want to understand how the relationship between a customer’s total sales and the number of orders they place impacts their overall value to the business.

To create interaction features for our Online Retail dataset, we can use techniques such as:

Multiplying or dividing features to create new features that capture the relationship between them
Polynomial features to capture more complex relationships between variables

3.5.1 Interaction features by division and multiplication#

Let’s create interaction features for our Online Retail dataset:

# Calculate the average unit price of products purchased by each customer
df_pbf['AvgUnitPrice'] = df_pbf['TotalSales'] /df_pbf['NumUniqueProducts']

# Calculate the average quantity per transaction for each customer
df_pbf['AvgQuantityPerTransaction'] = df_pbf['TotalQuantity'] /df_pbf['NumInvoices']

df_pbf.head()

	NumUniqueProducts	TotalSales	TotalQuantity	NumInvoices	AvgUnitPrice	AvgQuantityPerTransaction
Customer ID
12346.0	30	-64.68	52	17	-2.156000	3.058824
12347.0	126	5633.32	3286	8	44.708889	410.750000
12348.0	25	2019.40	2714	5	80.776000	542.800000
12349.0	139	4404.54	1619	5	31.687338	323.800000
12350.0	17	334.40	197	1	19.670588	197.000000

This code snippet creates an interaction feature called ‘AvgUnitPrice’ by combining two existing features: ‘TotalSales’ and ‘NumUniqueProducts’. The interaction is achieved through the division operation.

The ‘TotalSales’ feature represents the total revenue generated by each customer, while ‘NumUniqueProducts’ represents the number of unique products purchased by each customer. By dividing ‘TotalSales’ by ‘NumUniqueProducts’, we create a new feature that captures the relationship between these two variables.

The resulting ‘AvgUnitPrice’ feature provides insights into the average spending per product for each customer. It helps understand how the combination of total sales and the number of unique products impacts the average unit price, which can be an important factor in analyzing customer behavior and value.

The interaction feature ‘AvgQuantityPerTransaction’ provides insights into the average number of products a customer buys in each transaction. It helps understand how the combination of total quantity and the number of transactions impacts the average quantity per transaction, which can be useful in analyzing customer purchasing patterns and behavior.

3.5.2 Polynomial features#

Generating polynomial features involves creating new features by combining existing features with polynomial functions, such as squares, cubes, or interactions . Scikit-learn’s PolynomialFeatures class is a tool for generating these features . Feature engineering, including the generation of polynomial features, is crucial for improving model performance and can help avoid overfitting and underfitting problems.

from sklearn.preprocessing import PolynomialFeatures

# Assume you have a DataFrame called 'df' with the following columns:
# 'SalesVolume', 'DiscountPercentage', 'Recency', 'Frequency', 'MarketingSpend', 'ConversionRate'

# Select the relevant features for polynomial feature creation
features = ['Recency', 'Frequency']
X = rfm[features]

# Create polynomial features of degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Get the feature names
feature_names = poly.get_feature_names_out(features)

# Create a new DataFrame with the polynomial features
df_poly = pd.DataFrame(X_poly, columns=feature_names)

# Concatenate the polynomial features with the original DataFrame
rfm_enhanced = pd.concat([rfm.reset_index(), df_poly.reset_index()], axis=1)

rfm_enhanced.head()

	Customer ID	Recency	Frequency	Monetary	R_rank	F_rank	M_rank	RFM_score	index	Recency	Frequency	Recency^2	Recency Frequency	Frequency^2
0	12346.0	325	17	-64.68	4	4	1	441	0	325.0	17.0	105625.0	5525.0	289.0
1	12347.0	2	8	5633.32	1	3	5	135	1	2.0	8.0	4.0	16.0	64.0
2	12348.0	75	5	2019.40	3	2	4	324	2	75.0	5.0	5625.0	375.0	25.0
3	12349.0	18	5	4404.54	1	2	5	125	3	18.0	5.0	324.0	90.0	25.0
4	12350.0	310	1	334.40	4	1	2	412	4	310.0	1.0	96100.0	310.0	1.0

The code snippet generates polynomial features: ‘Recency^2’, ‘Recency*Frequency’, and ‘Frequency^2’. These capture the interactions and higher-order relationships between the original ‘Recency’ and ‘Frequency’ features. Incorporating polynomial features can offer additional insights for marketing and sales analysis, potentially enhancing predictive model performance. However, it’s important to consider the interpretability and relevance of these polynomial features within the specific context of our marketing and sales domain. We must keep in mind that the ultimate goal of feature engineering is to provide actionable insights for decision-makers.

3.6 Conclusion#

In this chapter, we explored feature engineering techniques for marketing data using Python. We created new features based on RFM analysis, time-based patterns, product characteristics, and interaction features to capture more complex relationships in our data.

Feature engineering is an iterative process that requires domain knowledge, creativity, and experimentation. By creating informative and relevant features, we can improve the performance of our marketing analytics models and gain deeper actionable insights into customer behavior.

In the next chapter, we’ll dive into predictive analytics and learn how to build machine learning models for marketing tasks such as customer segmentation, churn prediction, and sales forecasting.