Chapter 3 Feature Engineering for Marketing Data#
In the previous chapter, we learned how to preprocess and explore our marketing data using Python libraries like Pandas, Matplotlib, and Seaborn. Now, we’ll dive into feature engineering, a crucial step in the data science workflow that involves creating new and informative features from existing data to improve the predictive power of our models.
Feature engineering is particularly important in marketing analytics, as it helps us capture more complex patterns and relationships in customer behavior and market trends. In this chapter, we’ll explore various techniques for creating new features using Python and apply them to our Online Retail dataset.
3.1 Recency, Frequency, and Monetary (RFM) Analysis#
One of the most common feature engineering techniques in marketing is Recency, Frequency, and Monetary (RFM) analysis. RFM analysis segments customers based on three key metrics:
Recency: How recently a customer made a purchase
Frequency: How often a customer makes purchases
Monetary: How much a customer spends
Let’s calculate RFM metrics for our Online Retail dataset using Python:
import pandas as pd
# read the csv file into a pandas dataframe
df = pd.read_csv('data\online_retail.csv')
df['Amount'] = df['Quantity'] * df['Price']
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.head()
Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | Amount | |
---|---|---|---|---|---|---|---|---|---|
0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom | 83.4 |
1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 81.0 |
2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 81.0 |
3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom | 100.8 |
4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom | 30.0 |
import pandas as pd
from datetime import datetime
# Calculate the maximum InvoiceDate for each customer
max_date = df.groupby('Customer ID')['InvoiceDate'].max()
max_date
Customer ID
12346.0 2011-01-18 10:17:00
12347.0 2011-12-07 15:52:00
12348.0 2011-09-25 13:13:00
12349.0 2011-11-21 09:51:00
12350.0 2011-02-02 16:01:00
...
18283.0 2011-12-06 12:02:00
18284.0 2010-10-06 12:31:00
18285.0 2010-02-17 10:24:00
18286.0 2010-08-20 11:57:00
18287.0 2011-10-28 09:29:00
Name: InvoiceDate, Length: 5942, dtype: datetime64[ns]
scoring_date = datetime(2011, 12, 10)
# Calculate Recency, Frequency, and Monetary metrics
rfm = df.groupby('Customer ID').agg({
'InvoiceDate': lambda x: (scoring_date - x.max()).days,
'Invoice': 'nunique',
'Amount': 'sum'
})
# Rename columns
rfm.columns = ['Recency', 'Frequency', 'Monetary']
# Assign quintile ranks to each metric
rfm['R_rank'] = pd.qcut(rfm['Recency'], 5, labels=False, duplicates='drop')
rfm['F_rank'] = pd.qcut(rfm['Frequency'], 5, labels=False, duplicates='drop')
rfm['M_rank'] = pd.qcut(rfm['Monetary'], 5, labels=False, duplicates='drop')
# Add 1 to the ranks to start ranking from 1 instead of 0
rfm['R_rank'] += 1
rfm['F_rank'] += 1
rfm['M_rank'] += 1
# Combine R, F, and M ranks into a single RFM score
rfm['RFM_score'] = rfm['R_rank'].astype(str) + rfm['F_rank'].astype(str) + rfm['M_rank'].astype(str)
rfm.head()
Recency | Frequency | Monetary | R_rank | F_rank | M_rank | RFM_score | |
---|---|---|---|---|---|---|---|
Customer ID | |||||||
12346.0 | 325 | 17 | -64.68 | 4 | 4 | 1 | 441 |
12347.0 | 2 | 8 | 5633.32 | 1 | 3 | 5 | 135 |
12348.0 | 75 | 5 | 2019.40 | 3 | 2 | 4 | 324 |
12349.0 | 18 | 5 | 4404.54 | 1 | 2 | 5 | 125 |
12350.0 | 310 | 1 | 334.40 | 4 | 1 | 2 | 412 |
In this code snippet, we first calculate the maximum InvoiceDate for each customer. Then, we calculate the Recency, Frequency, and Monetary metrics using the groupby() and agg() functions. We assign quintile ranks to each metric using pd.qcut() and combine the ranks into a single RFM score.
The resulting RFM score can be used to segment customers and tailor marketing strategies accordingly. For example, customers with high RFM scores (e.g., 555) are likely to be high-value customers who have made recent, frequent, and high-value purchases.
By calculating RFM metrics for the Online Retail dataset, we can segment our customers based on their purchasing behavior and create targeted marketing campaigns. For example, we might offer incentives to customers in the lowest recency quintiles (Q1 or Q2) to encourage them to make a purchase. Similarly, we might target customers in the highest monetary quintiles (Q4 or Q5) with high-value products or services.
Now that we’ve calculated RFM metrics for our dataset, let’s move on to other feature engineering techniques for marketing data.
3.2 Time-based Features#
Time-based features can provide valuable insights into customer behavior over time. For example, we might want to understand how a customer’s purchasing behavior changes throughout the year or the impact of a marketing campaign on sales over time.
To create time-based features for our Online Retail dataset, we might use techniques such as:
Rolling averages and moving averages to capture trends in customer behavior over time
Lagged features to capture the impact of previous purchases or events on current behavior
Year, month, day, and hour-of-the-day features to capture seasonality or time-specific patterns
Let’s create time-based features for our Online Retail dataset:
df.head()
Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | Amount | |
---|---|---|---|---|---|---|---|---|---|
0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom | 83.4 |
1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 81.0 |
2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 81.0 |
3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom | 100.8 |
4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom | 30.0 |
df.loc[df['Customer ID'] == 18287.0].head()
Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | Amount | |
---|---|---|---|---|---|---|---|---|---|
1824 | C489592 | 22301 | COFFEE MUG CAT + BIRD DESIGN | -2 | 2009-12-01 14:19:00 | 2.55 | 18287.0 | United Kingdom | -5.10 |
199795 | 508581 | 22243 | HOOK, 5 HANGER ,MAGIC TOADSTOOL RED | 12 | 2010-05-17 11:55:00 | 1.65 | 18287.0 | United Kingdom | 19.80 |
199796 | 508581 | 22246 | GARLAND, MAGIC GARDEN 1.8M | 12 | 2010-05-17 11:55:00 | 1.95 | 18287.0 | United Kingdom | 23.40 |
199797 | 508581 | 22236 | CAKE STAND 3 TIER MAGIC GARDEN | 1 | 2010-05-17 11:55:00 | 12.75 | 18287.0 | United Kingdom | 12.75 |
199798 | 508581 | 21439 | BASKET OF TOADSTOOLS | 12 | 2010-05-17 11:55:00 | 1.25 | 18287.0 | United Kingdom | 15.00 |
# Set the index of the dataFrame to InvoiceDate
df_tbf = df.copy()
df_tbf.set_index('InvoiceDate', inplace=True)
# create a sum of sales for each customer for each day, number of unique invoices, and total sales
daily_sales = df_tbf.groupby('Customer ID').resample('D').agg({'Amount': 'sum', 'Invoice': 'nunique', 'Quantity': 'sum'})
daily_sales.head()
Amount | Invoice | Quantity | ||
---|---|---|---|---|
Customer ID | InvoiceDate | |||
12346.0 | 2009-12-14 | 90.0 | 3 | 20 |
2009-12-15 | 0.0 | 0 | 0 | |
2009-12-16 | 0.0 | 0 | 0 | |
2009-12-17 | 0.0 | 0 | 0 | |
2009-12-18 | 23.5 | 2 | 6 |
# create a rolling average of sales for each customer over the last 7 days
rolling_avg = daily_sales.groupby('Customer ID')[['Amount',"Invoice","Quantity"]].rolling(window=7).mean()
rolling_avg = rolling_avg.reset_index(drop=True, level=0).reset_index()
# rename the column to RollingAvg
rolling_avg = rolling_avg.rename(columns={'Amount': 'RollingAvgSales_7days', "Invoice": "RollingAvgInvoices_7days", "Quantity": "RollingAvgQuantity_7days"})
rolling_avg.tail()
Customer ID | InvoiceDate | RollingAvgSales_7days | RollingAvgInvoices_7days | RollingAvgQuantity_7days | |
---|---|---|---|---|---|
1644576 | 18287.0 | 2011-10-24 | 0.000000 | 0.000000 | 0.000000 |
1644577 | 18287.0 | 2011-10-25 | 0.000000 | 0.000000 | 0.000000 |
1644578 | 18287.0 | 2011-10-26 | 0.000000 | 0.000000 | 0.000000 |
1644579 | 18287.0 | 2011-10-27 | 0.000000 | 0.000000 | 0.000000 |
1644580 | 18287.0 | 2011-10-28 | 10.097143 | 0.142857 | 15.428571 |
# Create lagged features for each customer's total sales and number of orders (invoices) for daily_sales
daily_sales['Amount_lag1'] = daily_sales.groupby('Customer ID')['Amount'].shift(1)
daily_sales['Amount_lag2'] = daily_sales.groupby('Customer ID')['Amount'].shift(2)
daily_sales['InvoiceNo_lag1'] = daily_sales.groupby('Customer ID')['Invoice'].shift(1)
daily_sales['InvoiceNo_lag2'] = daily_sales.groupby('Customer ID')['Invoice'].shift(2)
# Create year, month, day, hour-of-the-day, and Day-of-the-week features from the InvoiceDate index in daily_sales
daily_sales['InvoiceYear'] = daily_sales.index.get_level_values('InvoiceDate').year
daily_sales['InvoiceMonth'] = daily_sales.index.get_level_values('InvoiceDate').month
daily_sales['InvoiceDay'] = daily_sales.index.get_level_values('InvoiceDate').day
daily_sales['InvoiceHour'] = daily_sales.index.get_level_values('InvoiceDate').hour
daily_sales['InvoiceDayOfWeek'] = daily_sales.index.get_level_values('InvoiceDate').dayofweek
daily_sales.head()
Amount | Invoice | Quantity | Amount_lag1 | Amount_lag2 | InvoiceNo_lag1 | InvoiceNo_lag2 | InvoiceYear | InvoiceMonth | InvoiceDay | InvoiceHour | InvoiceDayOfWeek | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Customer ID | InvoiceDate | ||||||||||||
12346.0 | 2009-12-14 | 90.0 | 3 | 20 | NaN | NaN | NaN | NaN | 2009 | 12 | 14 | 0 | 0 |
2009-12-15 | 0.0 | 0 | 0 | 90.0 | NaN | 3.0 | NaN | 2009 | 12 | 15 | 0 | 1 | |
2009-12-16 | 0.0 | 0 | 0 | 0.0 | 90.0 | 0.0 | 3.0 | 2009 | 12 | 16 | 0 | 2 | |
2009-12-17 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 2009 | 12 | 17 | 0 | 3 | |
2009-12-18 | 23.5 | 2 | 6 | 0.0 | 0.0 | 0.0 | 0.0 | 2009 | 12 | 18 | 0 | 4 |
In this example, we demonstrate how to extract time-based features from the InvoiceDate column and the methods for doing so. This exercise aims to enhance our understanding of the underlying concepts and fundamental calculation logic. We will show how to utilize specialized Python libraries to generate these features for large datasets.
3.4 Product-based Features#
Product-based features capture information about the products customers purchase and their characteristics. Let’s create some product-based features for our Online Retail dataset:
df.head()
Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | Amount | Holiday | HolidayName | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom | 83.4 | False | None |
1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 81.0 | False | None |
2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 81.0 | False | None |
3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom | 100.8 | False | None |
4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom | 30.0 | False | None |
df_pbf = df.groupby('Customer ID')[['StockCode', 'Amount', 'Quantity', 'Invoice']].agg({'StockCode': 'nunique', 'Amount': 'sum', 'Quantity': 'sum', 'Invoice': 'nunique'})
# rename the column StockCode to 'NumUniqueProduts'
df_pbf = df_pbf.rename(columns={'StockCode': 'NumUniqueProducts', 'Amount': 'TotalSales', 'Quantity': 'TotalQuantity', 'Invoice': 'NumInvoices'})
df_pbf.head()
NumUniqueProducts | TotalSales | TotalQuantity | NumInvoices | |
---|---|---|---|---|
Customer ID | ||||
12346.0 | 30 | -64.68 | 52 | 17 |
12347.0 | 126 | 5633.32 | 3286 | 8 |
12348.0 | 25 | 2019.40 | 2714 | 5 |
12349.0 | 139 | 4404.54 | 1619 | 5 |
12350.0 | 17 | 334.40 | 197 | 1 |
In this code snippet, we calculate various product-based features such as the total number of unique products purchased by each customer, the average unit price of products purchased, the total quantity of products purchased, and the average quantity per transaction.
These features can help us understand customer preferences and purchasing habits, which can inform product recommendations, cross-selling, and upselling strategies.
3.5 Interaction Features#
Interaction features capture the relationship between two or more variables in our dataset. For example, we might want to understand how the relationship between a customer’s total sales and the number of orders they place impacts their overall value to the business.
To create interaction features for our Online Retail dataset, we can use techniques such as:
Multiplying or dividing features to create new features that capture the relationship between them
Polynomial features to capture more complex relationships between variables
3.5.1 Interaction features by division and multiplication#
Let’s create interaction features for our Online Retail dataset:
# Calculate the average unit price of products purchased by each customer
df_pbf['AvgUnitPrice'] = df_pbf['TotalSales'] /df_pbf['NumUniqueProducts']
# Calculate the average quantity per transaction for each customer
df_pbf['AvgQuantityPerTransaction'] = df_pbf['TotalQuantity'] /df_pbf['NumInvoices']
df_pbf.head()
NumUniqueProducts | TotalSales | TotalQuantity | NumInvoices | AvgUnitPrice | AvgQuantityPerTransaction | |
---|---|---|---|---|---|---|
Customer ID | ||||||
12346.0 | 30 | -64.68 | 52 | 17 | -2.156000 | 3.058824 |
12347.0 | 126 | 5633.32 | 3286 | 8 | 44.708889 | 410.750000 |
12348.0 | 25 | 2019.40 | 2714 | 5 | 80.776000 | 542.800000 |
12349.0 | 139 | 4404.54 | 1619 | 5 | 31.687338 | 323.800000 |
12350.0 | 17 | 334.40 | 197 | 1 | 19.670588 | 197.000000 |
This code snippet creates an interaction feature called ‘AvgUnitPrice’ by combining two existing features: ‘TotalSales’ and ‘NumUniqueProducts’. The interaction is achieved through the division operation.
The ‘TotalSales’ feature represents the total revenue generated by each customer, while ‘NumUniqueProducts’ represents the number of unique products purchased by each customer. By dividing ‘TotalSales’ by ‘NumUniqueProducts’, we create a new feature that captures the relationship between these two variables.
The resulting ‘AvgUnitPrice’ feature provides insights into the average spending per product for each customer. It helps understand how the combination of total sales and the number of unique products impacts the average unit price, which can be an important factor in analyzing customer behavior and value.
The interaction feature ‘AvgQuantityPerTransaction’ provides insights into the average number of products a customer buys in each transaction. It helps understand how the combination of total quantity and the number of transactions impacts the average quantity per transaction, which can be useful in analyzing customer purchasing patterns and behavior.
3.5.2 Polynomial features#
Generating polynomial features involves creating new features by combining existing features with polynomial functions, such as squares, cubes, or interactions . Scikit-learn’s PolynomialFeatures class is a tool for generating these features . Feature engineering, including the generation of polynomial features, is crucial for improving model performance and can help avoid overfitting and underfitting problems.
from sklearn.preprocessing import PolynomialFeatures
# Assume you have a DataFrame called 'df' with the following columns:
# 'SalesVolume', 'DiscountPercentage', 'Recency', 'Frequency', 'MarketingSpend', 'ConversionRate'
# Select the relevant features for polynomial feature creation
features = ['Recency', 'Frequency']
X = rfm[features]
# Create polynomial features of degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Get the feature names
feature_names = poly.get_feature_names_out(features)
# Create a new DataFrame with the polynomial features
df_poly = pd.DataFrame(X_poly, columns=feature_names)
# Concatenate the polynomial features with the original DataFrame
rfm_enhanced = pd.concat([rfm.reset_index(), df_poly.reset_index()], axis=1)
rfm_enhanced.head()
Customer ID | Recency | Frequency | Monetary | R_rank | F_rank | M_rank | RFM_score | index | Recency | Frequency | Recency^2 | Recency Frequency | Frequency^2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12346.0 | 325 | 17 | -64.68 | 4 | 4 | 1 | 441 | 0 | 325.0 | 17.0 | 105625.0 | 5525.0 | 289.0 |
1 | 12347.0 | 2 | 8 | 5633.32 | 1 | 3 | 5 | 135 | 1 | 2.0 | 8.0 | 4.0 | 16.0 | 64.0 |
2 | 12348.0 | 75 | 5 | 2019.40 | 3 | 2 | 4 | 324 | 2 | 75.0 | 5.0 | 5625.0 | 375.0 | 25.0 |
3 | 12349.0 | 18 | 5 | 4404.54 | 1 | 2 | 5 | 125 | 3 | 18.0 | 5.0 | 324.0 | 90.0 | 25.0 |
4 | 12350.0 | 310 | 1 | 334.40 | 4 | 1 | 2 | 412 | 4 | 310.0 | 1.0 | 96100.0 | 310.0 | 1.0 |
The code snippet generates polynomial features: ‘Recency^2’, ‘Recency*Frequency’, and ‘Frequency^2’. These capture the interactions and higher-order relationships between the original ‘Recency’ and ‘Frequency’ features. Incorporating polynomial features can offer additional insights for marketing and sales analysis, potentially enhancing predictive model performance. However, it’s important to consider the interpretability and relevance of these polynomial features within the specific context of our marketing and sales domain. We must keep in mind that the ultimate goal of feature engineering is to provide actionable insights for decision-makers.
3.6 Conclusion#
In this chapter, we explored feature engineering techniques for marketing data using Python. We created new features based on RFM analysis, time-based patterns, product characteristics, and interaction features to capture more complex relationships in our data.
Feature engineering is an iterative process that requires domain knowledge, creativity, and experimentation. By creating informative and relevant features, we can improve the performance of our marketing analytics models and gain deeper actionable insights into customer behavior.
In the next chapter, we’ll dive into predictive analytics and learn how to build machine learning models for marketing tasks such as customer segmentation, churn prediction, and sales forecasting.