Chapter 4 Regression Analysis for Marketing#
4.1 Introduction to Regression Analysis#
Regression analysis stands as a cornerstone of marketing analytics, offering a powerful toolkit for deciphering the intricate dynamics between various factors influencing market outcomes. This statistical approach enables marketers to navigate through the complexity of consumer behavior, advertising effectiveness, pricing strategies, and more, by establishing quantifiable relationships between dependent variables and one or more independent variables.
Applications in Marketing#
The versatility of regression analysis makes it indispensable in the realm of marketing. Here are some ways it can be applied:
Forecasting Sales: By incorporating variables such as marketing expenditure, seasonal trends, and economic indicators, businesses can predict future sales, enabling better inventory and budget planning.
Advertising Effectiveness: Evaluating the ROI on advertising campaigns by analyzing how different channels and messaging impact customer acquisition and retention.
Customer Insights: Understanding what drives customer satisfaction and loyalty by examining factors such as service quality, product features, and user experience.
Pricing Optimization: Assessing how price changes influence demand and sales, aiding in the development of dynamic pricing strategies to maximize profitability.
4.2 Types of Regression Analysis#
Regression analysis is a versatile statistical tool used to examine the relationship between a dependent (target) variable and one or more independent (predictor) variables. The choice of regression analysis depends on the nature of the target variable and the relationship you wish to investigate. Below, we discuss several commonly used types of regression analysis, each serving distinct analytical needs in marketing analytics.
4.2.1 Linear Regression#
Linear regression is the most straightforward form of regression analysis. It assumes a linear relationship between the dependent variable and one or more independent variables. This model is highly interpretable and is often used as a starting point for understanding relationships between variables.
Usage: Ideal for continuous data and forecasting outcomes like sales volume based on advertising spend.
Python Implementation: Widely supported by libraries such as Scikit-learn (
LinearRegression
) and Statsmodels.
4.2.2 Logistic Regression#
Unlike linear regression, logistic regression is used when the dependent variable is categorical, typically binary. This model estimates probabilities by predicting the log odds of the outcome, making it suitable for classification tasks.
Usage: Commonly applied to predict binary outcomes such as customer churn (yes/no) or conversion success (purchase/no purchase).
Python Implementation: Can be implemented using Scikit-learn’s
LogisticRegression
class.
4.2.3 Poisson Regression#
Poisson regression is a specialized form of regression analysis used when the dependent variable is count data. This type of regression is particularly useful for modeling the number of times an event occurs within a fixed interval of time or space. Given its nature, Poisson regression is a powerful tool for analyzing and predicting behaviors or trends where outcomes are discrete counts.
Usage: Poisson regression is ideal for count-based data scenarios such as predicting the number of website visits, daily sales transactions, or call center calls received in a day. It helps in understanding how various factors or exposures influence the rate at which events occur.
Assumptions: The model assumes that the mean and variance of the distribution of the dependent variable are equal, a condition known as equidispersion. However, real-world data often violate this assumption, leading to overdispersion or underdispersion. In such cases, alternative models like Negative Binomial regression may be more appropriate.
Python Implementation: The
statsmodels
library in Python provides functionality to fit Poisson regression models.
4.2.4 Polynomial Regression#
Polynomial regression extends linear regression by introducing polynomial terms (squared, cubic, etc.) of the independent variables. This allows the model to capture nonlinear relationships between the dependent and independent variables.
Usage: Useful when the relationship between variables is not linear but still requires the simplicity and interpretability of regression models.
Python Implementation: Achieved by transforming features into polynomial features (e.g., using Scikit-learn’s
PolynomialFeatures
) before applying linear regression.
4.2.5 Stepwise Regression#
Stepwise regression is a systematic method for adding and removing predictors based on their statistical significance in explaining the variance of the dependent variable. It aims to identify the most parsimonious model that explains the data.
Usage: Effective in scenarios with large numbers of predictors, identifying a subset that offers the best prediction.
Python Implementation: Can be complex to implement directly but is supported through procedures in statistical packages like Statsmodels.
4.2.6 Ridge and Lasso Regression#
Both Ridge and Lasso regression are techniques used to analyze data with multicollinearity or when the number of predictors exceeds the number of observations. They introduce a penalty term to the loss function to shrink coefficient estimates.
Ridge Regression (L2 regularization) minimizes the sum of the square of coefficients, effectively reducing model complexity.
Lasso Regression (L1 regularization) can shrink some coefficients to zero, performing variable selection.
Usage: Both are used for feature selection, regularization, and improving model predictions in high-dimensional datasets.
Python Implementation: Scikit-learn offers
Ridge
andLasso
classes under its linear models module.
4.2.7 Generalized Linear Regression (GLM)#
Generalized Linear Regression (GLM) extend traditional linear regression models to accommodate response variables that are not normally distributed. GLMs are versatile, allowing for the specification of different distributions for the response variable and the use of a link function to model the relationship between the response variable and the predictors. This flexibility makes GLMs a powerful tool for a wide range of data types and analytical tasks.
Usage: GLMs are broadly applicable in marketing analytics, from modeling binary outcomes like conversion or churn with logistic regression, to modeling count data, such as the number of purchases made by a customer, using Poisson or Negative Binomial regression.
Components: A GLM is characterized by three components: the probability distribution of the response variable (from the exponential family), the linear predictor (a linear combination of the input variables), and the link function (which connects the expected value of the response to the linear predictor).
Python Implementation: Implementing GLMs in Python can be readily done with the
statsmodels
library, which supports a variety of families for the response distribution (e.g., Binomial for binary data, Poisson for count data).
Choosing the Right Model#
Selecting the appropriate regression model depends on several factors, including the nature of the dependent variable, the relationship between the independent and dependent variables, and the specific analytical objectives. Experimentation and validation are key to determining the most effective model for your data.
4.3 Linear Regression#
At the heart of regression analysis is linear regression, a model that assumes a linear relationship between the dependent and independent variables. This simplicity makes it not only a starting point for analysis but also a tool for making predictions and decisions in marketing strategies.
4.3.1 The Linear Regression Model#
The linear regression model can be expressed in the formula:
Here, \(y\) represents the dependent variable we aim to predict, such as sales volume or customer satisfaction scores. The \(x\) variables are the independent variables or predictors, such as advertising spend, product price, or customer demographics. \(\beta\) coefficients quantify the impact of each predictor on the dependent variable, and \(\epsilon\) is the error term, accounting for the variability not explained by the model.
4.3.2 Interpreting Coefficients#
Understanding the coefficients in a linear regression model is crucial for making informed marketing decisions. A positive coefficient for a variable, such as marketing spend, indicates a direct relationship with the dependent variable, suggesting that increasing the marketing budget could lead to higher sales. Conversely, a negative coefficient suggests an inverse relationship, providing insights into factors that might hinder performance.
4.3.3 Implementing Linear Regression with Scikit-learn#
Scikit-learn, a comprehensive Python library for machine learning, simplifies the process of implementing linear regression models. It offers an intuitive interface and efficient tools for model fitting, prediction, and evaluation, making it an excellent choice for marketing analytics projects.
Below are the general steps to build a regression model:
Data Preparation: Begin by loading your dataset and selecting the dependent and independent variables. It’s important to clean and preprocess your data, handling missing values, and encoding categorical variables as needed.
Model Setup: Import the
LinearRegression
class from Scikit-learn’slinear_model
module. Instantiate the model with appropriate parameters.Model Fitting: Use the
fit
method to train your model on the data. This involves finding the coefficients that minimize the error term.Prediction and Evaluation: With the model trained, use it to make predictions on new or test data. Evaluate the model’s performance using metrics such as R-squared and mean squared error to understand its accuracy and reliability.
Interpretation: Analyze the model’s coefficients to draw insights into the relationships between your variables. This step is critical for applying your findings to make informed marketing decisions.
By following these steps, you can leverage linear regression in Scikit-learn to uncover valuable marketing insights, predict outcomes, and optimize strategies for better performance.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
# Import the data
data = pd.read_csv('data\marketing_sales_data.csv')
data.head()
TV | Radio | Social Media | Influencer | Sales | |
---|---|---|---|---|---|
0 | Low | 3.518070 | 2.293790 | Micro | 55.261284 |
1 | Low | 7.756876 | 2.572287 | Mega | 67.574904 |
2 | High | 20.348988 | 1.227180 | Micro | 272.250108 |
3 | Medium | 20.108487 | 2.728374 | Mega | 195.102176 |
4 | High | 31.653200 | 7.776978 | Nano | 273.960377 |
# processed marketing dataset
X = data[['TV', 'Radio', 'Social Media', 'Influencer']] # Independent variables
y = data['Sales'] # Dependent variable
# separate the features into categorical and numerical features
categorical_features = [ 'TV', 'Influencer']
numerical_features = ['Social Media','Radio']
# target encoding for categorical features
def target_encoding(data, column, target):
grouped = data[[column,target]].groupby(column,as_index=False).mean()
empty_dict = {}
for i in range(len(grouped)):
empty_dict[grouped.iloc[i,0]]=grouped.iloc[i,1]
data[column]=data[column].map(lambda x: empty_dict[x])
return data
# combine X and y to apply target encoding
X['Sales'] = y
X_te=data.copy()
for col in categorical_features:
target_encoding(X_te, col, 'Sales')
# separate the features into X and y
y = X_te['Sales']
X = X_te.drop('Sales', axis=1)
X.head()
TV | Radio | Social Media | Influencer | |
---|---|---|---|---|
0 | 90.984101 | 3.518070 | 2.293790 | 188.321846 |
1 | 90.984101 | 7.756876 | 2.572287 | 194.487941 |
2 | 300.853195 | 20.348988 | 1.227180 | 188.321846 |
3 | 195.358032 | 20.108487 | 2.728374 | 194.487941 |
4 | 300.853195 | 31.653200 | 7.776978 | 191.874432 |
we define a function for target encoding of categorical features. This function replaces each unique value in the specified column with the mean of the target column for that value.
From the data examples above, we can see that features are at different scales. We usually standardize the features before we can estimate the model.
# standardize the numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
all_features = X.columns
# all the features in X are numerical
X = scaler.fit_transform(X)
X = pd.DataFrame(X, columns=all_features)
X.head()
TV | Radio | Social Media | Influencer | |
---|---|---|---|---|
0 | -1.173289 | -1.508439 | -0.465035 | -0.210564 |
1 | -1.173289 | -1.051809 | -0.340507 | 1.120998 |
2 | 1.331340 | 0.304689 | -0.941962 | -0.210564 |
3 | 0.072335 | 0.278781 | -0.270713 | 1.120998 |
4 | 1.331340 | 1.522447 | 1.986735 | 0.556614 |
import statsmodels.api as sm
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model using Scikit-learn
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Create a new dataset with the same features as the training data
X_train_sm = sm.add_constant(X_train)
# Create an OLS (Ordinary Least Squares) model using Statsmodels
model_sm = sm.OLS(y_train, X_train_sm)
# Fit the OLS model
results_sm = model_sm.fit()
# Print the detailed model summary from Statsmodels
print(results_sm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Sales R-squared: 0.905
Model: OLS Adj. R-squared: 0.904
Method: Least Squares F-statistic: 1074.
Date: Thu, 04 Apr 2024 Prob (F-statistic): 3.23e-229
Time: 02:02:08 Log-Likelihood: -2168.0
No. Observations: 457 AIC: 4346.
Df Residuals: 452 BIC: 4367.
Df Model: 4
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 190.1769 1.309 145.249 0.000 187.604 192.750
TV 59.9624 2.250 26.649 0.000 55.540 64.384
Radio 27.2419 2.391 11.393 0.000 22.543 31.941
Social Media 0.6669 1.678 0.398 0.691 -2.630 3.964
Influencer 0.1651 1.329 0.124 0.901 -2.447 2.778
==============================================================================
Omnibus: 50.570 Durbin-Watson: 1.852
Prob(Omnibus): 0.000 Jarque-Bera (JB): 15.172
Skew: 0.086 Prob(JB): 0.000507
Kurtosis: 2.124 Cond. No. 3.68
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The results above show the output of an Ordinary Least Squares (OLS) regression model. The model attempts to explain the relationship between the dependent variable “Sales” and four independent variables: “TV,” “Radio,” “Social Media,” and “Influencer.” Let’s go through the main parts of the results:
R-squared and Adjusted R-squared:
R-squared (0.905) indicates that approximately 90.5% of the variance in “Sales” can be explained by the independent variables included in the model.
Adjusted R-squared (0.904) is a modified version of R-squared that takes into account the number of predictors in the model. It is slightly lower than R-squared, suggesting that the model’s explanatory power is still high even after accounting for the number of variables.
F-statistic and Prob (F-statistic):
The F-statistic (1074.0) tests the overall significance of the regression model. A high F-statistic suggests that the model is statistically significant.
Prob (F-statistic) (3.23e-229) is the p-value associated with the F-statistic. It is extremely low (close to zero), indicating that the model is highly statistically significant.
Coefficients and their significance:
The “coef” column shows the estimated coefficients for each independent variable.
“const” (190.1769) represents the intercept or the expected value of “Sales” when all independent variables are zero.
“TV” (59.9624), “Radio” (27.2419), “Social Media” (0.6669), and “Influencer” (0.1651) are the coefficients for the respective independent variables.
The “std err” column shows the standard errors associated with each coefficient estimate.
The “t” column represents the t-statistic, which measures the statistical significance of each coefficient. Higher absolute values of t-statistic indicate greater significance.
The “P>|t|” column shows the p-value associated with each t-statistic. Lower p-values (typically < 0.05) suggest that the coefficient is statistically significant.
The “[0.025, 0.975]” columns represent the 95% confidence interval for each coefficient estimate.
Model diagnostics:
Omnibus, Prob(Omnibus), Skew, and Kurtosis provide information about the normality of the residuals. In this case, the low p-value for the Omnibus test (0.000) suggests that the residuals may not be normally distributed.
Durbin-Watson (1.852) is a test statistic for autocorrelation in the residuals. A value close to 2 indicates no autocorrelation.
Jarque-Bera (JB) and Prob(JB) are additional tests for normality of the residuals. The low p-value (0.000507) suggests that the residuals may not be normally distributed.
Cond. No. (3.68) is the condition number, which measures the sensitivity of the regression estimates to small changes in the input data. A higher value may indicate multicollinearity among the independent variables.
Business Insights Interpretation: Based on the results, the coefficients for “TV” and “Radio” are statistically significant (p-values < 0.05), indicating that they have a significant impact on “Sales.” The coefficients for “Social Media” and “Influencer” are not statistically significant (p-values > 0.05), suggesting that they may not have a significant effect on “Sales.”
However, the diagnostics raise some concerns about the normality of the residuals, which may affect the validity of the model assumptions. Further investigation and potential model refinements may be necessary.
4.4 Generalized Linear Regression (GLM)#
To create a Generalized Linear Model (GLM) instead of an Ordinary Least Squares (OLS) model, you can use the statsmodels.genmod.generalized_linear_model
module in Python. Here’s an example of how to create a GLM model using the same data:
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Gaussian
from sklearn.model_selection import train_test_split
# Add a constant term to the training and test sets
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
# Create the GLM model with Gaussian family and identity link function
model = GLM(y_train, X_train, family=Gaussian(link=sm.genmod.families.links.identity()))
# Fit the GLM model on the training data
results = model.fit()
# Print the model summary
print("GLM Model Summary:")
print(results.summary())
GLM Model Summary:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Sales No. Observations: 457
Model: GLM Df Residuals: 452
Model Family: Gaussian Df Model: 4
Link Function: identity Scale: 781.34
Method: IRLS Log-Likelihood: -2168.0
Date: Thu, 04 Apr 2024 Deviance: 3.5317e+05
Time: 02:18:05 Pearson chi2: 3.53e+05
No. Iterations: 3 Pseudo R-squ. (CS): 0.9999
Covariance Type: nonrobust
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
const 190.1769 1.309 145.249 0.000 187.611 192.743
TV 59.9624 2.250 26.649 0.000 55.552 64.372
Radio 27.2419 2.391 11.393 0.000 22.555 31.928
Social Media 0.6669 1.678 0.398 0.691 -2.621 3.955
Influencer 0.1651 1.329 0.124 0.901 -2.440 2.771
================================================================================
from sklearn.metrics import mean_squared_error, r2_score
# Make predictions on the test set
y_pred = results.predict(X_test)
# Calculate the mean squared error (MSE) and R-squared on the test set
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print the evaluation metrics
print("Test Set Evaluation:")
print("Mean Squared Error (MSE):", mse)
print("R-squared:", r2)
Test Set Evaluation:
Mean Squared Error (MSE): 802.7904830299971
R-squared: 0.8974988747706325
In this example:
We import the necessary modules from
statsmodels
, includingGLM
fromstatsmodels.genmod.generalized_linear_model
andGaussian
fromstatsmodels.genmod.families
.We assume that you have a dataset called
data
with ‘Sales’ as the dependent variable and ‘TV’, ‘Radio’, ‘Social Media’, and ‘Influencer’ as independent variables.We create the design matrix
X
by adding a constant term to the independent variables usingsm.add_constant()
. This is necessary for including the intercept in the model.We create the GLM model using
GLM()
and specify the dependent variable (data['Sales']
), the design matrix (X
), and the family and link function. In this case, we use the Gaussian family with the identity link function, which is equivalent to a linear regression model.We fit the GLM model using the
fit()
method and store the results in theresults
variable.Finally, we print the model summary using
results.summary()
, which will display the coefficients, standard errors, t-values, p-values, and other model diagnostics.
The output will be similar to the previous OLS model summary, but with some additional information specific to the GLM framework.
Note that in this example, we used the Gaussian family with the identity link function, which makes the GLM model equivalent to a linear regression model. However, GLMs allow for more flexibility in modeling different types of dependent variables and relationships by specifying different families and link functions. For example, you can use the Poisson family for count data or the Binomial family for binary outcomes.
Feel free to modify the code and experiment with different families and link functions based on the nature of your data and the specific problem you’re trying to solve.
The results above show the output of a Generalized Linear Model (GLM) regression. The model aims to explain the relationship between the dependent variable “Sales” and four independent variables: “TV,” “Radio,” “Social Media,” and “Influencer.” Let’s go through the main parts of the results:
Model Family and Link Function:
The GLM model assumes a Gaussian (normal) distribution for the dependent variable “Sales.”
The link function used is the identity function, which means that the linear predictor is directly related to the expected value of the response variable.
Goodness of Fit Measures:
Deviance (3.5317e+05) measures the difference between the fitted model and the saturated model. Lower values indicate a better fit.
Pearson chi2 (3.53e+05) is another measure of goodness of fit, comparing the observed and expected values. Lower values indicate a better fit.
Pseudo R-squared (CS) (0.9999) is a measure of the proportion of variance explained by the model. A value close to 1 suggests a good fit.
Coefficients and their significance:
The “coef” column shows the estimated coefficients for each independent variable.
“const” (190.1769) represents the intercept or the expected value of “Sales” when all independent variables are zero.
“TV” (59.9624), “Radio” (27.2419), “Social Media” (0.6669), and “Influencer” (0.1651) are the coefficients for the respective independent variables.
The “std err” column shows the standard errors associated with each coefficient estimate.
The “z” column represents the z-statistic, which measures the statistical significance of each coefficient. Higher absolute values of z-statistic indicate greater significance.
The “P>|z|” column shows the p-value associated with each z-statistic. Lower p-values (typically < 0.05) suggest that the coefficient is statistically significant.
The “[0.025, 0.975]” columns represent the 95% confidence interval for each coefficient estimate.
Model Summary:
No. Observations (457) indicates the total number of data points used in the model.
Df Residuals (452) represents the degrees of freedom for the residuals.
Df Model (4) represents the degrees of freedom for the model (number of independent variables).
Scale (781.34) is an estimate of the scale parameter (dispersion) of the model.
Log-Likelihood (-2168.0) measures the goodness of fit of the model. Higher values indicate a better fit.
No. Iterations (3) indicates the number of iterations required for the model to converge.
Business Insights Interpretation:
Based on the results, the coefficients for “TV” and “Radio” are statistically significant (p-values < 0.05), indicating that they have a significant impact on “Sales.” The coefficients for “Social Media” and “Influencer” are not statistically significant (p-values > 0.05), suggesting that they may not have a significant effect on “Sales.”
The model assumes a Gaussian distribution for the dependent variable, and the identity link function suggests a linear relationship between the predictors and the response variable. The goodness of fit measures (Deviance, Pearson chi2, and Pseudo R-squared) indicate a relatively good fit of the model to the data.