Starbucks - Advertising Promotion Optimization

Éverton Bin



Background Information

The dataset provided in this project was originally used as a take-home assignment provided by Starbucks for their job candidates. The data consists of about 120,000 data points split in a 2:1 ratio among training and test files. In the experiment simulated by the data, an advertising promotion was tested to see if it would bring more customers to purchase a specific product priced at $10. Since it costs the company 0.15 to send out each promotion, it would be best to limit that promotion only to those that are most receptive to the promotion. Each data point includes one column indicating whether or not an individual was sent a promotion for the product, and one column indicating whether or not that individual eventually purchased that product. Each individual also has seven additional features associated with them, which are provided abstractly as V1-V7.

Optimization Strategy

The task is to use the training data to understand what patterns in V1-V7 to indicate that a promotion should be provided to a user. Specifically, the goal is to maximize the following metrics:

IRR depicts how many more customers purchased the product with the promotion, as compared to if they didn't receive the promotion. Mathematically, it's the ratio of the number of purchasers in the promotion group to the total number of customers in the purchasers group (treatment) minus the ratio of the number of purchasers in the non-promotional group to the total number of customers in the non-promotional group (control).

$$ IRR = \frac{purch_{treat}}{cust_{treat}} - \frac{purch_{ctrl}}{cust_{ctrl}} $$

NIR depicts how much is made (or lost) by sending out the promotion. Mathematically, this is 10 times the total number of purchasers that received the promotion minus 0.15 times the number of promotions sent out, minus 10 times the number of purchasers who were not given the promotion.

$$ NIR = (10\cdot purch_{treat} - 0.15 \cdot cust_{treat}) - 10 \cdot purch_{ctrl}$$

For a full description of what Starbucks provides to candidates see the instructions available here.

How To Test The Strategy?

When reaching an optimization strategy, the promotion_strategy function will be run to pass to the test_results function.
From past data, we know there are four possible outomes:

Table of actual promotion vs. predicted promotion customers:

Actual
PredictedYesNo
YesIII
NoIIIIV

The metrics are only being compared for the individuals we predict should obtain the promotion – that is, quadrants I and II. Since the first set of individuals that receive the promotion (in the training set) receive it randomly, we can expect that quadrants I and II will have approximately equivalent participants.

Comparing quadrant I to II then gives an idea of how well the promotion strategy will work in the future.

Each strategy will be tested against the test dataset used in the final test_results function.

Table of Contents

  1. Checking A/B Test
    1. Invariant Metric
    2. Evaluation Metric
  2. Exploratory Analysis
    1. Correlation
      1. Considerations on Correlation Analysis
    2. Histogram
      1. Considerations on Histogram Analysis
  3. Starting Parameters
    1. All the Clients
    2. Client Segmentation Using Individual Correlation
  4. Testing Different Approaches
    1. Logistic Regression
    2. XGBoost Classifier
      1. Basic Approach
      2. Manipulating Features - Approach 1
      3. Manipulating Features - Approach 2
      4. Manipulating Features - Approach 3
  5. Conclusion

Checking A/B Test

Since we are dealing with the results of the A/B test applied, we are going to run a statistical test over both invariant and experimental metrics - in this case, number of clients assigned to the control and the experimental groups and the number of purchases for each group respectively.

In both tests, we are going to consider a significance of 5%.

Invariant Metric

First, let's check whether or not the invariant metric (number os clients designated to control and experimental groups) is statistically equal by computing the p-value when comparing the two groups:

Evaluation Metric

Once the composition of control and experimental groups is considered statistically correct, we're are going to check if the experiment resulted effectively in increasing the number of purchases in comparison to the control group:

Exploratory Analysis

In the experiment, the promotion was randomly sent to Starbucks' clients. The goal is to use the data collected to improve the strategy by sending the promotion only to the ones who are more likely to use it.

In other words, we need to differenciate customers that would buy the product because of the promotion from the ones that would buy it even if they didn't get the promotion and the group that wouldn't buy it despite of any promotion.

Correlation

It's interesting to observe that, considering the only difference between control and experimental groups is whether or not they were given a promotion ticket, among the features with higher correlation, the one that was most responsive to the experiment is represented by V4.

V4 already had a positive correlation with the purchase feature, considering the control group. With the promotion event, its positive correlation increased by over 2000%, which is impressive. Considering the ideal experiment, where all the other variables are controlled, we could say that the higher V4 is, the more susceptible the client is to use the promotion and purchase the product. Since V4 is a binary feature (1 or 2), we can translate it as class 2 being more receptive to the promotion event.

V5 behaves the same as V4, but with lower intensity. In the control group, it already showed a positive correlation with the purchase, and this correlation was powered with the promotion, increasing by over 300%. V5 represents categories from 1 to 4, indicating that categories represented by higher numbers tend to be more affected by the promotion, in a positive way.

A different response is shown when observing the V3 feature. In the control group, this variable represented the highest positive correlation, while in the experimental group this tendency was reversed: it became the second-highest correlation score, and this time in a negative direction. It looks like clients related to higher values of V3 have a naturally higher propensity for purchasing the product. At the same time, these clients are not quite receptive to promotional events.

Other features like V2 show a great variation when comparing control and experimental groups. However, they don't stand out because their correlation values are too low in comparison to the other features.

Considerations on Correlation Analysis

Just by looking at the correlation values for both groups, we could say that, in general, clients that belong to class 2 in feature V4, higher classes in V5, and are related to lower V3 values are the clients with more potential of being positively responsive to the promotion.

Also, we could see that clients associated with higer values of V3 seems to be more willing to purchase the product, despite any promotional event. Actually, promotional events seem to cause negative impact on their interest in purchasing the product.

At the same time, since even the highest correlation values represent in fact low correlation, we have to say that there must be other important clients' aspects not being captured by these features that affect the positive or negative responses to the promotional event.

Histogram

In the control group, we can see some differences between the ones who purchased the product or not. For example, people related to the class 1 in the V1 feature seem to be more likely to purchase the product in comparison to the other classes.

V3 reinforces what we concluded from the correlation analysis: clients with higher V3 values are more likely to buy the product without any promotional event.

V6 shows a slightly advantage in class 2 as a trend for naturally purchasing the product.

In the experimental group we can see some interesting changes. In V1, for example, the trend presented by class 1 in the control group is reverted, since proportionally class 2 increases its participation with the promotion, and class 1 decreases the participation among the buyers.

V2 shows that clients with values closer to the mean are more responsive to the promotion.

Again, we can see through V3 that its distribution resembles a mirrored-image of the control group, considering the ones who purchased the product. In other words, lower values of V3 represent consumers that purchase the product because of the promotion, while higher values of V3 represent consumers that are discouraged to buy the product because of the promotion.

Class 1 in V4 tends to be not responsive to the promotional event, as well as class 2 in V6 and more significantly in the V5 feature. In the other hand, class 3 in V5 seems to increase its purchase rate with the promotional event.

Considerations on Histogram Analysis

We could reinforce some of the conclusions settled before, and we could see some changes in the behavior of the clients depending on whether they were submitted to the promotional event or not.

Although we don't know exactly what each feature is representing, I would guess that V2 represents something like ages, being that the subjects who are around their thirties tend to be more responsive to the promotion.

V3 could easily represent the subject's incomes, once the ones who have higher incomes would buy the product whenever they wanted to, and the ones with lower incomes would increase their purchases once they were exposed to some promotional event.

Starting Parameters

Once we have identified one consumer's potential profile that is more receptive to the promotion offer, let's simply test what would be the test result if we sent the promotion to:

  1. all the clients;
  2. the clients within features' values with higher positive correlation.

Approach 1 - All the Clients

As expected, the strategy of sending the promotion to all the clients does not optimize Incremental Response Rate nor Net Incremented Revenue.

Approach 2 - Client Segmentation Using Individual Correlation

In comparison to the first approach, this simple client segmentation based on individual feature correlation already showed some advances. Still, we need to test approaches that consider the incremental contribution of each one of the features acting together.

Testing Different Approaches

Approach 3 - Logistic Regression

In this approach, we are going to train a logistic regression model to predict whether or not a client should be sent the promotion: we are only using the experimental group, considering the ones who made the purchase as the group that should recieve the promotion.

Logistic Regression improved our previous correlation approach, but it's not a good model yet.

Let's check for most important features using Decision Tree algorithm:

Next step, let's try the same Logistic Regression approach, this time selecting only the 4 most important variables to see if it improves the model:

Feature selection did improve the model, however, we couldn't achieve our purposes yet.

Approach 4 - XGBoost Classifier

Basic Approach

Up until now, we have trained the models using only our experimental group. In this fourth approach, let's try one kind of one-versus-rest classification.

In this attempt, let's consider that only the ones who recieved the promotion and purchased the product build the features' composition of those who should recieve the promotion. All the other clients represent the profile that should not recieve the promotion:

XGBoost Classifier improved our IRR and NIR metrics, but the last one still needs improvement.

Manipulating Features - Approach 1

In the next trial, let's transform the continuous numeric variables into classes distributed as observerd during the histogram analysis to see if it improves the generalization. Once again, let's use only the experimental group:

We can see that this strategy did optimize our metrics, not as good as Starbucks itself, but still a significant improvement.

Our understanding of the behavior of the features during the exploratory analysis was fundamental to elaborate this strategy. We are going to keep this same approach, now generalizing a little bit more the classes we created before.

Manipulating Features - Approach 2

Considering the V2 feature distribution, it looks like there is a similar response to the promotion along with its values with slight differences. Still, the largest proportion of clients is situated around the mean value - like a traditional Gaussian distribution. With that in mind, let's try to focus on the greatest public 'cutting off' the tails in one single group.

At the same time, we are simplifying V3 classes as an attempt to make them more generalistic:

We can see that this more generalist strategy came up with similar results.

Manipulating Features - Approach 3

As a next attempt, we are going to distinguish the classes not only by their distributions and importances, but the classes' values are going to represent some kind of weight according to their relevance when considering the ones who purchased the product in the experimental group:

In this last approach, we finally achieved results that overcome the ones given by Starbucks. While our irr has a similar value in comparison to Starbucks's model, our nir showed a great improvement, close to 15%.

Conclusion

We could see through this project the importance of data analysis in Data Science projects. Even when it's not clear what a feature represents, through the analysis process we were able to guess what some features could represent, and more importantly how they related to our business problem.

Understanding how differently these features behave when comparing our study groups made it possible to elaborate different feature engineering strategies that resulted in better optimization.

The most successful model came up with this strategy:

  • V2 was separated into classes labeled with the weight of their occurrences in the dataset, following a normal distribution;
  • V3 was separated into 6 classes, lower values were labeled with higher numbers since they represent the ones who are more likely to respond to the promotional event;
  • Besides that, we could guess that that V2 probably represents clients' age, while V3 stands for their incomes.