Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

Éverton Bin

I. Introduction

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and have not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

Get to Know the Data

There are four data files associated with this project:

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. One of them is a top-level list of attributes and descriptions, organized by informational category. The other is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the .csv data files in this project that they're semicolon (;) delimited, so an additional argument in the read_csv() call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

II. Table of Contents

Part 1: Customer Segmentation Report

1.1 Data Overview

$\;\;\;\;\;$1.2 Cleaning Data

$\;\;\;\;\;$1.2.1 NaN and Unknown Values

$\;\;\;\;\;$1.2.2 Non-Informative Columns

$\;\;\;\;\;$1.2.3 Columns' Types

$\;\;\;\;\;$1.2.4 Feature Engineering

$\;\;\;\;\;$1.2.5 Correlation Analysis

$\;\;\;\;\;$1.2.6 Applying Data Cleaning on Customer Data

1.3 Exploratory Data Analysis

$\;\;\;\;\;$1.3.1 Age

$\;\;\;\;\;$1.3.2 Youth Movements

$\;\;\;\;\;$1.3.3 Location

$\;\;\;\;\;$1.3.4 Consumer Classification

$\;\;\;\;\;$1.3.5 Income

$\;\;\;\;\;$1.3.6 Habits and Other Curiosities

1.4 The WIse-Conscious Avant-Gardes

1.5 Cluster Analysis

$\;\;\;\;\;$1.5.1 Feature Engineering

$\;\;\;\;\;$1.5.2 NaN Values

$\;\;\;\;\;$1.5.3 Standardizing Data

$\;\;\;\;\;$1.5.4 Dimensionality Reduction

$\;\;\;\;\;$1.5.5 Defining the Number of Clusters

$\;\;\;\;\;$1.5.6 Applying Transformations on Customer Data

$\;\;\;\;\;$1.5.7 Clustering

$\;\;\;\;\;$1.5.8 Evaluating Clusters

$\;\;\;\;\;\;\;\;\;\;$1.5.8.1 Overrepresented Clusters

$\;\;\;\;\;\;\;\;\;\;$1.5.8.2 Underrepresented Clusters


Part 2: Supervised Learning Model

2.1 Data Transformation

2.2 Analyzing Learning Curves

2.3 Training Classifiers

$\;\;\;\;\;$2.3.1 Training on Unbalanced Data

$\;\;\;\;\;$2.3.2 Training on Balanced Data

$\;\;\;\;\;$2.3.3 Using Information Level, PCA, and Truncated SVD

$\;\;\;\;\;$2.3.4 Using PCA Transformation

$\;\;\;\;\;$2.3.5 XGBoost Classifier and Bayesian Optimization

$\;\;\;\;\;$2.3.6 LightGBM and Bayesian Optimization


Part 3: Kaggle Competition

3.1 Attempt 1: Training on Unbalanced Data

3.2 Attempt 2: Training on Balanced Data

3.3 Attempt 3: Information Level, PCA, and TruncatedSVD

3.4 Attempt 4: PCA Transformation

3.5 Attempt 5: XGBoost Classifier and Bayesian Optimization

3.6 Attempt 6: LightGBM and Bayesian Optimization

4. Conclusion


Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

1.1 Data Overview

1.2 Cleaning Data

1.2.1 NaN and Unkown Values

Since there is a large number of columns, it's possible to be conservative eliminating columns with many nan values. In this case, columns with more than 35% of these values will be deleted.

First, variables will be mapped in order to check if there are unknown values that are represented by one specific class. In that case, these unknown values will be transformed into nan values as well.

1.2.2 Non-Informative Columns

1.2.3 Columns' Types

1.2.4 Feature Engineering

In the process of going through each one of the features, two features caught the attention for containing more than one apparently important information:

These observations will be used to create new features that will possibly help through the analysis' process. One another transformation to be performed is to simplify the ALTER_HH variable to represent decades.

1.2.5 Correlation Analysis

The next step will be to verify the correlation between columns.

Highly correlated features indicate they may represent similar information. Selecting only one of these highly correlated features will help to reduce the number of variables to be considered along in the process.

Since most of the numerical variables represent ordinal classes, the most appropriate correlation analysis would be one using distance or rank approaches. Because of the limited computational power, and considering that these variables represent, in essence, a metric measure, the Pearson correlation will be applied in this task.

1.2.6 Applying Data Cleaning on Customer Dataset

With these features cleaning and pre-selection, some exploratory data analysis will be performed.

1.3 Exploratory Data Analysis

Through this exploratory analysis, the goal is to understand the company's customer profile and how this profile relates to the general population. The analysis will be focused on answering a few questions:

1.3.1 Age

After the pre-selection of features, the age will be analyzed through the perspective of the YOUTH_DECADE variable that represents the decade corresponding to the person's youth. The decades go from the 40'ies to the 90'ies, and considering the youth period from the age of 15 to the age of 25, this variable can be interpreted as follows:

While in the general population, most of the people have their youth related to the 90'ies, among customers that would be the less representative population.

In the customers' group, the most representative classes refer to the 50'ies and the 60'ies as the most common youth decades. As an approximation, it would be possible to say that it refers to people that are between 65 and 85 years old.

It's interesting no notice that the younger the group is, the less representative it is among clients and the more representative it is in the general population. In other words, elders are overrepresented, while youngers are underrepresented in the customers' group.

This age analysis brings up more questions to be studied and new possibilities to the company:

Age Insights

1.3.2 Youth Movements

PRAEGENDE_JUGENDJAHRE was previously split into the YOUTH_DECADE and the AVANT_GARDE features. Age was analyzed through the first engineered feature, and the second one indicating the dominating movement in the person's youth (avant-garde or mainstream) will be checked now:

If the analysis was only performed on customers' data, it would be possible to say that clients were equally represented by avant-gardes and mainstreamers.

When comparing this distribution to the general population, it's possible to see that actually avant-gardes are more likely to be interested in the products offered by the mail-order company. While people related to the avant-garde movement represent about 20% of the general population, among clients this representation rises up to 50%.

The opposite happens to mainstreamers: they represent almost 80% of the general population and about 50% of the customers.

Another important aspect is the fact that, during the correlation analysis, the features AVANT_GARDE and GREEN_AVANTGARDE presented a perfect positive correlation with each other. It could indicate that, although avant-garde movements may be related to different social/economic aspects through the years, they can always be related in some aspects to the green avant-garde movement.

More studies should be done to prove this theory, but in this brief analysis, it's possible to say that this correlation indicates that the company's clients are more interested in topics related to sustainability, or more concerned about the impacts that people's actions cause on the environment.

Since the green movement has been increasing over the years both in society and also in politics, this could be an important aspect to be explored in order to reach younger generations.

Youth Movements Insights

1.3.3 Location

The BALLRAUM variable is described as the distance to next urban center, and it could give and an indication of where the customers live and how it relates to the general population. Its classes go from 1 to 7:

Most of the clients live between 50 and 100 km from urban centers, which corresponds to the class 6. However, this is not a characteristic that specifically defines the company's customers, since it follows the distribution of the general population.

Looking both at the bar plot and the statistics, it's clear that the general population and the customers follow the same distribution, meaning that equal proportions of the population in different urban centers positions are being reached.

The highest difference appears in class 1 that represents people living up to 10 km from the urban center. It makes sense, once it's a mail-order company, and people close to urban centers may have more possibilities of buying these products directly in the companies.

REGIOTYP classifies people according to their neighbourhood:

Although the proportional distributions don't show great differences between customers and the general population, there's a tendency of overrepresentation among customers for upper classes neighborhood and underrepresentation for the other neighborhood types.

It gets more clear when analyzing class 1 ( upper-class ) that represents about 7% of the general population, and about 13% of the customers.

ORTSGR_KLS9 variable represents community size, considering the number of inhabitants:

As an overall view, it would be possible to say that consumers are proportionally equally distributed in the different community sizes.

There's a slight tendency of overrepresentation in cities up to 50 thousand inhabitants and a tendency of underrepresentation in bigger cities with more than 700 thousand inhabitants, which corroborates with the previous BALLRAUM variable analysis.

Location Insights

1.3.4 Consumer Classification

So far, it was possible to identify the typical company's clients as elderly people living up to 50 km from urban centers. Now, the CAMEO Classification will be used in order to better understand customers' consumption and lifestyle habits and compare them to the general population's habits.

CAMEO_INTL_2015 was divided in two variables: CAMEO_INTL_FAM_STATUS and CAMEO_INTL_FAM_COMPOSITION. The first one relates to 5 different classes:

CAMEO_INTL_FAM_COMPOSITION represents:

Wealthy and prosperous classes are the most representative among clients. In, wealthy is the most overrepresented in comparison to the general population.

On the other hand, the poorer status appears as the most representative in the general population and the most underrepresented among clients.

More than the overrepresentation seen especially in the classes 4 and 5, indicating that older families, mature couples, and elders in retirement are quite more common among clients than in the general population, what catches the attention is the underrepresentation in the class 1 related to pre-families and singles.

It corroborates with one aspect seen before: that younger people are less likely to become clients in the company. Whether it's a matter of age, life situation, or any other possible condition would require deeper research.

CAMEO_DEU_2015 relates to a similar content as the previous variable, also being a CAMEO Classification 2015, this time with a more detailed classification:

Once again, there are indications that the distribution of customers along the different segments doesn't follow the general population distribution.

Among customers, the top classification is Fine Management, while among the general population, this position is taken by Petty Bourgeois.

Classes 1A to 2D, 3D, 4A and 5D tend to be overrepresented in customers in comparison to the general population. The opposite happens in classes 7A, and 8A to 9D.

It reinforces that there are specific segments in the population that are especially attracted to the products offered by the company. Although there's no further explanation about these classes, we can deduce by their names that they relate not only to the social class but also to people's behavior and habits.

Consumer Classifications Insights

1.3.5 Income

HH_EINKOMMEN_SCORE indicates the estimated household net income, corresponding to the following code:

Clients classified with highest income and very high income represent more than 50% of the clients. Their proportion among customers shows values more than twice as high as the proportion shown in the general population. The top customer's class is the 2 (very high income), exceeding 35%, while in the general population it represents about 15% of the people.

Classes 3 and 4, representing high income and average income, are practically equally represented both in the general population and among customers.

Lower income and especially very low income are underrepresented classes: while in the general population very low income represents almost 30% of the people, among clients its representation decreases to 7%.

Again, this can be seen as an opportunity. Once the majority of the population belongs to the lower classes when it comes to their incomes, if the company had the purpose to reach a broader audience, it could be considered releasing cheaper versions of the products.

Income Insights

1.3.6 Habits and Other Curiosities

To better understand customers' profiles, GFK_URLAUBERTYP indicates people's vacation habits. These vacation habits are represented by the following codes:

It is possible to see some clear trends among customers that differ from the general population. There is a huge overrepresentation of nature fans as a vacation habit among customers. It may indicate that, not only in terms of vacation habits, the company's customers may have a mindset that values nature connection, which makes sense given that the company sells organic products. In a lower proportion, golden agers also seem to be overrepresented in the customers' group.

When looking for underrepresentation, at least 4 classes catch the attention: the ones without vacation, active families and family-oriented vacationists, and package tour travelers. All of these are more representative among the general population than among customers.

LP_LEBENSPHASE_GROB refines the last classification, including information about incomes:

Corroborating with the previous analysis on age and incomes, the top class for customers is the one representing high-income earners of higher age.

The other classes that stand out when comparing to the general population are also related to high incomes, independently from the family structure. The only exception is the class representing low-income, and average earners of higher age, related to low-income and average earners that seem to be overrepresented among customers, possibly because of the higher age factor.

As seen before, pre-families and singles are the ones with the highest underrepresentation among clients. This time, it's possible to see that the class 5, representing single high-income, and earner-couples are overrepresented among clients. It could indicate that income is more important than age or family composition when it comes to becoming a client.

On the other hand, while people classified as single low-income and average earners of younger age represent over 15% of the general population, among clients this percentage barely exceeds 1%.

ZABEOTYP indicates energy consumers types as:

Among customers, it's possible to see an overrepresentation of energy consumers of the types green and fair supplied, indicating a tendency of a sense of responsibility with a conscious and sustainable energy consumption habit.

Habits and Curiosities Insights

1.4 The Wise-Conscious Avant-Gardes

If I had to reach out to the public that is more likely to join the company's customers group through a marketing campaign, I would focus on the wisdom related to the elders, but also on the consciousness of the impact that the consumption habits have on the planet.

Given the fact that they may have a special connection with nature, the consumption of organic products can improve the individual's health and also the planet, and that is an aspect that can be explored when reaching out to customers.

It would also be important to highlight the avant-garde profile of these people, of those who think ahead of their time, indicating that the habit of consuming organic products is not just a lifestyle, but a legacy for future generations.

As a big picture, the regular customer of the mail-order company would be:

1.5 Cluster Analysis

Now, unsupervised machine learning techniques will be used in order to identify hidden patterns in the data, clustering the population into different groups, each one composed of people with similar characteristics.

With the defined clusters, it will be possible to perform a new comparison between customers and the general population. The difference is that, this time, the comparison won't be performed over one dimension (one variable), but over the different groups created through the effect that all the variables together have on these groups.

1.5.1 Feature Engineering

First, there will be one more feature engineering process, this time over CAMEO_DEU_2015 feature. Since there are 44 different classifications, they will be grouped according to the behavior presented when comparing customers and the general population, following this code:

Since new features were created during the process, a complementary dictionary will be created to specify the new columns dtypes:

1.5.2 NaN Values

The approach that will be used to fill in nan values is the following:

1.5.3 Standardizing Data

Even after the feature selection process, many columns were left to be analyzed. Because of that, Principal Component Analysis will be applied to the data.

In order to apply the PCA algorithm, the values need to be on the same scale. For that, one function will be defined to fit the model to the data, and another to use the fitted model to transform the data:

Since there are different levels of information, according to the informational spreadsheet, the dimensionality reduction, in this case, the PCA technique, will not be applied to the whole dataset at once. Different components will be created for different levels of information.

Because of that, the dataset will be split according to their information level, and the transformations will be applied to these subsets.

Next, all the transformations will be joined together in one function, and then it will be applied to the different levels of information.

Considering the number of features after the transformations, PCA will be applied only on the first four levels of information:

The Community level has only three features, and because of that, they will be kept without further transformations.

1.5.4 Dimensionality Reduction

Once the data is treated and standardized, before actually applying the dimensionality reduction, PCA will be applied with standard parameters in order to decide the number of components to keep.

To make the number of components decision, a scree plot will be created for each level of information data.

The purpose is to retain most of the data variability, and the few components as possible, simplifying the resulted data.

Since it's a client segmentation problem, the assumption is that the person information level must be more important than the household information level, which is more important than the microcell and the macrocell information levels, being the last one the more general in this scale.

Because of that, when deciding the number of components to keep, more important levels will have the number of components necessary to explain about 60% of the variance, while less important levels will be allowed to have a lower explained variance rate.

In order to keep about 60% of the explained variance, in the person level 30 components will be kept.

In the household level, 20 components explain over 50% of the data variance.

Because of the scale of importance assumed before, microcell and macrocell levels will be allowed to have their rate of explained variance between 40 and 50%, keeping 10 components for each one of them.

Applying PCA

To better understand how the original features compose these components created through PCA, a function will be defined to return the most important features for each component:

Unfortunately, there was no explanation provided for the feature CJT_TYP_6 (it's only known to be related to customer journey typology). However, as an example, the Component 0 related to the person information level will be interpreted without this specific feature.

ALTERSKATEGORIE_GROB relates to age classification through prename analysis, where higher values represent higher ages, indicating that these components represent elder people. This aspect can be reinforced when analyzing the highest negative weight related to the YOUTH_DECADE component.

YOUTH_DECADE indicates in which decade the person lived his/her youth period. In other words, the lower the decade, the elder the person is. Basically, these two variables represent the same information, but their values are in opposite directions, confirming that this component represents elder people.

It can also be seen that SEMIO_ERL and SEMIO_LUST are important features that positively represent this component. SEMIO_ERL describes if the person is eventful oriented while SEMIO_LUSTindicates if the person is sensual minded. Higher values indicate lower affinity with that specific characteristic (1 - highest affinity, 7 - lowest affinity). It tells that this component represents elder people that are not eventful oriented, nor sensual minded.

On the oher hand, SEMIO_TRADV and SEMIO_REL represent important aspects in the opposite direction, indicating that the component represents people that are traditional-minded and religious. At the same time, they have a rational mind (SEMIO_RAT).

RETOURTYP_BK_S_5.0 indicates that the return type of these people is classified as determined minimal-returner, and the FINANZ_ALENGER indicates a high correlation to the investor financial typology, maybe giving a hint that these people may also be associated with higher incomes.

1.5.5 Defining the Number of Clusters

Before classifying the data in different clusters, it's necessary to find the optimal number of clusters. For that, the Elbow Method will be used:

Unfortunately, there's no clear elbow indicating which would be the best number of clusters to be chosen. However, it's possible to see that the most expressive change in the inclination happens between 15 and 20 clusters.

With that, the clustering process will be built considering 18 clusters.

1.5.6 Applying Transformations on Customer Data

Before proceeding to the clustering process, data transformation will be applied to the customers data:

1.5.7 Clustering

1.5.8 Evaluating Clusters

Once each observation was assigned to its correspondent cluster, the job now is to check which clusters are proportionally more representative among clients than in the general population. The same for the clusters that happen more frequently in the general population in comparison to the customers' group.

With that, it will be possible to understand the different combinations of features that result in a person being more likely to become a client or the other way around.

To better understand what these clusters represent, cluster center will provide the most important components related to that specific cluster, and then it will be possible to go back to those specific components to understand the features that better represent them.

That way, it will be possible to develop a cluster overview.

1.5.8.1 Overrepresented Clusters

Cluster 2 - GREEN DREAMERS

CLUSTER 2 is the one with the highest overrepresentation, being proportionally over 205% more representative in customers than in the general population.

Analyzing the most important components, that's how this cluster could be described:

Cluster 9 - HIGH-SOCIETY TRADITIONAL ELDERS

CLUSTER 9 is also about 205% more representative in customers than in the general population.

Analyzing the most important components, that's how this cluster could be described:

Cluster 4 - EMPTY NEST, FULL WALLET

CLUSTER 4 is over 100% more representative in customers than in the general population.

Analyzing the most important components, that's how this cluster could be described:

1.5.8.2 Underrepresented Clusters

Cluster 0 - LESS AFFORTUNATE BEGINNERS

CLUSTER 0 is 95% less representative in customers than in the general population.

Analyzing the most important components, that's how this cluster could be described:

Cluster 16 - MULTI-GENERATION MONEY SAVERS

CLUSTER 9 is also about 95% less representative in customers than in the general population.

Analyzing the most important components, that's how this cluster could be described:

Cluster 14 - SECOND-HAND CAR CELLS

CLUSTER 14 is about 83% less representative in customers than in the general population.

Analyzing the most important components, that's how this cluster could be described:

Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

2.1 Data Transformation

For the supervised learning task, the strategy will be less conservative, eliminating as few features as possible. Because of that, the process of selecting columns will be simplified, increasing the threshold percentage for nan values.

Next step, different pipelines will be built in order to treat nan values differently, according to column dtype.

2.2 Analyzing Learning Curves

To better understand which algorithms would be a better choice, the learning curves related to a few algorithms are going to be drawn:

Clearly, the learning curves are not converging: the average score on the training set stands in high values, while validation scores are poor.

It means that the XGBClassifier is overfitting, and the model is not actually learning or generalizing, explaining the low scores in the validation set.

On the other hand, GradientBoostingClassifier represents a better option once the learning curves are converging, and the validation score is consistently improving while the algorithm receives more information.

AdaBoostClassifier shows a similar pattern when comparing to the GradientBoostingClassifier. However, its validation score was not as good.

Considering the models that didn't overfit, GradientBoostingClassifier seems a better option:

2.3 Training Classifier

Now that the learning curve was observed for different algorithms, and the GradientBoostingClassifier was chosen as a better option, a few steps will be followed:

Since the data is highly unbalanced, the evaluation metric will be the roc_auc score.

2.3.1 Training on Unbalanced Data

In this first attempt, the unbalance seen in the classes will not be treated.

2.3.2 Training on Balanced Data

In this second attempt, the SMOTE technique will be included in the machine learning pipeline.

The purpose is to see if the roc_auc score increases, once the unbalance is treated.

This strategy to deal with the class unbalance didn't result in a better score. Because of that, in the next attempts, unbalance will not be treated.

2.3.3 Using Information Level and PCA Transformation

In this third attempt, a similar approach used during the cluster analysis will be performed here. In this case, data will be treated differently not only considering the columns' dtypes, but also the information level related to the columns.

As an example, person information level will be split into:

In the most generic levels of information like microcell and macrocell, dimensionality reduction will be applied (PCA for numerical features, and TruncatedSVD for sparse matrix (categorical columns after the one-hot encoding process).

This way, when applied, the dimensionality reduction will result in components representing one single level of information.

This approach resulted in a better score when comparing to the second approach, but still not as good as the first strategy.

2.3.4 Using PCA Transformation

The fourth approach is a variation of the first one, but this time the dimensionality reduction will be applied to the data. Different from the third approach, information level will not be considered.

This way, the PCA algorithm will be applied along in the machine learning pipeline, and the components will represent the whole data.

Considering the roc_auc metric, this strategy is the worst so far.

2.3.5 XGBoost Classifier and Bayesian Optimization

This time, not only the algorithm will be changed, but also the parameter tuning approach will be changed.

Instead of the Gradient Boosting Classifier, the XGBoost Classifier will be trained on the data. The parameter tuning will be performed by the BayesSearchCV algorithm. Instead of simply testing all the parameter combinations, this algorithm test different parameters, given a range of possible values.

Once it shows improvement, the algorithm 'explores' deeper the areas that resulted in better performance.

This strategy resulted in a score similar to the first one.

Since the first one was the best model so far, this new Bayesian Optimization approach will be performed in a few more algorithms.

2.3.6 LightGBM and Bayesian Optimization

LightGBM Classifier is similar to the XGBoost Classifier, but is considered faster. Besides that, it splits its trees leaf-wise, rather than depth or level-wise like most of the other similar algorithms.

The LightGBM Classifier resulted in a slightly lower score in comparison to the last model, but their performances are comparable.

Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link here, you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

3.1 Attempt 1: Training on Unbalanced Data

Considering the Kaggle rank, this first approach could be considered a regular model, being positioned among the top 150.

3.2 Attempt 2: Training on Balanced Data

3.3 Attempt 3: Information Level and PCA Transformation

3.4 Attempt 4: PCA Transformation

3.5 Attempt 5: XGBoost Classifier and Baysian Optimization

Although this model roc_auc score is comparable to the first model, when predicting on the test data, it represented a great advance.

This score positions the model among the top 40 in the Kaggle rank among 349 data scientists.

best_score.png

3.6 LightGBM and Bayesian Optimization

4. Conclusion

This project represented a great challenge, especially because of the amount of data and different features to consider.

Besides that, it is real data, meaning that it resembles in many aspects the challenges of a usual Data Science project at any company.

More than evaluating different algorithms' learning curves, I have to say that my own learning curve has increased exponentially during this project, and it's far from overfitting, although the convergence will still be in process - and that is how it has to be, given the fact that, as data scientists, we need to incorporate the ongoing learning as a lifestyle.

There is plenty of space for improvements in the project, and a few of them will be listed as possible approaches:

About the project, there are a few things that I would like to highlight: