Marketing on Steroids: Using Cutting-Edge Algorithms to Predict Future Customers

10 min readJan 6, 2019

Introduction

Cutting-edge algorithms typically used in Data Science competitions can be extended to handle real life problems. In this project , we used unsupervised learning techniques to describe the relationship between the demographics of the company’s existing customers and the general population of Germany .
This allowed to describe which parts of the general population are more likely to be part of the mail-order company’s main customer base, and which parts of the general population are less so (Customer Segmemtation).

We then developed a supervised learning model to predict which individuals (leads) are most likely to convert into customers using a marketing campaign dataset. Tree-based and boosting algorithms usually perform very well on tabular data, and are widely employed across data science competitions. After comparing learning curves from four candidate algorithms using stratified kfold cross-validation, we have chosen XGBoost and proceeded to tune its parameter following a step-by-step strategy rather than applying a wide GridSearch. This allowed us to tune XGBoost in around 4hrs on a MacBook.

The final optimized model was run against a test dataset and gave a final score of 0.80467, thereby showing that the model can predict if a lead can be converted into a customer with over 80% accuracy.The best score in the Leaderboard is 0.80819.

The data for this project has been provided to Udacity by Bertelsmann Arvato Analytics, and represents a real-life data science task. It includes demographics for customers of a mail-order sales company in Germany, a dataset of general population demographics, a mailout campaign with response and a final test dataset to make predictions on a Kaggle competition.

Metric

The metric used in the model is ROC (Receiver Operating Characteristic Curve) AUC (“Area under the ROC Curve”). AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example higher than a random negative example.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0. AUD is scale-invariant, i.e. it measures how well predictions are ranked, rather than their absolute values. This property makes it very useful in case of unbalanced datasets, as we will see later in this project.

Data Analysis and Preprocessing

We have been supplied the following datasets:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

An initial investigation of the mailout train dataset showed that the data is highly unbalanced as only 1.24% of the people responded to the mailout campaign. This immediately affected two major decisions:

the metric to use, as obviously accuracy won’t give us the information we need. We decided to use ROC AUC instead
the cross validation strategy: we used StratifiedKFold, a variation of KFold where the folds are made by preserving the percentage of samples for each class.

We have also been supplied with two explanatory excel spreadsheets:

DIAS InformationLevels - Atributes 2017, that includes a top-level list of attributes and descriptions, organized by informational category;
DIAS Attributes Values 2017, a detailed mapping of data values for each feature in alphabetical order.

Unfortunately, out of 366 features, only 290 were documented. The first step has been to build a dataframe of feature descriptions (azdias_info) including attribute name, description, meaning (if available), “unknown” values and type. The distribution of the variables types is listed below:

Our info dataframe “azdias_info” has descriptions for 366 features out of 366:
Dataset has 322 ordinal features
Dataset has 22 categorical features
Dataset has 9 numeric features
Dataset has 8 binary features
Dataset has 5 Mix-Type features

At this point, we could go ahead with the preprocessing, which was made of two steps:

Cleaning
Encoding, Imputing null values and Scaling

The cleaning comprise of the following steps:

converting all “unknown” values to Nan
dropping all columns with more than 30% Nan values, for a total of 10 columns
dropping some categorical variables with high cardinality values
dropping variables that were potential source of ethical bias, like NATIONALITAET_KZ
dropping columns with all distinct values, as they are of no use in machine learning
dropping rows with more than 50 nulls
re-engineeringmsome variables that had data across more than 1 dimension, like LP_LEBENSPHASE_FEIN (age, income, and family status), CAMEO_INTL_2015 (wealth and life stage), and PRAEGENDE_JUGENDJAHRE (generation, provenience and movement).

As for encoding and imputing, we took the decision of not applying one-hot encoding to categorical variables, as tree-based algorithms are known to underperform when variables are one-hot encoded. Moreover, the distributions of numerical variables were highly skewed, therefore we decided to apply log transformation. We followed the following imputing strategy:

binary_impute for binary categorical variables;
median for categorical, numerical and ordinal variables;
constant=0, for a group a variables (D19_…) where imputing a median would have introduced an acceptable bias in the model.

We wrapped all these steps in one custom Trasformer class Features_Transformer and we applied it to the dataset with a fit_transform method. The resulting dimensionality of the dataset was (737288, 288).

Customer Segmentation

We applied dimensionality reduction (Principal Component Analysis) to the scaled data to determine how many latent features we required to explain at least 90% of the variation of the data.

PCA plot after applying to the scaled dataset

The plot shows that over 80% of the variance is explained by the first 90 components; 90% by the first 140 components and thereafter it increases very slowly. Therefore we retained 140 components.

We then proceeded to explore the first 3 components and the weights associated to the features in each component:

First Component: explained 8.93% of the variability in the data. Positively correlated to wealth indicators like LP_STATUS and others like number of households in the building (higher density can be interpreted as lower wealth), or number of family houses in the area (low density indicates wealthier households)
Second Component: explained 5.27% of the variability in the data. Positively correlated to indicators of transaction activity in the last 12–24 months, age and customer journey (traditionalist vs online/cross channel). Higher values point to younger individuals who prefer online and cross-channel shopping, with less attitude to save.
Third Component: explained 5.07% of the variability. Positively correlated with upper-middle class indicators like owning a BMW, Merc or Sportscar and mature drivers.

The next step was to apply clustering to the general population: we used K-Means and the elbow method to determine the number of clusters to consider:

Elbow method applied to K-Means Clustering

The graph does not always show a decrease in the average distance of datapoints from centroids because we used MiniBatchKMeans instead of KMeans for performance reasons. However the trend shows a rapid descent towards 15 clusters and a leveling afterwards. Therefore we chose to retain 15 clusters.

The whole pipeline (data cleaning, transformation, dimensionality reduction and clustering) was then applied to the customers dataset:

The plot shows that clusters 0, 2, 4, and 13 are significantly more represented in the customer dataset compared to the general population, while clusters 1, 3, 6, 9, 10, and 11 are under-represented. We split those ones in two groups for the next plot.

The next step was to apply an inverse transformation to the cluster centroids, select the most important features from the first three PCA components and plot their means in the two cluster groups:

The plot shows that in the clusters that are over-represented in the Customers dataset we find data points with higher values of measures of wealth like LP_STATUS_FEIN, LP_STATUS_GROB. They tend to be less mobile than the general population (MOBI_REGIO, in inverted scale), they live in buildings with a smaller number of households. They tend to drive luxury cars, they do not belong to youngest generations (GENERATION) and they have a more conservative approach to financial investments (FINANZ_SPARER). Their customer journey experience is more of a conservative, traditionalist shopper as opposed to online-shopper and cross-channel.

Supervised Learning Model

It is now time to build a prediction model using the MAILOUT_TRAIN dataset.

As the dataset has only 42,962 data points we decided not to apply the drop_rows step of preprocessing to retain most of the data. Also, as mentioned earlier, tha dataset is highly imbalanced as only 1.24% of the people responded to the mailout campaign, therefore we opted for StratifiedKFold as cross-validation strategy. We ran four tree-based models with default settings, resulting in the following leaning curves:

Leaning Curves on four tree-based models

The learning curves above show that DecisionTree and RandomForest overfit the training data when used without tuning any parameter, and the performance on the CV dataset is quite poor. Boosting methods do much better, especially XGBoost as it shows that there is room for improvement by tuning the parameters. However both these algorithms are quite slow (at least on my Macbook) and each iteration on the full dataset using StratifiedKFold with 5 folds takes around 140 secs. Therefore we must be careful not to run a GridSearchCV with a wide grid as it could run forever: for example, 5 parameters with 5 possible values each would require 3125 iterations to complete the GridSearch (52 hours), if we are lucky enough that the process doesn’t crash in the meantime.

We therefore decided to follow an alternative tuning strategy, where we tune the parameters in 4 steps:

We choose an initial relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems.
We then determine the optimum number of trees for this learning rate by tuning tree-specific parameters (max_depth, min_child_weight, gamma, subsample, colsample_bytree) for the decided learning rate and number of trees.
We then tune regularization parameters (lambda, alpha) for XGBoost which can help reduce model complexity and enhance performance.
Finally, we lower the learning rate and decide the optimal parameters .

This strategy has allowed to increase the CV score from 0.7632 to 0.7674 with 323 estimators.

Feature importance

The analysis of the attributes of the best classifier allows us to rank the predictors according to their importance:

the most important feature is D_SOZIALES by a large margin over the others, as it accounts for almost 30% of the weight. Unfortunately we don’t have the description of this variable, although the name suggests it might be related to some form of social status.
the second one is D19_KONSUMTYP_MAX, that accounts for over 6%. The variable indicates consumption type, and it is related to the transactional activity of the customer.
FINANZ_UNAUFFAELLIGER is an indicator of conservative financial behavior, and it is in line with the data in our cluster analysis.
LP_LEBENSPHASE_GROB is an indicator of financial status, so its presence here is obviously expected.

Kaggle Competition

Finally, the fun part! We used our best model on the Udacity_MAILOUT_052018_TEST :

y_predict = xgb4.predict_proba(X_test)

Note that we use predict_proba instead of predict as the competition objective is to identify which individuals are most likely to respond to the campaign and become customers of the mail-order company. Since the scoring is based on AUC, the values need not represent actual probabilities, but should still represent some sort of ordinal ranking where an individual with a higher RESPONSE value should be treated as more likely to be a customer than someone with a smaller value.

After few rounds of tuning we got a score of 0.80467. That means the model can decide if a lead will become a customer with 80% accuracy. Not bad!

Conclusions

This project shows that cutting-edge algorithms can be applied to real-world problems with great results. The most challenging part of the project, as it is often the case in data science, was the data wrangling and relating decisions on whether to drop or keep data, and the strategies to impute null values without introducing bias into the model. Attention was also given to risk of ethical bias, as we know this is a subject of an interesting debate in machine learning and AI.

The Customer Segmentation report shows that customers of the mail order company are positively correlated with indicators of wealth, tend to be less mobile than the general population and live in buildings or neighborhoods with a smaller number of households. They tend to drive luxury cars, do not belong to youngest generations and they have a more conservative approach to financial investments. Their customer journey experience is more of a conservative, traditionalist shopper as opposed to online-shopper and cross-channel.

XGBoost was the best Supervised Learning model to predict which individuals are most likely to convert into customers using a marketing campaign dataset, with a final CV score of 0.7674 with 323 estimators. This model was used against a test dataset in the Kaggle competition, with a final score of 0.80467.

Recommendations for Improvements

The model we proposed can be improved further:

if we had information on the undocumented features, we could process some of them differently insted of using the default imputers;
we could re-engineer more features that seem to carry important information but do not appear on the list of most important predictors;
choose different thresholds for dropping rows and columns, as retaining more data might increase accuracy of prediction.

I want to thank Udacity and Bertelsmann Arvato Analytics for giving me the opportunity of working on real-life data, the experience has been invaluable. The source code for this project is hosted on Github.