Authors: Mauro Venticinque | Angelo Schillaci | Daniele Tambone
GitHub project: Bank-Marketing
Date: 2025-07-02
This study analyzes data from a Portuguese bank’s direct marketing campaigns to identify the key factors influencing customer subscription to term deposits and to develop a predictive model for optimizing future outreach efforts. Using exploratory data analysis and supervised learning techniques, including logistic regression (LASSO), Random Forest, and Boosting, we examine how demographic, behavioral, and macroeconomic variables affect subscription behavior.
Our findings highlight the importance of previous campaign success, age, and macroeconomic indicators such as Euribor rate, CPI, and employment variation. Among the tested models, Random Forest with a tuned threshold offers the best balance between sensitivity and precision, making it the most suitable for minimizing missed opportunities while avoiding unnecessary contacts.
These insights provide actionable recommendations for designing more effective, data-driven marketing strategies tailored to customer profiles and economic contexts.
In this project, we analyze data from a Portuguese banking institution’s direct marketing campaigns to identify key factors influencing customer subscription to term deposits.
A deposit account is a bank account maintained by a financial institution in which a customer can deposit and withdraw money. Deposit accounts can be savings accounts, current accounts or any of several other types of accounts explained below.
The dataset includes client demographics, previous campaign interactions, and economic indicators. Our goal is to develop insights that will enhance the effectiveness of future marketing strategies. By applying supervised learning techniques, we aim to predict customer responses and optimize outreach efforts for better engagement and conversion rates.
Specifically, the analysis aims to answer two main questions:
The report begins with an Exploratory Data Analysis (EDA), examining the variables and their relationship with the target attribute (subscribed) to identify the most influential factors.
age (Integer): age of the customerjob (Categorical): occupationmarital (Categorical): marital statuseducation (Categorical): education leveldefault (Binary): has credit in default?housing (Binary): has housing loan?loan (Binary): has personal loan?contact (Categorical): contact communication typemonth (Categorical): last contact month of yearday_of_week (Integer): last contact day of the
weekduration (Integer): last contact duration, in seconds
(numeric). Important note: this attribute highly affects the output
target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known
before a call is performed. Also, after the end of the call y is
obviously known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a realistic
predictive modelcampaign (Integer): number of contacts performed during
this campaign and for this client (numeric, includes last contact)pdays (Integer): number of days that passed by after
the client was last contacted from a previous campaign (numeric; -1
means client was not previously contacted)previous (Integer): number of contacts performed before
this campaign and for this clientpoutcome (Categorical): outcome of the previous
marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)subscribed (Binary): has the client subscribed a term
deposit?Source: UCI Machine Learning Repository
Note: In our dataset there isn’t the bank
balancevariable
| Name | train |
| Number of rows | 32950 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| job | 0 | 1 | 6 | 13 | 0 | 12 | 0 |
| marital | 0 | 1 | 6 | 8 | 0 | 4 | 0 |
| education | 0 | 1 | 7 | 19 | 0 | 8 | 0 |
| default | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| housing | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| loan | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| contact | 0 | 1 | 8 | 9 | 0 | 2 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 10 | 0 |
| day_of_week | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
| poutcome | 0 | 1 | 7 | 11 | 0 | 3 | 0 |
| subscribed | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.04 | 10.45 | 17.00 | 32.00 | 38.00 | 47.00 | 98.00 | ▅▇▃▁▁ |
| duration | 0 | 1 | 258.66 | 260.83 | 0.00 | 102.00 | 180.00 | 318.00 | 4918.00 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.57 | 2.77 | 1.00 | 1.00 | 2.00 | 3.00 | 43.00 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 961.90 | 188.33 | 0.00 | 999.00 | 999.00 | 999.00 | 999.00 | ▁▁▁▁▇ |
| previous | 0 | 1 | 0.17 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
| emp.var.rate | 0 | 1 | 0.08 | 1.57 | -3.40 | -1.80 | 1.10 | 1.40 | 1.40 | ▁▃▁▁▇ |
| cons.price.idx | 0 | 1 | 93.57 | 0.58 | 92.20 | 93.08 | 93.75 | 93.99 | 94.77 | ▁▆▃▇▂ |
| cons.conf.idx | 0 | 1 | -40.49 | 4.63 | -50.80 | -42.70 | -41.80 | -36.40 | -26.90 | ▅▇▁▇▁ |
| euribor3m | 0 | 1 | 3.62 | 1.74 | 0.63 | 1.34 | 4.86 | 4.96 | 5.04 | ▅▁▁▁▇ |
| nr.employed | 0 | 1 | 5167.01 | 72.31 | 4963.60 | 5099.10 | 5191.00 | 5228.10 | 5228.10 | ▁▁▃▁▇ |
The dataset includes 21 variables and 32,950 rows, with no
missing values.
Categorical variables like job and
education show good diversity, while
default, loan, and
housing have only 3 unique values.
Among numeric variables, age has a fairly normal
distribution (mean ≈ 40, sd ≈ 10), while
duration and pdays are highly skewed,
with extreme values up to 4918 and 999 respectively.
Some variables (e.g., campaign,
previous) have a low median but long tails, indicating
that most observations are clustered at low values.
Macroeconomic variables such as emp.var.rate,
euribor3m, and nr.employed are more
stable, with tight interquartile ranges, suggesting consistent economic
conditions during data collection.
Firstly we see that this dataset are unbaleanced, with the majority of people that have not subscribed.
Correlation Matrix
The
correlation matrix reveals clear patterns among the numerical variables.
Notably, euribor3m, nr.employed, and
emp.var.rate are strongly positively correlated with
each other, these suggest these variables capture similar information
about the economic environment. This should be taken into account in
predictive modeling, as using them together could lead to
multicollinearity. In contrast, variables like
campaign, pdays, and
previous show very weak correlations with most other
features, indicating they may contribute more independently to the
model.
Scatterplot Matrix by
Target
Several variables, such as duration
and pdays, show highly skewed
distributions, which could influence model performance and may
benefit from transformations (e.g., log or binning).While some variables
exhibit linear trends (e.g., euribor3m vs nr.employed), many
scatterplots show dispersed or nonlinear patterns. This suggests that
simple linear models may not fully capture the complexity in the
data.
In certain plots, the blue points (subscribed) are concentrated in specific areas, showing the key factors that influenced successful subscriptions.
Box plot of age
The
boxplot reveals that subscribers generally have a higher average age
than non-subscribers. The rightward shift of the interquartile range
suggests a potential positive relationship between age and the
probability of subscribing.
Box plot of emp.var.rate
Surprisingly, clients who subscribed are mostly associated with negative
values of emp.var.rate, as their interquartile range lies entirely below
zero. On the other hand, non-subscribers show a broader distribution
that includes both negative and positive values, with the majority of
the density concentrated around one. This suggests that
subscription is more likely to occur during periods of economic
downturn, when the employment variation rate is decreasing.
Box plot of euribor3m
Although both groups show a relatively large interquartile range, the
concentration of their distributions reveals an interesting pattern:
clients who subscribed are surprisingly associated with lower values of
euribor3m, while non-subscribers tend to be linked to higher euribor3m
values. This suggests that lower euribor rates may positively
influence the likelihood of subscription.
Distribution of Age
The
age distribution is right-skewed, with a peak around 30–40 years old.
The proportion of people that have subscribed is higher among those over
60.This may be due to greater financial stability in older age
groups.
Distribution of Job
The
distribution of the occupation is not uniform, with the majority of
people that are admin. The proportion of people that have subscribed is
among the higest between all the occupation. This is probably due to the
fact that people that are admin have a higher income and are more likely
to subscribe. While student and retired people have a higher proportion
of subscription, this explain that we saw in the previous plot that the
older people and the people with higher education level are more likely
to subscribe.
Distribution of Education
About Education Level, we can see that the distribution of the education
level is not uniform, with the majority of people that have a university
degree. The proportion of people that have a university degree and that
have subscribed is among the higest between all the education level.
This is probably due to the fact that people that have a university
degree have a higher income and are more likely to subscribe.
Distribution of Marital
status
The majority of clients are married, followed by
single and divorced individuals. However, when looking at the
subscription rate by marital status, single clients show a higher
proportion of “yes” responses compared to the other groups. This
suggests that single individuals may be more likely to subscribe to the
term deposit than married or divorced clients.
Distribution of Contact
Most clients were contacted via cellular phone rather than landline.
Moreover, the subscription rate is notably higher among those contacted
by cellular, suggesting that this channel is more effective in reaching
potential subscribers.
Distribution of Contacts
About previous campaign, while most clients were not previously
contacted, the success rate is visibly higher among those who were
previously contacted more than once or had a successful prior outcome.
This suggests that prior engagement is positively associated with
subscription, but they are a small part of sample.
Distribution of Days of
Week
The distribution of the last contact day of the week
is uniform, with the majority of people that have been contacted on
Thursday. The proportion of people that have subscribed is among the
higest when the last contact day of the week is on the middle of
week.
Distribution of Months
Instead, the distribution of the last contact month of the year is not
uniform, with the majority of people that have been contacted in May.
The proportion of people that have subscribed is among the higest when
the last contact month of the year is in March, December, September and
October. This is probably due to the fact that people are more likely to
subscribe when they have more money and not during the summer.
Distribution of Duration
The duration of the last contact is right-skewed, with a peak around
0-100 seconds. The proportion of people that have subscribed is higher
among people that have been contacted for a longer duration. This is
probably due to the fact that people that have been contacted for a
longer duration are more interested to subscribe.
The Exploratory Data Analysis reveals several important insights into the factors that influence the likelihood of subscription in this dataset. Below there is a summary of the key findings:
In summary, the analysis suggests that financial conditions, previous campaign interactions, and macroeconomic indicators are strong predictors of subscription behavior. Demographic factors such as age, occupation, and education level also contribute meaningfully to the outcome.
In the next section, we will use these EDA findings to conduct a preliminary skim of the most influential variables, based on the visual trends observed in the plots.
In this section, we explore different classification models to predict whether a client will subscribe to a term deposit.
Before training the models, we applied a transformation algorithm to convert categorical variables into numerical format. This is a crucial step in the data preprocessing phase, as many machine learning algorithms require numerical input. We used one-hot encoding to make categorical variables compatible with the classification models. This method represents each category as a binary variable, avoiding the introduction of arbitrary numerical orderings among categories. In this way, we ensure a correct statistical interpretation of qualitative variables and improve the effectiveness of the model training process.
However, among the classification models considered, LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) are not suitable in our case due to the nature of the predictor variables. In particular, most of the independent variables are binary and do not satisfy the fundamental assumption of normal distribution within each class, which is required by both methods. Furthermore, both models rely on covariance structures, whose interpretation becomes limited when applied to dichotomous variables.
For these reasons, we decided not to pursue further analysis with LDA and QDA and instead focused on models that are better aligned with the structure of the data, such as Logistic Regression, Random Forest and Boosting.
Based on the Exploratory Data Analysis (EDA), we selected only the most relevant variables.
With a view to training the model, we apply one-hot encoding. We obtain the following dataset:
| Variable | Type |
|---|---|
| age | int |
| single | bool |
| cellular | bool |
| low_call | bool |
| previous | int |
| negative_emp | bool |
| low_cpi | bool |
| high_cci | bool |
| low_euribor | bool |
| university | bool |
| p_course | bool |
| job_student | bool |
| job_retired | bool |
| job_admin | bool |
| month_sep | bool |
| month_oct | bool |
| month_dec | bool |
| month_mar | bool |
| p_failure | bool |
| p_success | bool |
| target | bool |
Logistic regression is a widely used statistical model for binary classification tasks. It is based on the sigmoid (logistic) function, which maps any real-valued input into the interval (0, 1), making the output interpretable as a probability. Specifically, the probability that an observation belongs to class 1 is given by:
\(P(Y=1|X=x)=p(x)=\frac{e^{\beta_0+\beta_1x_1+\dots + \beta_n x_n}}{1+e^{\beta_0+\beta_1x_1+\dots + \beta_n x_n}}\)
To perform classification, a decision threshold is applied: if \(p(x)\) exceeds the threshold (commonly 0.5), the observation is assigned to class 1; otherwise, it is assigned to class 0.
Before training the model on the training set, we applied two variable selection methods to the entire dataset in order to reduce the large number of predictors. The selection criteria were:
Stepwise selection in both directions, that optimizes the AIC at each step.
LASSO, which, as supported by the literature, often outperforms stepwise methods.
After that we checked for VIF, but for both models there wasn’t any multicollinearity.
Once the relevant variables were selected using both methods, we employed k-fold cross-validation to obtain stable estimates of accuracy, misclassification error, and other useful metrics which will be used for comparison. We chose 10 folds, as it offers a good balance between computational efficiency, compared to LOOCV, and reliable model evaluation, as compared to 5-fold cross-validation.
The variables we got from stepwise and LASSO are the following ones:
| Full_dataset | Stepwise_Model | Lasso_Model |
|---|---|---|
| age | ||
| single | x | |
| cellular | x | x |
| low_call | x | |
| previous | x | |
| negative_emp | x | x |
| low_cpi | x | x |
| high_cci | x | x |
| low_euribor | x | x |
| university | x | x |
| p_course | x | |
| job_student | x | x |
| job_retired | x | x |
| job_admin | x | |
| month_sep | x | x |
| month_oct | x | x |
| month_dec | x | x |
| month_mar | x | x |
| p_failure | x | x |
| p_success | x | x |
| total | 20 | 14 |
As we can see, LASSO was able to shrink the number of variables more than Stepwise. It’s a simpler model. Let’s now compare the two models at different thresholds. The thresholds maximize:
Accuracy, that takes into account just the general accuracy of the model.
F1 which is a metric useful in unbalanced dataset like ours, since it considers both precision (how many predicted positives are actual positives) and recall, or Sensitivity, (how many actual positives are detected).
We also have computed Specificity (how many actual negatives are detected).
| Model_Threshold | Threshold | Accuracy | Precision | Sensitivity | Specificity | F1 |
|---|---|---|---|---|---|---|
| Stepwise_0.5 | 0.5 | 0.8990 | 0.6635 | 0.2093 | 0.9865 | 0.3182 |
| Stepwise_0.2 | 0.2 | 0.8633 | 0.4209 | 0.5673 | 0.9009 | 0.4833 |
| LASSO_0.5 | 0.5 | 0.8990 | 0.6608 | 0.2126 | 0.9861 | 0.3216 |
| LASSO_0.2 | 0.2 | 0.8528 | 0.3946 | 0.5735 | 0.8883 | 0.4676 |
The table shows that both models perform similarly in terms of accuracy and specificity. However, when the threshold is lowered from 0.5 to 0.2, both models achieve higher F1 scores and sensitivity, though at the expense of reduced accuracy and specificity.
Between the two models, the LASSO-based logistic regression is slightly preferable, as it achieves comparable or slightly better F1 scores while benefiting from variable selection and model simplicity.
Given the imbalance in the dataset, we prioritize a threshold that maximizes the F1 score rather than overall accuracy. This is because both false positives (wasting resources contacting uninterested customers, and potentially harming the bank reputation) and false negatives (missing likely subscribers) are costly in this context. Therefore, a threshold that better balances precision and recall (Sensitivity), reflected by a higher F1 score, is more appropriate for the bank’s decision-making.
Now we explore decision tree ensemble methods, which combine multiple decision trees to improve predictive performance and robustness. Unlike a single decision tree, which may suffer from high variance or overfitting, ensemble techniques like Random Forest and Boosting aggregate the predictions of many trees to achieve better generalization. In the following sections, we will train and evaluate models using both approaches, starting with Random Forest.
Let us train a Random Forest classifier. We set the mtry = \(\sqrt{n}\), where n is the number of variable, to control the number of variables randomly selected at each split. The model was trained on the full training set with 500 trees and feature importance enabled.
The model return an out-of-bag (OOB) error rate of 10.23%, it’s a good level of error, indicates a fairly robust generalization without overfitting.
The following graphs represent the importance of each variable for accuracy and Gini impurity.
Both measures highlight similar variables as important, suggesting
consistent influence of features such as age,
p_success, and high_cci. However, there are
slight differences in ranking due to how each metric evaluates
importance.
Let us train a Boosting classifier, setting the number of trees to 5000 and the maximum depth of each tree to 3.
The following graph represent the importance of each variable.
For the boosting model, the most important variables are similar to
those identified by the random forest. However, in boosting,
previous appears to be more relevant.
As previously discussed, we calculate, for each model, the threshold that maximizes the F1 score, since this metric provides a balanced measure of precision and recall. This approach is particularly suitable in scenarios with class imbalance, where both false positives and false negatives are critical.
In our case, the optimal threshold was found to be 0.03 for Random Forest and 0.11 for Boosting.
Using this values, we obtain the following performance metrics:
| Model | Threshold | Accuracy | Precision | Sensitivity | Specificity | F1_Score |
|---|---|---|---|---|---|---|
| Random Forest 0.5 | 0.50 | 0.900 | 0.664 | 0.232 | 0.985 | 0.344 |
| Random Forest 0.03 | 0.03 | 0.866 | 0.426 | 0.558 | 0.905 | 0.483 |
| Boosting 0.5 | 0.50 | 0.898 | 0.615 | 0.259 | 0.979 | 0.364 |
| Boosting 0.11 | 0.11 | 0.824 | 0.348 | 0.639 | 0.848 | 0.450 |
At the default threshold (0.50), Random Forest is similar to Boosting. Both models show high specificity (i.e., they are good at detecting the negative class), but very low sensitivity — indicating poor performance in identifying positive cases.
When using the threshold that maximizes the F1 score, both models improve their sensitivity at the cost of some specificity. The Random Forest model achieves the highest F1 score (0.483), providing the best balance between precision and recall. Although the Boosting model reaches a higher sensitivity (0.639), it sacrifices more accuracy and specificity, resulting in a lower F1 score overall.
Random Forest with a threshold of 0.03 is the best-performing model overall, as it achieves the highest F1 score while maintaining a reasonable trade-off between all metrics. This suggests it offers the most balanced generalization ability for this classification task.
In our case study, which focuses on term deposit subscriptions, need to choose a model that maximizes sensitivity without lowering precision too much.
The sensitivity measures how many of the customers who would have subscribed were correctly identified by the model. The bank prefers to reach everyone who might potentially say “yes” even at the cost of contacting some who will eventually say “no” in order not to miss any possible customers.
On the other hand, we want to avoid decreasing precision too much. This metric measures, among all the instances the model predicts as “Yes”, how many actually said “Yes.” Maintaining high precision is important in our context, because contacting people who are not interested may harm the bank’s credibility and lead to wasted resources.
Therefore, we analyze the performance metrics of each previously selected model in order to answer the first question posed in the introduction:
| Model | Threshold | Accuracy | Precision | Sensitivity | Specificity | F1_Score |
|---|---|---|---|---|---|---|
| LASSO Logistic | 0.20 | 0.8528 | 0.3946 | 0.5735 | 0.8883 | 0.4676 |
| Random Forest | 0.03 | 0.8660 | 0.4260 | 0.5580 | 0.9050 | 0.4830 |
Overall, Random Forest with a threshold of 0.03 appears to be the better-performing model. It achieves higher accuracy, precision, and F1 score, and it maintains strong specificity, offering a solid trade-off between detecting potential subscribers and avoiding unnecessary contacts.
To answer the second question, regarding the influence of variables on term deposit subscriptions, we refer to the variable importance plot from the Random Forest model. According to both MeanDecreaseAccuracy and MeanDecreaseGini, the most influential variables in predicting subscription are:
p_success (past campaign success),age (age of the customer),high_cci (high consumer confidence index),low_euribor (low Euribor rate),negative_emp (negative employment variation rate),low_cpi (low Consumer Price Index).These features significantly contribute to the model’s predictive power and provide valuable insights for improving future marketing strategies.
These findings are consistent with the broader economic and social context of the data, which covers the period from 2008 to 2013, during and after the global financial crisis. In such an environment, characterized by high uncertainty and low trust in financial markets, customers tended to adopt conservative financial behaviors.
The strong influence of variables like low Euribor, negative employment variation, and low CPI suggests that clients were highly sensitive to macroeconomic conditions. In times of economic distress, the perceived safety and predictability of term deposits became particularly appealing, even when interest rates were low.
Moreover, age and previous campaign success reflect behavioral and trust-related aspects. Older individuals are generally more risk-averse, while a successful previous interaction with the bank increases the likelihood of continued engagement. Similarly, a high Consumer Confidence Index may indicate a more optimistic sentiment among consumers, making them more receptive to financial products offered by a trusted institution.
Taken together, these variables underline the importance of targeted marketing strategies that consider both economic conditions and individual customer profiles. Understanding these drivers allows banks to more effectively design and promote financial products, especially in times of uncertainty.
At the end, to answer the third question regarding
the improvement of marketing campaigns, we can conclude that:
The financial institution should leverage the identified key predictors
to create more personalized and context-aware marketing strategies. For
example:
By aligning campaign efforts with both individual customer characteristics and the broader economic environment, the institution can improve customer targeting, increase conversion rates, and strengthen long-term customer relationships.
Social and economic context attributes:
emp.var.rate(Integer): employment variation rate - quarterly indicatorcons.price.idx(Integer): consumer price index - monthly indicatorcons.conf.idx(Integer): consumer confidence index - monthly indicatoreuribor3m(Integer): euribor 3 month rate - daily indicatornr.employed(Integer): number of employees - quarterly indicator