Predict client subscription using Bank Marketing Dataset using SVM

Rajitha Gunathilake
4 min readJun 20, 2021

--

Bank Marketing Data Set consists of data about direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). this dataset is available in UCI data Archive .

This data consists of 17 fields (‘age’, ‘job’, ‘marital’, ‘education’, ‘default’, ‘balance’, ‘housing’, ‘loan’, ‘contact’, ‘day’, ‘month’, ‘duration’, ‘campaign’, ‘pdays’, ‘previous’, ‘poutcome’, ‘y’)

first look at the data using pandas data head.

visualize our data to get a good idea about it . the dataset consist of both numerical and categorical data . target of this dataset is a binary value whether the client subscribed of not . so there will be a binary prediction.

visualize numerical data using box plot diagrams.

check for null values , here we can see that there are no null values in the dataset.

next we will encode the categorical data .

if we plot the correlation heatmap for this data we can see that some of the attributes have a correlate with each other . for training the dataset we will use ‘age’, ‘job’, ‘marital’, ’education’, ‘housing’, ‘loan’ attributes.

when we look at age attribute we can see that the data is right skewed so we will apply logarithmic transform .

after applying the log transform data.

Using Principle component analysis we can reduce the number of features .

here used PCA to reduce the data into two features .

when we look at the target of this dataset we can see that this dataset have a class imbalance . this kind of imbalance can make the model to predict only the abundant class.

so we have to use Synthetic Minority Oversampling Technique to oversample the minority to train the model better.

after using SMOTE we can see that it generated 15388 new samples to reduce the class imbalance . here 20000 samples are being used to train the SVM model to reduce the complexity.

after training the model , and predicting the test dataset we can see that both of the targets classes are being predicted indicating the model have learned both classes.

after trying different C values and kernel , the best results was obtained by using 1000 as the C value and ‘rbf’ as the kernel.

visualize confusion matrix

in the classification report we can see the useful matric about the model.

SVM support vectors visualized

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response