Predict client subscription using Bank Marketing Dataset
Bank Marketing Data Set consists of data about direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). this dataset is available in UCI data Archive .
This data consists of 17 fields
(‘age’, ‘job’, ‘marital’, ‘education’, ‘default’, ‘balance’, ‘housing’,
‘loan’, ‘contact’, ‘day’, ‘month’, ‘duration’, ‘campaign’, ‘pdays’,
‘previous’, ‘poutcome’, ‘y’)
we first need to visualize our data to get a good idea about it .
We are removing the duration attribute to , because it is better to use with the benchmarking purposes .
here we first look for missing values . in this data set we can see that there are no missing values . so we don’t need to do anything to replace those values .
next we are going to handle outliers . here in the boxplot diagram we can see that there are many outliers .
we are using Inter Quartile Range(IQR) to rescale the outliers . here we only handled outliers on balance column because other columns are fairly distributed
after correcting outliers . after that we need to standardize data to better training .
next we have to encode the binary data to numerical values , and also encode the categorical data using a label encoder because neural networks can only work with numerical values .
after that we need to create a neural network to train to this data .
input layer has the shape of the input parameters which is 15 . because the output data is a binary value meaning either yes or no , there is only one neuron in the output layer, it will predict the output .
after creating the layer we can train our model using the data set.
after training for 120 epoch we can see that out model has got a 89% accuracy
here we can see that out model has predicated most of the time ‘no’ which is the most dominants class in the targets , because of the class imbalance . so to mitigate this behaviors we can synthetically oversample minority class using the SMOTE techniques (Synthetic Minority Oversampling Technique).
after using smote on the data set we can see the training data set is increased
next if we carry out training process again we can see that now our model has some hard training learning this patter because now the classes are balanced . but in the end of training we get bit less accuracy but usable model .
after retraining we can see the predicted results and see that now dominant class is less predicted . we can see in the confusion matrix.