We are using classification techniques like CART,Logistic Regression,Random Forest,Perceptron,SVM(SUPPORT VECTOR MACHINE),KNN(K NEAREST NEIGHBOR),NAIVE BAYES to predict the bank loan for given features
FLOW OF THE ARTICLE
1.About the data
2.Loading the data and import the necessary library
3. Pre processing and Data Visualization,finding correlation between features and target labels
4.Splitting the data into training set and testing set
5.Classification models
6.Analyzing different classification metrics like Accuracy,MSE,RMSE,Precision,Recall
7.Comparing them and concluding the final model
1.About the data
Above the link of data set I took this from kaggle,it consist of 5000 rows and 14 columns
The columns of the data set are:
1. ID
2. Age
3. Experience
4.Income
5.ZIP Code
6. Family
7. CCAvg
8.Education
9. Mortgage
10. Personal Loan
11. Securities Account
12. CD Account
13. Online
14. Credit Card
Our target label is Personal Loan , we need to classify the Personal Loan based on the specifications
2.Importing the Libraries and Loading the data
Now lets read the CSV file:
3.Pre processing the data set
By removing the unnecessary columns,applying One Hot Encoding and normalizing the data we will get the data set free of noise and there is no missing values involved in data set
4.Correlation Matrix for the Data set(Data Visualization)
So to know which feature is important or contributes more to predict the class we draw correlation matrix and then the highest correlation value indicates that it is contributing more to predict the class.
5.Splitting the data into training and testing data set
we just used sklearn library to split into train,test and we divided them into 80–20 ratio.
6.Classification Models
- Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Importing the libraries that are needed
Calculate the evaluating measures
Naive Bayes
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of features values, where the class labels are drawn from some finite set.
Now let’s import the necessary libraries needed
Random Forest
The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.
Now let’s import the necessary libraries needed
SVM:
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples
Now let’s import the necessary libraries needed
Perceptron:
A Perceptron is a neural network unit that does certain computations to detect features or business intelligence in the input data.
Now let’s import the necessary libraries needed
CART:
KNN(K Nearest Neighbor):
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems
Comparing the models:
For the data set with 80% training and 20% testing Random forest model is predicting the accuracy of 98.7 % which is highest among all the models
For the data with 70% training and 30% testing Random forest model is predicting the accuracy of 98.1% which is highest among all the models
For the data with 60% training and 40% testing Random forest model is predicting the accuracy of 98.05% which is highest among all the models
Conclusion:
So for all these models the accuracy is more for Random forest with accuracy of 98.7%
Thank you
Abhishek Veeravelli
Bennett University