Logistic Regression - Walk through using Social Network Ads dataset


Classification techniques are very important to understand in the field of machine learning and data mining applications.

Logistic regression is very useful regression methodology for solving classification problems, particularly the binary classification problems. In such problems, the target will take only two possible classes.

However, Logistic regression can also be applied for Multinomial classification where the target can take more than two classes.


We know that Linear equation consisting X1 to Xn input features can be represented as 
      
                                 f(x)  = β0X0 + β1X1 + β2X2 + ....... + βnXn


where   β1, β2, β3 .... βn  are co-efficients of respective input feature, βis constant factor and f(x) is our predicted target feature.

Logistic regression is extension to Linear regression and uses Sigmoid ( aka Logistic ) function. Sigmoid function is represented as




Sigmoid curve is S shaped.

Logistic Regression employs Maximum Likelihood Estimation (MLE) methodology for its predictions. From a statistical perspective, MLE sets the mean and variance as parameters in determining the specific parametric values for a given model.


Types of Logistic Regression:

Binary Logistic Regression: The target variable has only two possible outcomes such as Yes or No, Will purchase or Will not purchase

Multinomial Logistic Regression: The target variable has three or more nominal categories such as predicting the type of Liquor.

Ordinal Logistic Regression: the target variable has three or more ordinal categories such as rating of movie from 1 to 10.

The following metrics are used generally to evaluate Logistic Regression models


Confusion Matrix : Two dimensional array in case of binary classification and shows the cross product of True and False predicted values. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise We will go through this matrix in our example at later stage.

Confusion Matrix is represented as cross tabulation of the following

  • True Positive (TP i.e., Actual : Success and correctly predicted success)
  • False Positive (FP i.e., Actual: Failure but incorrectly predicted Success )
  • True Negative (TN i.e., Actual: Failure and correctly predicted failure )
  • False Negative (FN i.e., Actual: Success but incorrectly predicted failure )


Accuracy: This metric shows how accurate your model is.

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations

Accuracy is measured using the formula TP+TN/TP+FP+FN+TN

Precision : This metric shows how precise your model is (i.e., when the model is applied on test sample, how often the model predicts the target correctly)

Precision is calculated using the formula  TP / (TP+FP)

Recall: This metric shows the percentage of time the model will predict correctly.

Recall is calculated using the formula    TP / (TP+FN)

Precision and recall are both similar to accuracy, but both are very difficult to understand conceptually. Precision is very similar to accuracy but it only takes into account the data you predicted positive. 

Recall is also measure of accuracy but it only takes the data that is relevant in the situation.

F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

F1 is usually more useful than accuracy, especially if we have an uneven class distribution. 

F1 Score is calculated using the formula  2*(Recall * Precision) / (Recall + Precision)

Let's start with our example and walk through the Logistic Regression

from google.colab import files
files.upload()

Import the data into Data Frame to start our analysis

# Read the rawdata
import pandas as pd
social_network_data = pd.read_csv( 'Social_Network_Ads.csv' , skiprows=1 , names=[ 'User_ID' , 'Gender', 'Age' , 'EstimatedSalary' , 'Purchased'] )
social_network_data.head()


Let's find out population of attributes

social_network_data.info()

We notice all data is populated and the data types are as expected

social_network_data.Gender.value_counts()

Gender contains only Male and Female. Let's decode these values also to Integers
( 0 for Male and 1 for Female) and convert the data type from Object to Int64

newGender = pd.Categorical(social_network_data.Gender).rename_categories ( [ 1,0 ])
social_network_data.Gender = newGender.astype(int)
social_network_data.Gender.value_counts()

social_network_data.info()

We have successfully converted Male to 0 and Female to 1 and also amended the data type to integer

social_network_data.head(10)

Our goal is to find out Purchased which is Categorical and has only two values (0 - Not Purchased and 1 - Purchased) 

So, Logistic Regression will be more suitable in this case

As first step, let's choose the Input and Output Features. We have to find out Purchased. So this alone will be the output feature.

User ID being just an identifier, should not have any impact to Purchased feature. So, we select all other attributes except User ID as Input features

#split dataset in features and target variable
feature_cols = ['Gender', 'Age', 'EstimatedSalary' ]
X = social_network_data[feature_cols]   # Input Features
y = social_network_data.Purchased       # Target variable

The data supplied has 400 records. Let's use 80% of them to build our model and 20% to test

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X,y, test_size=0.20, random_state = 42 )

# import the required model classes
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
purchasereg = LogisticRegression()

# fit the model with data
purchasereg.fit(X_train,Y_train)

# predict the Purchased category for test data

Y_pred = purchasereg.predict( X_test )


We have used the built-in modules available in sklearn to predict the Purchased category.

Now, let's evaluate our model

from sklearn import metrics

confusionMatrix = metrics.confusion_matrix( Y_test, Y_pred )

confusionMatrix

# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Purchased_class=[0,1] # Purchased possible values
fig, ax = plt.subplots()
tick_marks = np.arange(len(Purchased_class))
plt.xticks(tick_marks, Purchased_class)
plt.yticks(tick_marks, Purchased_class)

# Create heatmap
sns.heatmap(pd.DataFrame(confusionMatrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual')
plt.xlabel('Predicted')

print("Accuracy:",metrics.accuracy_score(Y_test, Y_pred))
print("Precision:",metrics.precision_score(Y_test, Y_pred))
print("Recall:",metrics.recall_score(Y_test, Y_pred))

We notice that the model is always predicting only one category.

One possible reason for this odd behavior could be due to data variance among the input features. The expected salary is large number compared to Age and Gender

Let's try standardizing the data and retry our model

# import the model class
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() 

X_train_new = pd.DataFrame( scaler.fit_transform( X_train) , columns = X_train.columns )
X_test_new = pd.DataFrame ( scaler.transform(X_test) , columns = X_test.columns )

# instantiate the model (using the default parameters)
purchasereg = LogisticRegression()

# fit the model with data
purchasereg.fit(X_train_new,Y_train)

# predict the Purchased category for test data

Y_pred = purchasereg.predict( X_test_new )

Now, its time again to build the confusion matrix and validate our model

from sklearn import metrics

confusionMatrix = metrics.confusion_matrix( Y_test, Y_pred )



Let's find Accuracy/Precision and Recall

Purchased_class=[0,1] # Purchased possible values
fig, ax = plt.subplots()
tick_marks = np.arange(len(Purchased_class))
plt.xticks(tick_marks, Purchased_class)
plt.yticks(tick_marks, Purchased_class)

# Create heatmap
sns.heatmap(pd.DataFrame(confusionMatrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual')
plt.xlabel('Predicted')

print("Accuracy:",metrics.accuracy_score(Y_test, Y_pred))
print("Precision:",metrics.precision_score(Y_test, Y_pred))
print("Recall:",metrics.recall_score(Y_test, Y_pred))


Accuracy: 0.8875
Precision: 0.9130434782608695

Recall: 0.75

Accuracy is 88% which is pretty good. So, I will assume we have very good prediction model.









Comments

Popular posts from this blog

Auto MPG and Gradient Descent - My next steps

Auto MPG - My first ever Blog and Machine Learning