Logistic Regression - Walk through using Social Network Ads dataset
Classification techniques are very important to understand in
the field of machine learning and data mining applications.
Logistic regression is very useful regression
methodology for solving classification problems, particularly the binary
classification problems. In such problems, the target will take only two
possible classes.
However, Logistic regression can also be applied for Multinomial
classification where the target can take more than two classes.
We know that Linear
equation consisting X1 to Xn input features can be represented as
f(x) = β0X0 + β1X1 + β2X2
+ ....... + βnXn
where β1, β2, β3 .... βn are co-efficients of respective input feature, β0 is constant factor and f(x) is our predicted target feature.
Logistic regression is extension to Linear regression and uses Sigmoid ( aka Logistic ) function. Sigmoid function is represented as
Sigmoid curve is S shaped.
Logistic Regression
employs Maximum Likelihood Estimation (MLE) methodology for its predictions. From a statistical perspective, MLE sets the
mean and variance as parameters in determining the specific parametric values
for a given model.
Types of Logistic
Regression:
Binary
Logistic Regression: The
target variable has only two possible outcomes such as Yes or No, Will purchase
or Will not purchase
Multinomial
Logistic Regression: The
target variable has three or more nominal categories such as predicting the
type of Liquor.
Ordinal
Logistic Regression: the
target variable has three or more ordinal categories such as rating of movie from
1 to 10.
The following metrics are used generally to evaluate Logistic Regression models
Accuracy: This metric shows how accurate your model is.
Precision : This metric shows how precise your model is (i.e., when the model is applied on test sample, how often the model predicts the target correctly)
Recall: This metric shows the percentage of time the model will predict correctly.
from google.colab import files
files.upload()
Import the data into Data Frame to start our analysis
# Read the rawdata
import pandas as pd
social_network_data = pd.read_csv( 'Social_Network_Ads.csv' , skiprows=1 , names=[ 'User_ID' , 'Gender', 'Age' , 'EstimatedSalary' , 'Purchased'] )
social_network_data.head()
Let's find out population of attributes
social_network_data.info()
We notice all data is populated and the data types are as expected
social_network_data.Gender.value_counts()
Gender contains only Male and Female. Let's decode these values also to Integers
( 0 for Male and 1 for Female) and convert the data type from Object to Int64
newGender = pd.Categorical(social_network_data.Gender).rename_categories ( [ 1,0 ])
social_network_data.Gender = newGender.astype(int)
social_network_data.Gender.value_counts()
social_network_data.info()
We have successfully converted Male to 0 and Female to 1 and also amended the data type to integer
social_network_data.head(10)
Our goal is to find out Purchased which is Categorical and has only two values (0 - Not Purchased and 1 - Purchased)
So, Logistic Regression will be more suitable in this case
As first step, let's choose the Input and Output Features. We have to find out Purchased. So this alone will be the output feature.
User ID being just an identifier, should not have any impact to Purchased feature. So, we select all other attributes except User ID as Input features
#split dataset in features and target variable
feature_cols = ['Gender', 'Age', 'EstimatedSalary' ]
X = social_network_data[feature_cols] # Input Features
y = social_network_data.Purchased # Target variable
The data supplied has 400 records. Let's use 80% of them to build our model and 20% to test
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X,y, test_size=0.20, random_state = 42 )
# import the required model classes
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
purchasereg = LogisticRegression()
# fit the model with data
purchasereg.fit(X_train,Y_train)
# predict the Purchased category for test data
Y_pred = purchasereg.predict( X_test )
We have used the built-in modules available in sklearn to predict the Purchased category.
Now, let's evaluate our model
from sklearn import metrics
confusionMatrix = metrics.confusion_matrix( Y_test, Y_pred )
confusionMatrix
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Purchased_class=[0,1] # Purchased possible values
fig, ax = plt.subplots()
tick_marks = np.arange(len(Purchased_class))
plt.xticks(tick_marks, Purchased_class)
plt.yticks(tick_marks, Purchased_class)
# Create heatmap
sns.heatmap(pd.DataFrame(confusionMatrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual')
plt.xlabel('Predicted')
print("Accuracy:",metrics.accuracy_score(Y_test, Y_pred))
print("Precision:",metrics.precision_score(Y_test, Y_pred))
print("Recall:",metrics.recall_score(Y_test, Y_pred))
We notice that the model is always predicting only one category.
One possible reason for this odd behavior could be due to data variance among the input features. The expected salary is large number compared to Age and Gender
Let's try standardizing the data and retry our model
# import the model class
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_new = pd.DataFrame( scaler.fit_transform( X_train) , columns = X_train.columns )
X_test_new = pd.DataFrame ( scaler.transform(X_test) , columns = X_test.columns )
# instantiate the model (using the default parameters)
purchasereg = LogisticRegression()
# fit the model with data
purchasereg.fit(X_train_new,Y_train)
# predict the Purchased category for test data
Y_pred = purchasereg.predict( X_test_new )
Now, its time again to build the confusion matrix and validate our model
from sklearn import metrics
confusionMatrix = metrics.confusion_matrix( Y_test, Y_pred )
Let's find Accuracy/Precision and Recall
Purchased_class=[0,1] # Purchased possible values
fig, ax = plt.subplots()
tick_marks = np.arange(len(Purchased_class))
plt.xticks(tick_marks, Purchased_class)
plt.yticks(tick_marks, Purchased_class)
# Create heatmap
sns.heatmap(pd.DataFrame(confusionMatrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual')
plt.xlabel('Predicted')
print("Accuracy:",metrics.accuracy_score(Y_test, Y_pred))
print("Precision:",metrics.precision_score(Y_test, Y_pred))
print("Recall:",metrics.recall_score(Y_test, Y_pred))
The following metrics are used generally to evaluate Logistic Regression models
Confusion
Matrix : Two dimensional array in case of binary classification and shows the
cross product of True and False predicted values. The fundamental of a
confusion matrix is the number of correct and incorrect predictions are summed
up class-wise We will go through this matrix in our example at later
stage.
Confusion Matrix is represented as cross
tabulation of the following
- True Positive (TP i.e., Actual : Success and correctly predicted success)
- False Positive (FP i.e., Actual: Failure but incorrectly predicted Success )
- True Negative (TN i.e., Actual: Failure and correctly predicted failure )
- False Negative (FN i.e., Actual: Success but incorrectly predicted failure )
Accuracy: This metric shows how accurate your model is.
Accuracy
is the most intuitive performance measure and it is simply a ratio of correctly
predicted observation to the total observations
Accuracy is measured using the formula TP+TN/TP+FP+FN+TN
Precision : This metric shows how precise your model is (i.e., when the model is applied on test sample, how often the model predicts the target correctly)
Precision is calculated using the
formula TP / (TP+FP)
Recall: This metric shows the percentage of time the model will predict correctly.
Recall is calculated using the formula
TP / (TP+FN)
Precision and recall are both similar to
accuracy, but both are very difficult to understand conceptually. Precision is
very similar to accuracy but it only takes into account the data you predicted positive.
Recall is
also measure of accuracy but it only takes the data that is relevant in
the situation.
F1 score - F1 Score is
the weighted average of Precision and Recall. Therefore, this score takes both
false positives and false negatives into account.
F1 is usually more useful than accuracy, especially
if we have an uneven class distribution.
F1 Score is calculated using the formula 2*(Recall * Precision) / (Recall + Precision)
Let's start with our example and walk through the Logistic Regression
from google.colab import files
files.upload()
Import the data into Data Frame to start our analysis
# Read the rawdata
import pandas as pd
social_network_data = pd.read_csv( 'Social_Network_Ads.csv' , skiprows=1 , names=[ 'User_ID' , 'Gender', 'Age' , 'EstimatedSalary' , 'Purchased'] )
social_network_data.head()
Let's find out population of attributes
social_network_data.info()
We notice all data is populated and the data types are as expected
social_network_data.Gender.value_counts()
Gender contains only Male and Female. Let's decode these values also to Integers
( 0 for Male and 1 for Female) and convert the data type from Object to Int64
newGender = pd.Categorical(social_network_data.Gender).rename_categories ( [ 1,0 ])
social_network_data.Gender = newGender.astype(int)
social_network_data.Gender.value_counts()
social_network_data.info()
We have successfully converted Male to 0 and Female to 1 and also amended the data type to integer
social_network_data.head(10)
Our goal is to find out Purchased which is Categorical and has only two values (0 - Not Purchased and 1 - Purchased)
So, Logistic Regression will be more suitable in this case
As first step, let's choose the Input and Output Features. We have to find out Purchased. So this alone will be the output feature.
User ID being just an identifier, should not have any impact to Purchased feature. So, we select all other attributes except User ID as Input features
#split dataset in features and target variable
feature_cols = ['Gender', 'Age', 'EstimatedSalary' ]
X = social_network_data[feature_cols] # Input Features
y = social_network_data.Purchased # Target variable
The data supplied has 400 records. Let's use 80% of them to build our model and 20% to test
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X,y, test_size=0.20, random_state = 42 )
# import the required model classes
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
purchasereg = LogisticRegression()
# fit the model with data
purchasereg.fit(X_train,Y_train)
# predict the Purchased category for test data
Y_pred = purchasereg.predict( X_test )
We have used the built-in modules available in sklearn to predict the Purchased category.
Now, let's evaluate our model
from sklearn import metrics
confusionMatrix = metrics.confusion_matrix( Y_test, Y_pred )
confusionMatrix
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Purchased_class=[0,1] # Purchased possible values
fig, ax = plt.subplots()
tick_marks = np.arange(len(Purchased_class))
plt.xticks(tick_marks, Purchased_class)
plt.yticks(tick_marks, Purchased_class)
# Create heatmap
sns.heatmap(pd.DataFrame(confusionMatrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual')
plt.xlabel('Predicted')
print("Accuracy:",metrics.accuracy_score(Y_test, Y_pred))
print("Precision:",metrics.precision_score(Y_test, Y_pred))
print("Recall:",metrics.recall_score(Y_test, Y_pred))
We notice that the model is always predicting only one category.
One possible reason for this odd behavior could be due to data variance among the input features. The expected salary is large number compared to Age and Gender
Let's try standardizing the data and retry our model
# import the model class
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_new = pd.DataFrame( scaler.fit_transform( X_train) , columns = X_train.columns )
X_test_new = pd.DataFrame ( scaler.transform(X_test) , columns = X_test.columns )
# instantiate the model (using the default parameters)
purchasereg = LogisticRegression()
# fit the model with data
purchasereg.fit(X_train_new,Y_train)
# predict the Purchased category for test data
Y_pred = purchasereg.predict( X_test_new )
Now, its time again to build the confusion matrix and validate our model
from sklearn import metrics
confusionMatrix = metrics.confusion_matrix( Y_test, Y_pred )
Let's find Accuracy/Precision and Recall
Purchased_class=[0,1] # Purchased possible values
fig, ax = plt.subplots()
tick_marks = np.arange(len(Purchased_class))
plt.xticks(tick_marks, Purchased_class)
plt.yticks(tick_marks, Purchased_class)
# Create heatmap
sns.heatmap(pd.DataFrame(confusionMatrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual')
plt.xlabel('Predicted')
print("Accuracy:",metrics.accuracy_score(Y_test, Y_pred))
print("Precision:",metrics.precision_score(Y_test, Y_pred))
print("Recall:",metrics.recall_score(Y_test, Y_pred))
Accuracy: 0.8875
Precision: 0.9130434782608695
Recall: 0.75
Accuracy is 88% which is pretty good. So, I will assume we have very good prediction model.
Comments
Post a Comment